'How do I interpret Input size / records in Spark Stage UI

I'm looking at the Spark UI (Spark v1.6.0) for a stage of a job I'm currently running and I don't understand how to interpret what its telling me: Spark stage UI The number of records in the "Shuffle Write Size / Records" column makes sense, those numbers are consistent with the data I'm processing.

What I do not understand is the numbers in "Input Size / Records". They indicate that the incoming data has only ~67 records in each partition; the job has 200 partitions so ~1200 records in all. I dont know what that is referring to, none of the input datasets to this job (which was implemented using SparkSQL) have ~1200 records in them.

So, I'm flummoxed as to what those numbers are referring to. Can anyone enlighten me?



Solution 1:[1]

Your Input Size/Record is too low. It means that at a time, your task is only executing approximately 14 MB of data which is too low. The thumb rule is that it should be 128 MB.

You can change this by change the HDFS block size to 128 MB i.e. hdfs.block.size to 134217728 or if you are accessing from AWS S3 Storage, then you can set fs.s3a.block.size to 134217728 in core-site.xml file

Changing this will also bring down the number of parititions.

Next is your Shuffle Write Size / Records is too high. This means that the lot many data is getting exchanged between shuffles which is an expensive operation. You need to look at your code to see if you can optimize it or write your operations different so that it doesn't shuffle too much.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Arun