'Spark partition size greater than the executor memory

I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:

  • I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that i'm applying some narrow transformations like map() and filter() on my rdd.

  • Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called)

  • Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed?

  • If i'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD?



Solution 1:[1]

I answer as I know things on each part, possibly disregarding a few of your assertions:

I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if: >>> I would use 1 Executor, 1 Core. That is the generally accepted paradigm afaik.

  • I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that I'm applying some narrow transformations like map() and filter() on my rdd. >>> The number of partitions being the same of number of cores is not true. You can service 1000 partitions with 10 cores, processing one at a time. What if you have 100K partition and on-prem? Unlikely you will get 100K Executors. >>> Moving on and leaving Driver-side collect issues to one side: You may not have enough memory for a given operation on an Executor; Spark can spill to files to disk at the expense of speed of processing. However, the partition size should not exceed a maximum size, was beefed up some time ago. Using multi-core Executors failure can occur, i.e. OOM's, also a result of GC-issues, a difficult topic.

  • Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called) >>> Not if it can avoid it, but when memory is tight, eviction / spilling to disk can and will occur, and in some cases re-computation from source or last checkpoint will occur.

  • Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed? >>> They will be serviced by a free Executor at a point in time.

  • If I'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD? >>> Yes, and it will be spilled to the local file system. I think you can configure for HDFS via a setting, but local disks are faster.

This an insightful blog: https://medium.com/swlh/spark-oom-error-closeup-462c7a01709d

Solution 2:[2]

  1. Your data partition size looks bigger than your Core memory. Your Core memory is ~1.6 GB (5GB/3 Core). This will be a problem as your partition will not be able to process in the Core. To resolve this, you can try:

    • increasing the number of partitions such that each partition is < Core memory ~1.6 GB. So increase them to something like 150 partitions.
    • If you keep the partitions the same, you should try increasing your Executor memory and maybe also reducing number of Cores in your Executors.
  2. If everything goes well it will not need to store partitions on disk. However, if it is not able to find enough memory, it will find disk as a backup. If you want to store your data on Disk and persist it for some reason, you need to call persist(DISK_ONLY).

  3. They will wait until one of the Cores is available.

  4. Yes, it will spill on Disk. Where will depend on your cluster configuration I believe.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 convex1