'Spark partition size greater than the executor memory
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:
- I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that i'm applying some narrow transformations like map() and filter() on my rdd. 
- Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called) 
- Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed? 
- If i'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD? 
Solution 1:[1]
I answer as I know things on each part, possibly disregarding a few of your assertions:
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if: >>> I would use 1 Executor, 1 Core. That is the generally accepted paradigm afaik.
- I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that I'm applying some narrow transformations like map() and filter() on my rdd. >>> The number of partitions being the same of number of cores is not true. You can service 1000 partitions with 10 cores, processing one at a time. What if you have 100K partition and on-prem? Unlikely you will get 100K Executors. >>> Moving on and leaving Driver-side collect issues to one side: You may not have enough memory for a given operation on an Executor; Spark can spill to files to disk at the expense of speed of processing. However, the partition size should not exceed a maximum size, was beefed up some time ago. Using multi-core Executors failure can occur, i.e. OOM's, also a result of GC-issues, a difficult topic. 
- Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called) >>> Not if it can avoid it, but when memory is tight, eviction / spilling to disk can and will occur, and in some cases re-computation from source or last checkpoint will occur. 
- Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed? >>> They will be serviced by a free Executor at a point in time. 
- If I'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD? >>> Yes, and it will be spilled to the local file system. I think you can configure for HDFS via a setting, but local disks are faster. 
This an insightful blog: https://medium.com/swlh/spark-oom-error-closeup-462c7a01709d
Solution 2:[2]
- Your data partition size looks bigger than your Core memory. Your Core memory is ~1.6 GB (5GB/3 Core). This will be a problem as your partition will not be able to process in the Core. To resolve this, you can try: - increasing the number of partitions such that each partition is < Core memory ~1.6 GB. So increase them to something like 150 partitions.
- If you keep the partitions the same, you should try increasing your Executor memory and maybe also reducing number of Cores in your Executors.
 
- If everything goes well it will not need to store partitions on disk. However, if it is not able to find enough memory, it will find disk as a backup. If you want to store your data on Disk and persist it for some reason, you need to call persist(DISK_ONLY). 
- They will wait until one of the Cores is available. 
- Yes, it will spill on Disk. Where will depend on your cluster configuration I believe. 
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source | 
|---|---|
| Solution 1 | |
| Solution 2 | convex1 | 
