'Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4

I am submitting Spark job with following specification:(same program has been used to run different size of data range from 50GB to 400GB)

/usr/hdp/2.6.0.3-8/spark2/bin/spark-submit 
 --master yarn 
 --deploy-mode cluster 
 --driver-memory 5G 
 --executor-memory 10G 
 --num-executors 60
 --conf spark.yarn.executor.memoryOverhead=4096 
 --conf spark.shuffle.registration.timeout==1500 
 --executor-cores 3 
 --class classname /home//target/scala-2.11/test_2.11-0.13.5.jar

I have tried reparations the data while reading and also applied reparation before do any count by Key operation on RDD:

val rdd1 = rdd.map(x=>(x._2._2,x._2._1)).distinct.repartition(300)
val receiver_count=rdd1.map(x=>x._2).distinct.count

User class threw exception:

org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 9



Solution 1:[1]

In my case I gave my executors a little more memory and the job went through fine. You should definitely look at what stage your job is failing at and accordingly determine if increasing/decreasing the executors' memory would help.

Solution 2:[2]

I had similar issue when I run using memory driver 48G, memory worker 48G and number executor 32 (50% cluster resource). And it solved by decreasing the configuration to be 32G 32G 32 (36% cluster resource).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 rishab137
Solution 2 zonna