'OutOfMemoryError : Java heap space in Spark

I'm facing some problems regarding the memory issue, but I'm unable to solve it. Any help is highly appreciated. I am new to Spark and pyspark functionalities and trying to read a large JSON file that is around 5GB in size and build the rdd using

df = spark.read.json("example.json")

Every time I run the above statement, I get the following error :

java.lang.OutOfMemoryError : Java heap space

I need to get the JSON data in form of RDD and then use SQL Spark for manipulating and analysing. But I get error at the first step(reading JSON) itself. I am aware that to read such large files necessary changes in the configuration of the Spark Session is required. I followed the answers given at Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons" and Spark java.lang.OutOfMemoryError: Java heap space

I tried to change my SparkSession's configuration but I think I may have misunderstood some of the settings. The following are my spark configuration.

spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.memory.fraction", 0.8) \
.config("spark.executor.memory", "14g") \
.config("spark.driver.memory", "12g")\
.config("spark.sql.shuffle.partitions" , "8000") \
.getOrCreate()

Is there any mistake in the values that I have set for the different parameters like driver memory and executor memory. Also do I need to a set more config parameters other than this ?



Solution 1:[1]

Try to use:

df = spark.read.json("example.json").repartition(100)

This is due to shuffle data between too small partitions and memory overhead put all the partitions in heap memory.

My suggestion is to reduce the spark.sql.shuffle.partitions value to minimal and try to use re-partition or parallelism to increase partition of your input/intermediate dataframes.

spark = SparkSession \
  .builder \
  .appName("Python Spark SQL basic example") \
  .config("spark.memory.fraction", 0.8) \
  .config("spark.executor.memory", "14g") \
  .config("spark.driver.memory", "12g")\
  .config("spark.sql.shuffle.partitions" , "800") \
  .getOrCreate()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shaido