'How to handle memory issue while writing data in which a particular column contains very large data in each record in databricks in pyspark

I have a set of records with 10 columns. There is a column 'x' which contains an array of float values and the length of array can be very large(for eg, the length of array can be 25000000,50000000,80000000 etc) I am trying to read the data and write as delta with partition on id column in azure databricks using pyspark, but it is giving out of memory issue. Can anyone suggest optimization method to handle huge data inside a single cell.



Solution 1:[1]

You can set system properties using SparkConf().setAll() class method before instantiating SparkContext.

First just open pyspark shell and check the settings:

sc.getConf().getAll()

You first have to create conf and then you can create the Spark Context using that configuration object.

config = pyspark.SparkConf().setAll([('spark.executor.memory', '8g'), ('spark.executor.cores', '3'), ('spark.cores.max', '3'), ('spark.driver.memory','8g')])
sc.stop()
sc = pyspark.SparkContext(conf=config)

You can try for higher spark.executor.memory values and check which suits your requirement.

You can also try this example:

from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '2g')
sc = SparkContext("local", "App Name")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 UtkarshPal-MT