'Spark OutOfMemory Error is remedied by repartition

I have highly compressed non-splittable gzip archive ~100MB size and ~10 mln records. I'm trying to read it into spark dataframe and then write it to parquet. I have one driver and one executor (16GB RAM, 8vCPU, in fact, it's a Glue job with 2 G1.X nodes).

Read gzipped CSV / write parquet directly leads to OOM:

df = spark.read.option("sep", "|") \
    .option("header", "true") \
    .option("quote", "\"") \
    .option("mode", "FAILFAST") \
    .csv("path.gz")
    

df.write
    .format("parquet") \
    .mode("Overwrite") \
    .save("path")

And I can understand this. DataFrame is loaded into single executor memory, it doesn't fit and OOM appears. But, if call .repartition(8) (same hardware setup) before write, then everything is OK, no OOM occurred. I don't understand why this happens, anyway we have to load all DataFrame into executor memory?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Spark OutOfMemory Error is remedied by repartition

Sources

Related Questions