'Spark Compressing files from Multiple partitions into Single partition with larger files

I would like to take small parquet files that are spread out through multiple partition layers on s3 and compress them into larger files with a single partition back out to s3.

So in this example, I have 3 partition layers (part1, part2, part3). I would like to take this data and write it back out only partitioned by part2

For my first run through I used:

df = spark.read
.option("basePath", "s3://some_bucket/base/location/in/s3/")
.parquet("s3://some_bucket/base/location/in/s3/part1=*/part2=*/part3=*/")

df.write.partitionBy("part2").parquet("s3://some_bucket/different/location/")

This worked for the most part but this seems to still create smaller files. Since I'm not running a coalesce or repartition. This brings me to my question. Is there a way I can easily compress these files into larger files based on size/row counts?

Thanks in advance!



Solution 1:[1]

Is there a way I can easily compress these files into larger files based on size/row counts?

Not really. Spark doesn't provide any utilities which can be used to limit size of the output files, as each files corresponds in general to a single partition.

So repartitioning by the same column as used for partitionBy is your best bet.

Solution 2:[2]

option("maxRecordsPerFile", 400000)

use this option while writing the file.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 yedukondalu annangi