'Spark Compressing files from Multiple partitions into Single partition with larger files
I would like to take small parquet files that are spread out through multiple partition layers on s3 and compress them into larger files with a single partition back out to s3.
So in this example, I have 3 partition layers (part1
, part2
, part3
). I would like to take this data and write it back out only partitioned by part2
For my first run through I used:
df = spark.read
.option("basePath", "s3://some_bucket/base/location/in/s3/")
.parquet("s3://some_bucket/base/location/in/s3/part1=*/part2=*/part3=*/")
df.write.partitionBy("part2").parquet("s3://some_bucket/different/location/")
This worked for the most part but this seems to still create smaller files. Since I'm not running a coalesce
or repartition
. This brings me to my question. Is there a way I can easily compress these files into larger files based on size/row counts?
Thanks in advance!
Solution 1:[1]
Is there a way I can easily compress these files into larger files based on size/row counts?
Not really. Spark doesn't provide any utilities which can be used to limit size of the output files, as each files corresponds in general to a single partition.
So repartitioning
by the same column as used for partitionBy
is your best bet.
Solution 2:[2]
option("maxRecordsPerFile", 400000)
use this option while writing the file.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | yedukondalu annangi |