'pandas df.to_parquet write to multiple smaller files
Is it possible to use Pandas' DataFrame.to_parquet
functionality to split writing into multiple files of some approximate desired size?
I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow', compression='snappy')
to write to a file, but this results in a file that's about 4GB. I'd instead like this split into many ~100MB files.
Solution 1:[1]
I ended up using Dask:
import dask.dataframe as da
ddf = da.from_pandas(df, chunksize=5000000)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
This saves to multiple parquet files inside save_dir
, where the number of rows of each sub-DataFrame is the chunksize
. Depending on your dtypes and number of columns, you can adjust this to get files to the desired size.
Solution 2:[2]
One other option is to use the partition_cols
option in pyarrow.parquet.write_to_dataset()
:
import pyarrow.parquet as pq
import numpy as np
# df is your dataframe
n_partition = 100
df["partition_idx"] = np.random.choice(range(n_partition), size=df.shape[0])
table = pq.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path="{path to dir}/", partition_cols=["partition_idx"])
Solution 3:[3]
Slice the dataframe and save each chunk to a folder, using just pandas api (without dask or pyarrow).
You can pass extra params to the parquet engine if you wish.
def df_to_parquet(df, target_dir, chunk_size=1000000, **parquet_wargs):
"""Writes pandas DataFrame to parquet format with pyarrow.
Args:
df: DataFrame
target_dir: local directory where parquet files are written to
chunk_size: number of rows stored in one chunk of parquet file. Defaults to 1000000.
"""
for i in range(0, len(df), chunk_size):
slc = df.iloc[i : i + chunk_size]
chunk = int(i/chunk_size)
fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
slc.to_parquet(fname, engine="pyarrow", **parquet_wargs)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Austin |
Solution 2 | Random Certainty |
Solution 3 | Maciej S. |