'Best way to store many pandas dataframes in a single file
I have 20,000 ~1000-row dataframes, each of which has a name, in a 170GB pickle file at the moment. I'd like to write these to a file these so I can load them individually, by name. I won't need to query for subsets of these dataframes, only read them to memory in their entireties.
I've considered:
- shelve (slow, not portable)
- hdf5 (not well suited to many small dataframes?)
- parquet / feather / pyarrow (no way to consolidate these in one file?)
- SQL?
Goals:
- I'm looking for something I can set up in just a few lines of code.
- I'd like a file I can
scp
around - Decent read/write speeds
- Parallel read/write (e.g. with multiprocessing)
Tried so far:
- HDF5 via pandas
df.to_hdf()
:
- it takes up more space on disk than csv's (to be fair I'm not using compression yet). (Why do my hdf5 files seem so unnecessarily large?)
- it's slow to write many keys (Limit on number of HDF5 Datasets)
- Many parquet files:
- Ran into quota limits for inodes on the cluster.
- HDF5 via
h5py
serializing strings is a pain.
Trying next:
SQL via df.to_sql()
Solution 1:[1]
I like HDF5. It is simple enough to write to HDF5 from a Pandas dataframe, and read the data back in. (Note: I prefer using numpy and h5py to work with HDF5. They create a smaller file. It's worth looking at if you don't have to use Pandas.) Here is a very basic example with pandas. I used variables to size the dataframe and the number of names/keys so you can experiment.
Code to create the dataframes and write to HDF5:
import pandas as pd
import numpy as np
a0, a1 = 1_00, 2
ndf = 20
for df_name in ['name_'+str(i) for i in range(ndf)]:
arr = np.random.randint(19,99,size=(a0,a1))
df = pd.DataFrame(data=arr, columns=["col_1", "col_2"])
df.to_hdf('SO_72178611.h5', df_name, mode='a')
Code to load HDF5 data to a dataframe (note that you don't have to know the names (keys) a priori: (comment out the print statements if you run with large numbers)
with pd.HDFStore('SO_72178611.h5') as store:
for df_name in store.keys():
df = store.get(df_name)
print(df_name)
print(df)
Solution 2:[2]
You could try reading in all the CSV's as one big dataframe in pandas and then convert said dataframe to parquet format using pyarrow via:
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df)
pq.write_table(table, 'output_path/example.parquet',compression='snappy')
Where the df is the pandas dataframe where you've concatenated all the CSV's. This may not work if the CSVs are different shapes. You may also want to play with creating a parquet dataset to increase read speeds. Additionally, you can play with different compression types to get a smaller file at the cost of read speed using something like Gzip. More information can be found here. https://arrow.apache.org/docs/python/parquet.html
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | kcw78 |
Solution 2 | djo |