'Python: Obtain number of rows for ParquetDataset?
How do I obtain the number of rows of a ParquetDataset that is structured in the form of a folder containing multiple parquet files.
I tried
from pyarrow.parquet import ParquetDataset
a = ParquetDataset(path)
a.metadata
a.schema
a.commmon_metadata
I want to figure out the number of rows in total without reading the dataset as it can quite large.
What's the best way to do that?
Solution 1:[1]
You will still have to touch each individual file but luckily Parquet saves the total row count of each file in its footer. Thus you will only need to read the metadata of each file to figure out its size. The following code will compute the number of rows in the ParquetDataset
nrows = 0
dataset = ParquetDataset(..)
for piece in dataset.pieces:
nrows += piece.get_metadata().num_rows
Solution 2:[2]
For pyarrow >= 5.0.0:
from pyarrow.parquet import ParquetDataset
dataset = ParquetDataset(path, use_legacy_dataset=False)
nrows = sum(p.count_rows() for p in dataset.fragments)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | quant_dev |
Solution 2 | fsenart |