'AzureML: TabularDataset.to_pandas_dataframe() hangs when parquet file is empty

I have created a Tabular Dataset using Azure ML python API. Data under question is a bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. When I try to load the dataset using the API TabularDataset.to_pandas_dataframe(), it continues forever (hangs), if there are empty parquet files included in the Dataset. If the tabular dataset doesn't include those empty parquet files, TabularDataset.to_pandas_dataframe() completes within few minutes.

By empty parquet file, I mean that the if I read the individual parquet file using pandas (pd.read_parquet()), it results in an empty DF (df.empty == True).

I discovered the root cause while working on another issue mentioned [here][1].

My question is how can make TabularDataset.to_pandas_dataframe() work even when there are empty parquet files?

Update The issue has been fixed in the following version:

  • azureml-dataprep : 3.0.1
  • azureml-core : 1.40.0


Solution 1:[1]

Thanks for reporting it. This is a bug in handling of the parquet files with columns but empty row set. This has been fixed already and will be included in next release.

I could not repro the hang on multiple files, though, so if you could provide more info on that would be nice.

Solution 2:[2]

You can use the on_error='null' parameter to handle the null values.

Your statement will look like this:

TabularDataset.to_pandas_dataframe(on_error='null', out_of_range_datetime='null')

Alternatively, you can check the size of the file before passing it to to_pandas_dataframe method. If the filesize is 0, either write some sample data into it using python open keyword or ignore the file, based on your requirement.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrei Liakhovich
Solution 2 UtkarshPal-MT