'Azure Databricks Command Stuck when Processing large files. Pure Python. (2.5gb + file size)

I am converting txt files into XML format using pure Python. I have a list of files from 1kb to 2.5Gb in the txt format. When converting the size grows about 5x.

The issue is that when processing the larger 2.5Gb files the first file works but subsequent processing hangs and gets stuck running command... Smaller files seem to work with no issue.

  • I've edited the code to make sure it's using generators and not keeping large lists in memory.

  • I'm processing from dbfs so connection should not be an issue.

  • Doing memory checks show that it's consistently using only ~200Mb of memory and the size does not grow.

  • Large files take about 10 mins to process.

  • No GC warnings or other Error in logs

  • Azure Databricks, Pure Python

  • Cluster is large enough and using only Python so that shouldn't be the issue.

  • Restarting cluster is the only thing that gets things working again.

  • Stuck command also causes other notebooks on the cluster to not work.

Basic code outline with redaction for simplicity.

# list of files to convert that are in Azure Blob Storage
text_files = ['file1.txt','file2.txt','file3.txt']

# loop over files and convert them to xml
for file in text_files:
    
    xml_filename = file.replace('.txt','.xml')
    # copy files from blob storage to dbfs
    dbutils.fs.cp(f'dbfs:/mnt/storage_account/projects/xml_converter/input/{file}',f'dbfs:/tmp/temporary/{file}')
    
    # open files and convert to xml
    with open(f'/dbfs/tmp/temporary/{file}','r') as infile, open(f'/dbfs/tmp/temporary/{xml_filename}','a', encoding="utf-8") as outfile:

        # list of strings to join at write time
        to_write = []

        for line in infile:
            # convert to xml
            # code redacted for simplicity

            to_write.append(new_xml)

            # batch the write operations to avoid huge lists
            if len(to_write) > 10_000:

                outfile.write(''.join(to_write))
                to_write = [] # reset the batch

        # do a final write of anything that is in the list
        outfile.write(''.join(to_write))
    
    # move completed files from dbfs to blob storage
    dbutils.fs.cp(f'dbfs:/tmp/temporary/{xml_filename}',f"/mnt/storage_account/projects/xml_converter/output/{xml_filename}")

Azure Cluster Info

cluster info

I would expect this code to run with no issues. Memory doesn't seem to be the problem. The data is in dbfs so it's not a blob issue. It's using generators so not much is in memory. I'm at a loss. Any suggestions would be appreciated. Thanks for looking!



Solution 1:[1]

Have you tried to copy the files from Azure Storage to the local Databricks /tmp/ folder and not using dbfs? I had a similar issues when unpacking large .zip files and that fixed the problem. Have a look here: https://docs.databricks.com/data/databricks-file-system.html

Side note: Since you are using pure Python the workers are not used for processing the files. You can switch to a single node setup.

Solution 2:[2]

This is environment behavioral, If the script is pure Python then it would only run on the driver node of the Databricks cluster making it very expensive as single node processing.Python will definitely perform better compared to pyspark on smaller data sets. But you will see the difference when you are dealing with larger data sets.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 MaDebu
Solution 2 Karthikeyan Rasipalay Durairaj