'How to process a 64 GB CSV file in Pyspark efficiently?
I have a very large CSV file in a blob storage nearly of size 64 GB. I need to do some processing on top of every row and push the data to DB. What should be the best solution to do this efficiently?
Solution 1:[1]
64 GB file size shouldn't be the reason to worry you as Pyspark is even capable to process even 1TB of data.
You can use Azure Databricks to write the code in Pyspark and run it on cluster.
A Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.
You can refer this third-party tutorial Read a CSV file stored in blob container using python in DataBricks.
Though you shouldn't face any issue while processing the file but if required you can consider Boost Query Performance with Databricks and Spark .
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | UtkarshPal-MT |