'Synapse spark job fails as input folder does not exist
How to do exception handling for file reading.
For example, I have a daily job that will run at 8:00 am. It reads files from Azure data lake storage(Gen 2). The path looks like 2022/01/06/data.csv. So the file will not be populated in ADLS for all the days. So whenever the file is not populated the job is getting failed. So I tried using try-catch to handle the exception. Is there any other way to handle the exception?
df1 = spark.read.format('csv').load(fileLocation)
Solution 1:[1]
To summarize your problem: The spark-job is failing because the folder you are pointing to does not exist.
On Azure Synapse, mssparkutils is perfect for this. This is how you would do it in Scala (you can do similar for python as well). This works for notebooks as well as spark/pyspark batch jobs.
def exists(f: String): Boolean = {
try {
mssparkutils.fs.ls(f)
true
} catch {
case e: Exception => false
}
}
exists("valid/folder") // returns true
exists("abfss://[email protected]/valid/folder") // returns true
exists("invalid/folder") //returns false
exists("abfss://[email protected]/invalid/folder") // returns false
// you can also do below for more info:
mssparkutils.fs.help()
If the storage account is not your primary linked storage account, you would need to provide the full url (abfss path).
I prefer providing full-fledged url (abfss path) as synapse account can have multiple linked storage accounts. So there is no scope for error
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |