'Error while using readstream from delta on azure data lake gen 2
I get the below error while reading data from delta lake. The detailed log on azure shows its failing to read .tmp file from the _delta_log folder. I have tried adding trigger with 2 to 5 seconds but still face this issue.
Caused by: Operation failed: "The specified path does not exist.", 404, GET, https://xxxxxx.dfs.core.windows.net/xxxxcontainer?upn=false&resource=filesystem&maxResults=500&directory=LandingHome/_delta_log&timeout=90&recursive=false, PathNotFound, "The specified path does not exist. RequestId:d2b580d6-201f-0087-4216-5df359000000 Time:2022-05-01T04:44:55.4973721Z"
import io.delta.tables._
val hadoopConf = spark.sparkContext.hadoopConfiguration
import org.apache.spark.sql._
hadoopConf.set("fs.azure.account.auth.type.xx.dfs.core.windows.net", "OAuth")
hadoopConf.set("fs.azure.account.oauth.provider.type.xx.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider");
hadoopConf.set("fs.azure.account.oauth2.client.id.xx.dfs.core.windows.net", "sdfdfsdfdss");
hadoopConf.set("fs.azure.account.oauth2.client.secret.xx.dfs.core.windows.net", "sdfdsfd");
hadoopConf.set("fs.azure.account.oauth2.client.endpoint.xx.dfs.core.windows.net", "https://login.microsoftonline.com/sdfdfsd/oauth2/token");
hadoopConf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true");
hadoopConf.set("fs.abfs.impl", "org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem");
df = SparkBean.getSparkSession().readStream().format("delta").options(sparkProps).load(deltaLakeEventPath).where(condition)
df.writeStream().format("CustomSinkHandler").options(streamProps).start();
Solution 1:[1]
Per Delta documentation, you need to use spark.conf
instead of setting these properties in Hadoop configuration (with them, it works just fine):
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net","<password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
Don't forget to attach packages org.apache.hadoop:hadoop-azure-datalake
& org.apache.hadoop:hadoop-azure
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Alex Ott |