'Azure ADLS Gen2 file created by Azure Databricks doesn't inherit ACL

I have a databricks notebook that is writing a dataframe to a file in ADLS Gen2 storage.

It creates a temp folder, outputs the file and then copies that file to a permanent folder. For some reason the file doesn't inherit the ACL correctly. The folder it creates has the correct ACL.

The code for the notebook:

#Get data into dataframe
df_export = spark.sql(SQL)

# OUTPUT file to temp directory coalesce(1) creates a single output data file
(df_export.coalesce(1).write.format("parquet")
.mode("overwrite")
.save(TempFolder))

#get the parquet file name.  It's always the last in the folder as the other files are created starting with _
file = dbutils.fs.ls(TempFolder)[-1][0]

#create permanent copy
dbutils.fs.cp(file,FullPath)

The temp folder that is created shows the following for the relevant account. Folder Permissions

Where the file shows the following.

File Permission

There is also a mask. I'm not really familiar with masks so not sure how this differs.

The Mask permission on the folder shows

Mask Folder Permissions

On the file it shows as

Mask File Permission

Does anyone have any idea why this wouldn't be inheriting the ACL from the parent folder?



Solution 1:[1]

I've had a response from Microsoft support which has resolved this issue for me.

Cause: Databricks stored files have Service principal as the owner of the files with permission -rw-r--r--, consequently forcing the effective permission of rest of batch users in ADLS from rwx (directory permission) to r-- which in turn causes jobs to fail

Resolution: To resolve this, we need to change the default mask (022) to custom mask (000) on Databricks end. You can set the following in Spark Configuration settings under your cluster configuration: spark.hadoop.fs.permissions.umask-mode 000

Solution 2:[2]

Wow, thats great! I was looking for a solution. Passthrough Authentication might be a proper solution now.

I had the feeling it was part of this acient hadoop bug: https://issues.apache.org/jira/browse/HDFS-6962 (solved in hadoop-3, now part of spark 3+).

Spark tries to set the ACL's after moving the files, but fails. First the files are created somewhere else in a tmp dir. The tmp-dir rights are inherated by default adls-behaviour.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bee_Riii
Solution 2 MeneerBij