'Get the list of loaded files from Databricks Autoloader

We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader database to get the list of files that have been loaded?

I can easily do this in AWS Glue job bookmark, but I'm not aware on how to do this in Databricks Autoloader.



Solution 1:[1]

.load("path")
.withColumn("filePath",input_file_name())

than you can for example insert filePath to your stream sink and than get distinct value from there or use forEatch / forEatchBatch and for example insert it into spark sql table

Solution 2:[2]

You can get notification of files loaded into S3 using structural streaming. For already loaded files,s3_output_path destination path can be checked.

    df = (spark.readStream.format('cloudFiles') \
    .option("cloudFiles.format",    "json") \
    .option("cloudFiles.region", "<aws region>) \
    .option("cloudFiles.awsAccessKey",<ACCESS_KEY>) \
    .option("cloudFiles.awsSecretKey", <SECRET_KEY>) \
   .option ("cloudFiles.useNotifications", "true") \
   .load(<s3_path>))

    df.writeStream.format('delta').outputMode("append") \
      .option("checkpointLocation", <checkpoint_path>) \
      .start(<s3_output_path>)

Solution 3:[3]

If you are using the checkpointLocation option you can read all the files that were processed by reading the rocksDB logs. Some example code to achieve that, note that you need to point to the path on the checkpoint location that you want to retrieve the loaded files list.

from glob import glob
import codecs

directory = "<YOUR_PATH_GOES_HERE>/sources/*/rocksdb/logs/"
for file in glob(f"{directory}/*.log"):
    with codecs.open(file, encoding='utf-8', errors='ignore') as f:
        f = f.readlines()
        print(f)

PS.: The logs need to be parsed properly in order to get only the filenames.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Hubert Dudek
Solution 2
Solution 3 Herivelton Andreassa