'VACUUM/OPTIMIZE Effect on Autoloader Checkpoints

I'm using Databricks Autoloader to incrementally stream from a Delta Lake table into a SQL database. If an OPTIMIZE or VACUUM statement is ran against the Delta table, new files are added/subtracted.

My question is, will the autoloader checkpoint discount these optimized files on the next stream? Or will my entire Delta table be streamed into SQL because autoloader doesn't recognize it's already processed the data?



Solution 1:[1]

As long as you specify the format of the readStream correctly, the autoloader checkpoint will disregard all aggregated files created by OPTIMIZE command. In this case, the code should be started as follows. df.readStream.format('delta')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1