'How do I list all paths in Azure data lake gen 2 filtered by last modified date in Azure Data Factory?

We have an Azure Data Lake Gen 2 which contains 100's of thousands of JSON messages that come in on a continuous basis. These files are stored in a folder structure, but not one based on load time. We now have a requirement that we need to use Azure Data Factory to retrieve all new JSON files since we last ran our pipelines. Since the get metadata activity doesn't allow for recursive retrieval of files and folders I've been looking at other options. I know it's possible to use Azure functions but ideally we'd like to use low/no code solutions. I'm able to list all paths in a given container using the Azure Storage Services API using either the Path option or the List Blobs option. Unfortunately I cannot seem to find an option to filter this based on the last-modified date. Since we are getting 1000's of new messages in every day we need to limit the response of the API to only those files that have come in since the previous pipeline run. Any suggestions on how this could be achieved without an Azure function would be greatly appreciated.



Solution 1:[1]

You can use getmetadata activity for recursive retrieval as well by making use of ForEach activity next to it.

Use getmetadata activity pointing to the folder and use ChildItems in the fieldlist to retrieve the filenames inside the folder.

Use ForEach activity to iterate through each of the files and use Getmetadata pointing to parameterized dataset. Inside getmetadata activity, us 'LastModified' in the childItems option to get the last modified datetime for each of the files.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 AnnuKumari-MSFT