'How to check for new files in a folder in python

I am trying to create a script that will be executed every 10 minutes. Each time I have to check if there are new files in specific folder in my computer and if yes, there are some functions that would run on this file in order to get some values. These values will be written to excel file. The problem is that every time this function will be executed, the variables that contain the path to all the files will be generated again, and the program will go over all the files. How can I handle this problem? Thanks



Solution 1:[1]

Start by initializing variables:

savedSet=set()
mypath=… #YOUR PATH HERE

At the end of each cycle, save a set of file names, creation times and sizes in tuple format to another variable. When retrieving files, do the following:

-Retrieve a set of file paths

nameSet=set()
for file in os.listdir(path):
    fullpath=os.path.join(mypath, file)
    if os.path.isfile(fullpath):
        nameSet.add(file)

-Create tuples

retrievedSet=set()
for name in nameSet:
    stat=os.stat(os.path.join(mypath, name))
    time=ST_CTIME
    #size=stat.ST_SIZE If you add this, you will be able to detect file size changes as well.
    #Also consider using ST_MTIME to detect last time modified
    retrievedSet.add((name,time))

-Compare set with saved set to find new files

newSet=retrievedSet-savedSet

-Compare set with saved set to find removed files

deletedSet=savedSet-retrievedSet

-Run your functions on files with names from newSet -Update saved set

savedSet=newSet

Solution 2:[2]

from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class MyHandler(FileSystemEventHandler):
    def on_any_event(self, event):
        print(event.event_type, event.src_path)

    def on_created(self, event):
        print("on_created", event.src_path)
        print(event.src_path.strip())
        if((event.src_path).strip() == ".\test.xml"):        
            print("Execute your logic here!")

event_handler = MyHandler()
observer = Observer()
observer.schedule(event_handler, path='.', recursive=False)
observer.start()


while True:
    try:
        pass
    except KeyboardInterrupt:
        observer.stop()
  1. pip install watchdog
  2. Create a scheduled task for this script in the Task scheduler and monitor the folder where the file will be created.

Solution 3:[3]

import operator
from stat import ST_CTIME
import os, sys, time
path = str(os.getcwd()) + '/' ;  #or you can assign the return value of your 
                                 #function (the updated path as per your question) 
                                 #which operates on the file 'new_file'  to this variable. 
files = os.listdir(path);

def mostRecentFile(path):
    all_files = os.listdir(path);
    file_ctime = dict();
    for file in all_files:
        file_times[e] = time.time() - os.stat(e).st_ctime;
    return  sorted(file_times.items(), key=operator.itemgetter(1))[0][0]

new_file = mostRecentFile(path)

The code returns only one file, which is the newest in the directory (as per your requirement). The variable new_file has the file name returned by the function mostRecentFile, which is the one most recently created in the present directory given by the variable path. You can tweak that to change how you want the path to be fed - current working directory or by changing to the desired directory. Given to your requirement, I think you want the current directory, and the same is used by the code.

I have considered creation time by using st_ctime. You can use the modification time by replacing st_ctime with st_mtime.

You can pass this newly created file new_file to your function, and assign the new path that is generated by this function to the variable path.

Solution 4:[4]

first, run for the first time this script in your directory to create a "files" file

import os
import pandas as pd
list_of_files=os.listdir()
list_of_files.append('files.csv')

pd.DataFrame({'files':list_of_files}).to_csv('files.csv')

then in your main script add this:

import pandas as pd
import os

files=pd.read_csv('files.csv')
list_of_files=os.listdir()

if len(files.files)!=len(list_of_files):
    #do what you want
    #save your excel with the name sample.xslx
    #append your excel into list of files and get the set so you will not have the sample.xlsx twice if run again
    list_of_files.append('sample.xslx')
    list_of_files=list(set(list_of_files))
    #save again the curent list of files 
    pd.DataFrame({'files':list_of_files}).to_csv('files.csv')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dorijan Cirkveni
Solution 2 user3349907
Solution 3
Solution 4 Billy Bonaros