'Using zipfile to archive directory contents while skipping files from list

I'm using zipfile to create an archive of all files in a directory (recursively, while preserving directory structure including empty folders) and want the process to skip the filenames specified in a list.

This is the basic function that os.walks through a directory and adds all the containing files and directories to an archive.

def zip_dir(path):
    zipname = str(path.rsplit('/')[-1]) + '.zip'
    with zipfile.ZipFile(zipname, 'w', zipfile.ZIP_DEFLATED) as zf:
        if os.path.isdir(path):
            for root, dirs, files in os.walk(path):
                for file_or_dir in files + dirs:
                    zf.write(os.path.join(root, file_or_dir),
                            os.path.relpath(os.path.join(root, file_or_dir),
                            os.path.join(path, os.path.pardir)))
        elif os.path.isfile(filepath):
            zf.write(os.path.basename(filepath))
    zf.printdir()
    zf.close()

We can see the code should also have the ability to handle single files but it is mainly the part concerning directories that we are interested in.

Now let's say we have a list of filenames that we want to exclude from being added to the zip archive.

skiplist = ['.DS_Store', 'tempfile.tmp']

What is the best and cleanest way to achieve this?

I tried using zip which was somewhat successful but causes it to exclude empty folders for some reason (empty folders should be included). I'm not sure why this happens.

skiplist = ['.DS_Store', 'tempfile.tmp']
for root, dirs, files in os.walk(path):
    for (file_or_dir, skipname) in zip(files + dirs, skiplist):
        if skipname not in file_or_dir:
            zf.write(os.path.join(root, file_or_dir),
                    os.path.relpath(os.path.join(root, file_or_dir),
                    os.path.join(path, os.path.pardir)))

It would also be interesting to see if anyone has a clever idea for adding the ability to skip specific file extensions, perhaps something like .endswith('.png') but I'm not entirely sure of how to incorporate it together with the existing skiplist.

I would also appreciate any other general comments regarding the function and if it indeed works as expected without surprises, as well as any suggestions for optimizations or improvements.



Solution 1:[1]

You can simply check if the file is not in skiplist:

skiplist = {'.DS_Store', 'tempfile.tmp'}

for root, dirs, files in os.walk(path):
    for file in files + dirs:
        if file not in skiplist:
            zf.write(os.path.join(root, file),
                     os.path.relpath(os.path.join(root, file),
                     os.path.join(path, os.path.pardir)))

This will ensure that files in skiplist won't be added to the archive.

Another optimization is to make skiplist a set, just in case it gets very large, and you want constant time O(1) lookup instead of linear O(N) lookup from using a list.

You can research this more at TimeComplexity, which shows the time complexities of various Python operations on data structures.

As for extensions, you can use os.path.splitext() to extract the extension and use the same logic as above:

from os.path import splitext

extensions = {'.png', '.txt'}

for root, dirs, files in os.walk(path):
    for file in files:
        _, extension = splitext(file)
        if extension not in extensions:
            zf.write(os.path.join(root, file),
                     os.path.relpath(os.path.join(root, file),
                     os.path.join(path, os.path.pardir)))

If you want to combine the above features, then you can handle the logic for files and directories separately:

from os.path import splitext

extensions = {'.png', '.txt'}
skiplist = {'.DS_Store', 'tempfile.tmp'}

for root, dirs, files in os.walk(path):
    for file in files:
        _, extension = splitext(file)
        if file not in skiplist and extension not in extensions:
            zf.write(os.path.join(root, file),
                     os.path.relpath(os.path.join(root, file),
                     os.path.join(path, os.path.pardir)))

    for directory in dirs:
        if directory not in skiplist:
            zf.write(os.path.join(root, directory),
                     os.path.relpath(os.path.join(root, directory),
                     os.path.join(path, os.path.pardir))) 

Note: The above code snippets won't work by themselves, and you will need to weave in your current code to use these ideas.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1