'Why Does a Strange File Shows Up in Directory When Using os.walk()?

The project is written in Pycharm on Windows 10.

I wrote a program that grabs .docx files from a directory and searches for information. At the end of the list of file names I get this file: "~$640188.docx"

I get this error when it hits this file:

raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file

This error happens when I try to put file '~$640188.docx' into the docx2text method process

text = docx2txt.process(r'C:\path\to\folder\~$640188.docx')

From what I can see, this file does not exist in the directory I'm searching nor anywhere on my computer. The other strange part is that yesterday I wasn't getting this error.

I know there are sometimes "hidden" files in directories and I ran into those before on my mac (specifically '.DS_Store') but this is a .docx file.

I currently have an ugly solution, which says "don't run the code if you run into '~$640188.docx'". My concern is that this will become more of a problem when I dump 11000 files into the directory.

Where does this file come from?

Below is the code for reference

import docx2txt 
import os

check_files = [] 
for dir, subdir, files in os.walk(r'C:\path\to\folder'):
    for file in files:
        check_files.append(file)

for file in check_files:
    print "file: {0}".format(file)
    text = docx2txt.process(r'C:\path\to\folder\{0}'.format(file))


Solution 1:[1]

Hidden .docx files starting with ~$ are simply temporary files created by Word while a file is actively open and being edited – the first two characters of the respective parent file's name are replaced with the ~$. They are usually deleted once you save and close a document, but sometimes they manage to stick around after you quit anyway. Since they are designed to be temporary compliments to a proper .docx file, they do not necessary have the correct zip package structure at all times.

You will do well to skip those. Checking if the file name starts with '~' should be good enough. Just add the following filtering:

check_files2 = [fl for fl in check_files if fl[0] != '~']
for file in check_files2:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Next-Door Tech