'Why Does a Strange File Shows Up in Directory When Using os.walk()?
The project is written in Pycharm on Windows 10.
I wrote a program that grabs .docx files from a directory and searches for information. At the end of the list of file names I get this file: "~$640188.docx"
I get this error when it hits this file:
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file
This error happens when I try to put file '~$640188.docx' into the docx2text method process
text = docx2txt.process(r'C:\path\to\folder\~$640188.docx')
From what I can see, this file does not exist in the directory I'm searching nor anywhere on my computer. The other strange part is that yesterday I wasn't getting this error.
I know there are sometimes "hidden" files in directories and I ran into those before on my mac (specifically '.DS_Store') but this is a .docx file.
I currently have an ugly solution, which says "don't run the code if you run into '~$640188.docx'". My concern is that this will become more of a problem when I dump 11000 files into the directory.
Where does this file come from?
Below is the code for reference
import docx2txt
import os
check_files = []
for dir, subdir, files in os.walk(r'C:\path\to\folder'):
for file in files:
check_files.append(file)
for file in check_files:
print "file: {0}".format(file)
text = docx2txt.process(r'C:\path\to\folder\{0}'.format(file))
Solution 1:[1]
Hidden .docx
files starting with ~$
are simply temporary files created by Word while a file is actively open and being edited – the first two characters of the respective parent file's name are replaced with the ~$
. They are usually deleted once you save and close a document, but sometimes they manage to stick around after you quit anyway. Since they are designed to be temporary compliments to a proper .docx
file, they do not necessary have the correct zip package structure at all times.
You will do well to skip those. Checking if the file name starts with '~'
should be good enough. Just add the following filtering:
check_files2 = [fl for fl in check_files if fl[0] != '~']
for file in check_files2:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Next-Door Tech |