'Unzipping a .docx file using zipfile library
I am trying write an application gets information from a table in a word docx file in order to do some analysis on it by putting turning it into a pandas DataFrame
. The first step is properly reading and the docx file, and to do this, I am following Virantha Ekanayake's guide for Reading and writing Microsoft Word docx files with Python.
I'm at the first step where they say to use the Zipfile
method of the zipfile
library in order to unzip the docx file into xml files. I adapted the function definitions in the guide into my code (code included below), but when I run my code, I get an error saying that the docx file is "not a zip file".
This person in the guide says that, "At its heart, a docx file is just a zip file (try running unzip on it!)…" I have tried renaming the docx file to a zip file, and it successfully unzips using WinZip. In my program, however, I want to be able to unzip the docx file without having to rename it to a .zip
file manually. Am I able somehow to unzip the docx file without renaming it? Or, if I have to rename it in order to use the Zipfile
method, how do I do this in my python code?
import zipfile
from lxml import etree
import pandas as pd
FILE_PATH = 'C:/Users/user/Documents/Python Project'
class Application():
def __init__(self):
#debug print('Initialized!')
xml_content = self.get_word_xml(f'{FILE_PATH}/DocxFile.docx')
xml_tree = self.get_xml_tree(xml_content)
def get_word_xml(self, docx_filename):
with open(docx_filename) as f:
zip = zipfile.ZipFile(f)
xml_content = zip.read('word/document.xml')
return xml_content
def get_xml_tree(self, xml_string):
return (etree.fromstring(xml_string))
a = Application()
a.mainloop()
Error:
Traceback (most recent call last):
File "C:\Users\user\Documents\New_Tool.py", line 39, in <module>
a = Application()
File "C:\Users\user\Documents\New_Tool.py", line 27, in __init__
xml_content = self.get_word_xml(f'{FILE_PATH}/DocxFile.docx')
File "C:\Users\user\Documents\New_Tool.py", line 32, in get_word_xml
zip = zipfile.ZipFile(f)
File "C:\Progra~1\Anaconda3\lib\zipfile.py", line 1222, in __init__
self._RealGetContents()
File "C:\Progra~1\Anaconda3\lib\zipfile.py", line 1289, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Solution 1:[1]
Open the file in binary mode:
with open(docx_filename, 'rb') as f:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | AKX |