'Well formed XML throw XMLSyntaxError when read in a zipfile
I have a well formed XML embedded in a zip file
<?xml version="1.0" encoding="utf-8"?>
<board>
<columns>
<c name="Work" position="1">
<tasks>
<t id="9b860ebd-a18f-4944-bc0c-e6846c03a5a2" />
</tasks>
</c>
<c name="Home" position="2">
<tasks>
<t id="6d6c6b90-5f06-49fe-90ea-50227c90bd8c" />
</tasks>
</c>
<c name="Fun" position="3">
<tasks>
<t id="bd5f7e33-1011-4c96-8022-900dad135145" />
</tasks>
</c>
<c name="Empty column" position="4">
<tasks>
</tasks>
</c>
</columns>
</board>
When that file is parsed and not embedded in a archive, lxml throws no parse/syntax error (this also "works" with standard python ElementTree). Thought this had to do with compression, but no.
Work (not in archive) :
import lxml.etree as etree
# Yes, I could parse the file directly but wanted to check xml type
with open("board.xml", "r") as bo:
xml = bytes(bo.read(), "utf8")
e = etree.fromstring(xml)
Dont work (in archive) :
import zipfile
import lxml.etree as etree
# import xml.etree.ElementTree as etree
# Setting compression argument as zipfile.ZIP_DEFLATED because the archive's
# files were compressed that way changed nothing.
with zipfile.ZipFile("board.zip", "r") as boardzip:
manifest_xml = boardzip.read("board.xml") or False
# As seen from above, lxml.fromstring requires a bytes object. However, in
# the case of the embedded file, ZipFile.read already return a bytes
# object. Also, manifest_xml have a well-formed content.
if manifest_xml:
mxml = etree.fromstring(manifest_xml)
The last code outputs :
# With LXML
lxml.etree.XMLSyntaxError: expected '>', line 7, column 10
# With standard python
xml.etree.ElementTree.ParseError: mismatched tag: line 7, column 8
As the mismatched tag is </tasks>
, maybe it was <t>
that was not parsed correctly. But transforming <t id="9b860ebd-a18f-4944-bc0c-e6846c03a5a2" />
into <t id="9b860ebd-a18f-4944-bc0c-e6846c03a5a2"></t>
also changed nothing.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|