'Well formed XML throw XMLSyntaxError when read in a zipfile

I have a well formed XML embedded in a zip file

<?xml version="1.0" encoding="utf-8"?>
<board>
  <columns>
    <c name="Work" position="1">
      <tasks>
        <t id="9b860ebd-a18f-4944-bc0c-e6846c03a5a2" />
      </tasks>
    </c>
    <c name="Home" position="2">
      <tasks>
        <t id="6d6c6b90-5f06-49fe-90ea-50227c90bd8c" />
      </tasks>
    </c>
    <c name="Fun" position="3">
      <tasks>
        <t id="bd5f7e33-1011-4c96-8022-900dad135145" />
      </tasks>
    </c>
    <c name="Empty column" position="4">
      <tasks>
      </tasks>
    </c>
  </columns>
</board>

When that file is parsed and not embedded in a archive, lxml throws no parse/syntax error (this also "works" with standard python ElementTree). Thought this had to do with compression, but no.

Work (not in archive) :

import lxml.etree as etree

# Yes, I could parse the file directly but wanted to check xml type
with open("board.xml", "r") as bo:
    xml = bytes(bo.read(), "utf8")

e = etree.fromstring(xml)

Dont work (in archive) :

import zipfile
import lxml.etree as etree
# import xml.etree.ElementTree as etree

# Setting compression argument as zipfile.ZIP_DEFLATED because the archive's
# files were compressed that way changed nothing.

with zipfile.ZipFile("board.zip", "r") as boardzip:
    manifest_xml = boardzip.read("board.xml") or False

    # As seen from above, lxml.fromstring requires a bytes object. However, in 
    # the case of the embedded file, ZipFile.read already return a bytes 
    # object. Also, manifest_xml have a well-formed content.

    if manifest_xml:
        mxml = etree.fromstring(manifest_xml)

The last code outputs :

# With LXML
lxml.etree.XMLSyntaxError: expected '>', line 7, column 10
# With standard python
xml.etree.ElementTree.ParseError: mismatched tag: line 7, column 8

As the mismatched tag is </tasks>, maybe it was <t> that was not parsed correctly. But transforming <t id="9b860ebd-a18f-4944-bc0c-e6846c03a5a2" /> into <t id="9b860ebd-a18f-4944-bc0c-e6846c03a5a2"></t> also changed nothing.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source