'XML: delete unwanted tags but keep text content
I am trying to tidy up a corpus with way too many tags. To do this, I want to filter out/remove the useless tags but keep the text content. I'm quite new at working with xml and neither code I tried works. The corpus looks something like this:
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2= "dd"> How are you </sentence>
<sentence tag1="ff" tag2= "e"> today </sentence>
</dialogue>
<dialogue speaker="A">
<sentence tag1="d" tag2= "bbb"> Great </sentence>
<sentence tag1="f" tag2= "dd"> How about you </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2= "dd"> me too </sentence>
</dialogue>
</corpus>
And the ideal outcome should be:
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2= "dd"> How are you today </sentence>
</dialogue>
<dialogue speaker="A">
Great How about you
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2= "dd"> me too </sentence>
</dialogue>
</corpus>
The first code I tried is this, but it kept giving me an error for strip_tags()
:
f = ET.parse("file.xml")
root = f.getroot()
def filter_by(f, tag_list):
for elem in root.iter('dialogue'):
for start in elem.iter('sentence'):
print(sentence.attrib)
if tag_list in root.findall('.//sentence[@tag1]'):
pass
else:
etree.strip_tags(f, 'sentence')
return f
filter_by(f, ["a"])
f.write("output.xml")
Since there are more then one tag I need to keep, the other option I tried was this one, but it still gave me an error in the if-statement:
f = ET.parse("file.xml")
root = f.getroot()
tags_want = ["a", "cc"]
for child in root.iter('sentence'):
attrib = child.get("tag1")
if attrib not in tags_want:
etree.strip_tags(f,'sentence')
f.write("output.xml")
Can someone help me?
Solution 1:[1]
I would do it in one of these two ways. First, using ElementTree, like you did, and xpath:
for dia in root.findall('.//dialogue'):
if len(dia.findall('./sentence'))>1:
new_text = "".join([t.text for t in dia.findall('.//sentence')])
dia.find('.//sentence').text=new_text
for to_delete in dia.findall('./sentence')[1:]:
to_delete.clear()
print(ET.tostring(root).decode())
Second, while in the case of your sample xml it might not make a big difference, I would use lxml instead of ElementTree, because of the former's better xpath support:
from lxml import etree
root = etree.parse('file.xml')
for dia in root.xpath('//dialogue'):
if (dia.xpath('count(./sentence)'))>1:
new_text = "".join(dia.xpath('.//sentence//text()')).strip()
dia.xpath('.//sentence')[0].text=new_text
for to_delete in dia.xpath('.//sentence[position()>1]'):
to_delete.getparent().remove(to_delete)
print(etree.tostring(root).decode())
In either case, the output should be
<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2="dd">How are you today</sentence>
</dialogue>
<dialogue speaker="A">
<sentence tag1="d" tag2="bbb">Great How about you</sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2="dd"> me too </sentence>
</dialogue>
</corpus>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jack Fleeting |