'XML: delete unwanted tags but keep text content

I am trying to tidy up a corpus with way too many tags. To do this, I want to filter out/remove the useless tags but keep the text content. I'm quite new at working with xml and neither code I tried works. The corpus looks something like this:

<corpus>
  <dialogue speaker="A">
    <sentence tag1="a" tag2="b"> Hello </sentence>
  </dialogue>
  <dialogue speaker="B">
    <sentence tag1="cc" tag2= "dd"> How are you </sentence>
    <sentence tag1="ff" tag2= "e"> today </sentence>
  </dialogue>
  <dialogue speaker="A">
    <sentence tag1="d" tag2= "bbb"> Great </sentence>
    <sentence tag1="f" tag2= "dd"> How about you </sentence>
  </dialogue>
  <dialogue speaker="B">
    <sentence tag1="a" tag2= "dd"> me too </sentence>
  </dialogue>
</corpus>

And the ideal outcome should be:

<corpus>
  <dialogue speaker="A">
    <sentence tag1="a" tag2="b"> Hello </sentence>
  </dialogue>
  <dialogue speaker="B">
    <sentence tag1="cc" tag2= "dd"> How are you today </sentence>
  </dialogue>
  <dialogue speaker="A">
    Great How about you
  </dialogue>
  <dialogue speaker="B">
     <sentence tag1="a" tag2= "dd"> me too </sentence>
  </dialogue>
</corpus>

The first code I tried is this, but it kept giving me an error for strip_tags():

f = ET.parse("file.xml")
root = f.getroot()

def filter_by(f, tag_list):
    for elem in root.iter('dialogue'):
        for start in elem.iter('sentence'):
            print(sentence.attrib)
            if tag_list in root.findall('.//sentence[@tag1]'):
                pass
            else:
                etree.strip_tags(f, 'sentence')
    return f

filter_by(f, ["a"])
f.write("output.xml")

Since there are more then one tag I need to keep, the other option I tried was this one, but it still gave me an error in the if-statement:

f = ET.parse("file.xml")
root = f.getroot()
tags_want = ["a", "cc"]

for child in root.iter('sentence'):
    attrib = child.get("tag1")
    if attrib not in tags_want: 
        etree.strip_tags(f,'sentence')
        f.write("output.xml")

Can someone help me?



Solution 1:[1]

I would do it in one of these two ways. First, using ElementTree, like you did, and xpath:

for dia in root.findall('.//dialogue'):
    if len(dia.findall('./sentence'))>1:
        new_text = "".join([t.text for t in dia.findall('.//sentence')])
        dia.find('.//sentence').text=new_text
        for to_delete in dia.findall('./sentence')[1:]:
            to_delete.clear()
print(ET.tostring(root).decode())

Second, while in the case of your sample xml it might not make a big difference, I would use lxml instead of ElementTree, because of the former's better xpath support:

from lxml import etree
root = etree.parse('file.xml')
for dia in root.xpath('//dialogue'):
    if (dia.xpath('count(./sentence)'))>1:
        new_text = "".join(dia.xpath('.//sentence//text()')).strip()
        dia.xpath('.//sentence')[0].text=new_text
        for to_delete in dia.xpath('.//sentence[position()>1]'):
            to_delete.getparent().remove(to_delete)    
print(etree.tostring(root).decode())

In either case, the output should be

<corpus>
  <dialogue speaker="A">
    <sentence tag1="a" tag2="b"> Hello </sentence>
  </dialogue>
  <dialogue speaker="B">
    <sentence tag1="cc" tag2="dd">How are you  today</sentence>
    </dialogue>
  <dialogue speaker="A">
    <sentence tag1="d" tag2="bbb">Great  How about you</sentence>
    </dialogue>
  <dialogue speaker="B">
    <sentence tag1="a" tag2="dd"> me too </sentence>
  </dialogue>
</corpus>

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jack Fleeting