'How can I iterate child text nodes (not descendants) in ElementTree?

Given an element like this

<A>
    hello

    <annotation> NOT part of text </annotation>

    world
</A>

how can I get just the child text nodes (like XPath text()), using ElementTree?

Both iter() and itertext() are tree walkers, which include all descendant nodes. There is no immediate child iterator that I'm aware of. Plus, iter() only finds elements, anyway (it is after all, ElementTree), so can't be used to collect text nodes as such.

I understand that there's a library called lxml which provides better XPath support, but I'm asking here before adding another dependency. (Plus I'm very new to Python so I might be missing something obvious.)



Solution 1:[1]

You find the text of your example somewhat counter-intuitively in three attributes:

  • A.text for "hello"
  • annotation.text for "NOT part of text"
  • annotation.tail for "world"

(whitespace omitted). This is somewhat cumbersome. However, something along these lines should help:

 import xml.etree.ElementTree as et

 xml = """
 <A>
     hello

     <annotation> NOT part of text </annotation>

     world
 </A>"""


 doc = et.fromstring(xml)


 def all_texts(root):
     if root.text is not None:
         yield root.text
     for child in root:
         if child.tail is not None:
             yield child.tail


 print list(all_texts(doc))

Solution 2:[2]

I wrote a function similar inspired by the accepted answer (from deets) that I found helpful, that concatenates all text within a node:

def get_text(node: ET.Element):
    '''Gets text out of an XML Node'''

    # Get initial text
    text = node.text if node.text else ""
    # Get all text from child nodes recursively
    for child_node in node:
        text += self._get_text(child_node)
    # Get text that occurs after child nodes
    text += node.tail if node.tail else ""
    return text

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 deets
Solution 2 Justin Furuness