'How can I iterate child text nodes (not descendants) in ElementTree?
Given an element like this
<A>
hello
<annotation> NOT part of text </annotation>
world
</A>
how can I get just the child text nodes (like XPath text()
), using ElementTree
?
Both iter()
and itertext()
are tree walkers, which include all descendant nodes. There is no immediate child iterator that I'm aware of. Plus, iter()
only finds elements, anyway (it is after all, ElementTree), so can't be used to collect text nodes as such.
I understand that there's a library called lxml
which provides better XPath support, but I'm asking here before adding another dependency. (Plus I'm very new to Python so I might be missing something obvious.)
Solution 1:[1]
You find the text of your example somewhat counter-intuitively in three attributes:
- A.text for "hello"
- annotation.text for "NOT part of text"
- annotation.tail for "world"
(whitespace omitted). This is somewhat cumbersome. However, something along these lines should help:
import xml.etree.ElementTree as et
xml = """
<A>
hello
<annotation> NOT part of text </annotation>
world
</A>"""
doc = et.fromstring(xml)
def all_texts(root):
if root.text is not None:
yield root.text
for child in root:
if child.tail is not None:
yield child.tail
print list(all_texts(doc))
Solution 2:[2]
I wrote a function similar inspired by the accepted answer (from deets) that I found helpful, that concatenates all text within a node:
def get_text(node: ET.Element):
'''Gets text out of an XML Node'''
# Get initial text
text = node.text if node.text else ""
# Get all text from child nodes recursively
for child_node in node:
text += self._get_text(child_node)
# Get text that occurs after child nodes
text += node.tail if node.tail else ""
return text
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | deets |
Solution 2 | Justin Furuness |