'Parse Grobid .tei.xml output with Beautiful Soup

I am trying to use Beautiful Soup to extract elements from a .tei.xml file that was generated using Grobid.

I can get title(s) using:

titles = soup.findAll('title')

What is the correct syntax to access the 'lower level' elements? (Author / Affiliation etc)

This is a portion of the tei.xml file that is the Grobid output:

 <?xml version="1.0" encoding="UTF-8"?>
 <TEI xmlns="http://www.tei-c.org/ns/1.0" 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://www.tei-c.org/ns/1.0 /data/grobid-0.5.1/grobid-home/schemas/xsd/Grobid.xsd"
  xmlns:xlink="http://www.w3.org/1999/xlink">
     <teiHeader xml:lang="en">
         <encodingDesc>
             <appInfo>
                 <application version="0.5.1-SNAPSHOT" ident="GROBID" when="2018-08-15T14:51+0000">
                     <ref target="https://github.com/kermitt2/grobid">GROBID - A machine learning software for extracting information from scholarly documents</ref>
                 </application>
             </appInfo>
         </encodingDesc>
         <fileDesc>
             <titleStmt>
                 <title level="a" type="main">The Role of Artificial Intelligence in Software Engineering</title>
             </titleStmt>
             <publicationStmt>
                 <publisher/>
                 <availability status="unknown"><licence/></availability>
             </publicationStmt>
             <sourceDesc>
                 <biblStruct>
                     <analytic>
                         <author>
                             <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Mark</forename><surname>Harman</surname></persName>
                             <affiliation key="aff0">
                                 <orgName type="department">CREST Centre</orgName>
                                 <orgName type="institution">University College London</orgName>
                                 <address>
                                     <addrLine>Malet Place</addrLine>
                                     <postCode>WC1E 6BT</postCode>
                                     <settlement>London</settlement>
                                     <country key="GB">UK</country>
                                 </address>
                             </affiliation>
                         </author>
                         <title level="a" type="main">The Role of Artificial Intelligence in Software Engineering</title>
                     </analytic>
                     <monogr>
                         <imprint>
                             <date/>
                         </imprint>
                     </monogr>
                 </biblStruct>
             </sourceDesc>
         </fileDesc>

Thanks.



Solution 1:[1]

BeautifulSoup lowercases the nodes, here's some examples:

title = soup.html.body.teiheader.filedesc.analytic.title.string

for author in soup.html.body.teiheader.filedesc.sourcedesc.find_all('author'):
    tag_or_none = author.persname.forename
    first_affiliation = author.affiliation

Also see the BeautifulSoup documentation which covers everything.

I'm working on a similar problem now and looking for collaboration. Let me know if you want to team up -- [email protected]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 nkconnor