'XML data extraction where not all parent nodes contain the child node

I have an xml data file where user has opened an account and in some cases the account has been terminated. The data does not list the value when account has not been terminated, which makes it very difficult to extract the information.

Here is the reproducible example (where only user 1 and 3 have had their account terminated):

library(XML)
my_xml <- xmlParse('<accounts>
                    <user>
                      <id>1</id>
                      <start>2015-01-01</start>
                      <termination>2015-01-21</termination>
                    </user>
                    <user>
                      <id>2</id>
                      <start>2015-01-01</start>
                    </user>
                    <user>
                      <id>3</id>
                      <start>2015-02-01</start>
                      <termination>2015-04-21</termination>
                    </user>
                    <user>
                      <id>4</id>
                      <start>2015-03-01</start>
                    </user>
                    <user>
                      <id>5</id>
                      <start>2015-04-01</start>
                    </user>
                    </accounts>')

To create a data.frame I've tried using sapply however due to it not returning NA when user does not have a termination value, the code produces an error: arguments imply differing number of rows: 5, 2

accounts <- data.frame(id=sapply(my_xml["//user//id"], xmlValue),
                       start=sapply(my_xml["//user//start"], xmlValue),
                       termination=sapply(my_xml["//user//termination"], xmlValue)
                       )

Any suggestions on how to solve this problem ?



Solution 1:[1]

I prefer to use the xml2 package over the XML package, I find the syntax easier to use.
This is a straight forward problem. Find all of the user nodes and then parse out the id and termination nodes. With xml2, the xml_find_first function will return NA even if the node is not found.

library(xml2)
my_xml <- read_xml('<accounts>
                   <user>
                   <id>1</id>
                   <start>2015-01-01</start>
                   <termination>2015-01-21</termination>
                   </user>
                   <user>
                   <id>2</id>
                   <start>2015-01-01</start>
                   </user>
                   <user>
                   <id>3</id>
                   <start>2015-02-01</start>
                   <termination>2015-04-21</termination>
                   </user>
                   <user>
                   <id>4</id>
                   <start>2015-03-01</start>
                   </user>
                   <user>
                   <id>5</id>
                   <start>2015-04-01</start>
                   </user>
                   </accounts>')

usernodes<-xml_find_all(my_xml, ".//user")
  ids<-xml_text(xml_find_first(usernodes, ".//id") )
  terms<-xml_text(xml_find_first(usernodes, ".//termination"))

answer<-data.frame(ids, terms)

Solution 2:[2]

I managed to find a solution from XPath in R: return NA if node is missing

accounts <- data.frame(id=sapply(my_xml["//user//id"], xmlValue),
                       start=sapply(my_xml["//user//start"], xmlValue),
                       termination=sapply(xpathApply(my_xml, "//user",
                                                     function(x){
                                                     if("termination" %in% names(x))
                                                     xmlValue(x[["termination"]])
                                                     else NA}), function(x) x))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 camnesia