'XML data extraction where not all parent nodes contain the child node
I have an xml data file where user has opened an account and in some cases the account has been terminated. The data does not list the value when account has not been terminated, which makes it very difficult to extract the information.
Here is the reproducible example (where only user 1 and 3 have had their account terminated):
library(XML)
my_xml <- xmlParse('<accounts>
<user>
<id>1</id>
<start>2015-01-01</start>
<termination>2015-01-21</termination>
</user>
<user>
<id>2</id>
<start>2015-01-01</start>
</user>
<user>
<id>3</id>
<start>2015-02-01</start>
<termination>2015-04-21</termination>
</user>
<user>
<id>4</id>
<start>2015-03-01</start>
</user>
<user>
<id>5</id>
<start>2015-04-01</start>
</user>
</accounts>')
To create a data.frame I've tried using sapply
however due to it not returning NA when user does not have a termination value, the code produces an error: arguments imply differing number of rows: 5, 2
accounts <- data.frame(id=sapply(my_xml["//user//id"], xmlValue),
start=sapply(my_xml["//user//start"], xmlValue),
termination=sapply(my_xml["//user//termination"], xmlValue)
)
Any suggestions on how to solve this problem ?
Solution 1:[1]
I prefer to use the xml2 package over the XML package, I find the syntax easier to use.
This is a straight forward problem. Find all of the user nodes and then parse out the id and termination nodes. With xml2, the xml_find_first
function will return NA even if the node is not found.
library(xml2)
my_xml <- read_xml('<accounts>
<user>
<id>1</id>
<start>2015-01-01</start>
<termination>2015-01-21</termination>
</user>
<user>
<id>2</id>
<start>2015-01-01</start>
</user>
<user>
<id>3</id>
<start>2015-02-01</start>
<termination>2015-04-21</termination>
</user>
<user>
<id>4</id>
<start>2015-03-01</start>
</user>
<user>
<id>5</id>
<start>2015-04-01</start>
</user>
</accounts>')
usernodes<-xml_find_all(my_xml, ".//user")
ids<-xml_text(xml_find_first(usernodes, ".//id") )
terms<-xml_text(xml_find_first(usernodes, ".//termination"))
answer<-data.frame(ids, terms)
Solution 2:[2]
I managed to find a solution from XPath in R: return NA if node is missing
accounts <- data.frame(id=sapply(my_xml["//user//id"], xmlValue),
start=sapply(my_xml["//user//start"], xmlValue),
termination=sapply(xpathApply(my_xml, "//user",
function(x){
if("termination" %in% names(x))
xmlValue(x[["termination"]])
else NA}), function(x) x))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | camnesia |