Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
444 views
in Technique[技术] by (71.8m points)

r - How to convert an portion of an XML into a data frame? (properly)

I am trying to extract information from an XML file from ClinicalTrials.gov. The file is organized in the following way:

<clinical_study>
  ...
  <brief_title>
  ...
  <location>
    <facility>
      <name>
      <address>
        <city>
        <state>
        <zip>
        <country>
    </facility>
    <status>
    <contact>
      <last_name>
      <phone>
      <email>
    </contact>
  </location>
  <location>
    ...
  </location>
  ...
</clinical_study>

I can use the R XML package from CRAN in the following code to extract all location nodes from the XML file:

library(XML)
clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)
locations <- xmlToDataFrame(getNodeSet(xmlDoc,"//location"))

This works kind of ok. However, if you look at the data frame, you will notice that the xmlToDataFrame function lumped together everything under <facility> into a single concatenated string. A solution would be to write code to generate the data frame column by column, for example, you could generate

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You could flatten the XML first.

flatten_xml <- function(x) {
  if (length(xmlChildren(x)) == 0) structure(list(xmlValue(x)), .Names = xmlName(xmlParent(x)))
  else Reduce(append, lapply(xmlChildren(x), flatten_xml))
}

dfs <- lapply(getNodeSet(xmlDoc,"//location"), function(x) data.frame(flatten_xml(x)))
allnames <- unique(c(lapply(dfs, colnames), recursive = TRUE))
df <- do.call(rbind, lapply(dfs, function(df) { df[, setdiff(allnames,colnames(df))] <- NA; df }))
head(df)

 #          city      state   zip       country     status          last_name        phone                    email               last_name.1
 # 1  Birmingham    Alabama 35294 United States Recruiting Louis B Nabors, MD 205-934-1813          [email protected]        Louis B Nabors, MD
 # 2      Mobile    Alabama 36604 United States Recruiting Melanie Alford, RN 251-445-9649     [email protected]    Pamela Francisco, CCRP
 # 3     Phoenix    Arizona 85013 United States Recruiting     Lynn Ashby, MD 602-406-6262           [email protected]            Lynn Ashby, MD
 # 4      Tucson    Arizona 85724 United States Recruiting         Jamie Holt 520-626-6800 [email protected] Baldassarre Stea, MD, PhD
 # 5 Little Rock   Arkansas 72205 United States Recruiting   Wilma Brooks, RN 501-686-8530       [email protected]       Amanda Eubanks, APN
 # 6    Berkeley California 94704 United States  Withdrawn               <NA>         <NA>                     <NA>                      <NA>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...