I'm trying to scrape text from an html document using htmlParse (package: XML) in R. In the code below, I would like to know how return a NA when a tag (e.g., <p class="neg">) is missing:
<div class="review">
<p class="pos">positive</p><p class="neg">negative</p>
</div>
<div class="review">
<p class="pos">positive</p>
</div>
<div class="review">
<p class="pos">positive</p><p class="neg">negative</p>
</div>
<div class="review">
<p class="neg">negative</p>
</div>
I want the result to look like this:
"positive" "negative"
"positive" NA
"positive" "negative"
NA "negative"
Thanks!
Majesus
::::::::::::::::::::::::::::::::::::::::
Chris,
I have included a new record (hotel_name):
<div class="review">
<p class="pos">positive</p><p class="neg">negative</p>
</div>
<div class="review">
<p class="pos">positive</p>
</div>
<div class="review">
<p class="pos">positive</p><p class="neg">negative</p>
</div>
<div class="review">
<p class="neg">negative</p>
</div>
<div class="hotel">
<h3 class="hotel_name">Hotel Bla</h3>
</div>
y <-getNodeSet(doc, "//div")
y <- lapply(y, function(x){
y <- xpathSApply(x, ".//p[@class]", xmlValue)
names(y) <- xpathSApply(x, ".//p[@class]", xmlGetAttr, "class")
y
})
ldply(y, "rbind")
t <-getNodeSet(doc, "//div[@class='hotel']")
t <- lapply(t, function(x){
t <- xpathSApply(x, ".//h3[@class='hotel_name']", xmlValue)
names(t) <- xpathSApply(x, ".//h3[@class='hotel_name']", xmlGetAttr, "class")
t
})
ldply(t, "rbind")
How I can combine both records (y and z) in a table ( CSV ??) in Excel? "pos", "neg" and "t" must be columns in the same table. Importantly, each "pos" and each "neg" could be composed of different line breaks. I combined cbind and write.table. However, the result is deconfigured.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…