r - How do you scrape items together so you don't lose the index?

Question

Welcome To Ask or Share your Answers For Others

r - How do you scrape items together so you don't lose the index?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - How do you scrape items together so you don't lose the index?

I am doing some basic webscraping with RVest and am getting results to return, however the data isnt lining up with each other. Meaning, I am getting the items but they are out of order from the site so the 2 data elements I am scraping cant be joined in a data.frame.

library(rvest)
library(tidyverse)

base_url<- "https://www.uchealth.com/providers"
loc <- read_html(base_url) %>%
  html_nodes('[class=locations]') %>%
  html_text() 
dept <- read_html(base_url) %>%
  html_nodes('[class=department last]') %>%
  html_text()

I was expecting to be able to create a dataframe of :

Location  Department

Any suggestions? I was wondering if there is an index that would keep these items together but I didnt see anything.

EDIT: I tried this also and did not have any luck. It seems the location is getting an erroneous starting value:

scraping <- function(

base_url = "https://www.uchealth.com/providers"
)
{
loc <- read_html(base_url) %>%
  html_nodes('[class=locations]') %>%
  html_text() 

dept <- read_html(base_url) %>%
  html_nodes('[class=specialties]') %>%
  html_text()

data.frame(
  loc = ifelse(length(loc)==0, NA, loc),
  dept = ifelse(length(dept)==0, NA, loc), 
  stringsAsFactors=F
)

}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:39:38+0000

The problem you are facing, is not every child node is present in all of the parent nodes. The best way to handle these situations is to collect all parent nodes in a list/vector and then extract the desired information from each parent using the html_node function. html_node will always return 1 result for every node, even if it is NA.

library(rvest)

#read the page just onece
base_url<- "https://www.uchealth.com/providers"
page <- read_html(base_url)

#parse out the parent node for each parent
providers<-page %>% html_nodes('ul[id=providerlist]')  %>% html_children()

#parse out the requested information from each child.
dept<-providers %>% html_node("[class ^= 'department']") %>% html_text()
location<-providers %>%html_node('[class=locations]') %>% html_text()

The length of providers, dept and location should all be equal.

Categories

r - How do you scrape items together so you don't lose the index?

r - How do you scrape items together so you don't lose the index?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags