Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
285 views
in Technique[技术] by (71.8m points)

r - Web Scraping on multiple pages with RSelenium and select emails with regular expression

I would like to collect email addresses clicking each name from this website https://ki.se/en/research/professors-at-ki I created the following loop. For some reason some email are not collected, and the code is very slow... Do you have a better code idea? Thank you very much in advance

library(RSelenium)

#use Rselenium to dowload emails
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://ki.se/en/research/professors-at-ki")


database<-data.frame(NA, nrow = length(name), ncol = 3)

for(i in 1:length(name)){
  #first website
  remDr$navigate("https://ki.se/en/research/professors-at-ki")
  elems <- remDr$findElements(using = 'xpath', "//strong")   #all elements to be selected
  elem <- elems[[i]] #do search and click on each one
  class(elem)
 people<- elem$getElementText()
  elem$clickElement()
  page <- remDr$getPageSource()
  #stringplit
  p<-str_split(as.character(page), "
")
  a<-grep("@", p[[1]])

  if(length(a)>0){
    email<-p[[1]][a[2]]
    email<-gsub(" ", "", email)        
    database[i,1]<-people
    database[i,2]<-email
    database[i,3]<-"Karolinska Institute"
  }
}


question from:https://stackoverflow.com/questions/65642259/web-scraping-on-multiple-pages-with-rselenium-and-select-emails-with-regular-exp

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

RSelenium is usually not the fastest approach as it requires the browser to load the page. There are cases, when RSelenium is the only option, but in this case, you can achieve what you need using rvest library, which should be faster. As for the errors you receive, there are two professors, for which the links provided do not seem to be working, thus the errors you receive.

library(rvest)
library(tidyverse)

# getting links to professors microsites as part of the KI main website
r <- read_html("https://ki.se/en/research/professors-at-ki")

people_links <- r %>%
  html_nodes("a") %>%
  html_attrs() %>%
  as.character() %>%
  str_subset("https://staff.ki.se/people/")

# accessing the obtained links, getting the e-mails
df <- tibble(people_links) %>%
  # filtering out these links as they do not seem to be accessible
  filter( !(people_links %in% c("https://staff.ki.se/people/gungra", "https://staff.ki.se/people/evryla")) ) %>%
  rowwise() %>%
  mutate(
    mail = read_html(people_links) %>%
      html_nodes("a") %>%
      html_attrs() %>%
      as.character() %>%
      str_subset("mailto:") %>%
      str_remove("mailto:")
  )

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...