r - Web Scraping on multiple pages with RSelenium and select emails with regular expression

Question

Welcome To Ask or Share your Answers For Others

r - Web Scraping on multiple pages with RSelenium and select emails with regular expression

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Web Scraping on multiple pages with RSelenium and select emails with regular expression

I would like to collect email addresses clicking each name from this website https://ki.se/en/research/professors-at-ki I created the following loop. For some reason some email are not collected, and the code is very slow... Do you have a better code idea? Thank you very much in advance

library(RSelenium)

#use Rselenium to dowload emails
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://ki.se/en/research/professors-at-ki")


database<-data.frame(NA, nrow = length(name), ncol = 3)

for(i in 1:length(name)){
  #first website
  remDr$navigate("https://ki.se/en/research/professors-at-ki")
  elems <- remDr$findElements(using = 'xpath', "//strong")   #all elements to be selected
  elem <- elems[[i]] #do search and click on each one
  class(elem)
 people<- elem$getElementText()
  elem$clickElement()
  page <- remDr$getPageSource()
  #stringplit
  p<-str_split(as.character(page), "
")
  a<-grep("@", p[[1]])

  if(length(a)>0){
    email<-p[[1]][a[2]]
    email<-gsub(" ", "", email)        
    database[i,1]<-people
    database[i,2]<-email
    database[i,3]<-"Karolinska Institute"
  }
}

question from:https://stackoverflow.com/questions/65642259/web-scraping-on-multiple-pages-with-rselenium-and-select-emails-with-regular-exp

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

...

深蓝 · Answer 1 · 2021-10-06T18:46:57+0000

RSelenium is usually not the fastest approach as it requires the browser to load the page. There are cases, when RSelenium is the only option, but in this case, you can achieve what you need using rvest library, which should be faster. As for the errors you receive, there are two professors, for which the links provided do not seem to be working, thus the errors you receive.

library(rvest)
library(tidyverse)

# getting links to professors microsites as part of the KI main website
r <- read_html("https://ki.se/en/research/professors-at-ki")

people_links <- r %>%
  html_nodes("a") %>%
  html_attrs() %>%
  as.character() %>%
  str_subset("https://staff.ki.se/people/")

# accessing the obtained links, getting the e-mails
df <- tibble(people_links) %>%
  # filtering out these links as they do not seem to be accessible
  filter( !(people_links %in% c("https://staff.ki.se/people/gungra", "https://staff.ki.se/people/evryla")) ) %>%
  rowwise() %>%
  mutate(
    mail = read_html(people_links) %>%
      html_nodes("a") %>%
      html_attrs() %>%
      as.character() %>%
      str_subset("mailto:") %>%
      str_remove("mailto:")
  )

Categories

r - Web Scraping on multiple pages with RSelenium and select emails with regular expression

r - Web Scraping on multiple pages with RSelenium and select emails with regular expression

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags