Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
155 views
in Technique[技术] by (71.8m points)

r - Using purrr:map to loop through web pages for scraping with Rselenium

I have a basic R script which I have cobbled together using Rselenium which allows me to log into a website, once authenticated my script then goes to the first page of interest and pulls 3 pieces of text from the page.

Luckily for me the URL has been created in such a way that I can pass in a vector of numbers to the URL to take me to the next page of interest hence the use of map().

While on each page I want to scrape the same 3 elements off the page and store them in a master data frame for later analysis.

I wish to use the map family of functions so that I can become more familiar with them but I am really struggling to get these to work, could anyone kindly tell me where I am going wrong?

Here is the main part of my code (go to website and log in)

library(RSelenium)
# https://stackoverflow.com/questions/55201226/session-not-created-this-version-of-chromedriver-only-supports-chrome-version-7/56173984
rd <- rsDriver(browser = "chrome",
               chromever = "88.0.4324.27",
               port = netstat::free_port())

remdr <- rd[["client"]]

# url of the site's login page
url <- "https://www.myWebsite.com/"

# Navigating to the page
remdr$navigate(url)

# Wait 5 secs for the page to load
Sys.sleep(5)

# Find the initial login button to bring up the username and password fields
loginbutton <- remdr$findElement(using = 'css selector','.plain')

# Click the initial login button to bring up the username and password fields
loginbutton$clickElement()

# Find the username box
username <- remdr$findElement(using = 'css selector','#username')

# Find the password box
password <- remdr$findElement(using = 'css selector','#password')

# Find the final login button
login <- remdr$findElement(using = 'css selector','#btnLoginSubmit1')

# Input the username
username$sendKeysToElement(list("myUsername"))

# Input the password
password$sendKeysToElement(list("myPassword"))

# Click login
login$clickElement()

And hey presto we're in!

Now my code takes me to the initial page of interest (index = 1)

Above I mentioned that I am looking to increment through each page and I can do this by substituting an integer into the URL at the rcId element, see below

#remdr$navigate("https://myWebsite.com/rc_redesign/#/layout/jcard/drugCard?accountId=XXXXXX&rcId=1&searchType=R&reimbCode=&searchTerm=&searchTexts=*") # Navigating to the page

For each rcId in 1:9999 I wish to grab the following 3 elements and store them in a data frame

hcpcs_info <- remdr$findElement(using = 'class','is-jcard-heading')

hcpcs <- hcpcs_info$getElementText()[[1]]

hcpcs_description <- remdr$findElement(using = 'class','is-jcard-desc')

hcpcs_desc <- hcpcs_description$getElementText()[[1]]

tc_info <- remdr$findElement(using = 'css selector','.col-12.ng-star-inserted')

therapeutic_class <- tc_info$getElementText()[[1]]

I have tried creating a separate function and passing to map but I am not advance enough to piece this together, below is what I have tried.

my_function <- function(index) {
  remdr$navigate(sprintf("https://rc2.reimbursementcodes.com/rc_redesign/#/layout/jcard/drugCard?accountId=113479&rcId=%d&searchType=R&reimbCode=*&searchTerm=*&searchTexts=*",index)
                 Sys.sleep(5)
                 hcpcs_info[index] <- remdr$findElement(using = 'class','is-jcard-heading')
                 hcpcs[index] <- hcpcs_info$getElementText()[index][[1]])
}

x <- 1:10 %>% 
map(~ my_function(.x))

Any help would be greatly appreciated

question from:https://stackoverflow.com/questions/66056336/using-purrrmap-to-loop-through-web-pages-for-scraping-with-rselenium

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Try the following :

library(RSelenium)

purrr::map_df(1:10, ~{
          remdr$navigate(sprintf("https://rc2.reimbursementcodes.com/rc_redesign/#/layout/jcard/drugCard?accountId=113479&rcId=%d&searchType=R&reimbCode=*&searchTerm=*&searchTexts=*",.x))
          Sys.sleep(5)
          hcpcs_info <- remdr$findElement(using = 'class','is-jcard-heading')
          hcpcs <- hcpcs_info$getElementText()[[1]]
          hcpcs_description <- remdr$findElement(using = 'class','is-jcard-desc')
          hcpcs_desc <- hcpcs_description$getElementText()[[1]]
          tc_info <- remdr$findElement(using = 'css selector','.col-12.ng-star-inserted')
          therapeutic_class <- tc_info$getElementText()[[1]]
          tibble(hcpcs, hcpcs_desc, therapeutic_class)
          }) -> result
result

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...