Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
769 views
in Technique[技术] by (71.8m points)

web scraping - How do I scrape / automatically download PDF files from a document search web interface in R?

I am using the R programming language for NLP (natural language process) analysis - for this, I need to "webscrape" publicly available information on the internet.

Recently, I learned how to "webscrape" a single pdf file from the website I am using :

library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tibble)

#this is an example of a single pdf
url <- "https://www.canlii.org/en/ns/nswcat/doc/2013/2013canlii47876/2013canlii47876.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words <- article_words %>%
  anti_join(stop_words, by = "word")

#this final command can take some time to run
article_summary <- textrank_sentences(data = article_sentences, terminology = article_words)

#Sources: https://stackoverflow.com/questions/66979242/r-error-in-textrank-sentencesdata-article-sentences-terminology-article-w  ,  https://www.hvitfeldt.me/blog/tidy-text-summarization-using-textrank/

The above code works fine if you want to manually access a single website and then "webscrape" this website. Now, I want to try and automatically download 10 such articles at the same time, without manually visiting each page. For instance, suppose I want to download the first 10 pdf's from this website: https://www.canlii.org/en/#search/type=decision&text=dog%20toronto

I think I found the following website which discusses how to do something similar (I adapted the code for my example): https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199

library(tidyverse)
library(rvest)
library(stringr)

page <- read_html("https://www.canlii.org/en/#search/type=decision&text=dog%20toronto ")

raw_list <- page %>% 
    html_nodes("a") %>%  
    html_attr("href") %>% 
    str_subset("\.pdf") %>% 
    str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .) 
    map(read_html) %>% 
    map(html_node, "#raw-url") %>% 
    map(html_attr, "href") %>% 
    str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .) %>% 
    walk2(., basename(.), download.file, mode = "wb") 

But this produces the following error:

Error in .f(.x[[1L]], .y[[1L]], ...) : scheme not supported in URL 'NA'

Can someone please show me what I am doing wrong? Is it possible to download the first 10 pdf files that appear on this website and save them individually in R as "pdf1", "pdf2", ... "pdf9", "pdf10"?

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I see some people suggesting that you use rselenium, which is a way to simulate browser actions, so that the web server renders the page as if a human was visiting the site. From my experience it is almost never necessary to go down that route. The javascript part of the website is interacting with an API and we can utilize that to circumvent the Javascript part and get the raw json data directly. In Firefox (and Chrome is similar in that regard I assume) you can right-click on the website and select “Inspect Element (Q)”, go to the “Network” tab and click on reload. You’ll see that each request the browser makes to the webserver is being listed after a few seconds or less. We are interested in the ones that have the “Type” json. When you right click on an entry you can select “Open in New Tab”. One of the requests that returns json has the following URL attached to it https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1 Opening that URL in Firefox gets you to a GUI that lets you explore the json data structure and you’ll see that there is a “results” entry which contains the data for the 25 first results of your search. Each one has a “path” entry, that leads to the page that will display the embedded PDF. It turns out that if you replace the “.html” part with “.pdf” that path leads directly to the PDF file. The code below utilizes all this information.

library(tidyverse) # tidyverse for the pipe and for `purrr::map*()` functions.
library(httr) # this should already be installed on your machine as `rvest` builds on it
library(pdftools)
#> Using poppler version 20.09.0
library(tidytext)
library(textrank)

base_url <- "https://www.canlii.org"

json_url_search_p1 <-
  "https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1"

This downloads the json for page 1 / results 1 to 25

results_p1 <-
  GET(json_url_search_p1, encode = "json") %>%
  content()

For each result we extract the path only.

result_html_paths_p1 <-
  map_chr(results_p1$results,
          ~ .$path)

We replace “.html” with “.pdf”, combine the base URL with the path to generate the full URLs pointing to the PDFs. Last we pipe it into purrr::map() and pdftools::pdf_text in order to extract the text from all 25 PDFs.

pdf_texts_p1 <-
  gsub(".html$", ".pdf", result_html_paths_p1) %>%
  paste0(base_url, .) %>%
  map(pdf_text)

If you want to do this for more than just the first page you might want to wrap the above code in a function that lets you switch out the “&page=” parameter. You could also make the “&text=” parameter an argument of the function in order to automatically scrape results for other searches.

For the remaining part of the task we can build on the code you already have. We make it a function that can be applied to any article and apply that function to each PDF text again using purrr::map().

extract_article_summary <-
  function(article) {
    article_sentences <- tibble(text = article) %>%
      unnest_tokens(sentence, text, token = "sentences") %>%
      mutate(sentence_id = row_number()) %>%
      select(sentence_id, sentence)
    
    
    article_words <- article_sentences %>%
      unnest_tokens(word, sentence)
    
    
    article_words <- article_words %>%
      anti_join(stop_words, by = "word")
    
    textrank_sentences(data = article_sentences, terminology = article_words)
  }

This now will take a real long time!

article_summaries_p1 <- 
  map(pdf_texts_p1, extract_article_summary)

Alternatively you could use furrr::future_map() instead to utilize all the CPU cores in your machine and speed up the process.

library(furrr) # make sure the package is installed first
plan(multisession)
article_summaries_p1 <- 
  future_map(pdf_texts_p1, extract_article_summary)

Disclaimer

The code in the answer above is for educational purposes only. As many websites do, this service restricts automated access to its contents. The robots.txt explicitly disallows the /search path from being accessed by bots. It is therefore recommended to get in contact with the site owner before downloading big amounts of data. canlii offers API access on an individual request basis, see documentation here. This would be the correct and safest way to access their data.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...