Well, I'll be darned. They don't use some of the more gnarly features of ASP.NET so this is really straightforward. As I noted in a similar question on this site, there does not appear to any restrictions on scraping in the non-existent robots.txt nor any terms/conditions that I could find.
library(httr)
library(rvest)
library(docxtractr) # for data frame cleaning helper utilities
library(tidyverse)
Let's get the first page:
httr::GET(
url = "http://www.domainia.nl/quarantaine/2018/12/15"
) -> res
pg <- httr::content(res)
Now, we'll extract the table:
html_node(pg, xpath = ".//table[contains(., 'Domein')]") %>%
html_table(fill=TRUE, trim=TRUE) %>%
select(2:6) %>% # The table is full of junk so we trim it off
docxtractr::assign_colnames(3) %>% # The column headers in in row 3
docxtractr::mcga() %>% # Make the column names great again
tbl_df() -> pg_one
Assign it to a list that we'll be adding to:
pgs <- list(pg01 = pg_one)
Now, go over the remaining tabs (you can do the extra bit required to go past 10 if there's more than 10 by extracting the pagination row and getting the max/last td
).
Inside the loop, we extract the view state fields, setup the other POST
body parameters and increment the page we're getting. We issue the POST
, extract the new table into the list and lather/rinse/repeat for the remaining pages:
for (pg_num in 2:10) {
Sys.sleep(5) # be kind since you don't own the server or pay for the bandwidth
hinputs <- html_nodes(pg, "input[type='hidden']")
hinputs <- as.list(setNames(html_attr(hinputs, "value"), html_attr(hinputs, "name")))
hinputs$`ctl00$tbSearch` <- ""
hinputs$`ctl00$ddlState` <- "quarantaine"
hinputs$`__EVENTTARGET` <- "ctl00$ContentPlaceHolder1$gvDomain"
hinputs$`__EVENTARGUMENT` <- sprintf("Page$%s", pg_num)
httr::POST(
url = "http://www.domainia.nl/quarantaine/2018/12/15",
encode = "form",
body = hinputs
) -> res
httr::content(res) %>%
html_node(xpath = ".//table[contains(., 'Domein')]") %>%
html_table(fill=TRUE, trim=TRUE) %>%
select(2:6) %>%
docxtractr::assign_colnames(3) %>%
docxtractr::mcga() %>%
tbl_df() -> pgs[[sprintf("pg_%02s", pg_num)]] # assign it to a new named list entry
}
Finally, combine all those rows:
bind_rows(pgs)
## # A tibble: 954 x 5
## domein status archive geregistreerd_op uit_quarantaine
## <chr> <chr> <chr> <chr> <chr>
## 1 0172design.nl quarantaine 0 "" 15-12-2018
## 2 0172designs.nl quarantaine 0 "" 15-12-2018
## 3 0172kleding.nl quarantaine 0 "" 15-12-2018
## 4 0172online.nl quarantaine 0 "" 15-12-2018
## 5 123shows.nl quarantaine 0 "" 15-12-2018
## 6 123story.nl quarantaine 0 "" 15-12-2018
## 7 21018dagen.nl quarantaine 0 "" 15-12-2018
## 8 22academy.nl quarantaine 0 "" 15-12-2018
## 9 22aviationcampus.nl quarantaine 0 "" 15-12-2018
## 10 22campus.nl quarantaine 0 "" 15-12-2018
## # ... with 944 more rows