Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
124 views
in Technique[技术] by (71.8m points)

Scraping multiple webpages with further nested pages with R

I am a new R user. I am trying to find a solution to my problem but just can't find the exact one, surely my fault. Anyway: I have this website I want to scrape and put on a .xlsx worksheet: "http://www.tbca.net.br/base-dados/composicao_estatistica.php?pagina=1&atuald=1". Basically, I am interested on the six variables indicated on the first row of the table: codigo, nome, nome inglés, ecc. for all the 53 pages of the dataset. Any of this variables contains links to other nested pages whose variables (componente, unidade, ecc.) I should scrape as well in order for me to have such a table:

codigo   nome  nome_inglés  nome_cientifico  grupo  marca  componente   unidade
C105      bla    blabla          blabla195    aq      awa    Energia      11    
C105      bla    blabla          blabla195    aq      awa    carboidrato  45
C105      bla    blabla          blabla195    aq      awa    proteina     22
C106      blu    blublu          blublu196    ar      owo    Energia      22    
C106      blu    blublu          blublu196    ar      owo    carboidrato  33
C106      blu    blublu          blublu196    ar      owo    proteina     44

Yet, I have done various attempts, but it does not work out.

Here's my code:

library(rvest)
library(dplyr)
library(data.table)
library(tidyverse)
library(stringr)

 get_tbca = function(tbca_link) {
 tbca_page = read_html(tbca_link)
 tbca_data = tbca_page %>% html_nodes("tr :nth-child(1)") %>%
html_text() 
 return(tbca_data)
  }


 tbca_df <- data.frame()

 lupin_fun <- function(page_result){

 print(paste("Page:", page_result))  

 link = paste0("http://www.tbca.net.br/base-dados/composicao_estatistica.php?pagina=", 
            page_result, "&atuald=1")
 page = read_html(link)

 codigo = page %>% html_nodes("td:nth-child(1) a") %>%  html_text()
 codigo_links <- page %>%  html_nodes("td:nth-child(1)") %>%
html_attr("href") %>% paste("http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=", ., sep = "")
 nome = page %>%  html_nodes("td:nth-child(2) a") %>%  html_text()
 nome_ingles = page %>%   html_nodes("td:nth-child(3) a") %>%  html_text()
 nome_cientifico = page %>%  html_nodes("td:nth-child(4) a") %>%  html_text()
 grupo = page %>%  html_nodes("td:nth-child(5) a") %>%  html_text()
 marca = page %>%  html_nodes("td:nth-child(6) a") %>%  html_text()
 tbca_reference = sapply(codigo_links, FUN = get_tbca, USE.NAMES = FALSE)

 tbca_df <- cbind(tbca_reference, codigo, nome, nome_ingles, nome_cientifico, grupo, marca, stringsAsFactors = FALSE)

 return(tbca_df)  
 }


 lupin_list <- lapply(1:3, lupin_fun)

 lupin_result <- do.call(rbind, lupin_list)
question from:https://stackoverflow.com/questions/65883727/scraping-multiple-webpages-with-further-nested-pages-with-r

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I think you were on the right track. Two issues that I could see:

  1. I think using html_table is much easier in this case. You get the table directly as a data frame, instead of getting the cells/columns and then binding everything together.
  2. One if the issues I found was in codigo_links. You need to get the nodes with the <a>tags inside each <td> before extracting the href attribute. I fixed this part in my solution.

This is the way I did it:

library(rvest)
library(dplyr)

get.table.in.link <- function(url1) {
  # get code of food from link
  cod_produto <- strsplit(url1, 'cod_produto=')[[1]][2]

  # get table in nested link 
  table.2 <- read_html(url1) %>% html_table() %>% .[[1]]

  table.3 <- table.2 %>% 
    # filter only Energia, carboidrato, proteina (if you want all rows you can ignore this)
    dplyr::filter(Componente %in% c('Energia', 'Carboidrato total', 'Proteína')) %>%
    # Also choosing subset of columns (you can also change this)
    dplyr::select(Componente, Unidades, `Valor por 100 g`) %>%
    # add column with product code
    dplyr::mutate(Código=cod_produto) %>%
    # change decimal separator and convert to numeric
    dplyr::mutate(`Valor por 100 g`= as.numeric(gsub(',','.',gsub('\.', '', `Valor por 100 g`))))
  
  return(table.3)
}

get.main.table <- function(page.number) {
  print(paste("Page:", page.number))
  
  url.main <- paste0("http://www.tbca.net.br/base-dados/composicao_estatistica.php?pagina=", page.number, "&atuald=1")
 
  page <- read_html(url.main)
  
  # this is simpler to get the main table
  df.table <- page %>% html_table() %>% .[[1]]
  
  # now get list of links in each row (get from first column)
  list.links <- page %>%  html_nodes("td:nth-child(1)") %>% html_nodes('a') %>%
    html_attr("href") %>% paste("http://www.tbca.net.br/base-dados/", ., sep = "")
  
  # get table with details of each product
  # ldply applies function for each element of list.links, then combine results into a data frame
  table.composicao <- plyr::ldply(list.links, get.table.in.link)
  
  # now merge df.table and table.composicao using "Código"   
  df.final <- df.table %>% left_join(table.composicao, by="Código")  
  
  return(df.final)
}

# run get.main.table with arguments = 1, 2, 3 and combine results in a dataframe
df.total <- plyr::ldply(1:3, get.main.table)

The result (even only loading 3 pages) is a big table, so I am not sure it is correct (because I could not look at all of it). But it seemed ok.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...