I think you were on the right track. Two issues that I could see:
- I think using
html_table
is much easier in this case. You get the table directly as a data frame, instead of getting the cells/columns and then binding everything together.
- One if the issues I found was in
codigo_links
. You need to get the nodes with the <a>
tags inside each <td>
before extracting the href
attribute. I fixed this part in my solution.
This is the way I did it:
library(rvest)
library(dplyr)
get.table.in.link <- function(url1) {
# get code of food from link
cod_produto <- strsplit(url1, 'cod_produto=')[[1]][2]
# get table in nested link
table.2 <- read_html(url1) %>% html_table() %>% .[[1]]
table.3 <- table.2 %>%
# filter only Energia, carboidrato, proteina (if you want all rows you can ignore this)
dplyr::filter(Componente %in% c('Energia', 'Carboidrato total', 'Proteína')) %>%
# Also choosing subset of columns (you can also change this)
dplyr::select(Componente, Unidades, `Valor por 100 g`) %>%
# add column with product code
dplyr::mutate(Código=cod_produto) %>%
# change decimal separator and convert to numeric
dplyr::mutate(`Valor por 100 g`= as.numeric(gsub(',','.',gsub('\.', '', `Valor por 100 g`))))
return(table.3)
}
get.main.table <- function(page.number) {
print(paste("Page:", page.number))
url.main <- paste0("http://www.tbca.net.br/base-dados/composicao_estatistica.php?pagina=", page.number, "&atuald=1")
page <- read_html(url.main)
# this is simpler to get the main table
df.table <- page %>% html_table() %>% .[[1]]
# now get list of links in each row (get from first column)
list.links <- page %>% html_nodes("td:nth-child(1)") %>% html_nodes('a') %>%
html_attr("href") %>% paste("http://www.tbca.net.br/base-dados/", ., sep = "")
# get table with details of each product
# ldply applies function for each element of list.links, then combine results into a data frame
table.composicao <- plyr::ldply(list.links, get.table.in.link)
# now merge df.table and table.composicao using "Código"
df.final <- df.table %>% left_join(table.composicao, by="Código")
return(df.final)
}
# run get.main.table with arguments = 1, 2, 3 and combine results in a dataframe
df.total <- plyr::ldply(1:3, get.main.table)
The result (even only loading 3 pages) is a big table, so I am not sure it is correct (because I could not look at all of it). But it seemed ok.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…