Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

r - html_table dont work with long row

I am trying to extract the table that is on the page

Using html_table and rvest, However the first text, first row, is part of the table and apparently is causing conflicts with html_table. I leave the code

#Library's
library(rvest)
library(XML)

    url<-"http://www.svs.cl/institucional/mercados/consulta.php?mercado=V&Estado=VI&entidad=RVEMI" #page
    url<-read_html(url) 
    table<-html_nodes(url,"table") #read notes
    table<-html_table(table,fill=TRUE) #write like table

ANd the error is

Error in if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n != : missing value where TRUE/FALSE needed In addition: Warning message: In lapply(ncols, as.integer) : NAs introduced by coercion

Maybe it could be written using html_text, but I need it in table format.

Any help is appreciated

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It's not the size of the table but the extremely gnarly nodes in the first two rows.

So, just edit out the problem nodes.

xml2 supports a much wider array of libxml2 operations, now:

library(rvest)
library(tidyverse)

pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=V&Estado=VI&entidad=RVEMI")

xml_remove(html_nodes(pg, xpath=".//table/tr[1]"))
xml_remove(html_nodes(pg, xpath=".//table/tr[1]"))

html_nodes(pg, xpath=".//table") %>% 
  html_table() %>% 
  .[[1]] %>% 
  as_tibble()

## # A tibble: 368 × 3
##            X1                                                   X2    X3
##         <chr>                                                <chr> <chr>
## 1  76675290-K                                       AD RETAIL S.A.    VI
## 2  98000000-1  ADMINISTRADORA  DE FONDOS DE PENSIONES CAPITAL S.A.    VI
## 3  98000100-8  ADMINISTRADORA  DE FONDOS DE PENSIONES HABITAT S.A.    VI
## 4  76240079-0    ADMINISTRADORA DE FONDOS DE PENSIONES CUPRUM S.A.    VI
## 5  76762250-3    ADMINISTRADORA DE FONDOS DE PENSIONES MODELO S.A.    VI
## 6  98001200-K ADMINISTRADORA DE FONDOS DE PENSIONES PLANVITAL S.A.    VI
## 7  76265736-8   ADMINISTRADORA DE FONDOS DE PENSIONES PROVIDA S.A.    VI
## 8  94272000-9                                       AES GENER S.A.    VI
## 9  96566940-K                            AGENCIAS UNIVERSALES S.A.    VI
## 10 91253000-0                        AGRICOLA NACIONAL S.A.C. E I.    VI
## # ... with 358 more rows

Note you can do:

xml_remove(html_nodes(pg, xpath=".//table/tr[position() >= 1 and position() <=2]"))

instead of the two remove ops but it's almost as verbose and there's no real performance gain here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...