Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
728 views
in Technique[技术] by (71.8m points)

r - How to scrape a table with rvest and xpath?

using the following documentation i have been trying to scrape a series of tables from marketwatch.com

here is the one represented by the code bellow:

enter image description here

The link and xpath are already included in the code:

url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation <- url %>%
  html() %>%
  html_nodes(xpath='//*[@id="maincontent"]/div[2]/div[1]') %>%
  html_table()
valuation <- valuation[[1]]

I get the following error:

Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated") 

Thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

That website doesn't use an html table, so html_table() can't find anything. It actaully uses div classes column and data lastcolumn.

So you can do something like

url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="column"]')
    
valuation_data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="data lastcolumn"]')

Or even

url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="section"]')

To get you most of the way there.

Please also read their terms of use - particularly 3.4.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...