Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
234 views
in Technique[技术] by (71.8m points)

r - Save content in web as data.frame

I want to grab content in the url while the original data come in simple column and row. I tried readHTMLTable and obviously its not working. Using webcsraping xpath, how to get clean data without ' ...' and keep the data in data.frame. Is this possible without saving in csv? kindly help me to improve my code. Thank you

library(rvest)
library(dplyr)
page <- read_html("http://weather.uwyo.edu/cgi-bin/sounding?region=seasia&TYPE=TEXT%3ALIST&YEAR=2006&MONTH=09&FROM=0100&TO=0100&STNM=48657")

xpath <- '/html/body/pre[1]'
txt <- page %>% html_node(xpath=xpath) %>% html_text()
txt

[1] "
-----------------------------------------------------------------------------
   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV
    hPa     m      C      C      %    g/kg    deg   knot     K      K      K 
-----------------------------------------------------------------------------
 1009.0     16   23.8   22.7     94  17.56    170      2  296.2  346.9  299.3
 1002.0     78   24.6   21.6     83  16.51    252      4  297.6  345.6  300.5
 1000.0     96   24.4   21.3     83  16.23    275      4  297.6  344.8  300.4
  962.0    434   22.9   20.0     84  15.56    235     10  299.4  345.0  302.1
  925.0    777   21.4   18.7     85  14.90    245     11  301.2  345.2  303.9
  887.0   1142   20.3   16.0     76  13.04    255     15  303.7  342.7  306.1
  850.0   1512   19.2   13.2     68  11.34    230     17  306.2  340.6  308.3
  839.0   1624   18.8   11.8     64  10.47    225     17  307.0  338.8  308.9
  828.0   1735   18.0   11.4     65  10.33   ... <truncated>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

We can extend your base code and treat the web page as an API endpoint since it takes parameters:

library(httr)
library(rvest)

I use more than ^^ below via :: but I don't want to pollute the namespace.

I'd usually end up writing a small, parameterized function or small package with a cpl parameterized functions to encapsulate the logic below.

httr::GET(
  url = "http://weather.uwyo.edu/cgi-bin/sounding",
  query = list(
    region = "seasia",
    TYPE = "TEXT:LIST",
    YEAR = "2006",
    MONTH = "09",
    FROM = "0100",
    TO = "0100",
    STNM = "48657"
  )
) -> res

^^ makes the web page request and gathers the response.

httr::content(res, as="parsed") %>%
  html_nodes("pre") -> wx_dat

^^ turns it into an html_document

Now, we extract the readings:

html_text(wx_dat[[1]]) %>%           # turn the first <pre> node into text
  strsplit("
") %>%                 # split it into lines
  unlist() %>%                       # turn it back into a character vector
  { col_names <<- .[3]; . } %>%      # pull out the column names (we'll use them later)
  .[-(1:5)] %>%                      # strip off the header
  paste0(collapse="
") -> readings  # turn it back into a big text blob

^^ cleaned up the table and we'll use readr::read_table() to parse it. We'll also turn the extract column names into the actual colum names:

readr::read_table(readings, col_names = tolower(unlist(strsplit(trimws(col_names), " +"))))
## # A tibble: 106 x 11
##     pres  hght  temp  dwpt  relh  mixr  drct  sknt  thta  thte  thtv
##    <dbl> <int> <dbl> <dbl> <int> <dbl> <int> <int> <dbl> <dbl> <dbl>
##  1  1009    16  23.8  22.7    94 17.6    170     2  296.  347.  299.
##  2  1002    78  24.6  21.6    83 16.5    252     4  298.  346.  300.
##  3  1000    96  24.4  21.3    83 16.2    275     4  298.  345.  300.
##  4   962   434  22.9  20      84 15.6    235    10  299.  345   302.
##  5   925   777  21.4  18.7    85 14.9    245    11  301.  345.  304.
##  6   887  1142  20.3  16      76 13.0    255    15  304.  343.  306.
##  7   850  1512  19.2  13.2    68 11.3    230    17  306.  341.  308.
##  8   839  1624  18.8  11.8    64 10.5    225    17  307   339.  309.
##  9   828  1735  18    11.4    65 10.3    220    17  307.  339.  309.
## 10   789  2142  15.1  10      72  9.84   205    16  308.  339.  310.
## # ... with 96 more rows

You didn't say you wanted the station metadata but we can get that too (in the second <pre>:

html_text(wx_dat[[2]]) %>%
  strsplit("
") %>%
  unlist() %>%
  trimws() %>%       # get rid of whitespace
  .[-1] %>%          # blank line removal
  strsplit(": ") %>% # separate field and value
  lapply(function(x) setNames(as.list(x), c("measure", "value"))) %>% # make each pair a named list
  dplyr::bind_rows() -> metadata # turn it into a data frame

metadata
## # A tibble: 30 x 2
##    measure                                 value      
##    <chr>                                   <chr>      
##  1 Station identifier                      WMKD       
##  2 Station number                          48657      
##  3 Observation time                        060901/0000
##  4 Station latitude                        3.78       
##  5 Station longitude                       103.21     
##  6 Station elevation                       16.0       
##  7 Showalter index                         0.34       
##  8 Lifted index                            -1.40      
##  9 LIFT computed using virtual temperature -1.63      
## 10 SWEAT index                             195.39     
## # ... with 20 more rows

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...