Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
115 views
in Technique[技术] by (71.8m points)

html - Querying data from a website using R and GET function

I am very new to web scraping and I need to download data that appears a couple of clics after making a query. This means that i need to fill up two fields in the first page, then clic on a text in bold and then identify a table of data in upper case, and download it.

I started with using the GET function and adding the required names as a list to the "query"argument. However, despite i am an old R user, I can not even decipher the error i got.

GET("http://apps.kew.org/wcsp/advsearch.do;jsessionid=15925570A99B794122939889DE7DCDBC",path = "search", query =list(Genus="Imperata",Species="cylindrica"))


Response[http://apps.kew.org/search;jsessionid=15925570A99B794122939889DE7DCDBC?      Genus=Imperata&Species=cylindrica]  

Date: 2016-04-18 18:29
Status: 404
Content-Type: text/html; charset=iso-8859-1
Size: 445 B


404 Not Found

Not Found


The requested URL /search;jsessionid=15925570A99B794122939889DE7DCDBC was not found ...

Additionally, a 403 Forbidden
error was encountered while trying to use an ErrorDocument to handle the request.




Apache/2.2.3 (Red Hat) Server at apps.kew.org Port 80 See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It might be not working as it's a POST request vs a GET request (you can use my curlconverter package to help with these "hidden" APIs, btw):

library(httr)
library(rvest)

res <- httr::POST(url = "http://apps.kew.org/wcsp/advsearch.do", 
           body = list(page = "advancedSearch", 
                       AttachmentExist = "", 
                       family = "", 
                       placeOfPub = "", 
                       genus = "Imperata", 
                       yearPublished = "", 
                       species = "cylindrica", 
                       author = "", 
                       infraRank = "", 
                       infraEpithet = "", 
                       selectedLevel = "cont"), 
           encode = "form") 


pg <- content(res, as="parsed")

html_text(html_nodes(pg, "a.onwardnav"))

##  [1] "Imperata cylindrica (L.) P.Beauv., Ess. Agrostogr.: 165 (1812)."                                                
##  [2] "Imperata cylindrica var. africana (Andersson) C.E.Hubb., Joint Publ. Imp. Agric. Bur. 7: 10 (1944)."            
##  [3] "Imperata cylindrica var. condensata (Steud.) Hack., Anales Mus. Nac. Hist. Nat. Buenos Aires 21: 9 (1911)."     
##  [4] "Imperata cylindrica var. europaea (Andersson) Asch. & Graebn., Syn. Mitteleur. Fl. 2(1): 37 (1898)."            
##  [5] "Imperata cylindrica subsp. koenigii (Retz.) Masamura & Yanagih., Trans. Nat. Hist. Soc. Formosa 31: 326 (1941)."
##  [6] "Imperata cylindrica subvar. koenigii (Retz.) T.Durand & Schinz, Consp. Fl. Afric. 5: 694 (1894)."               
##  [7] "Imperata cylindrica var. koenigii (Retz.) Pilg., Fragm. Fl. Philipp. 1: 137 (1904)."                            
##  [8] "Imperata cylindrica var. latifolia (Hook.f.) C.E.Hubb., Joint Publ. Imp. Agric. Bur. 7: 14 (1944)."             
##  [9] "Imperata cylindrica var. major (Nees) C.E.Hubb., Grasses Mauritius: 96 (1940)."                                 
## [10] "Imperata cylindrica var. mexicana (Rupr. ex Galeotti) D.B.Ward, Novon 14: 368 (2004)."                          
## [11] "Imperata cylindrica f. pallida Honda, J. Fac. Sci. Univ. Tokyo, Sect. 3, Bot. 3: 374 (1930)."                   
## [12] "Imperata cylindrica var. parviflora Batt. & Trab., Bull. Soc. Bot. France 53: 32 (1906)."                       
## [13] "Imperata cylindrica var. pedicellata (Steud.) Debeaux, Actes Soc. Linn. Bordeaux 32: 52 (1878)."                
## [14] "Imperata cylindrica var. thunbergii (Retz.) T.Durand & Schinz, Consp. Fl. Afric. 5: 693 (1894), nom. superfl."  

lnks <- html_attr(html_nodes(pg, "a.onwardnav"), "href")

res2 <- GET(sprintf("http://apps.kew.org%s", lnks[1]))
pg2 <- content(res2, as="parsed")

trimws(gsub("[[:space:]]+", " ", html_text(html_nodes(pg2, "th + td"))))

## [1] "Medit. to Africa and Afghanistan 12 BAL COR FRA POR SAR SPA 13 ALB BUL GRC ITA KRI SIC TUE YUG 20 ALG EGY LBY MOR TUN 21 CNY CVI MDR 22 BEN BKN GAM GHA GNB GUI IVO LBR MLI NGA NGR SEN SIE TOG 23 BUR CAF CMN CON EQG GAB GGI RWA ZAI 24 CHA ETH SOC SUD 25 KEN TAN UGA 26 ANG MLW MOZ ZAM ZIM 27 BOT CPP LES NAM NAT OFS SWZ TVL 29 COM MAU MDG (32) kaz kgz tkm tzk uzb (33) ncs tcs 34 AFG CYP EAI IRN IRQ LBS PAL SIN TUR 35 KUW OMA? SAU YEM (36) chc chh chi chm chn chs cht chx (38) jap kor nns oga tai (40) ass ban ehm ind nep pak srl whm (41) and cbd lao mya ncb scs tha vie (42) bor cki jaw lsi mly mol phi sul sum xms (43) bis nwg sol (50) nfk (51) nzn (60) fij nwc sam ton van wal (62) mrn (73) ore (77) tex (78) ala fla geo lou msi sca vrg (79) mxs mxt"
## [2] "Hemicr. or rhizome geophyte"    
## [3] "Poaceae"                                     
## [4] "W.D.Clayton, R.Govaerts, K.T.Harman, H.Williamson & M.Vorontsova"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...