Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
886 views
in Technique[技术] by (71.8m points)

r - Use RCurl to bypass "disclaimer page" then do the web scraping

I have a link like this one that I would like to extract data from it using RCurl, there is a disclaimer page before that and I need to click it in my browser before I can assess the data. Previously I use the script below, which is from here, to "bypass" disclaimer page and access the data using RCurl:

 pagesource <- getURL(url,.opts=curlOptions(followlocation=TRUE,cookiefile="nosuchfile"))
 doc <- htmlParse(pagesource)

It works before, but in recent few days it no long works. Actually I don't have much idea on what the code it doing, I wonder if I have to change something in the curlOptions, or re-write the whole piece of code?

Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As I mention in my comment, the solution to your problem will totally depend on the implementation of the "disclaimer page." It looks like the previous solution used cURL options defined in more detail here. Basically, what it's instructing cURL to do is to provide a fake cookies file (named "nosuchfile") and then followed the header redirect given by the site you were trying to access. Apparently that site was setup in such a way that if a visitor claimed not to have the proper cookies, then it would immediately redirect the visitor past the disclaimer page.

You didn't happen to create a file named "nosuchfile" in your working directory, did you? If not, it sounds like the target site changed the way its disclaimer page operates. If that's the case, there's really no help we can provide unless we have the actual page you're trying to access to diagnose.

In the example you reference in your question, they're using Javascript to move past the disclaimer, which could be tricky to get past.

For the example you mention, however...

  1. Open it in Chrome (or Firefox with Firebug)
  2. Right click on some blank space in the page and select "Inspect Element"
  3. Click the Network tab
  4. If there's content there, click the "Clear" button at the bottom to empty out the page.
  5. Accept the license agreement
  6. Watch all of the traffic that comes across the network. In my case, the top result was the interesting one. If you click it, you can preview it to verify that it is, indeed, an HTML document. If you click on the "Headers" tab under that item, it will show you the "Request URL". In my case, that was: http://bank.hangseng.com/1/PA_1_1_P1/ComSvlet_MiniSite_eng_gif?app=eINVCFundPriceDividend&pri_fund_code=U42360&data_selection=0&keyword=U42360&start_day=30&start_month=03&start_year=2012&end_day=18&end_month=04&end_year=2012&data_selection2=0

You can access that URL directly without having to accept any license agreement, either by hand or from cURL.

Note that if you've already accepted the agreement, this site stores a cookie stating such which will need to be deleted in order to get back to the license agreement page. You can do this by clicking the "Resources" tab, then going to "Cookies" and deleting each one, then refreshing the URL you posted above.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...