python urllib2 - wait for page to finish loading/redirecting before scraping?

Question

Welcome To Ask or Share your Answers For Others

python urllib2 - wait for page to finish loading/redirecting before scraping?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python urllib2 - wait for page to finish loading/redirecting before scraping?

I'm learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing the html using urllib2. However, I'm running into a problem where, using the code below, the html I get back is not correct as the page seems to take a second to redirect (you can verify this by visiting the url) - instead I get the code from the page that initially briefly appears.

Is there some behavior or parameter to set to make sure the page has completely finished loading/redirecting before getting the website content?

import urllib2
from bs4 import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
print soup.prettify()

Edit: The answer is thorough, however, in the end what solved my problem was this: https://stackoverflow.com/a/3210737/1157283

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:55:58+0000

Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns

import urllib2
from BeautifulSoup import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())

test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.

So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available

Travel/Hotel API's? it looks they might, though with some restrictions.

But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.

I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

hope that helps.

As a side note:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> 
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

Categories

python urllib2 - wait for page to finish loading/redirecting before scraping?

python urllib2 - wait for page to finish loading/redirecting before scraping?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags