Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2
doesn't have a JS
engine it just GETS
data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2
returns
import urllib2
from BeautifulSoup import BeautifulSoup
bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())
test.html
and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.
So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up
http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available
Travel/Hotel API's?
it looks they might, though with some restrictions.
But if we still need to scrape it, with JS, then we can use selenium
http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.
I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
hope that helps.
As a side note:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>>
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…