I'm scraping tripadvisor. My problem is right now to scrape the Hotelstars ( not the average user rating [bubbles] but the hotel class rating) of a given hotel and I'll later run in the problem of reviews being hidden behind "read more".
https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html
fortunately I know where the data where to find both. It in the page within this tag:
<script window.__WEB_CONTEXT={pageManifest:{"assets":[....
....
</script>
search here https://pastebin.com/Ww3ugxFR for "The view was fantastic!!" ( example of hidden text) or '"star":' for the Hotelstars.
I want to learn how to access this tag.
Here my example of how it doesn't work. I need to learn how to tell CSS selector ( or another tool) how to address this specific and how to extract the data from it. Here in this example I would just load the response and do a simple pattern search. I guess one could also load it with Json and extract from there but I'm not to firm with Json yet.:
hotel_CONTEXT = response.css("script text=window.__WEB_CONTEXT ::attr(pageManifest)).extract()
pattern_hotelstar = re.compile(r'star":["d')
matches_hotelstar = pattern_hotelstar.findall(hotel_CONTEXT)
Hotel_stars = str(matches_hotelstar).split('"')[2].split("'")[0]
Apparently what I want to achieve is possible with BeautifulSoup ( Scraping a website with data hidden under "read more" ... however I got errors with json when trying to replicate) but generally I'd prefer a solution with Scrapy.
Andrej Kesely provided an excellent solution to my problem! His code works so well that I want to fully understand it! Here is what I think to understand from the code and where I just don't understand his sorcery ;) :
data = re.search(r'window.__WEB_CONTEXT__=(.*?});', html_text).group(1)
Andrej searches the whole html_text for the pattern that starts with "window.__WEB...", extends the pattern over all characters (.), for any number of times (*) in an non-greedy way (?) and ends with a ";".I don't understand why there is a capturing group with } init and why } was not just put at the end given that the script ends with }; ( how did Andrej found this out ? is that a general pattern for these or did he print the whole page and looked it up ?). I also don't understand why it had to be non-greedy. Group(1) selected everything within the first paranthesis leaving window.WEB_CONTEXT= out. I guess this had something to do with loading the outcome with json. Same goes for
data = data.replace('pageManifest', '"pageManifest"')
Then Andrej creates a function called traverse that will later be filled with the output from data. In the if-statement Andrej checks whether the input is a dictionary. In a next step Andrej loops through key(k) and value(v) of the dictionary. If k=="reviews" he yields the value. If not "yield from the function" ?? I'm also lost with elif and the check whether val is a list... In general what is the output v of the function ? How would I change the function to include more dictionaries to scroll over since else is already occupied by this yield from.
def traverse(val):
if isinstance(val, dict):
for k, v in val.items():
if k == 'reviews':
yield v
else:
yield from traverse(v)
elif isinstance(val, list):
for v in val:
yield from traverse(v)
Here Andrej loops over the traverse(data) ( a dictionary, right ?). Since we've got multiple reviews on this page.
In the nested loop Andrej gives each dictionary within the single review the name r and by dictonary_name["key"] he retrieves the value which is stored. Am I right ?
for reviews in traverse(data):
for r in reviews:
print(r['userProfile']['displayName'])
print(r['title'])
print(r['text'])
print('Rating:', r['rating'])
print('-' * 80)
Sorry for all these rookie questions.
See Question&Answers more detail:
os