Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
183 views
in Technique[技术] by (71.8m points)

python - Scraping hidden data [ window.__WEB_CONTEXT__= ] ... preferably with Scrapy

I'm scraping tripadvisor. My problem is right now to scrape the Hotelstars ( not the average user rating [bubbles] but the hotel class rating) of a given hotel and I'll later run in the problem of reviews being hidden behind "read more". https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html fortunately I know where the data where to find both. It in the page within this tag:

<script window.__WEB_CONTEXT={pageManifest:{"assets":[.... 
....
</script>

search here https://pastebin.com/Ww3ugxFR for "The view was fantastic!!" ( example of hidden text) or '"star":' for the Hotelstars.

I want to learn how to access this tag.

Here my example of how it doesn't work. I need to learn how to tell CSS selector ( or another tool) how to address this specific and how to extract the data from it. Here in this example I would just load the response and do a simple pattern search. I guess one could also load it with Json and extract from there but I'm not to firm with Json yet.:

hotel_CONTEXT = response.css("script text=window.__WEB_CONTEXT ::attr(pageManifest)).extract()

pattern_hotelstar = re.compile(r'star":["d')
matches_hotelstar = pattern_hotelstar.findall(hotel_CONTEXT)
Hotel_stars = str(matches_hotelstar).split('"')[2].split("'")[0]

Apparently what I want to achieve is possible with BeautifulSoup ( Scraping a website with data hidden under "read more" ... however I got errors with json when trying to replicate) but generally I'd prefer a solution with Scrapy.


Andrej Kesely provided an excellent solution to my problem! His code works so well that I want to fully understand it! Here is what I think to understand from the code and where I just don't understand his sorcery ;) :

data = re.search(r'window.__WEB_CONTEXT__=(.*?});', html_text).group(1)

Andrej searches the whole html_text for the pattern that starts with "window.__WEB...", extends the pattern over all characters (.), for any number of times (*) in an non-greedy way (?) and ends with a ";".I don't understand why there is a capturing group with } init and why } was not just put at the end given that the script ends with }; ( how did Andrej found this out ? is that a general pattern for these or did he print the whole page and looked it up ?). I also don't understand why it had to be non-greedy. Group(1) selected everything within the first paranthesis leaving window.WEB_CONTEXT= out. I guess this had something to do with loading the outcome with json. Same goes for

data = data.replace('pageManifest', '"pageManifest"')   

Then Andrej creates a function called traverse that will later be filled with the output from data. In the if-statement Andrej checks whether the input is a dictionary. In a next step Andrej loops through key(k) and value(v) of the dictionary. If k=="reviews" he yields the value. If not "yield from the function" ?? I'm also lost with elif and the check whether val is a list... In general what is the output v of the function ? How would I change the function to include more dictionaries to scroll over since else is already occupied by this yield from.

def traverse(val):
if isinstance(val, dict):
    for k, v in val.items():
        if k == 'reviews':
            yield v
        else:
            yield from traverse(v)
elif isinstance(val, list):
    for v in val:
        yield from traverse(v)
 

Here Andrej loops over the traverse(data) ( a dictionary, right ?). Since we've got multiple reviews on this page. In the nested loop Andrej gives each dictionary within the single review the name r and by dictonary_name["key"] he retrieves the value which is stored. Am I right ?

for reviews in traverse(data):
  for r in reviews:
    print(r['userProfile']['displayName'])
    print(r['title'])
    print(r['text'])
    print('Rating:', r['rating'])
    print('-' * 80)

Sorry for all these rookie questions.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This script will print all reviews and review-rating found on the page:

import re
import json
import requests


url = 'https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html'
html_text = requests.get(url).text

data = re.search(r'window.__WEB_CONTEXT__=(.*?});', html_text).group(1)
data = data.replace('pageManifest', '"pageManifest"')
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

def traverse(val):
    if isinstance(val, dict):
        for k, v in val.items():
            if k == 'reviews':
                yield v
            else:
                yield from traverse(v)
    elif isinstance(val, list):
        for v in val:
            yield from traverse(v)

for reviews in traverse(data):
    for r in reviews:
        print(r['userProfile']['displayName'])
        print(r['title'])
        print(r['text'])
        print('Rating:', r['rating'])
        print('-' * 80)

Prints:

BBDoll619
Just WOW!!
Okay, I didn't know this resort would be mainly couples and honeymooners as I went with 2 friends. We weren't uncomfortable though and met lots of nice people from across the globe and 1 couple from the US. This resort can only be reached by boat, so it is very secluded. We stayed in bungalow #2. It was rustic, but beautiful and right on the beach. Everyone who worked in the resort was friendly and very accommodating. We ate most meals at the resort which was pretty good. We had happy hour at the pier bar every day which was from 4-7pm. They had half off certain drinks and food specials. It was very nice relaxing, enjoying a great drink and watching the sunset. You can snorkel right in front of the resort which was so cool! We snorkeled for 2 hours!! The best is right by the floating bungalows where they did massages. Speaking of massages....OMG! It was heaven!! Very affordable and different. When you lie face down, you look into a cut out in the floor, so you can view the water and fish swimming by. I loved it!! We did an island hopping tour and it was not an issue coming from this resort. When we got into Coron town and passed by all the hotels in that area, we were so glad and thankful we chose El Rio Y Mar. Coron Town is very dirty, dusty, full of young backpackers and the hotels look subpar. It's fine if you're on a budget. I get it, but us girls/mom/friends wanted to treat ourselves. That we did! One day we went on a guided hike to the top of a closeby mountain. The view was fantastic!! I highly recommend this resort and would definitely return.
Rating: 5
--------------------------------------------------------------------------------
MaricrisAndPiotr
Amazing staff
The best customer experience we ever had! the school of fishes within the resort are amazing, very quite, very clean and well maintained rooms and outdoor surroundings. Our island trip organized by them is one of the best experience we had in our Coron trip. 
Kudos to El Rio highly recommended
Rating: 5
--------------------------------------------------------------------------------

...and so on.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...