Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
961 views
in Technique[技术] by (71.8m points)

python - BeautifulSoup web scraping for a webpage where information is obtained after clicking a button

So, I am trying to get "Amenities and More" portion of the Yelp page for a few restaurants. The issue is that I can get to the Amenities from the restaurant's yelp page that are displayed first. It however has "n more" button that when clicked gives more amenities. Using BeautifulSoup and selenium with the webpage url and using BeautifulSoup with requests gives exact same results and I am stuck as to how to open the whole Amenities before grabbing them in my code. Two pictures below show what happens before and after click of the button. enter image description here

enter image description here

  1. "Before clicking '5 More Attributes': The first pic shows 4 "div" within which lies "span" that I can get to using any of the above methods.
  2. "After clicking '5 More Attributes': The second pic shows 9 "div" within which lies "span" that I am trying to get to.

Here is the code using selenium/beautifulsoup

import selenium
from selenium import webdriver
from bs4 import BeautifulSoup

URL ='https://www.yelp.com/biz/ziggis-coffee-longmont'

driver = 
 webdriver.Chrome(r"C:UsersFarihaAppDataLocalProgramschromedriver_win32chromedriver.exe")
driver.get(URL)
yelp_page_source_page1 = driver.page_source



soup = BeautifulSoup(yelp_page_source_page1,'html.parser')
spans = soup.find_all('span')

Result: There are 990 elements in "spans". I am only showing what is relevant for my question:

enter image description here

question from:https://stackoverflow.com/questions/65838411/beautifulsoup-web-scraping-for-a-webpage-where-information-is-obtained-after-cli

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

An alternative approach would be to extract the data directly from the JSON api on the site. This could be done without the overhead of selenium as follows:

from bs4 import BeautifulSoup
import requests
import json

session = requests.Session()
r = session.get('https://www.yelp.com/biz/ziggis-coffee-longmont')
#r = session.get('https://www.yelp.com/biz/menchies-frozen-yogurt-lafayette')

soup = BeautifulSoup(r.content, 'lxml')

# Locate the business ID to use (from JSON inside one of the script entries)
for script in soup.find_all('script', attrs={"type" : "application/json"}):
    gaConfig = json.loads(script.text.strip('<!-->'))

    try:
        biz_id = gaConfig['gaConfig']['dimensions']['www']['business_id'][1]
        break
    except KeyError:
        pass

# Build a suitable JSON request for the required information
json_post = [
    {
        "operationName": "GetBusinessAttributes",
        "variables": {
            "BizEncId": biz_id
        },
        "extensions": {
            "documentId": "35e0950cee1029aa00eef5180adb55af33a0217c64f379d778083eb4d1c805e7"
        }
    },
    {
        "operationName": "GetBizPageProperties",
        "variables": {
            "BizEncId": biz_id
        },
        "extensions": {
            "documentId": "f06d155f02e55e7aadb01d6469e34d4bad301f14b6e0eba92a31e635694ebc21"
        }
    },
]

r = session.post('https://www.yelp.com/gql/batch', json=json_post)
j = r.json()

business = j[0]['data']['business']
print(business['name'], '
')

for property in j[1]['data']['business']['organizedProperties'][0]['properties']:
    print(f'{"Yes" if property["isActive"] else "No":5} {property["displayText"]}')

This would give you the following entries:

Ziggi's Coffee 

Yes   Offers Delivery
Yes   Offers Takeout
Yes   Accepts Credit Cards
Yes   Private Lot Parking
Yes   Bike Parking
Yes   Drive-Thru
No    No Outdoor Seating
No    No Wi-Fi

How was this solved?

Your best friend here is your browser's network dev tools. With this you can watch the requests made to obtain the information. The normal process flow is the initial HTML page is downloaded, this runs the javascript and requests more data to further fill the page.

The trick is to first locate where the data you want is (often returned as JSON), then determine what you need to recreate the parameters needed to make the request for it.

To further understand this code, use print(). Print everything, it will show you how each part builds on the next part. It is how the script was written, one bit at a time.

Approaches using Selenium allow the javascript to work, but most times this is not needed as it is just making requests and formatting the data for display.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...