Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
435 views
in Technique[技术] by (71.8m points)

selenium - How to click a link by text with No Text in Python

I am trying to scrape a Wine data from vivino.com and using selenium to automate it and scrape as many data as possible. My code looks like this:

import time 
from selenium import webdriver

browser = webdriver.Chrome('C:Program Files (x86)chromedriver.exe')

browser.get('https://www.vivino.com/explore?e=eJwFwbEOQDAUBdC_uaNoMN7NZhQLEXmqmiZaUk3x987xkVXRwLtAVcLLy7qE_tiN0Bz6FhcV7M4s0ZkkB86VUZIL9l4kmyjW4ORmbo0nTTPVDxlkGvg%3D&cart_item_source=nav-explore') # Vivino Website with 5 wines for now (simple example). Plan to scrape around 10,000 wines 

lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")

match=False
while(match==False):
    lastCount = lenOfPage
    time.sleep(7)
    lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    if lastCount==lenOfPage:
        match=True

That opens a website with 5 wines and scrolls down. Now I want to click to hyperlink of the wine one by one to scrape information about its price, wine grapes sort, etc. So, basically my script will try scroll down which allows to have as many wines displayed on the page and then click to a first hyperlink, get additional information and go back. Then, the process will repeat. I don't think that's an efficient strategy but that's what I came up so far.

The problem I have is with hyperlink in the vivino website. There is no text near the href link which allows me to use find_element_by_link_text function:

<a class="anchor__anchor--2QZvA" href="/weingut-r-a-pfaffl-austrian-cherry-zweigelt/w/1261542?year=2018&amp;price_id=23409078&amp;cart_item_source=direct-explore" target="_blank">

Could you please suggest the way how click for a wine with Selenium that has not text after the hyperlink? I haven't found proper answer during my web search. Thanks in advance


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You're doing way more work than you have to - with Selenium I mean. When visiting the page, I logged my network traffic with Google Chrome's dev tools, and I saw that my browser made an HTTP GET request to a REST API, the response of which is JSON and contains all the wine/price information you could ever want. So, you don't need to do any scraping. Just imitate that GET request with the desired query-string parameters and the correct headers. It seems the REST API just cares about the user-agent header, which is trivial.


  1. First, visit the URL in Google Chrome.
  2. Press the F12 key to open the Google Chrome dev tools menu. Click on the Network tab.
  3. Click on the round Record button. It should turn red, and it will start logging all network traffic in the log below. Click on the Filter button next to it, and then click on XHR. This will only show XHR (XmlHttpRequest) requests in the log. We are interested in these requests specifically, because it is via XHR that requests to APIs are typically made. Here's what it should look like now:

  1. With the Chrome dev tools menu still open, right-click (not left-click) on the page refresh button to reveal a drop-down menu. Then, click on Empty Cache and Hard Reload.

This will empty your browser's cache for this page, and force your browser to refresh the page. As the page is being refreshed, you should start to see some entries appearing in the traffic log.

There should now be some XHR request entries in the log. We don't know which one of these is the one we're actually interested in, so we just look at all of them until we find one that looks like it could be the right one (or, if it contains the information we're looking for, like information for individual wines, etc.). I happen to know that we are interested in the one that starts with explore?..., so let's click on that.

  1. Once you click on the entry, a panel will open on the right of the log. Click on the Headers tab.

This tab contains all the information regarding how this request was made. Under the General area, you can see the Request URL, which is the URL to the REST API endpoint that we made a request to. This URL might be quite long, because it will also typically contain the query-string parameters (those are the key-value pairs that come after explore?, like country_code=DE or currency_code=EUR. They are separated by &). The query-string parameters are important, because they contain information about certain filters that we want to apply to our query. In my code example, I've removed them from the REST API endpoint URL, and instead moved them into the params dictionary. This step isn't required - you could also just leave them in the URL, but I find that it is easier to read and modify this way. The query-string parameters are also important because, sometimes, certains APIs will expect certain parameters to be present in the request, or they will expect them to have certain values - in other words, some APIs are very picky about their query-string parameters, and if you remove them or tamper with them in a way that the API doesn't expect, the API will say that your request isn't formulated correctly.

In the General area, you can also see Request Method, which in our case is GET. This tells us, that our browser made an HTTP GET request. Not all API endpoints work the same, some want HTTP POST, etc.

Status Code tells us what status code the server sent back. 200 means everything went OK. You can learn more about HTTP status codes here.

Let's take a look at the Response Headers area. This area contains all the response headers that the server sent back after the request was made. These can be useful for a browser for things like setting cookies or knowing how to interpret the data the server has sent back.

The Request Headers area contains all the headers that your browser sent to the server when it made the request. Usually, it's a good idea to copy all of these key-value pairs and turn them into a Python dictionary headers, because that way you can be sure that your Python script will make the exact same request that your browser made. However, usually, I like to trim this down as much as I can. I know that many APIs desperately care about the user-agent field, so usually I'll keep that one, but sometimes they also care about the referer. As you work with different APIs, you'll have to just kind of figure out which request headers the API cares about through trial-and-error. This API happens to only care about the user-agent.

The last area Query String Parameters is just a cute way of showing the query-string parameters from the Request URL in a human-friendly list of key-value pairs. Sometimes it's helpful to copy them from here, rather than from the URL.

  1. Now, click on the Preview tab, next to the Headers tab.

The Preview tab contains a pretty-printed preview of the actual data that was sent back as a result of the browser's request. In our case, this contains the JSON data sent back by the server. You can click on the little gray triangles to expand or collapse certain parts of the JSON structure, to reveal different data.

Looking at this, I can tell that the JSON response is one big dictionary, which has a key explore_vintage, whose value is another dictionary, which has a key records whose value is a list of dictionaries, where each dictionary in this list represents one wine object. Expanding the first record (the 0th one) reveals all information regarding the first wine in the list. You can explore these structures as much as you like to see what kinds of information are available to you.


def main():

    import requests

    url = "https://www.vivino.com/api/explore/explore"

    params = {
        "country_code": "DE",
        "currency_code": "EUR",
        "grape_filter": "varietal",
        "min_rating": "3.5",
        "order_by": "ratings_average",
        "order": "desc",
        "page": "1",
        "price_range_max": "30",
        "price_range_min": "7",
        "wine_type_ids[]": "1"
    }

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    records = response.json()["explore_vintage"]["records"]

    for record in records:
        name = record["vintage"]["name"]
        price = record["price"]["amount"]
        currency = record["price"]["currency"]["code"]
        print(f""{name}" - Price: {price} {currency}")

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

"Varvaglione Cosimo Varvaglione Collezione Privata Primitivo di Manduria 2015" - Price: 21.9 EUR
"Masseria Borgo dei Trulli Mirea Primitivo di Manduria 2019" - Price: 19.9 EUR
"Vigneti del Salento Vigne Vecchie Primitivo di Manduria 2016" - Price: 22.95 EUR
"Vigneti del Salento Vigne Vecchie Leggenda Primitivo di Manduria 2016" - Price: 17.87 EUR
"Varvaglione Papale Linea Oro Primitivo di Manduria 2016" - Price: 18.85 EUR
"Caballo Loco Grand Cru Apalta 2014" - Price: 27.9 EUR
"Luccarelli Il Bacca Old Vine Primitivo di Manduria 2016" - Price: 20.9 EUR
"Mottura Stilio Primitivo di Manduria 2018" - Price: 12.89 EUR
"Caballo Loco Grand Cru Maipo 2015" - Price: 24.81 EUR
"Lorusso Michele Solone Primitivo 2017" - Price: 21.39 EUR
"Chateau Purcari Negru de Purcari 2017" - Price: 29.8 EUR
"San Marzano 60 Sessantanni Limited Edition Old Vines Primitivo di Manduria 2016" - Price: 22.85 EUR
"San Marzano 60 Sessantanni Old Vines Primitivo di Manduria 2016" - Price: 20.9 EUR
"San Marzano 60 Sessantanni Old Vines Primitivo di Manduria 2017" - Price: 17.775 EUR
"Lenotti Amarone della Valpolicella Classico 2015" - Price: 27.95 EUR
"Zeni Cruino Rosso Veronese 2015" - Price: 22.9 EUR
"Masseria Pietrosa Palmenti Primitivo di Manduria Vigne Vecchie 2016" - Price: 25 EUR
"Ravazzi Prezioso 2016" - Price: 29.95 EUR
"Nino Negri Sfursat Carlo Negri 2017" - Price: 23.89 EUR
"Quinta do Paral Reserva Tinto 2017" - Price: 29.24 EUR
"Wildekrans Barrel Select Reserve Pinotage 2016" - Price: 29.9 EUR
"Caballo Loco Grand Cru Limarí 2016" - Price: 27.9 EUR
"San Marzano F Negroamaro 2018" - Price: 16.9 EUR
"Atlan & Artisan 8 Vents Mallorca 2018" - Price: 19 EUR
"Schneider Rooi Olifant Red 2017" - Price: 19.5 EUR
>>> 

It just seems to grab twenty-five records/wines per page, but changing the page key-value pair in the params query-string parameter dictionary will yield the records from whatever page you desire. I'm currently located in Germany, that's why my country_code and currency_code are "DE" and "EUR", but you should be able to change those to suit your needs.


EDIT - here are some more key-value pairs you may be interested in, though I would recommend you get familiar with how your browser's dev tools work so that you can discover these fields in the JSON yourself:

record["vintage"]["year"]
record["vintage"]["wine"]["region"]["name"]
record["vintage"]["wine"]["region"]["country"]["name"]
record["vintage"]["wine"]["taste"]["structure"]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...