You're doing way more work than you have to - with Selenium I mean. When visiting the page, I logged my network traffic with Google Chrome's dev tools, and I saw that my browser made an HTTP GET request to a REST API, the response of which is JSON and contains all the wine/price information you could ever want. So, you don't need to do any scraping. Just imitate that GET request with the desired query-string parameters and the correct headers. It seems the REST API just cares about the user-agent
header, which is trivial.
- First, visit the URL
in Google Chrome.
- Press the
F12
key to open the Google Chrome dev tools menu.
Click on the Network
tab.
- Click on the round Record button. It should turn red, and it will
start logging all network traffic in the log below. Click on the
Filter button next to it, and then click on
XHR
. This will only
show XHR (XmlHttpRequest) requests in the log. We are interested in
these requests specifically, because it is via XHR that requests to
APIs are typically made. Here's what it should look like now:
- With the Chrome dev tools menu still open, right-click (not
left-click) on the page refresh button to reveal a drop-down menu.
Then, click on
Empty Cache and Hard Reload
.
This will empty your browser's cache for this page, and force your
browser to refresh the page. As the page is being refreshed, you
should start to see some entries appearing in the traffic log.
There should now be some XHR request entries in the log. We don't know which one of these is the one we're actually interested in, so we just look at all of them until we find one that looks like it could be the right one (or, if it contains the information we're looking for, like information for individual wines, etc.). I happen to know that we are interested in the one that starts with explore?...
, so let's click on that.
- Once you click on the entry, a panel will open on the right of the
log. Click on the
Headers
tab.
This tab contains all the information regarding how this request was made. Under the General
area, you can see the Request URL
, which is the URL to the REST API endpoint that we made a request to. This URL might be quite long, because it will also typically contain the query-string parameters (those are the key-value pairs that come after explore?
, like country_code=DE
or currency_code=EUR
. They are separated by &
). The query-string parameters are important, because they contain information about certain filters that we want to apply to our query. In my code example, I've removed them from the REST API endpoint URL, and instead moved them into the params
dictionary. This step isn't required - you could also just leave them in the URL, but I find that it is easier to read and modify this way. The query-string parameters are also important because, sometimes, certains APIs will expect certain parameters to be present in the request, or they will expect them to have certain values - in other words, some APIs are very picky about their query-string parameters, and if you remove them or tamper with them in a way that the API doesn't expect, the API will say that your request isn't formulated correctly.
In the General
area, you can also see Request Method
, which in our case is GET
. This tells us, that our browser made an HTTP GET request. Not all API endpoints work the same, some want HTTP POST, etc.
Status Code
tells us what status code the server sent back. 200
means everything went OK. You can learn more about HTTP status codes here.
Let's take a look at the Response Headers
area. This area contains all the response headers that the server sent back after the request was made. These can be useful for a browser for things like setting cookies or knowing how to interpret the data the server has sent back.
The Request Headers
area contains all the headers that your browser sent to the server when it made the request. Usually, it's a good idea to copy all of these key-value pairs and turn them into a Python dictionary headers
, because that way you can be sure that your Python script will make the exact same request that your browser made. However, usually, I like to trim this down as much as I can. I know that many APIs desperately care about the user-agent
field, so usually I'll keep that one, but sometimes they also care about the referer
. As you work with different APIs, you'll have to just kind of figure out which request headers the API cares about through trial-and-error. This API happens to only care about the user-agent
.
The last area Query String Parameters
is just a cute way of showing the query-string parameters from the Request URL
in a human-friendly list of key-value pairs. Sometimes it's helpful to copy them from here, rather than from the URL.
- Now, click on the
Preview
tab, next to the Headers
tab.
The Preview
tab contains a pretty-printed preview of the actual data that was sent back as a result of the browser's request. In our case, this contains the JSON data sent back by the server. You can click on the little gray triangles to expand or collapse certain parts of the JSON structure, to reveal different data.
Looking at this, I can tell that the JSON response is one big dictionary, which has a key explore_vintage
, whose value is another dictionary, which has a key records
whose value is a list of dictionaries, where each dictionary in this list represents one wine object. Expanding the first record (the 0th one) reveals all information regarding the first wine in the list. You can explore these structures as much as you like to see what kinds of information are available to you.
def main():
import requests
url = "https://www.vivino.com/api/explore/explore"
params = {
"country_code": "DE",
"currency_code": "EUR",
"grape_filter": "varietal",
"min_rating": "3.5",
"order_by": "ratings_average",
"order": "desc",
"page": "1",
"price_range_max": "30",
"price_range_min": "7",
"wine_type_ids[]": "1"
}
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
records = response.json()["explore_vintage"]["records"]
for record in records:
name = record["vintage"]["name"]
price = record["price"]["amount"]
currency = record["price"]["currency"]["code"]
print(f""{name}" - Price: {price} {currency}")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
"Varvaglione Cosimo Varvaglione Collezione Privata Primitivo di Manduria 2015" - Price: 21.9 EUR
"Masseria Borgo dei Trulli Mirea Primitivo di Manduria 2019" - Price: 19.9 EUR
"Vigneti del Salento Vigne Vecchie Primitivo di Manduria 2016" - Price: 22.95 EUR
"Vigneti del Salento Vigne Vecchie Leggenda Primitivo di Manduria 2016" - Price: 17.87 EUR
"Varvaglione Papale Linea Oro Primitivo di Manduria 2016" - Price: 18.85 EUR
"Caballo Loco Grand Cru Apalta 2014" - Price: 27.9 EUR
"Luccarelli Il Bacca Old Vine Primitivo di Manduria 2016" - Price: 20.9 EUR
"Mottura Stilio Primitivo di Manduria 2018" - Price: 12.89 EUR
"Caballo Loco Grand Cru Maipo 2015" - Price: 24.81 EUR
"Lorusso Michele Solone Primitivo 2017" - Price: 21.39 EUR
"Chateau Purcari Negru de Purcari 2017" - Price: 29.8 EUR
"San Marzano 60 Sessantanni Limited Edition Old Vines Primitivo di Manduria 2016" - Price: 22.85 EUR
"San Marzano 60 Sessantanni Old Vines Primitivo di Manduria 2016" - Price: 20.9 EUR
"San Marzano 60 Sessantanni Old Vines Primitivo di Manduria 2017" - Price: 17.775 EUR
"Lenotti Amarone della Valpolicella Classico 2015" - Price: 27.95 EUR
"Zeni Cruino Rosso Veronese 2015" - Price: 22.9 EUR
"Masseria Pietrosa Palmenti Primitivo di Manduria Vigne Vecchie 2016" - Price: 25 EUR
"Ravazzi Prezioso 2016" - Price: 29.95 EUR
"Nino Negri Sfursat Carlo Negri 2017" - Price: 23.89 EUR
"Quinta do Paral Reserva Tinto 2017" - Price: 29.24 EUR
"Wildekrans Barrel Select Reserve Pinotage 2016" - Price: 29.9 EUR
"Caballo Loco Grand Cru Limarí 2016" - Price: 27.9 EUR
"San Marzano F Negroamaro 2018" - Price: 16.9 EUR
"Atlan & Artisan 8 Vents Mallorca 2018" - Price: 19 EUR
"Schneider Rooi Olifant Red 2017" - Price: 19.5 EUR
>>>
It just seems to grab twenty-five records/wines per page, but changing the page
key-value pair in the params
query-string parameter dictionary will yield the records from whatever page you desire. I'm currently located in Germany, that's why my country_code
and currency_code
are "DE"
and "EUR"
, but you should be able to change those to suit your needs.
EDIT - here are some more key-value pairs you may be interested in, though I would recommend you get familiar with how your browser's dev tools work so that you can discover these fields in the JSON yourself:
record["vintage"]["year"]
record["vintage"]["wine"]["region"]["name"]
record["vintage"]["wine"]["region"]["country"]["name"]
record["vintage"]["wine"]["taste"]["structure"]