python - Navigate through all the search results pages with BeautifulSoup

Question

Welcome To Ask or Share your Answers For Others

python - Navigate through all the search results pages with BeautifulSoup

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:23:36+0000

beautiful soup will only give you the tools, how to go about navigating pages is something you need to work out in a flow diagram sense.

Taking the page you mentioned, clicking through a few of the pages it seems that when we are on page 1, nothing is shown in the url.

htt...ru/moskva/transport

and we see in the source of the page:

<div class="pagination-pages clearfix">
   <span class="pagination-page pagination-page_current">1</span>
   <a class="pagination-page" href="/moskva/transport?p=2">2</a>

lets check what happens when we go to page 2

ht...ru/moskva/transport?p=2

<div class="pagination-pages clearfix">
  <a class="pagination-page" href="/moskva/transport">1</a>
  <span class="pagination-page pagination-page_current">2</span>
  <a class="pagination-page" href="/moskva/transport?p=3">3</a>

perfect, now we have the layout. one more thing to know before we make our beautiful soup. what happenes when we go to a page past the last available page. which at the time of this writing was: 40161

ht...ru/moskva/transport?p=40161
we change this to:
ht...ru/moskva/transport?p=40162

the page seems to go back to page 1 automatically. great!

so now we have everything we need to make our soup loop.

instead of clicking next each time, just make a url statement. you know the elements required.

url = ht...ru/moskva/$searchterm?p=$pagenum

im assuming transport is the search term??? i dont know, i cant read russian. but you get the idea. construct the url. then do a requests call

request =  requests.get(url)
mysoup = bs4.BeautifulSoup(request.text)

and now you can wrap that whole thing in a while loop, and each time except the first time check

mysoup.select['.pagination-page_current'][0].text == 1

this says, each time we get the page, find the currently selected page by using the class pagination-page_current, it returns an array so we select the first element [0] get its text .text and see if it equals 1.

this should only be true in two cases. the first page you run, and the last. so you can use this to start and stop the script, or however you want.

this should be everything you need to do this properly. :)

Categories

python - Navigate through all the search results pages with BeautifulSoup

python - Navigate through all the search results pages with BeautifulSoup

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags