Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
238 views
in Technique[技术] by (71.8m points)

python - Navigate through all the search results pages with BeautifulSoup

I can not seem to grasp. How can I make BeautifulSoup parse every page by navigating using Next page link up until the last page and stop parsing when there is no "Next page" found. On a site like this

enter link description here

I try looking for the Next button element name, I use 'find' to find it, but do not know how to make it recurring to do iterations until all pages are scraped.

Thank you

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

beautiful soup will only give you the tools, how to go about navigating pages is something you need to work out in a flow diagram sense.

Taking the page you mentioned, clicking through a few of the pages it seems that when we are on page 1, nothing is shown in the url.

htt...ru/moskva/transport

and we see in the source of the page:

<div class="pagination-pages clearfix">
   <span class="pagination-page pagination-page_current">1</span>
   <a class="pagination-page" href="/moskva/transport?p=2">2</a>

lets check what happens when we go to page 2

ht...ru/moskva/transport?p=2

<div class="pagination-pages clearfix">
  <a class="pagination-page" href="/moskva/transport">1</a>
  <span class="pagination-page pagination-page_current">2</span>
  <a class="pagination-page" href="/moskva/transport?p=3">3</a>

perfect, now we have the layout. one more thing to know before we make our beautiful soup. what happenes when we go to a page past the last available page. which at the time of this writing was: 40161

ht...ru/moskva/transport?p=40161
we change this to:
ht...ru/moskva/transport?p=40162

the page seems to go back to page 1 automatically. great!

so now we have everything we need to make our soup loop.

instead of clicking next each time, just make a url statement. you know the elements required.

url = ht...ru/moskva/$searchterm?p=$pagenum

im assuming transport is the search term??? i dont know, i cant read russian. but you get the idea. construct the url. then do a requests call

request =  requests.get(url)
mysoup = bs4.BeautifulSoup(request.text)

and now you can wrap that whole thing in a while loop, and each time except the first time check

mysoup.select['.pagination-page_current'][0].text == 1

this says, each time we get the page, find the currently selected page by using the class pagination-page_current, it returns an array so we select the first element [0] get its text .text and see if it equals 1.

this should only be true in two cases. the first page you run, and the last. so you can use this to start and stop the script, or however you want.

this should be everything you need to do this properly. :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...