Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
397 views
in Technique[技术] by (71.8m points)

python - Scraping contents of multi web pages of a website using BeautifulSoup and Selenium

The website I want to scrap is :

http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061

I want to get the last page number of the above the link for proceeding, which is 499 while taking the screenshot.

The screenshot showing the last page number which I get as my output as of now

My code :

   from bs4 import BeautifulSoup 
   from urllib.request import urlopen as uReq
   from selenium import webdriver;import time
   from selenium.webdriver.common.by import By
   from selenium.webdriver.support.ui import WebDriverWait
   from selenium.webdriver.support import expected_conditions as EC
   from selenium.webdriver.common.desired_capabilities import         DesiredCapabilities

   firefox_capabilities = DesiredCapabilities.FIREFOX
   firefox_capabilities['marionette'] = True
   firefox_capabilities['binary'] = '/etc/firefox'

   driver = webdriver.Firefox(capabilities=firefox_capabilities)
   url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"

   driver.get(url)
   wait = WebDriverWait(driver, 10)
   soup=BeautifulSoup(driver.page_source,"lxml")
   containers = soup.findAll("ul",{"class":"pages table"})
   containers[0] = soup.findAll("li")
   li_len = len(containers[0])
   for item in soup.find("ul",{"class":"pages table"}) : 
   li_text = item.select("li")[li_len].text
   print("li_text : {}
".format(li_text))
   driver.quit()

I need help to figure out the error in my code for getting the last page number. Also, I would be grateful if someone give the alternate solution for the same and suggest ways to achieve my intention.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you want to get the last page number of the above the link for proceeding, which is 499 you can use either Selenium or Beautifulsoup as follows :


Selenium :

from selenium import webdriver

driver = webdriver.Firefox(executable_path=r'C:UtilityBrowserDriversgeckodriver.exe')
url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
driver.get(url)
element = driver.find_element_by_xpath("//div[@class='row pagination']//p/span[contains(.,'Reviews on Reliance Jio')]")
driver.execute_script("return arguments[0].scrollIntoView(true);", element)
print(driver.find_element_by_xpath("//ul[@class='pagination table']/li/ul[@class='pages table']//li[last()]/a").get_attribute("innerHTML"))
driver.quit()

Console Output :

499

Beautifulsoup :

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.find("ul",{"class":"pages table"})
all_li = container.findAll("li")
last_div = None
for last_div in all_li:pass
if last_div:
    content = last_div.getText()
    print(content)

Console Output :

499

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...