python - selenium with scrapy for dynamic page

Question

Welcome To Ask or Share your Answers For Others

python - selenium with scrapy for dynamic page

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - selenium with scrapy for dynamic page

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

starts with a product_list page with 10 products
a click on "next" button loads the next 10 products (url doesn't change between the two pages)
i use LinkExtractor to follow each product link into the product page, and get all the information I need

I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?

My spider is pretty standard, like the following:

class ProductSpider(CrawlSpider):
    name = "product_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/shanghai']
    rules = [
        Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
        ]

    def parse_product(self, response):
        self.log("parsing product %s" %response.url, level=INFO)
        hxs = HtmlXPathSelector(response)
        # actual data follows

Any idea is appreciated. Thank you!

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T21:20:09+0000

It really depends on how do you need to scrape the site and how and what data do you want to get.

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapy
from selenium import webdriver

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

                # get the data and write it to scrapy items
            except:
                break

        self.driver.close()

Here are some examples of "selenium spiders":

There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

Scraping dynamic content using python-Scrapy

Categories

python - selenium with scrapy for dynamic page

python - selenium with scrapy for dynamic page

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags