python - Unable to make my script stop when some urls are scraped

Question

Welcome To Ask or Share your Answers For Others

python - Unable to make my script stop when some urls are scraped

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Unable to make my script stop when some urls are scraped

I'v created a script in scrapy to parse the titles of different sites listed in start_urls. The script is doing it's job flawlessly.

What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there.

I've tried so far with:

import scrapy
from scrapy.crawler import CrawlerProcess

class TitleSpider(scrapy.Spider):
    name = "title_bot"
    start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]

    def parse(self, response):
        yield {'title':response.css('title::text').get()}

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0', 
    })
    c.crawl(TitleSpider)
    c.start()

How can I make my script stop when two of the listed urls are scraped?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:57:36+0000

As Gallaecio proposed, you can add a counter, but the difference here is that you export an item after the if statement. This way, it will almost always end up exporting 2 items.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import CloseSpider


class TitleSpider(scrapy.Spider):
    name = "title_bot"
    start_urls = ["https://www.google.com/", "https://www.yahoo.com/", "https://www.bing.com/"]
    item_limit = 2

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.counter = 0

    def parse(self, response):
        self.counter += 1
        if self.counter > self.item_limit:
            raise CloseSpider

        yield {'title': response.css('title::text').get()}

Why almost always? you may ask. It has to do with race condition in parse method.

Imagine that self.counter is currently equal to 1, which means that one more item is expected to be exported. But now Scrapy receives two responses at the same moment and invokes the parse method for both of them. If two threads running the parse method will increase the counter simultaneously, they will both have self.counter equal to 3 and thus will both raise the CloseSpider exception.

In this case (which is very unlikely, but still can happen), spider will export only one item.

Categories

python - Unable to make my script stop when some urls are scraped

python - Unable to make my script stop when some urls are scraped

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags