python - Easiest way to run scrapy crawler so it doesn't block the script

Question

Welcome To Ask or Share your Answers For Others

python - Easiest way to run scrapy crawler so it doesn't block the script

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Easiest way to run scrapy crawler so it doesn't block the script

The official docs give many ways for running scrapy crawlers from code:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

But all of them block script until crawling is finished. What's the easiest way in python to run the crawler in a non-blocking, async manner?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:40:14+0000

I tried every solution I could find, and the only working for me was this. But in order to make it work with scrapy 1.1rc1 I had to tweak it a little bit:

from scrapy.crawler import Crawler
from scrapy import signals
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from billiard import Process

class CrawlerScript(Process):
    def __init__(self, spider):
        Process.__init__(self)
        settings = get_project_settings()
        self.crawler = Crawler(spider.__class__, settings)
        self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        self.spider = spider

    def run(self):
        self.crawler.crawl(self.spider)
        reactor.run()

def crawl_async():
    spider = MySpider()
    crawler = CrawlerScript(spider)
    crawler.start()
    crawler.join()

So now when I call crawl_async, it starts crawling and doesn't block my current thread. I'm absolutely new to scrapy, so may be this isn't a very good solution but it worked for me.

I used these versions of the libraries:

cffi==1.5.0
Scrapy==1.1rc1
Twisted==15.5.0
billiard==3.3.0.22

Categories

python - Easiest way to run scrapy crawler so it doesn't block the script

python - Easiest way to run scrapy crawler so it doesn't block the script

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags