Getting Scrapy to run on a schedule is driving me around the Twist(ed).
I thought the below test code would work, but I get a twisted.internet.error.ReactorNotRestartable
error when the spider is triggered a second time:
from quotesbot.spiders.quotes import QuotesSpider
import schedule
import time
from scrapy.crawler import CrawlerProcess
def run_spider_script():
process.crawl(QuotesSpider)
process.start()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
schedule.every(5).seconds.do(run_spider_script)
while True:
schedule.run_pending()
time.sleep(1)
I'm going to guess that as part of the CrawlerProcess, the Twisted Reactor is called to start again, when that's not required and so the program crashes. Is there any way I can control this?
Also at this stage if there's an alternative way to automate a Scrapy spider to run on a schedule, I'm all ears. I tried scrapy.cmdline.execute
, but couldn't get that to loop either:
from quotesbot.spiders.quotes import QuotesSpider
from scrapy import cmdline
import schedule
import time
from scrapy.crawler import CrawlerProcess
def run_spider_cmd():
print("Running spider")
cmdline.execute("scrapy crawl quotes".split())
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
schedule.every(5).seconds.do(run_spider_cmd)
while True:
schedule.run_pending()
time.sleep(1)
EDIT
Adding code, which uses Twisted task.LoopingCall()
to run a test spider every few seconds. Am I going about this completely the wrong way to schedule a spider that runs at the same time each day?
from twisted.internet import reactor
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
author = quote.xpath('.//small[@class="author"]/text()').extract_first()
text = quote.xpath('.//span[@class="text"]/text()').extract_first()
print(author, text)
def run_crawl():
runner = CrawlerRunner()
runner.crawl(QuotesSpider)
l = task.LoopingCall(run_crawl)
l.start(3)
reactor.run()
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…