Before you link me to other answers related to this, note that I've read them and am still a bit confused. Alrighty, here we go.
So I am creating a webapp in Django. I am importing the newest scrapy library to crawl a website. I am not using celery (I know very little about it, but saw it in other topics related to this).
One of the url's of our website, /crawl/, is meant to start the crawler running. It's the only url in our site that requires scrapy to be used. Here is the function which is called when the url is visited:
def crawl(request):
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(ReviewSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
return render(request, 'index.html')
You'll notice that this is an adaptation of the scrapy tutorial on their website. The first time this url is visited when the server starts running, everything works as intended. The second time and further, a ReactorNotRestartable exception is thrown. I understand that this exception happens when a reactor which has already been stopped is issued a command to start again, which is not possible.
Looking at the sample code, I would assume the line "runner = CrawlerRunner()" would return a ~new~ reactor for use each time this url is visited. But I believe perhaps my understanding of twisted reactors is not completely clear.
How would I go about getting and running a NEW reactor each time this url is visited?
Thank you so much
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…