We have a system written with scrapy to crawl a few websites. There are several spiders, and a few cascaded pipelines for all items passed by all crawlers.
One of the pipeline components queries the google servers for geocoding addresses.
Google imposes a limit of 2500 requests per day per IP address, and threatens to ban an IP address if it continues querying google even after google has responded with a warning message: 'OVER_QUERY_LIMIT'.
Hence I want to know about any mechanism which I can invoke from within the pipeline that will completely and immediately stop all further crawling/processing of all spiders and also the main engine.
I have checked other similar questions and their answers have not worked:
from scrapy.project import crawler
crawler._signal_shutdown(9,0) #Run this if the cnxn fails.
this does not work as it takes time for the spider to stop execution and hence many more requests are made to google (which could potentially ban my IP address)
import sys
sys.exit("SHUT DOWN EVERYTHING!")
this one doesn't work at all; items keep getting generated and passed to the pipeline, although the log vomits sys.exit() -> exceptions.SystemExit raised (to no effect)
crawler.engine.close_spider(self, 'log message')
this one has the same problem as the first case mentioned above.
I tried:
scrapy.project.crawler.engine.stop()
To no avail
EDIT:
If I do in the pipeline:
from scrapy.contrib.closespider import CloseSpider
what should I pass as the 'crawler' argument to the CloseSpider's init() from the scope of my pipeline?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…