Okay so I ended up following the suggestion from @rickgh12hs and wrote my own Downloader Middleware:
from scrapy.exceptions import IgnoreRequest
import tldextract
class clearQueueDownloaderMiddleware(object):
def process_request(self, request, spider):
domain_obj = tldextract.extract(request.url)
just_domain = domain_obj.registered_domain
if(just_domain in spider.blocked):
print "Blocked domain: %s (url: %s)" % (just_domain, request.url)
raise IgnoreRequest("URL blocked: %s" % request.url)
spider.blocked is a class list variable that contains blocked domains preventing any further downloads from the blocked domains. Seem to work great, cudos to @rickgh12hs!
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…