python - Is it possible to remove requests from scrapys scheduler queue?

Question

Welcome To Ask or Share your Answers For Others

python - Is it possible to remove requests from scrapys scheduler queue?

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:29:45+0000

Okay so I ended up following the suggestion from @rickgh12hs and wrote my own Downloader Middleware:

from scrapy.exceptions import IgnoreRequest
import tldextract

class clearQueueDownloaderMiddleware(object):
    def process_request(self, request, spider):
        domain_obj = tldextract.extract(request.url)
        just_domain = domain_obj.registered_domain
        if(just_domain in spider.blocked):
            print "Blocked domain: %s (url: %s)" % (just_domain, request.url)
            raise IgnoreRequest("URL blocked: %s" % request.url)

spider.blocked is a class list variable that contains blocked domains preventing any further downloads from the blocked domains. Seem to work great, cudos to @rickgh12hs!

Categories

python - Is it possible to remove requests from scrapys scheduler queue?

python - Is it possible to remove requests from scrapys scheduler queue?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags