python - Scrapy - how to identify already scraped urls

Question

Welcome To Ask or Share your Answers For Others

python - Scrapy - how to identify already scraped urls

1 Reply

深蓝 · Answer 1 · 2021-10-17T02:51:10+0000

You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:

SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }

The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

Finally, you'll need to modify your items.py so that each item class has the following fields:

visit_id = Field()
visit_status = Field()

And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.

Good luck!

Categories

python - Scrapy - how to identify already scraped urls

python - Scrapy - how to identify already scraped urls

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags