You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/
To use it, copy the code from the link and put it into some file in your scrapy project.
To reference it, add a line in your settings.py to reference it:
SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }
The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html
Finally, you'll need to modify your items.py so that each item class has the following fields:
visit_id = Field()
visit_status = Field()
And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.
Good luck!
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…