I am currently working on a scraper project which is much important to ensure EVERY request got properly handled, i.e., either to log an error or to save a successful result. I've already implemented the basic spider, and I can now process 99% of the requests successfully, but I could get errors like captcha, 50x, 30x, or even no enough fields in the result(then I'll try another website to find the missing fields).
At first, I thought it's more "logical" to raise exceptions in the parsing callback and process them all in errback, this could make the code more readable. But I tried only to find out errback can only trap errors in the downloader module, such as non-200 response statuses. If I raise a self-implemented ParseError in the callback, the spider just raises it and stops.
Even if I'll have to process the parsing request directly in the callback, I don't know how to retry the request immediately in the callback in a clean fashion. u know, I may have to include a different proxy to send another request, or modify some request header.
I admit I'm relatively new to scrapy but I've tried back and forth for days and still cannot get this to working… I've checked every single question on SO and no one matches, thanks in advance for the help.
UPDATE: I realize this could be a very complex question so I try to illustrate the scenario in the following pseudo code, hope this helps:
from scraper.myexceptions import *
def parseRound1(self, response):
.... some parsing routines ...
if something wrong happened:
# this causes the spider raises a SpiderException and stops
raise CaptchaError
...
if no enough fields scraped:
raise ParseError(task, "no enough fields")
else:
return items
def parseRound2(self, response):
...some other parsing routines...
def errHandler(self, failure):
# how to trap all the exceptions?
r = failure.trap()
# cannot trap ParseError here
if r == CaptchaError:
# how to enqueue the original request here?
retry
elif r == ParseError:
if raised from parseRound1:
new request for Round2
else:
some other retry mechanism
elif r == HTTPError:
ignore or retry
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…