I'm using scrapy to crawl my sitemap, to check for 404, 302 and 200 pages. But i can't seem to be able to get the response code. This is my code so far:
from scrapy.contrib.spiders import SitemapSpider
class TothegoSitemapHomesSpider(SitemapSpider):
name ='tothego_homes_spider'
## robe che ci servono per tothego ##
sitemap_urls = []
ok_log_file = '/opt/Workspace/myapp/crawler/valid_output/ok_homes'
bad_log_file = '/opt/Workspace/myapp/crawler/bad_homes'
fourohfour = '/opt/Workspace/myapp/crawler/404/404_homes'
def __init__(self, **kwargs):
SitemapSpider.__init__(self)
if len(kwargs) > 1:
if 'domain' in kwargs:
self.sitemap_urls = ['http://url_to_sitemap%s/sitemap.xml' % kwargs['domain']]
if 'country' in kwargs:
self.ok_log_file += "_%s.txt" % kwargs['country']
self.bad_log_file += "_%s.txt" % kwargs['country']
self.fourohfour += "_%s.txt" % kwargs['country']
else:
print "USAGE: scrapy [crawler_name] -a country=[country] -a domain=[domain]
With [crawler_name]:
- tothego_homes_spider
- tothego_cars_spider
- tothego_jobs_spider
"
exit(1)
def parse(self, response):
try:
if response.status == 404:
## 404 tracciate anche separatamente
self.append(self.bad_log_file, response.url)
self.append(self.fourohfour, response.url)
elif response.status == 200:
## printa su ok_log_file
self.append(self.ok_log_file, response.url)
else:
self.append(self.bad_log_file, response.url)
except Exception, e:
self.log('[eccezione] : %s' % e)
pass
def append(self, file, string):
file = open(file, 'a')
file.write(string+"
")
file.close()
From scrapy's docs, they said that response.status parameter is an integer corresponding to the status code of the response. So far, it logs only the 200 status urls, while the 302 aren't written on the output file (but i can see the redirects in crawl.log). So, what do i have to do to "trap" the 302 requests and save those urls?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…