Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.
You could save the pdf in the spider callback:
def parse_listing(self, response):
# ... extract pdf urls
for url in pdf_urls:
yield Request(url, callback=self.save_pdf)
def save_pdf(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f:
f.write(response.body)
If you choose to do it in a pipeline:
# in the spider
def parse_pdf(self, response):
i = MyItem()
i['body'] = response.body
i['url'] = response.url
# you can add more metadata to the item
return i
# in your pipeline
def process_item(self, item, spider):
path = self.get_path(item['url'])
with open(path, "wb") as f:
f.write(item['body'])
# remove body and add path as reference
del item['body']
item['path'] = path
# let item be processed by other pipelines. ie. db store
return item
[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget
)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…