Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
411 views
in Technique[技术] by (71.8m points)

python - Scrapy export with headers if empty

As far as I can see nobody has asked this question, and I'm completely stuck trying to solve it. What I have is a spider that sometimes won't return any results (either no results exist or the site could not be scraped due to, say, the robots.txt file), and this results in an empty, headerless, csv file. I have a robot looking to pick up this file, so when it is empty the robot doesn't realise it is finished and in any case without headers cannot understand it.

What I want is to output the csv file with headers every time, even if there are no results. I've tried using json instead but have the same issue - if there is an error or there are no results the file is empty.

I'd be quite happy to call something on the spider closing (for whatever reason, even an error in initialising due to say, a bad url) and writing something to the file.

Thanks.

question from:https://stackoverflow.com/questions/65917261/scrapy-export-with-headers-if-empty

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I solved this by amending my item pipeline and not using the feed exporter in the command line. This allowed me to use close_spider to write, if no results, the header.

I'd obviously welcome any improvements if I've missed something.

Pipeline code:

from scrapy.exceptions import DropItem
from scrapy.exporters import CsvItemExporter
import csv
from Generic.items import GenericItem

class GenericPipeline:
    
    def __init__(self):
        self.emails_seen = set()
        self.files = {}

    def open_spider(self, spider):
        self.file = open("results.csv", 'wb')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        if not self.emails_seen:
            header = GenericItem()
            header["email"] = "None Found"
            self.exporter.export_item(header)
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        
        if item["email"] in self.emails_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.emails_seen.add(item["email"])
            self.exporter.export_item(item)
            return item

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...