I am trying to extract the business name and address from each listing and export it to a -csv, but I am having problems with the output csv. I think bizs = hxs.select("//div[@class='listing_content']") may be causing the problems.
yp_spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from yp.items import Biz
class MySpider(BaseSpider):
name = "ypages"
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
bizs = hxs.select("//div[@class='listing_content']")
items = []
for biz in bizs:
item = Biz()
item['name'] = biz.select("//h3/a/text()").extract()
item['address'] = biz.select("//span[@class='street-address']/text()").extract()
print item
items.append(item)
items.py
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html
from scrapy.item import Item, Field
class Biz(Item):
name = Field()
address = Field()
def __str__(self):
return "Website: name=%s address=%s" % (self.get('name'), self.get('address'))
The output from 'scrapy crawl ypages -o list.csv -t csv' is a long list of business names then locations and it repeats the same data several times.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…