First, you are using results = hxs.select('//*[@id="content"]/div[1]')
so
results = hxs.select('//*[@id="content"]/div[1]')
for result in results:
...
will loop on one div
only, the first child div
of <div id="content" class="clear">
Want you need is to loop on every <dl class="clear">...</dl>
within this //*[@id="content"]/div[1]
(it would probably be easier to maintain with //*[@id="content"]/div[@class="content"]
)
results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
Second, in each loop iteration, you are using absolute XPath expressions (//div...
)
result.select('//div/dl/dt[contains(text(), "...")]/following-sibling::dd[1]/text()')
this will select all dd
following dt
matching the text content starting from the document root node.
Look at this section in Scrapy docs for details.
You need to use relative XPath expressions -- relative within each result
scope representing each dl
, like dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()
or ./dt[contains(text(), "Contact hours")]/following-sibling::dd[1]/text()
,
The "practice" field however can still use an absolute XPath expression //h1/text()
, but you could also have a variable practice
set once, and use it in each WebhealthItem1()
instance
...
practice = hxs.select('//h1/text()').extract()
for result in results:
item = WebhealthItem1()
...
item['practice'] = practice
Here's what your spider would look like with these changes:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1
class WebhealthSpider(BaseSpider):
name = "webhealth_content1"
download_delay = 5
allowed_domains = ["webhealth.co.nz"]
start_urls = [
"http://auckland.webhealth.co.nz/provider/service/view/914136/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
practice = hxs.select('//h1/text()').extract()
items1 = []
results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
for result in results:
item = WebhealthItem1()
#item['url'] = result.select('//dl/a/@href').extract()
item['practice'] = practice
item['hours'] = map(unicode.strip,
result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract())
item['more_hours'] = map(unicode.strip,
result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract())
item['physical_address'] = map(unicode.strip,
result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract())
item['postal_address'] = map(unicode.strip,
result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract())
item['postcode'] = map(unicode.strip,
result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract())
item['district_town'] = map(unicode.strip,
result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract())
item['region'] = map(unicode.strip,
result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract())
item['phone'] = map(unicode.strip,
result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract())
item['website'] = map(unicode.strip,
result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract())
item['email'] = map(unicode.strip,
result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract())
items1.append(item)
return items1
I also created a Cloud9 IDE project with this code. You can play with it at https://c9.io/redapple/so_19309960