Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
539 views
in Technique[技术] by (71.8m points)

xpath - Scrapy: Parsing list items onto separate lines

Tried to adapt the answer to this question to my issue but not successfully.

Here's some example html code:

<div id="provider-region-addresses">
<h3>Contact details</h3>
<h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>North Shore Hospital</dd><dt>Physical address</dt>
                <dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt>
                <dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt>
                <dd>0740</dd><dt>District/town</dt>

                <dd>
                North Shore, Takapuna</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 486 8996</dd><dt>Fax</dt>
                <dd>(09) 486 8342</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>Physical address</dt>
                <dd>Helensville</dd><dt>Postal address</dt>
                <dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt>
                <dd>0840</dd><dt>District/town</dt>

                <dd>
                Rodney, Helensville</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 420 9450</dd><dt>Fax</dt>
                <dd>(09) 420 7050</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>Physical address</dt>
                <dd>Warkworth</dd><dt>Postal address</dt>
                <dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt>
                <dd>0941</dd><dt>District/town</dt>

                <dd>
                Rodney, Warkworth</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 422 2700</dd><dt>Fax</dt>
                <dd>(09) 422 2709</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>Waitakere Hospital</dd><dt>Physical address</dt>
                <dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt>
                <dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt>
                <dd>0650</dd><dt>District/town</dt>

                <dd>
                Waitakere, Henderson</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 839 0000</dd><dt>Fax</dt>
                <dd>(09) 837 6634</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt>
                <dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt>
                <dd>0932</dd><dt>District/town</dt>

                <dd>
                Rodney, Red Beach</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 427 0300</dd><dt>Fax</dt>
                <dd>(09) 427 0391</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    </div>

Search again

And here's my spider;

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1

class WebhealthSpider(BaseSpider):

name = "webhealth_content1"

download_delay = 5

allowed_domains = ["webhealth.co.nz"]
start_urls = [
    "http://auckland.webhealth.co.nz/provider/service/view/914136/"
    ]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    results = hxs.select('//*[@id="content"]/div[1]')
    items1 = []
    for result in results:
        item = WebhealthItem1()
        item['url'] = result.select('//dl/a/@href').extract()
        item['practice'] = result.select('//h1/text()').extract()
        item['hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()').extract())
        item['more_hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"More information")]/following-sibling::dd[1]/text()').extract())
        item['physical_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Physical address")]/following-sibling::dd[1]/text()').extract())
        item['postal_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postal address")]/following-sibling::dd[1]/text()').extract())
        item['postcode'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postcode")]/following-sibling::dd[1]/text()').extract())
        item['district_town'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"District/town")]/following-sibling::dd[1]/text()').extract())
        item['region'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Region")]/following-sibling::dd[1]/text()').extract())
        item['phone'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Phone")]/following-sibling::dd[1]/text()').extract())
        item['website'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Website")]/following-sibling::dd[1]/a/@href').extract())
        item['email'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Email")]/following-sibling::dd[1]/a/text()').extract())
        items1.append(item)
    return items1

From here, how do I parse list items onto separate lines, with the corresponding //h1/text() value in the name field? Currently I'm getting a list of each Xpath item all in one cell. Is it to do with the way that I am declaring the Xpaths?

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First, you are using results = hxs.select('//*[@id="content"]/div[1]') so

    results = hxs.select('//*[@id="content"]/div[1]')
    for result in results:
        ...

will loop on one div only, the first child div of <div id="content" class="clear">

Want you need is to loop on every <dl class="clear">...</dl> within this //*[@id="content"]/div[1] (it would probably be easier to maintain with //*[@id="content"]/div[@class="content"])

        results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')

Second, in each loop iteration, you are using absolute XPath expressions (//div...)

result.select('//div/dl/dt[contains(text(), "...")]/following-sibling::dd[1]/text()')

this will select all dd following dt matching the text content starting from the document root node.

Look at this section in Scrapy docs for details.

You need to use relative XPath expressions -- relative within each result scope representing each dl, like dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text() or ./dt[contains(text(), "Contact hours")]/following-sibling::dd[1]/text(),

The "practice" field however can still use an absolute XPath expression //h1/text(), but you could also have a variable practice set once, and use it in each WebhealthItem1() instance

        ...
        practice = hxs.select('//h1/text()').extract()
        for result in results:
            item = WebhealthItem1()
            ...
            item['practice'] = practice

Here's what your spider would look like with these changes:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1

class WebhealthSpider(BaseSpider):

    name = "webhealth_content1"

    download_delay = 5

    allowed_domains = ["webhealth.co.nz"]
    start_urls = [
        "http://auckland.webhealth.co.nz/provider/service/view/914136/"
        ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        practice = hxs.select('//h1/text()').extract()
        items1 = []

        results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
        for result in results:
            item = WebhealthItem1()
            #item['url'] = result.select('//dl/a/@href').extract()
            item['practice'] = practice
            item['hours'] = map(unicode.strip,
                result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract())
            item['more_hours'] = map(unicode.strip,
                result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract())
            item['physical_address'] = map(unicode.strip,
                result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract())
            item['postal_address'] = map(unicode.strip,
                result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract())
            item['postcode'] = map(unicode.strip,
                result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract())
            item['district_town'] = map(unicode.strip,
                result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract())
            item['region'] = map(unicode.strip,
                result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract())
            item['phone'] = map(unicode.strip,
                result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract())
            item['website'] = map(unicode.strip,
                result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract())
            item['email'] = map(unicode.strip,
                result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract())
            items1.append(item)
        return items1

I also created a Cloud9 IDE project with this code. You can play with it at https://c9.io/redapple/so_19309960


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...