Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
189 views
in Technique[技术] by (71.8m points)

python - Crawl a full domain and load all h1 into a item

I am relatively new to python and scrapy. What I want to achieve is to crawl a number of websites mainly company websites. Crawl the full domain and extract all the h1 h2 h3. Create a record that contains the domain name and a string with all the h1 h2 h3 from that domain. Basically have a Domain item and a large string containing all the headers.

I would like the output to be DOMAIN, STRING(h1,h2,h2) - from all the urls on this domain

The problem I have is that each URL goes into separate Items. I know I haven't gotten very far but a hint in the right direction would be very much appreciated. Basically, how I create an outer loop so that the yield statement keeps going until the next domain is up.

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from Autotask_Prospecting.items import AutotaskProspectingItem
from Autotask_Prospecting.items import WebsiteItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from nltk import clean_html


class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = [ l.strip() for l in open('Domains.txt').readlines() ]
    start_urls = [ l.strip() for l in open('start_urls.txt').readlines() ]


    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_xpath('h1',("//h1/text()"))
        loader.add_xpath('h2',("//h2/text()"))
        loader.add_xpath('h3',("//h3/text()"))
        yield loader.load_item()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

yield statement keeps going until the next domain is up.

cannot be done, things are done in parallel and there is no way to make domain crawling serially.

what you can do is to write a pipeline that will accumulate them and yield the entire structure on spider_close, something like:

# this assume your item looks like the following
class MyItem():
    domain = Field()
    hs = Field()


import collections
class DomainPipeline(object):

    accumulator = collections.defaultdict(set)

    def process_item(self, item, spider):
        accumulator[item['domain']].update(item['hs'])

    def close_spider(spider):
        for domain,hs in accumulator.items():
            yield MyItem(domain=domain, hs=hs)

usage:

>>> from scrapy.item import Item, Field
>>> class MyItem(Item):
...     domain = Field()
...     hs = Field()
... 
>>> from collections import defaultdict
>>> accumulator = defaultdict(set)
>>> items = []
>>> for i in range(10):
...     items.append(MyItem(domain='google.com', hs=[str(i)]))
... 
>>> items
[{'domain': 'google.com', 'hs': ['0']}, {'domain': 'google.com', 'hs': ['1']}, {'domain': 'google.com', 'hs': ['2']}, {'domain': 'google.com', 'hs': ['3']}, {'domain': 'google.com', 'hs': ['4']}, {'domain': 'google.com', 'hs': ['5']}, {'domain': 'google.com', 'hs': ['6']}, {'domain': 'google.com', 'hs': ['7']}, {'domain': 'google.com', 'hs': ['8']}, {'domain': 'google.com', 'hs': ['9']}]
>>> for item in items:
...     accumulator[item['domain']].update(item['hs'])
... 
>>> accumulator
defaultdict(<type 'set'>, {'google.com': set(['1', '0', '3', '2', '5', '4', '7', '6', '9', '8'])})
>>> for domain, hs in accumulator.items():
...     print MyItem(domain=domain, hs=hs)
... 
{'domain': 'google.com',
 'hs': set(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])}
>>> 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...