python - Creating Scrapy array of items with multiple parse

Question

Welcome To Ask or Share your Answers For Others

python - Creating Scrapy array of items with multiple parse

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Creating Scrapy array of items with multiple parse

I am scraping listings with Scrapy. My script parses first for the listing urls using parse_node, then it parses each listing using parse_listing, for each listing it parses the agents for the listing using parse_agent. I would like to create an array, that builds up as scrapy parses through the listings and the agents for the listings and resets for each new listing.

Here is my parsing script:

 def parse_node(self,response,node):
  yield Request('LISTING LINK',callback=self.parse_listing)
 def parse_listing(self,response):
  yield response.xpath('//node[@id="ListingId"]/text()').extract_first()
  yield response.xpath('//node[@id="ListingTitle"]/text()').extract_first()
  for agent in string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^'):
   yield Request('AGENT LINK',callback=self.parse_agent)
 def parse_agent(self,response):
  yield response.xpath('//node[@id="AgentName"]/text()').extract_first()
  yield response.xpath('//node[@id="AgentEmail"]/text()').extract_first()

I would like parse_listing to result in:

{
 'id':123,
 'title':'Amazing Listing'
}

then parse_agent to add to the listing array:

{
 'id':123,
 'title':'Amazing Listing'
 'agent':[
  {
   'name':'jon doe',
   'email:'[email protected]'
  },
  {
   'name':'jane doe',
   'email:'[email protected]'
  }
 ]
}

How do I get the results from each level and build up an array?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:59:15+0000

This is somewhat complicated issued:
You need to form a single item from multiple different urls.

Scrapy allows you to carry over data in request's meta attribute so you can do something like:

def parse_node(self,response,node):
    yield Request('LISTING LINK', callback=self.parse_listing)

def parse_listing(self,response):
    item = defaultdict(list)
    item['id'] = response.xpath('//node[@id="ListingId"]/text()').extract_first()
    item['title'] = response.xpath('//node[@id="ListingTitle"]/text()').extract_first()
    agent_urls = string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^')
    # find all agent urls and start with first one
    url = agent_urls.pop(0)
    # we want to go through agent urls one-by-one and update single item with agent data
    yield Request(url, callback=self.parse_agent, 
                  meta={'item': item, 'agent_urls' agent_urls})

def parse_agent(self,response):
    item = response.meta['item']  # retrieve item generated in previous request
    agent = dict() 
    agent['name'] = response.xpath('//node[@id="AgentName"]/text()').extract_first()
    agent['email'] =  response.xpath('//node[@id="AgentEmail"]/text()').extract_first()
    item['agents'].append(agent)
    # check if we have any more agent urls left
    agent_urls = response.meta['agent_urls']
    if not agent_urls:  # we crawled all of the agents!
        return item
    # if we do - crawl next agent and carry over our current item
    url = agent_urls.pop(0)
    yield Request(url, callback=self.parse_agent, 
                  meta={'item': item, 'agent_urls' agent_urls})

Categories

python - Creating Scrapy array of items with multiple parse

python - Creating Scrapy array of items with multiple parse

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags