Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
815 views
in Technique[技术] by (71.8m points)

python - Is it possible to run another spider from Scrapy spider?

For now I have 2 spiders, what I would like to do is

  1. Spider 1 goes to url1 and if url2 appears, call spider 2 with url2. Also saves the content of url1 by using pipeline.
  2. Spider 2 goes to url2 and do something.

Due to the complexities of both spiders I would like to have them separated.

What I have tried using scrapy crawl:

def parse(self, response):
    p = multiprocessing.Process(
        target=self.testfunc())
    p.join()
    p.start()

def testfunc(self):
    settings = get_project_settings()
    crawler = CrawlerRunner(settings)
    crawler.crawl(<spidername>, <arguments>)

It does load the settings but doesn't crawl:

2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 14:13:32 [scrapy] INFO: Spider opened
2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

The documentations has a example about launching from script, but what I'm trying to do is launch another spider while using scrapy crawl command.

edit: Full code

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from multiprocessing import Process
import scrapy
import os


def info(title):
    print(title)
    print('module name:', __name__)
    if hasattr(os, 'getppid'):  # only available on Unix
        print('parent process:', os.getppid())
    print('process id:', os.getpid())


class TestSpider1(scrapy.Spider):
    name = "test1"
    start_urls = ['http://www.google.com']

    def parse(self, response):
        info('parse')
        a = MyClass()
        a.start_work()


class MyClass(object):

    def start_work(self):
        info('start_work')
        p = Process(target=self.do_work)
        p.start()
        p.join()

    def do_work(self):

        info('do_work')
        settings = get_project_settings()
        runner = CrawlerRunner(settings)
        runner.crawl(TestSpider2)
        d = runner.join()
        d.addBoth(lambda _: reactor.stop())
        reactor.run()
        return

class TestSpider2(scrapy.Spider):

    name = "test2"
    start_urls = ['http://www.google.com']

    def parse(self, response):
        info('testspider2')
        return

What I hope is like:

  1. scrapy crawl test1 (for example, when response.status_code is 200:)
  2. in test1, call scrapy crawl test2
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I won't go in depth given since this question is really old but I'll go ahead drop this snippet from the official Scrappy docs.... You are very close! lol

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

https://doc.scrapy.org/en/latest/topics/practices.html

And then using callbacks you can pass items between your spiders do do w.e logic functions your talking about


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...