For now I have 2 spiders, what I would like to do is
- Spider
1
goes to url1
and if url2
appears, call spider 2
with url2
. Also saves the content of url1
by using pipeline.
- Spider
2
goes to url2
and do something.
Due to the complexities of both spiders I would like to have them separated.
What I have tried using scrapy crawl
:
def parse(self, response):
p = multiprocessing.Process(
target=self.testfunc())
p.join()
p.start()
def testfunc(self):
settings = get_project_settings()
crawler = CrawlerRunner(settings)
crawler.crawl(<spidername>, <arguments>)
It does load the settings but doesn't crawl:
2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 14:13:32 [scrapy] INFO: Spider opened
2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
The documentations has a example about launching from script, but what I'm trying to do is launch another spider while using scrapy crawl
command.
edit: Full code
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from multiprocessing import Process
import scrapy
import os
def info(title):
print(title)
print('module name:', __name__)
if hasattr(os, 'getppid'): # only available on Unix
print('parent process:', os.getppid())
print('process id:', os.getpid())
class TestSpider1(scrapy.Spider):
name = "test1"
start_urls = ['http://www.google.com']
def parse(self, response):
info('parse')
a = MyClass()
a.start_work()
class MyClass(object):
def start_work(self):
info('start_work')
p = Process(target=self.do_work)
p.start()
p.join()
def do_work(self):
info('do_work')
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(TestSpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return
class TestSpider2(scrapy.Spider):
name = "test2"
start_urls = ['http://www.google.com']
def parse(self, response):
info('testspider2')
return
What I hope is like:
- scrapy crawl test1
(for example, when response.status_code is 200:)
- in test1, call
scrapy crawl test2
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…