Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
818 views
in Technique[技术] by (71.8m points)

web scraping - Scrapy Shell and Scrapy Splash

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container.

If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments:

yield Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }
})

This works as documented. But, how can we use scrapy-splash inside the Scrapy Shell?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

just wrap the url you want to shell to in splash http api.

So you would want something like:

scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

where localhost:port is where your splash service is running
url is url you want to crawl and dont forget to urlquote it!
render.html is one of the possible http api endpoints, returns redered html page in this case
timeout time in seconds for timeout
wait time in seconds to wait for javascript to execute before reading/saving the html.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...