Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
215 views
in Technique[技术] by (71.8m points)

python - How to make Selenium scripts work faster?

I use Python Selenium and Scrapy for crawling a website.
But my script is so slow,

Crawled 1 pages (at 1 pages/min)

I use CSS SELECTOR instead of XPATH for optimise the time.
I change the middlewares

'tutorial.middlewares.MyCustomDownloaderMiddleware': 543,

is Selenium is too slow or I should change something in Setting?

my code:

def start_requests(self):
    yield Request(self.start_urls, callback=self.parse)
def parse(self, response):
    display = Display(visible=0, size=(800, 600))
    display.start()
    driver = webdriver.Firefox()
    driver.get("http://www.example.com")
    inputElement = driver.find_element_by_name("OneLineCustomerAddress")
    inputElement.send_keys("75018")
    inputElement.submit()
    catNums = driver.find_elements_by_css_selector("html body div#page div#main.content div#sContener div#menuV div#mvNav nav div.mvNav.bcU div.mvNavLk form.jsExpSCCategories ul.mvSrcLk li")
    #INIT
    driver.find_element_by_css_selector(".mvSrcLk>li:nth-child(1)>label.mvNavSel.mvNavLvl1").click()
    for catNumber in xrange(1,len(catNums)+1):
        print "
 IN catnumber 
"
        driver.find_element_by_css_selector("ul#catMenu.mvSrcLk> li:nth-child(%s)> label.mvNavLvl1" % catNumber).click()
        time.sleep(5)
        self.parse_articles(driver)
        pages = driver.find_elements_by_xpath('//*[@class="pg"]/ul/li[last()]/a')

        if(pages):
            page = driver.find_element_by_xpath('//*[@class="pg"]/ul/li[last()]/a')

            checkText = (page.text).strip()
            if(len(checkText) > 0):
                pageNums = int(page.text)
                pageNums = pageNums  - 1
                for pageNumbers in range (pageNums):
                    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "waitingOverlay")))
                    driver.find_element_by_css_selector('.jsNxtPage.pgNext').click()
                    self.parse_articles(driver)
                    time.sleep(5)

def parse_articles(self,driver) :
    test = driver.find_elements_by_css_selector('html body div#page div#main.content div#sContener div#sContent div#lpContent.jsTab ul#lpBloc li div.prdtBloc p.prdtBDesc strong.prdtBCat')

def between(self, value, a, b):
    pos_a = value.find(a)
    if pos_a == -1: return ""
    pos_b = value.rfind(b)
    if pos_b == -1: return ""
    adjusted_pos_a = pos_a + len(a)
    if adjusted_pos_a >= pos_b: return ""
    return value[adjusted_pos_a:pos_b]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

So your code has few flaws here.

  1. You use selenium to parse the page contents when scrapy Selectors are faster and more efficient.
  2. You start a webdriver for every response.

This can be resolved very eloquently by using scrapy's Downloader middlewares! You want to create a custom downloader middleware that would download requests using selenium rather than scrapy downloader.

For example I use this:

# middlewares.py
class SeleniumDownloader(object):
    def create_driver(self):
        """only start the driver if middleware is ever called"""
        if not getattr(self, 'driver', None):
            self.driver = webdriver.Chrome()

    def process_request(self, request, spider):
        # this is called for every request, but we don't want to render
        # every request in selenium, so use meta key for those we do want.
        if not request.meta.get('selenium', False):
            return request
        self.create_driver()
        self.driver.get(request.url)
        return HtmlResponse(request.url, body=self.driver.page_source, encoding='utf-8')

Activate your middleware:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middleware.SeleniumDownloader': 13,
}

Then in your spider you can specify which urls to download via selenium driver by adding a meta argument.

# you can start with selenium
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, meta={'selenium': True})

def parse(self, response):
    # this response is rendered by selenium!
    # also can use no selenium for another response if you wish
    url = response.xpath("//a/@href")
    yield scrapy.Request(url)

The advantages of this approach is that you your driver is being started only once and used to download page source only, the rest is left to proper asynchronous scrapy tools.
The disadvantages is that you cannot click buttons around and such since you are not exposed to the driver. Most of the times you can reverse engineer what the buttons do via network inspector and you should never need to do any clicking with the driver itself.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

Just Browsing Browsing

[2] html - How to create even cell spacing within a

1.4m articles

1.4m replys

5 comments

57.0k users

...