Threading
for selenium
speed up
Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using BeautifulSoup
. The list of pages is links
.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
def create_driver():
"""returns a new chrome webdriver"""
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
return webdriver.Chrome(options=chromeOptions)
def get_title(url, webdriver=None):
"""get the url html title using BeautifulSoup
if driver is None uses a new chrome-driver and quit() after
otherwise uses the driver provided and don't quit() after"""
def print_title(driver):
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")
item = soup.find('title')
print(item.string.strip())
if webdriver:
print_title(webdriver)
else:
webdriver = create_driver()
print_title(webdriver)
webdriver.quit()
links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/",
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]
Calling now get_tile
on the links
above.
Sequential approach
A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).
start_time = time.time()
driver = create_driver()
for link in links: # could be 'like' clicks
get_title(link, driver)
driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")
Multiple threads approach
Using a thread for each link. Results in 10.5 s > 2x faster.
start_time = time.time()
threads = []
for link in links: # each thread could be like a new 'click'
th = threading.Thread(target=get_title, args=(link,))
th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
threads.append(th)
for th in threads:
th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")
This here and this better are some other working examples. The second uses a fixed number of threads on a ThreadPool
. And suggests that storing the chrome-driver
instance initialized on each thread is faster than creating-starting it every time.
Still I was not sure this was the optimal approach for selenium to have considerable speed-ups. Since threading
on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).
Processes
for selenium
speed up
To try to overcome the Python GIL limitation using the package multiprocessing
and Processes
class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the get_title
function above. Additional code is here.
start_time = time.time()
processes = []
for link in links: # each thread a new 'click'
ps = multiprocessing.Process(target=get_title, args=(link,))
ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
processes.append(ps)
for ps in processes:
ps.join() # Main wait for processes finish
return (time.time() - start_time)
Contrary of what I would expect Python multiprocessing.Process
based parallelism for selenium
in average was around 8% slower than threading.Thread
. But obviously booth were in average more than twice faster than the sequential approach. Just found out that selenium
chrome-driver commands uses HTTP-Requets
(like POST
, GET
) so it is I/O bounded therefore it releases the Python GIL indeed making it parallel in threads.
Threading
a good start for selenium
speed up **
This is not a definitive answer as my tests were only a tiny example. Also I'm using Windows and multiprocessing
have many limitations in this case. Each new Process
is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.
Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).