I'm scraping Bet365, probably one of the most tricky websites I've encountered, with selenium and Chrome.
The issue with this page is that, even though my scraper takes sleeps so in no way it runs faster of what a human could, at some point, sometimes, it blocks my ip from a random amount of time (between half and 2 hours).
So, I'm looking into proxies to change my IP and resume my scraping. And here is where I'm kind of stuck trying to decide how to approach this
I've used 2 different free ip providers as follows
https://gimmeproxy.com
I wasn't able to make this one work, I'm emailing their support, but what I have, which should work is as follows
import requests
api="MY_API_KEY" #with the free plan I can ask 240 times a day for an IP
adder="&post=true&supportsHttps=true&maxCheckPeriod=3600"
url="https://gimmeproxy.com/api/getProxy?"
r=requests.get(url=url,params=adder)
THIS IS EDITED
apik="api_key={}".format(api)
r=requests.get(url=url,params=apik+adder)
aaand I get no answer. 404 error not found. NOW WORKS, MY BAD
My second approach is through this other site sslproxy
With this one, you scrape the page, and you get a list of 100 IPs, theoretically checked and working. So, I've set up a loop in which I try a random IP from that list, and if it doesn't work deletes it from the list and tries again. This approach works hen trying to open Bet365.
for n in range(1, 100):
proxy_index=random.randint(0, len(proxies) - 1)
proxi=proxies[proxy_index]
PROXY=proxi['ip']+':'+proxi['port']
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server={}'.format(PROXY))
url="https://www.bet365.es"
try:
browser=webdriver.Chrome(path,options=chrome_options)
browser.get(url)
WebDriverWait(browser,10)..... #no need to post the whole condition
break
except:
del proxies[proxy_index]
browser.quit()
Well, with this one I succeded on trying to open Bet365, and I'm still checking, but I think this webdriver is going to be much slower than my original one, with no proxy.
So, my question is, is it expected that using proxy the scraping is going to be much slower, or does it depend on the proxy used? If so, does anyone recommed a different (or better, surely) approach?
See Question&Answers more detail:
os