I am using Python
to scrape pages. Until now I didn't have any complicated issues.
The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.
Using Requests
and lxml
I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.
I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300)
call on every 300th request). Despite that, Im getting banned.
From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.
Based on your experience what else should I try?
Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…