selenium - Scraping in Python - Preventing IP ban

Question

Welcome To Ask or Share your Answers For Others

selenium - Scraping in Python - Preventing IP ban

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

selenium - Scraping in Python - Preventing IP ban

I am using Python to scrape pages. Until now I didn't have any complicated issues.

The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.

Using Requests and lxml I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.

I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300) call on every 300th request). Despite that, Im getting banned.

From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.

Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:26:58+0000

If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:

the built-in AutoThrottle extension:

This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.

rotating user agents with scrapy-fake-useragent middleware:

Use a random User-Agent provided by fake-useragent every request

rotating IP addresses:
- Setting Scrapy proxy middleware to rotate on each request
- scrapy-proxies
you can also run it via local proxy & TOR:
- Scrapy: Run Using TOR and Multiple Agents

Categories

selenium - Scraping in Python - Preventing IP ban

selenium - Scraping in Python - Preventing IP ban

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags