I am dealing with Scrapy, Privoxy and Tor. I have all installed and properly working. But Tor connects with the same IP everytime, so I can easily be banned. Is it possible to tell Tor to reconnect each X seconds or connections?
Thanks!
EDIT about the configuration:
For the user agent pool i did this: http://tangww.com/2013/06/UsingRandomAgent/ (I had to put a _ init _.py file as it is said in the comments), and for the Privoxy and Tor I followed http://www.andrewwatters.com/privoxy/ (I had to create the private user and private group manually with the terminal). It worked :)
My spider is this:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
class YourCrawler(CrawlSpider):
name = "spider_name"
start_urls = [
'https://example.com/listviews/titles.php',
]
allowed_domains = ["example.com"]
def parse(self, response):
# go to the urls in the list
s = Selector(response)
page_list_urls = s.xpath('///*[@id="tab7"]/article/header/h2/a/@href').extract()
for url in page_list_urls:
yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
next_page = response.css('ul.pagin li.presente ~ li a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield Request(next_page, callback=self.parse)
# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
def parse_following_urls(self, response):
#Parsing rules go here
for each_book in response.css('main#main'):
yield {
'editor': each_book.css('header.datos1 > ul > li > h5 > a::text').extract(),
}
In settings.py I have an user agent rotation and privoxy:
DOWNLOADER_MIDDLEWARES = {
#user agent
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'spider_name.comm.rotate_useragent.RotateUserAgentMiddleware' :400,
#privoxy
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'spider_name.middlewares.ProxyMiddleware': 100
}
In middlewares.py I added:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
And I think that's all…
EDIT II ---
Ok, I changed my middlewares.py file as in the blog @Tomá? Linhart said from:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
To
from stem import Signal
from stem.control import Controller
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
def set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='tor_password')
controller.signal(Signal.NEWNYM)
But now is really slow, and doesn't appear to change the ip… I did it ok or is something wrong?
See Question&Answers more detail:
os