I am trying to build multi threaded crawler that uses tor proxies:
I am using following to establish tor connection:
from stem import Signal
from stem.control import Controller
controller = Controller.from_port(port=9151)
def connectTor():
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
socket.socket = socks.socksocket
def renew_tor():
global request_headers
request_headers = {
"Accept-Language": "en-US,en;q=0.5",
"User-Agent": random.choice(BROWSERS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": "http://thewebsite2.com",
"Connection": "close"
}
controller.authenticate()
controller.signal(Signal.NEWNYM)
Here is url fetcher:
def get_soup(url):
while True:
try:
connectTor()
r = requests.Session()
response = r.get(url, headers=request_headers)
the_page = response.content.decode('utf-8',errors='ignore')
the_soup = BeautifulSoup(the_page, 'html.parser')
if "captcha" in the_page.lower():
print("flag condition matched while url: ", url)
#print(the_page)
renew_tor()
else:
return the_soup
break
except Exception as e:
print ("Error while URL :", url, str(e))
I am then creating multithreaded fetch job:
with futures.ThreadPoolExecutor(200) as executor:
for url in zurls:
future = executor.submit(fetchjob,url)
then I am getting following error, which I am not seeing when I use multiprocessing:
Socket connection failed (Socket error: 0x01: General SOCKS server failure)
I would appreciate Any advise to avoid socks error and improving the performance of crawling method to make it multi threaded.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…