python - multithreaded crawler while using tor proxy

Question

Welcome To Ask or Share your Answers For Others

python - multithreaded crawler while using tor proxy

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - multithreaded crawler while using tor proxy

I am trying to build multi threaded crawler that uses tor proxies: I am using following to establish tor connection:

from stem import Signal
from stem.control import Controller
controller = Controller.from_port(port=9151)
def connectTor():
    socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
    socket.socket = socks.socksocket


def renew_tor():
    global request_headers
    request_headers = {
        "Accept-Language": "en-US,en;q=0.5",
        "User-Agent": random.choice(BROWSERS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Referer": "http://thewebsite2.com",
        "Connection": "close"
    }

    controller.authenticate()
    controller.signal(Signal.NEWNYM)

Here is url fetcher:

def get_soup(url):
    while True:
        try:
            connectTor()
            r = requests.Session()
            response = r.get(url, headers=request_headers)
            the_page = response.content.decode('utf-8',errors='ignore')
            the_soup = BeautifulSoup(the_page, 'html.parser')
            if "captcha" in the_page.lower():
                print("flag condition matched while url: ", url)
                #print(the_page)
                renew_tor()
            else:
                return the_soup
                break
        except Exception as e:
            print ("Error while URL :", url, str(e))

I am then creating multithreaded fetch job:

with futures.ThreadPoolExecutor(200) as executor:
            for url in zurls:
                future = executor.submit(fetchjob,url)

then I am getting following error, which I am not seeing when I use multiprocessing:

 Socket connection failed (Socket error: 0x01: General SOCKS server failure)

I would appreciate Any advise to avoid socks error and improving the performance of crawling method to make it multi threaded.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:39:32+0000

This is a perfect example of why monkey patching socket.socket is bad.

This replaces the socket used by all socket connections (which is most everything) with the SOCKS socket.

When you go to connect to the controller later, it attempts to use the SOCKS protocol to communicate instead of establishing a direct connection.

Since you're already using requests, I'd suggest getting rid of SocksiPy and the socks.socket = socks.socksocket code and using the SOCKS proxy functionality built into requests:

proxies = {
    'http': 'socks5h://127.0.0.1:9050',
    'https': 'socks5h://127.0.0.1:9050'
}

response = r.get(url, headers=request_headers, proxies=proxies)

Categories

python - multithreaded crawler while using tor proxy

python - multithreaded crawler while using tor proxy

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags