python - How to use multiprocessing to loop through a big list of URL?

Question

Welcome To Ask or Share your Answers For Others

python - How to use multiprocessing to loop through a big list of URL?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to use multiprocessing to loop through a big list of URL?

Problem: Check a listing of over 1000 urls and get the url return code (status_code).

The script I have works but very slow.

I am thinking there has to be a better, pythonic (more beutifull) way of doing this, where I can spawn 10 or 20 threads to check the urls and collect resonses. (i.e:

200 -> www.yahoo.com
404 -> www.badurl.com
...

Input file:Url10.txt

www.example.com
www.yahoo.com
www.testsite.com

....

import requests

with open("url10.txt") as f:
    urls = f.read().splitlines()

print(urls)
for url in urls:
    url =  'http://'+url   #Add http:// to each url (there has to be a better way to do this)
    try:
        resp = requests.get(url, timeout=1)
        print(len(resp.content), '->', resp.status_code, '->', resp.url)
    except Exception as e:
        print("Error", url)

Challenges: Improve speed with multiprocessing.

With multiprocessing

But is it not working. I get the following error: (note: I am not sure if I have even implemented this correctly)

AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>

--

import requests
from multiprocessing import Pool

with open("url10.txt") as f:
    urls = f.read().splitlines()
 
def checkurlconnection(url):
    
    for url in urls:
        url =  'http://'+url
        try:
            resp = requests.get(url, timeout=1)
            print(len(resp.content), '->', resp.status_code, '->', resp.url)
        except Exception as e:
            print("Error", url)
        
if __name__ == "__main__":
    p = Pool(processes=4)
    result = p.map(checkurlconnection, urls)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:52:35+0000

In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what multiprocessing does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named, multiprocessing.dummy:

import requests
from multiprocessing.dummy import Pool as ThreadPool 

urls = ['https://www.python.org',
        'https://www.python.org/about/']

def get_status(url):
    r = requests.get(url)
    return r.status_code

if __name__ == "__main__":
    pool = ThreadPool(4)  # Make the Pool of workers
    results = pool.map(get_status, urls) #Open the urls in their own threads
    pool.close() #close the pool and wait for the work to finish 
    pool.join()

See here for examples of multiprocessing vs multithreading in Python.

Categories

python - How to use multiprocessing to loop through a big list of URL?

python - How to use multiprocessing to loop through a big list of URL?

Input file:Url10.txt

With multiprocessing

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags