Speeding up crawling is basically Eventlet's main use case. It's deeply fast -- we have an application that has to hit 2,000,000 urls in a few minutes. It makes use of the fastest event interface on your system (epoll, generally), and uses greenthreads (which are built on top of coroutines and are very inexpensive) to make it easy to write.
Here's an example from the docs:
urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
"https://wiki.secondlife.com/w/images/secondlife.jpg",
"http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]
import eventlet
from eventlet.green import urllib2
def fetch(url):
body = urllib2.urlopen(url).read()
return url, body
pool = eventlet.GreenPool()
for url, body in pool.imap(fetch, urls):
print "got body from", url, "of length", len(body)
This is a pretty good starting point for developing a more fully-featured crawler. Feel free to pop in to #eventlet on Freenode to ask for help.
[update: I added a more-complex recursive web crawler example to the docs. I swear it was in the works before this question was asked, but the question did finally inspire me to finish it. :)]
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…