Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
692 views
in Technique[技术] by (71.8m points)

download - Downloading a LOT of files using python

Is there a good way to download a lot of files en masse using python? This code is speedy enough for downloading about 100 or so files. But I need to download 300,000 files. Obviously they are all very small files (or I wouldn't be downloading 300,000 of them :) ) so the real bottleneck seems to be this loop. Does anyone have any thoughts? Maybe using MPI or threading?

Do I just have to live with the bottle neck? Or is there a faster way, maybe not even using python?

(I included the full beginning of the code just for completeness sake)

from __future__ import division
import pandas as pd
import numpy as np
import urllib2
import os
import linecache 

#we start with a huge file of urls

data= pd.read_csv("edgar.csv")
datatemp2=data[data['form'].str.contains("14A")]
datatemp3=data[data['form'].str.contains("14C")]

#data2 is the cut-down file

data2=datatemp2.append(datatemp3)
flist=np.array(data2['filename'])
print len(flist)
print flist

###below we have a script to download all of the files in the data2 database
###here you will need to create a new directory named edgar14A14C in your CWD

original=os.getcwd().copy()
os.chdir(str(os.getcwd())+str('/edgar14A14C'))


for i in xrange(len(flist)):
    url = "ftp://ftp.sec.gov/"+str(flist[i])
    file_name = str(url.split('/')[-1])
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    f.write(u.read())
    f.close()
    print i
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The usual pattern with multiprocessing is to create a job() function that takes arguments and performs some potentially CPU bound work.

Example: (based on your code)

from multiprocessing import Pool

def job(url):
    file_name = str(url.split('/')[-1])
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    f.write(u.read())
    f.close()

pool = Pool()
urls = ["ftp://ftp.sec.gov/{0:s}".format(f) for f in flist]
pool.map(job, urls)

This will do a number of things:

  • Create a multiprocessing pool and set of workers as you have CPU(s) or CPU Core(s)
  • Create a list of inputs to the job() function.
  • Map the list of inputs urls to job() and wait for all jobs to complete.

Python's multiprocessing.Pool.map will take care of splitting up your input across the no. of workers in the pool.

Another useful neat little thing I've done for this kind of work is to use progress like this:

from multiprocessing import Pool


from progress.bar import Bar


def job(input):
    # do some work


pool = Pool()
inputs = range(100)
bar = Bar('Processing', max=len(inputs))
for i in pool.imap(job, inputs):
    bar.next()
bar.finish()

This gives you a nice progress bar on your console as your jobs are progressing so you have some idea of progress and eta, etc.

I also find the requests library very useful here and a much nicer set of API(s) for dealing with web resources and downloading of content.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...