My usual approach (if you can live with extra memory copies) is to do all IO in one process and then send things out to a pool of worker threads. To load a slice of a memmapped array into memory just do x = np.array(data[yourslice])
(data[yourslice].copy()
doesn't actually do this, which can lead to some confusion.).
First off, let's generate some test data:
import numpy as np
np.random.random(10000).tofile('data.dat')
You can reproduce your errors with something like this:
import numpy as np
import multiprocessing
def main():
data = np.memmap('data.dat', dtype=np.float, mode='r')
pool = multiprocessing.Pool()
results = pool.imap(calculation, chunks(data))
results = np.fromiter(results, dtype=np.float)
def chunks(data, chunksize=100):
"""Overly-simple chunker..."""
intervals = range(0, data.size, chunksize) + [None]
for start, stop in zip(intervals[:-1], intervals[1:]):
yield data[start:stop]
def calculation(chunk):
"""Dummy calculation."""
return chunk.mean() - chunk.std()
if __name__ == '__main__':
main()
And if you just switch to yielding np.array(data[start:stop])
instead, you'll fix the problem:
import numpy as np
import multiprocessing
def main():
data = np.memmap('data.dat', dtype=np.float, mode='r')
pool = multiprocessing.Pool()
results = pool.imap(calculation, chunks(data))
results = np.fromiter(results, dtype=np.float)
def chunks(data, chunksize=100):
"""Overly-simple chunker..."""
intervals = range(0, data.size, chunksize) + [None]
for start, stop in zip(intervals[:-1], intervals[1:]):
yield np.array(data[start:stop])
def calculation(chunk):
"""Dummy calculation."""
return chunk.mean() - chunk.std()
if __name__ == '__main__':
main()
Of course, this does make an extra in-memory copy of each chunk.
In the long run, you'll probably find that it's easier to switch away from memmapped files and move to something like HDF. This especially true if your data is multidimensional. (I'd reccomend h5py
, but pyTables
is nice if your data is "table-like".)
Good luck, at any rate!
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…