I've implemented a function for applying np.dot
to blocks that are explicitly read into core memory from the memory-mapped array:
import numpy as np
def _block_slices(dim_size, block_size):
"""Generator that yields slice objects for indexing into
sequential blocks of an array along a particular axis
"""
count = 0
while True:
yield slice(count, count + block_size, 1)
count += block_size
if count > dim_size:
raise StopIteration
def blockwise_dot(A, B, max_elements=int(2**27), out=None):
"""
Computes the dot product of two matrices in a block-wise fashion.
Only blocks of `A` with a maximum size of `max_elements` will be
processed simultaneously.
"""
m, n = A.shape
n1, o = B.shape
if n1 != n:
raise ValueError('matrices are not aligned')
if A.flags.f_contiguous:
# prioritize processing as many columns of A as possible
max_cols = max(1, max_elements / m)
max_rows = max_elements / max_cols
else:
# prioritize processing as many rows of A as possible
max_rows = max(1, max_elements / n)
max_cols = max_elements / max_rows
if out is None:
out = np.empty((m, o), dtype=np.result_type(A, B))
elif out.shape != (m, o):
raise ValueError('output array has incorrect dimensions')
for mm in _block_slices(m, max_rows):
out[mm, :] = 0
for nn in _block_slices(n, max_cols):
A_block = A[mm, nn].copy() # copy to force a read
out[mm, :] += np.dot(A_block, B[nn, :])
del A_block
return out
I then did some benchmarking to compare my blockwise_dot
function to the normal np.dot
function applied directly to a memory-mapped array (see below for the benchmarking script). I'm using numpy 1.9.0.dev-205598b linked against OpenBLAS v0.2.9.rc1 (compiled from source). The machine is a quad-core laptop running Ubuntu 13.10, with 8GB RAM and an SSD, and I've disabled the swap file.
Results
As @Bi Rico predicted, the time taken to compute the dot product is beautifully O(n) with respect to the dimensions of A
. Operating on cached blocks of A
gives a huge performance improvement over just calling the normal np.dot
function on the whole memory-mapped array:
It's surprisingly insensitive to the size of the blocks being processed - there's very little difference between the time taken to process the array in blocks of 1GB, 2GB or 4GB. I conclude that whatever caching np.memmap
arrays natively implement, it seems to be very suboptimal for computing dot products.
Further questions
It's still a bit of a pain to have to manually implement this caching strategy, since my code will probably have to run on machines with different amounts of physical memory, and potentially different operating systems. For that reason I'm still interested in whether there are ways to control the caching behaviour of memory-mapped arrays in order to improve the performance of np.dot
.
I noticed some odd memory handling behaviour as I was running the benchmarks - when I called np.dot
on the whole of A
I never saw the resident set size of my Python process exceed about 3.8GB, even though I have about 7.5GB of RAM free. This leads me to suspect that there is some limit imposed on the amount of physical memory an np.memmap
array is allowed to occupy - I had previously assumed that it would use whatever RAM the OS allows it to grab. In my case it might be very beneficial to be able to increase this limit.
Does anyone have any further insight into the caching behaviour of np.memmap
arrays that would help to explain this?
Benchmarking script
def generate_random_mmarray(shape, fp, max_elements):
A = np.memmap(fp, dtype=np.float32, mode='w+', shape=shape)
max_rows = max(1, max_elements / shape[1])
max_cols = max_elements / max_rows
for rr in _block_slices(shape[0], max_rows):
for cc in _block_slices(shape[1], max_cols):
A[rr, cc] = np.random.randn(*A[rr, cc].shape)
return A
def run_bench(n_gigabytes=np.array([16]), max_block_gigabytes=6, reps=3,
fpath='temp_array'):
"""
time C = A * B, where A is a big (n, n) memory-mapped array, and B and C are
(n, o) arrays resident in core memory
"""
standard_times = []
blockwise_times = []
differences = []
nbytes = n_gigabytes * 2 ** 30
o = 64
# float32 elements
max_elements = int((max_block_gigabytes * 2 ** 30) / 4)
for nb in nbytes:
# float32 elements
n = int(np.sqrt(nb / 4))
with open(fpath, 'w+') as f:
A = generate_random_mmarray((n, n), f, (max_elements / 2))
B = np.random.randn(n, o).astype(np.float32)
print "
" + "-"*60
print "A: %s(%i bytes)" %(A.shape, A.nbytes)
print "B: %s(%i bytes)" %(B.shape, B.nbytes)
best = np.inf
for _ in xrange(reps):
tic = time.time()
res1 = np.dot(A, B)
t = time.time() - tic
best = min(best, t)
print "Normal dot:%imin %.2fsec" %divmod(best, 60)
standard_times.append(best)
best = np.inf
for _ in xrange(reps):
tic = time.time()
res2 = blockwise_dot(A, B, max_elements=max_elements)
t = time.time() - tic
best = min(best, t)
print "Block-wise dot:%imin %.2fsec" %divmod(best, 60)
blockwise_times.append(best)
diff = np.linalg.norm(res1 - res2)
print "L2 norm of difference:%g" %diff
differences.append(diff)
del A, B
del res1, res2
os.remove(fpath)
return (np.array(standard_times), np.array(blockwise_times),
np.array(differences))
if __name__ == '__main__':
n = np.logspace(2,5,4,base=2)
standard_times, blockwise_times, differences = run_bench(
n_gigabytes=n,
max_block_gigabytes=4)
np.savez('bench_results', standard_times=standard_times,
blockwise_times=blockwise_times, differences=differences)