Using numpy.concatenate
apparently load the arrays into memory. To avoid this you can easily create a thrid memmap
array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.
For any case you must choose the right order for the array (row-major or column-major).
The following examples illustrate how to concatenate along axis 0 and axis 1.
1) concatenate along axis=0
a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222
You can define a third array reading the same file as the first array to be concatenated (here a
) in mode r+
(read and append), but with the shape of the final array you want to achieve after concatenation, like:
c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b
Concatenating along axis=0
does not require to pass order='C'
because this is already the default order.
2) concatenate along axis=1
a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222
The arrays saved on disk are actually flattened, so if you create c
with mode=r+
and shape=(5000,4000)
without changing the array order, the 1000
first elements from the second line in a
will go to the first in line in c
. But you can easily avoid this passing order='F'
(column-major) to memmap
:
c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b
Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.
Related questions: