I have 3 numpy arrays and need to form the cartesian product between them. Dimensions of the arrays are not fixed, so they can take different values, one example could be A=(10000, 50), B=(40, 50), C=(10000,50).
Then, I perform some processing (like a+b-c) Below is the function that I am using for the product.
def cartesian_2d(arrays, out=None):
arrays = [np.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = np.prod([x.shape[0] for x in arrays])
if out is None:
out = np.empty([n, len(arrays), arrays[0].shape[1]], dtype=dtype)
m = n // arrays[0].shape[0]
out[:, 0] = np.repeat(arrays[0], m, axis=0)
if arrays[1:]:
cartesian_2d(arrays[1:], out=out[0:m, 1:, :])
for j in range(1, arrays[0].shape[0]):
out[j * m:(j + 1) * m, 1:] = out[0:m, 1:]
return out
a = [[ 0, -0.02], [1, -0.15]]
b = [[0, 0.03]]
result = cartesian_2d([a,b,a])
// array([[[ 0. , -0.02],
[ 0. , 0.03],
[ 0. , -0.02]],
[[ 0. , -0.02],
[ 0. , 0.03],
[ 1. , -0.15]],
[[ 1. , -0.15],
[ 0. , 0.03],
[ 0. , -0.02]],
[[ 1. , -0.15],
[ 0. , 0.03],
[ 1. , -0.15]]])
The output is the same as with itertools.product
. However, I am using my custom function to take advantage of numpy vectorized operations, which is working fine compared to itertools.product in my case.
After this, I do
result[:, 0, :] + result[:, 1, :] - result[:, 2, :]
//array([[ 0. , 0.03],
[-1. , 0.16],
[ 1. , -0.1 ],
[ 0. , 0.03]])
So this is the final expected result.
The function works as expected as long as my array fits in memory. But my usecase requires me to work with huge data and I get a MemoryError at the line np.empty()
since it is unable to allocate the memory required.
I am working with circa 20GB data at the moment and this might increase in future.
These arrays represent vectors and will have to be stored in float
, so I cannot use int
. Also, they are dense arrays, so using sparse
is not an option.
I will be using these arrays for further processing and ideally I would not like to store them in files at this stage. So memmap
/ h5py
format may not help, although I am not sure of this.
If there are other ways to form this product, that would be okay too.
As I am sure there are applications with way larger datasets than this, I hope someone has encountered such issues before and would like to know how to handle this issue. Please help.
See Question&Answers more detail:
os