Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
338 views
in Technique[技术] by (71.8m points)

python - How to merge two large numpy arrays if slicing doesn't resolve memory error?

I have two numpy arrays container1 and container2 where container1.shape = (900,4000) and container2.shape = (5000,4000). Merging them using vstack results in a MemoryError. After searching through the old questions posted here, I tried to merge them using slicing like this:

mergedContainer = numpy.vstack((container1, container2[:1000]))
mergedContainer = numpy.vstack((mergedContainer, container[1000:2500]))
mergedContainer = numpy.vstack((mergedContainer, container[2500:3000]))

but after this even if I do:

mergedContainer = numpy.vstack((mergedContainer, container[3000:3100]))

it results in MemoryError.

I am using Python 3.4.3 (32-Bit) and would like to resolve without shifting to 64-Bit.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Every time you call np.vstack NumPy has to allocate space for a brand new array. So if we say 1 row requires 1 unit of memory

np.vstack([container, container2])

requires an additional 900+5000 units of memory. Moreover, before the assignment occurs, Python needs to hold space for the old mergedContainer (if it exists) as well as space for the new mergedContainer. So building mergedContainer iteratively with slices actually requires more memory than trying to build it with a single call to np.vstack.

Building it iteratively:

| total | mergedContainer | container1 | container2 | temp |                                                                      |
|-------+-----------------+------------+------------+------+----------------------------------------------------------------------|
|  7800 |            1900 |        900 |       5000 |    0 | mergedContainer = np.vstack((container1, container2[:1000]))         |
| 11200 |            3400 |        900 |       5000 | 1900 | mergedContainer = np.vstack((mergedContainer, container[1000:2500])) |
| 13200 |            3900 |        900 |       5000 | 3400 | mergedContainer = np.vstack((mergedContainer, container[2500:3000])) |

Building it from a single call to np.vstack:

| total | mergedContainer | container1 | container2 | temp |                                                       |
|-------+-----------------+------------+------------+------+-------------------------------------------------------|
| 11800 |            5900 |        900 |       5000 |    0 | mergedContainer = np.vstack((container1, container2)) |

We can do even better, however. Instead of calling np.vstack repeatedly, allocate all the space that is needed once from the very beginning and write the contents of both container1 and container2 into it. In other words, avoid allocating two disparate arrays container1 and container2 if you know eventually you want to merge them.

container = np.empty((5900, 4000))

Note that basic slices such as container[:900] always return views, and views require essentially no additional memory. So you could define container1 and container2 like this:

container1 = container[:900]   
container2 = container[900:]   

and then assign values in place. This modifies container:

container1[:] = ...              
container2[:] = ...

Thus your your memory requirement would stay around 5900 units.


For example,

import numpy as np
np.random.seed(2015)

container = np.empty((5, 4), dtype='int')
container1 = container[:2]   
container2 = container[2:]   
container1[:] = np.random.randint(10, size=(2,4))
container2[:] = np.random.randint(1000, size=(3,4))
print(container)

yields

[[  2   2   9   6]
 [  8   5   7   8]
 [112  70 487 124]
 [859   8 275 936]
 [317 134 393 909]]

while only requiring space for one array of shape (5, 4), and temporarly-used space for the random arrays.

Thus, you wouldn't have to change very much in your code to save memory. Just set it up with

container = np.empty((5900, 4000))
container1 = container[:900]   
container2 = container[900:]   

and then use

container1[:] = ...

instead of

container1 = ...

to assign values in-place. (Or, of course, you could just write directly into container.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...