You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:
>>> import numpy as np
>>> np.zeros((1000000, 1000), dtype=np.float32)
If you see a MemoryError
, you either need more RAM, or you need to process your dataset one chunk at a time.
With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.
As I don't have your data, let me start from creating a random dataset of the same size:
import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()
It creates a nice 3.8 GiB file.
Now, if we are in Linux, we can limit how much memory is available to our program:
$ bash
$ ulimit -m $((1024*1024*2))
$ ulimit -m
2097152
Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)
Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit()
method many times, providing a different slice of the dataset each time.
import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA
h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet
n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)
for i in range(0, n//chunk_size):
ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
It seems to be working for me, and if I look at what top
reports, the memory allocation stays below 200M.