I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA.
Therefore, I shifted to using Iterative PCA.
Dataset Size-(140000,3504)
The documentation states that This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.
This is really good, but unsure on how take advantage of this.
I tried load one memmap hoping it would access it in chunks but my RAM blew.
My code below ends up using a lot of RAM:
ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)
When I say "my RAM blew", the Traceback I see is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:Python27libsite-packagessklearnase.py", line 433, in fit_transfo
rm
return self.fit(X, **fit_params).transform(X)
File "C:Python27libsite-packagessklearndecompositionincremental_pca.py",
line 171, in fit
X = check_array(X, dtype=np.float)
File "C:Python27libsite-packagessklearnutilsvalidation.py", line 347, in
check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError
How can I improve this without comprising on accuracy by reducing the batch-size?
My ideas to diagnose:
I looked at the sklearn
source code and in the fit()
function Source Code I can see the following. This makes sense to me, but I am still unsure about what is wrong in my case.
for batch in gen_batches(n_samples, self.batch_size_):
self.partial_fit(X[batch])
return self
Edit:
Worst case scenario I will have to write my own code for iterativePCA which batch processes by reading and closing .npy files. But that would defeat the purpose of taking advantage of already present hack.
Edit2:
If somehow I could delete a batch of processed memmap file
. It would make much sense.
Edit3:
Ideally if IncrementalPCA.fit() is just using batches it should not crash my RAM. Posting the whole code, just to make sure I am not making a mistake in flushing the memmap completely to disk before.
temp_train_data=X_train[1000:]
temp_labels=y[1000:]
out = np.empty((200001, 3504), np.int64)
for index,row in enumerate(temp_train_data):
actual_index=index+1000
data=X_train[actual_index-1000:actual_index+1].ravel()
__,cd_i=pywt.dwt(data,'haar')
out[index] = cd_i
out.flush()
pca_obj=IncrementalPCA()
clf = pca_obj.fit(out)
Surprisingly, I note out.flush
doesn't free my memory. Is there a way to using del out
to free my memory completely and then someone pass a pointer of the file to IncrementalPCA.fit()
.
See Question&Answers more detail:
os