I got linearsvc working against training set and test set using load_file
method i am trying to get It working on Multiprocessor enviorment.
How can i get multiprocessing work on LinearSVC().fit()
LinearSVC().predict()
? I am not really familiar with datatypes of scikit-learn yet.
I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.
Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?
EDIT:
Here is my scenario:
lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?
Thanks.
PS: Please do not close this .
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…