If you want to stick with SVC as much as possible and train on the full dataset, you can use ensembles of SVCs that are trained on subsets of the data to reduce the number of records per classifier (which apparently has quadratic influence on complexity). Scikit supports that with the BaggingClassifier
wrapper. That should give you similar (if not better) accuracy compared to a single classifier, with much less training time. The training of the individual classifiers can also be set to run in parallel using the n_jobs
parameter.
Alternatively, I would also consider using a Random Forest classifier - it supports multi-class classification natively, it is fast and gives pretty good probability estimates when min_samples_leaf
is set appropriately.
I did a quick tests on the iris dataset blown up 100 times with an ensemble of 10 SVCs, each one trained on 10% of the data. It is more than 10 times faster than a single classifier. These are the numbers I got on my laptop:
Single SVC: 45s
Ensemble SVC: 3s
Random Forest Classifier: 0.5s
See below the code that I used to produce the numbers:
import time
import numpy as np
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
iris = datasets.load_iris()
X, y = iris.data, iris.target
X = np.repeat(X, 100, axis=0)
y = np.repeat(y, 100, axis=0)
start = time.time()
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
end = time.time()
print "Single SVC", end - start, clf.score(X,y)
proba = clf.predict_proba(X)
n_estimators = 10
start = time.time()
clf = OneVsRestClassifier(BaggingClassifier(SVC(kernel='linear', probability=True, class_weight='auto'), max_samples=1.0 / n_estimators, n_estimators=n_estimators))
clf.fit(X, y)
end = time.time()
print "Bagging SVC", end - start, clf.score(X,y)
proba = clf.predict_proba(X)
start = time.time()
clf = RandomForestClassifier(min_samples_leaf=20)
clf.fit(X, y)
end = time.time()
print "Random Forest", end - start, clf.score(X,y)
proba = clf.predict_proba(X)
If you want to make sure that each record is used only once for training in the BaggingClassifier
, you can set the bootstrap
parameter to False.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…