I am doing k-means clustering on the set of 30 samples with 2 clusters (I already know there are two classes). I divide my data into training and test set and try to calculate the accuracy score on my test set. But there are two problems: first I don't know if I can actually do this (accuracy score on test set) for k-means clustering. Second: if I am allowed to do this, whether my implementation is right or wrong. Here is what I've tried:
df_hist = pd.read_csv('video_data.csv')
y = df_hist['label'].values
del df_hist['label']
df_hist.to_csv('video_data1.csv')
X = df_hist.values.astype(np.float)
X_train, X_test,y_train,y_test = cross_validation.train_test_split(X,y,test_size=0.20,random_state=70)
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(X_train)
print(k_means.labels_[:])
print(y_train[:])
score = metrics.accuracy_score(y_test,k_means.predict(X_test))
print('Accuracy:{0:f}'.format(score))
k_means.predict(X_test)
print(k_means.labels_[:])
print(y_test[:])
But, when I print k-means labels for the test set (k_means.predict(X_test) print(k_means.labels_[:])) and y_test labels (print(k_means.labels_[:])) in the last three lines, I get the same label as the ones when I was fitting the the X-train, rather than the labels that were produced for the X-test. Any idea what I might be doing wrong here? Is it right at all what I'm doing to evaluate the performance of k-means?
Thank you!
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…