'sklearn: calculating accuracy score of k-means on the test data set
I am doing k-means clustering on the set of 30 samples with 2 clusters (I already know there are two classes). I divide my data into training and test set and try to calculate the accuracy score on my test set. But there are two problems: first I don't know if I can actually do this (accuracy score on test set) for k-means clustering. Second: if I am allowed to do this, whether my implementation is right or wrong. Here is what I've tried:
df_hist = pd.read_csv('video_data.csv')
y = df_hist['label'].values
del df_hist['label']
df_hist.to_csv('video_data1.csv')
X = df_hist.values.astype(np.float)
X_train, X_test,y_train,y_test = cross_validation.train_test_split(X,y,test_size=0.20,random_state=70)
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(X_train)
print(k_means.labels_[:])
print(y_train[:])
score = metrics.accuracy_score(y_test,k_means.predict(X_test))
print('Accuracy:{0:f}'.format(score))
k_means.predict(X_test)
print(k_means.labels_[:])
print(y_test[:])
But, when I print k-means labels for the test set (k_means.predict(X_test) print(k_means.labels_[:])) and y_test labels (print(k_means.labels_[:])) in the last three lines, I get the same label as the ones when I was fitting the the X-train, rather than the labels that were produced for the X-test. Any idea what I might be doing wrong here? Is it right at all what I'm doing to evaluate the performance of k-means? Thank you!
Solution 1:[1]
In terms of evaluating accuracy. You should remember that k-means is not a classification tool, thus analyzing accuracy is not a very good idea. You can do this, but this is not what k-means is for. It is supposed to find a grouping of data which maximizes between-clusters distances, it does not use your labeling to train. Consequently, things like k-means are usually tested with things like RandIndex and other clustering metrics. For maximization of accuracy you should fit actual classifier, like kNN, logistic regression, SVM, etc.
In terms of the code itself, k_means.predict(X_test)
returns labeling, it does not update the internal labels_
field, you should do
print(k_means.predict(X_test))
Furthermore in python you do not have to (and should not) use [:]
to print an array, just do
print(k_means.labels_)
print(y_test)
Solution 2:[2]
Data in unsupervised learning labeled by us is can be the same as labels given by unsupervised algorithms like K-means or can not be. For example: Data have two classes one is spam or not spam, spam is labeled by us as 0 and not spam as 1. but after running the clustering algorithm spam is treated as 1 and not spam as 0. That time below code will not work. It will indicate low accuracy but in real algo is doing good.
score = metrics.accuracy_score(y_test,k_means.predict(X_test))
so by keeping track of how much predicted 0 or 1 are there for true class 0 and the same for true class 1 and we choose the max one for each true class. So let if number of predicted class 0 is 90 and 1 is 10 for true class 1 it means clustering algo treating true class 1 as 0.
true_classes=np.asarray(y_test)
pred_classes=pred
no_correct=0
di={}
for i in range(k):
di[i]={}
for j in range(k):
di[i][j]=[]
for i in range(true_classes.shape[0]):
di[true_classes[i]][pred_classes[i]].append(1)
for i in range(len(di)):
temp=-1
for j in range(len(di[i])):
temp=max(temp,len(di[i][j]))
if temp==len(di[i][j]):
cluser_class=j
print("class {} named as class {} in clustering algo".format(list(di.keys())[i],cluser_class))
no_correct=no_correct+temp
print(no_correct/true_classes.shape[0])
Solution 3:[3]
The metric that you need is the adjusted rand index. But evaluate the k-means on whole dataset. It return values from 0 to 1. Check the link bellow:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | lejlot |
Solution 2 | Lokesh Borawar |
Solution 3 |