'Kmean clustering labels in Python

I have a dataset with 7 labels in the target variable.

X = data.drop('target', axis=1)
Y = data['target']
Y.unique()

array(['Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Insufficient_Weight', 'Obesity_Type_II', 'Obesity_Type_III'], dtype=object)

km = KMeans(n_clusters=7, init="k-means++", random_state=300)
km.fit_predict(X)
np.unique(km.labels_)

array([0, 1, 2, 3, 4, 5, 6])

After performing KMean clustering algorithm with number of clusters as 7, the resulted clusters are labeled as 0,1,2,3,4,5,6. But how to know which real label matches with the predicted label.

In other words, I want to know how to give original label names to new predicted labels, so that they can be compared like how many values are clustered correctly (Accuracy).



Solution 1:[1]

Since we don't know how you chose the initial clusters ('Normal_Weight', 'Overweight_Level_I', etc.), knowing what predicted cluster corresponds to your initial ones would require some qualitative assessment and domain knowledge. You would need to "explain" each new cluster in order to associate them to the initial ones. One way to do so is to take the approach "one-vs-other": this consists in checking how observations within a cluster look compared to observations that are not within the same cluster. Example: if in new cluster 1 you have only observations that have a low weight, it's a good indication that cluster 1 may represents your initial cluster "Insufficient_Weight".

Solution 2:[2]

Clustering is a unsupervised machine learning method. It is not meant to be used to calculate accuracy compared to given labels. Clustering is mainly used for data exploration, to find patterns in the data that are not immediately apparent to us humans (also due to the size of some datasets).

With the data you have you should implement a classification algorithm to classify your documents and then you can calculate accuracy.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 savoga
Solution 2 Ethan Van den Bleeken