'Visualization of K-Means Clustering of multiple columns
Dataset file : google drive link
Hello Community , I need help regarding how to apply KNN clustering on this use case.
I have a dataset consisting (27884 ROWS, 8933 Columns)
Here's a little preview of a dataset
user_iD | b1 | b2 | b3 | b4 | b5 | b6 | b7 | b8 | b9 | b10 | b11 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 7 | 2 | 3 | 8 | 0 | 4 | 0 | 6 | 0 | 5 |
2 | 7 | 8 | 1 | 2 | 4 | 6 | 5 | 9 | 10 | 3 | 0 |
3 | 0 | 0 | 0 | 0 | 1 | 5 | 2 | 3 | 4 | 0 | 6 |
4 | 1 | 7 | 2 | 3 | 8 | 0 | 5 | 0 | 6 | 0 | 4 |
5 | 0 | 4 | 7 | 0 | 6 | 1 | 5 | 3 | 0 | 0 | 2 |
6 | 1 | 0 | 2 | 3 | 0 | 5 | 4 | 0 | 0 | 6 | 7 |
Here the column userid represents: STUDENTS and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.
This is just a small preview of a big dataset. There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)
Here's the complete dataset shape information
I'm Applying KMEANS CLUSTERING. How do I visualize all the clusters using all the columns
As I stated there are 27844 users & 8932 other columns I have achieved by just using user_iD & b1 column only. How do I take all the columns at once?
What I have tried so far
#Build and train the model
from sklearn.cluster import KMeans
model = KMeans(n_clusters=5)
model.fit(df3)
#See the predictions
model.labels_
model.cluster_centers_
#PLot the predictions against the original data set
fig = plt.figure(figsize=(6, 6))
#ax = fig.add_subplot(111)
plt.scatter(df3['user_iD'], df3['b1'],cmap='rainbow',
linewidths=1, alpha=.7,
edgecolor='k'
)
plt.show()
This gives me clustering visualization based on a single column.
Solution 1:[1]
Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.
pca_num_components = 2
reduced_data = PCA(n_components=pca_num_components).fit_transform(df3.iloc[:,1:12])
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=df3['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | SAL |