'Number of distinct clusters in KMeans is less than n_clusters?
I have some food images stored in a single folder. All the images are unlabeled, nor are they stored into separate folder such as "pasta" or "meat". My current goal is to cluster the images into a number of categories so that I can later assess if the taste of foods depicted in images of the same cluster is similar.
To do that, I load the images and process them in a format that can be fed into the VGG16 for feature extraction and then pass the features to my KMeans to cluster the images. The code I am using is:
path = r'C:\Users\Hi\Documents\folder'
train_dir = os.path.join(path)
model = VGG16(weights='imagenet', include_top=False)
vgg16_feature_list = []
files = glob.glob(r'C:\Users\Hi\Documents\folder\*.jpg')
for i in enumerate(files):
img = image.load_img(img_path,target_size=(224,224))
img_data=image.img_to_array(img)
img_data=np.expand_dims(img_data,axis=0)
img_data=preprocess_input(img_data)
vgg16_feature = model.predict(img_data)
vgg16_feature_np = np.array(vgg16_feature)
vgg16_feature_list.append(vgg16_feature_np.flatten())
vgg16_feature_list_np=np.array(vgg16_feature_list)
print(vgg16_feature_list_np.shape)
print(vgg16_feature_np.shape)
kmeans = KMeans(n_clusters=3, random_state=0).fit(vgg16_feature_list_np)
print(kmeans.labels_)
The issue is that I get the following warning:
ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (3). Possibly due to duplicate points in X.
How can I fix that?
Solution 1:[1]
This is one of these situations where, although your code is fine from a programming point of view, it does not produce satisfactory results due to an ML-related issue (data, model, or both), hence it is rather difficult to "debug" (I'm quoting the word, since this is not the typical debugging procedure, as the code itself runs fine).
At first instance, the situation seems to imply that there is not enough diversity in your features to justify 3 different clusters. And, provided that we remain in a K-means context, there is not much you can do; among the few options available (refer to the documentation for details of the respective parameters):
- Increase the number of iterations
max_iter
(default 300) - Increase the number of different centroid initializations
n_init
(default 10) - Change the
init
argument torandom
(the default isk-means++
) or, even better, provide a 3-element array with one sample from each of your targeted clusters (if you already have an idea which these clusters may actually be in your data) - Run the model with different
random_state
values - Combine the above
If nothing of the above works, it would very likely mean that K-means is actually not applicable here, and you may have to look for alternative approaches (which are out of the scope of this thread). Truth is, as correctly pointed out in the comment below, K-means does not usually work that well with data of such high dimensionality.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |