'Avoiding local minima when using Kmeans
I'm using the following code for clustering with KMeans from sklearn.cluster.KMeans
from sklearn.cluster import KMeans
num_clusters = 60
km = KMeans(n_clusters=num_clusters, init="k-means++", n_init = 100)
%time km.fit(newvec)
clusters = km.labels_.tolist()
To avoid local minima I use n_init = 100. What else can be done to avoid falling into local minima. I appreciate your help.
Solution 1:[1]
I'm afraid that with the k-means algorithm, you cannot entirely avoid local minima; you can only try to minimize your chances of getting one.
This Cross Validated post has a good discussion on why you cannot escape local minima.
A common hack used by many, and the one you are doing by setting n_init = 100
is to run K-means multiple times and then choosing the run that gives the lowest error. If you run this k^n times and then choose the best out of that, then you WOULD be guaranteed you're finding a global minima, but that's too time consuming to be practical.
Solution 2:[2]
Why do you think you need the global minimum?
Any clustering is only a heuristic. There is no such thing as the optimal clustering, because this is an explorative technique.
Also, your data probably is not "exact" but you only have a subset, limited precision etc. - "optimality" is completely overrated.
Compare the best result you got from 10 initializations to the best you got from 100 initializations. Does it make a difference? How much longer would you wait to get the same improvement again?
Solution 3:[3]
Another "hack" - I've had success by randomizing the cluster assignments during iterations, for example (in R) by multiplying the Euclidian distance by
(1 + runif(1,0,1.0)/iteration)
so in the first iterations, the distance may be up to twice as large as it really is, and as iterations go up, the true distance is used. In my examples this has helped me out of local minima!
Solution 4:[4]
There is no way to technically avoid the chances of getting stuck at a local minimum, but there are two ways I can think of lowering the chances you will get a far from optimal solution.
- Run Kmeans multiple times and use some sort of output function (i.e. either choosing the minimum error solution or averaging all the cluster solutions. There are many things you can do.
- Use a better initialization algorithm. A simple and quite effective initialization strategy is the one used in the KMeans++ algorithm. It initializes each cluster to 1 of n of the data points, and the probability of being assigned to a point is proportional to the distances of the previous cluster(s). This makes it likely that the clusters are going to be spread out and likely near the optimal centroids. This also will significantly decrease the number of iterations required for KMeans to converge.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Has QUIT--Anony-Mousse |
Solution 3 | Stephen Rauch |
Solution 4 | mguthrie45 |