'How to find cut-off height in agglomerative clustering with a predefined number of clusters in sklearn?

I'm deploying sklearn's hierarchical clustering algorithm with the following code:

AgglomerativeClustering(compute_distances = True, n_clusters = 15, linkage = 'complete', affinity = 'cosine').fit(X_scaled)

How can I extract the exact height at which the dendrogram has been cut off to create the 15 clusters?



Solution 1:[1]

Try this code with your feature data set X to find heights vs # of clusters:

import numpy as np
from scipy.cluster.hierarchy import linkage, cut_tree

hegits = np.arange(0, 20)
n_clusters = np.zeros(len(hegits))
linked = linkage(X, metric="euclidean", method="average")

for i, d in enumerate(hegits):
    t = cut_tree(linked, height=d)
    n_clusters[i] = len(np.unique(t))

plt.plot(n_clusters, hegits, '-o')
plt.grid()
plt.xlabel('k')
plt.ylabel('heights')

Or, try this code with your feature data set X to find distance vs # of clusters:

import numpy as np
from sklearn.cluster import AgglomerativeClustering

distance = np.arange(0,20,0.1)
n_clusters = np.zeros(len(distance))

for i, d in enumerate(distance):

    cluster = AgglomerativeClustering(distance_threshold=d, n_clusters=None, affinity='euclidean', linkage='ward')
    cluster.fit(X)
    n_clusters[i] = cluster.n_clusters_

plt.plot(n_clusters, distance, '-o')
plt.grid()
plt.xlabel('k')
plt.ylabel('distance')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1