'Node indexing in hierarchical clustering dendrograms
I'm looking to annotate a hierarchical clustering dendrogram, but I have some trouble associating the node indices produced by scipy.cluster.hierarchy.dendrogram
when plotting, to the node indices in the original linkage matrix (e.g. produced with scipy.cluster.hierarchy.linkage
).
For instance, say we have the following example (adapted from this SO question),
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
%matplotlib inline
# generate two clusters: a with 10 points, b with 5:
np.random.seed(1)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[5,])
X = np.concatenate((a, b),)
Z = linkage(X, 'ward')
# make distances between pairs of children uniform
# (re-scales the horizontal (distance) axis when plotting)
Z[:,2] = np.arange(Z.shape[0])+1
def plot_dendrogram(linkage_matrix, **kwargs):
ddata = dendrogram(linkage_matrix, **kwargs)
idx = 0
for i, d, c in zip(ddata['icoord'], ddata['dcoord'],
ddata['color_list']):
x = 0.5 * sum(i[1:3])
y = d[1]
plt.plot(y, x, 'o', c=c)
plt.annotate("%.3g" % idx, (y, x), xytext=(15, 5),
textcoords='offset points',
va='top', ha='center')
idx += 1
plot_dendrogram(Z, labels=np.arange(X.shape[0]),
truncate_mode='level', show_leaf_counts=False,
orientation='left')
which produces the following dendrogram:
The original X
matrix has (X.shape[0] == 15
) samples, and the tick labels on the vertical axis corresponds to the sample id for each tree leaf. The number at each node is the id of that node as returned by the dendrogram
function. Now if we look at the original linkage matrix, the 1st two columns give the children of each tree node,
print(Z[:,:2].astype('int'))
[[ 0 3]
[ 1 8]
[ 6 16]
[ 2 5]
...
[22 24]
[23 25]
[26 27]]
For instance, the node 0
in the linkage matrix has for children leaves [0, 3]
, but on the dendrogram above it is labeled as number 9
. Similarly the node 1
, is labeled as number 4
, etc.
I was wondering what would be the simplest way of finding the correspondence between these 2 indices? I looked at the dendrogram
function but didn't see any simple way of doing that (particularly if we truncate the dendrogram to some level (with e.g. truncate_mode='level', p=2
)...
Note: I'm actually using a linkage matrix given by sklearn.cluster.AgglomerativeClustering
but that doesn't really matter for this question (as illustrated in this github issue).
Note2: alternatively if there is a way to compute the list of leaves for every dendrogram node that would also solve my problem.
Solution 1:[1]
What I can deduce from the posted material are things I expect you've already noticed:
- The linkage matrix nodes are numbered in order of the clustering. The original nodes are 0-14. [0 3] becomes node 15, [1 8] is node 16, [6 [1 8]] is node 17, etc.
- The dendogram nodes are numbered from the root, depth-first, upper (right) branch before left (lower), with labels N-1 down to 0.
- Right and left are determined by the columns in the linkage matrix.
The linkage matrix has 2N+1 rows: N+1 original nodes and N clusterings. To reconstruct the dendogram labels, start at the last row of the matrix, [26 27]. This gets label N-1, or 13. Recur on the right node (column 1), then the left (column 0). Decrement the label value each time you assign it.
Is that clear enough to follow?
Recursion depth is a problem for you? The tree depth shouldn't be much more than log2(N) for N nodes, or about 22-25 for a million nodes. Your run-time stack can't handle that? Mine defaults to one thousand.
In any case, it's not that hard to convert to iteration. Create a stack of unresolved nodes, starting with the root node.
stack = [root]
while stack is not empty:
node = stack.pop()
stack.push (node.right)
stack.push (node.left)
process node # assign node number, etc.
This makes it even easier to maintain the node counter, as it's now a local variable. Also note that this is a basic graph-search algorithm: "While we have nodes, grab the next one on the list, push its neighbors onto the list (but check to see whether each one is already handled), and process the current node."
For depth-first, use a stack; for breadth-first, use a queue.
Solution 2:[2]
Another solution I have found is to compute all the children for a given node from the dendogram figure (code available in the _DendrogramChildren
class here¹) using the geometric coordinates on the figure. Then because each node has a unique list of children, we can associate the node index on the figure with that of the linkage matrix.
However this solution is not satisfactory (uses coordinates of the nodes on the figure), does not scale well and would not work with a truncated dendogram. The solution proposed by @Prune above is more appropriate.
¹ not re-posting it here as it's too long
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | joanis |