I had a confusion regarding this module (scipy.cluster.hierarchy) ... and still have some !
For example we have the following dendrogram:
My question is how can I extract the coloured subtrees (each one represent a cluster) in a nice format, say SIF format ?
Now the code to get the plot above is:
import scipy
import scipy.cluster.hierarchy as sch
import matplotlib.pylab as plt
scipy.randn(100,2)
d = sch.distance.pdist(X)
Z= sch.linkage(d,method='complete')
P =sch.dendrogram(Z)
plt.savefig('plot_dendrogram.png')
T = sch.fcluster(Z, 0.5*d.max(), 'distance')
#array([4, 5, 3, 2, 2, 3, 5, 2, 2, 5, 2, 2, 2, 3, 2, 3, 2, 5, 4, 5, 2, 5, 2,
# 3, 3, 3, 1, 3, 4, 2, 2, 4, 2, 4, 3, 3, 2, 5, 5, 5, 3, 2, 2, 2, 5, 4,
# 2, 4, 2, 2, 5, 5, 1, 2, 3, 2, 2, 5, 4, 2, 5, 4, 3, 5, 4, 4, 2, 2, 2,
# 4, 2, 5, 2, 2, 3, 3, 2, 4, 5, 3, 4, 4, 2, 1, 5, 4, 2, 2, 5, 5, 2, 2,
# 5, 5, 5, 4, 3, 3, 2, 4], dtype=int32)
sch.leaders(Z,T)
# (array([190, 191, 182, 193, 194], dtype=int32),
# array([2, 3, 1, 4,5],dtype=int32))
So now, the output of fcluster()
gives the clustering of the nodes (by their id's), and leaders()
described here is supposed to return 2 arrays:
first one contains the leader nodes of the clusters generated by Z, here we can see we have 5 clusters, as well as in the plot
and the second one the id's of these clusters
So if this leaders() returns resp. L and M : L[2]=182
and M[2]=1
, then cluster 1 is leaded by node id 182, which doesn't exist in the observations set X, the documentation says "... then it corresponds to a non-singleton cluster". But I can't get it ...
Also, I converted the Z to a tree by sch.to_tree(Z)
, that will return an easy-to-use tree object, which I want to visualize, but which tool should I use as a graphical platform that manipulate these kind of tree objects as inputs?
See Question&Answers more detail:
os