How does hierarchical clustering work, and how do you decide the number of clusters from a dendrogram?
Agglomerative hierarchical clustering starts with each point as its own cluster and repeatedly merges the two closest clusters using a linkage rule (single, complete, average, or Ward) until one cluster remains, producing a dendrogram. You choose the number of clusters by cutting the dendrogram at a height where merges jump sharply, which indicates joining dissimilar groups. Unlike k-means it needs no preset k but is computationally expensive.
How to think about it
The crisp answer
Agglomerative hierarchical clustering is bottom-up: every point starts as its own cluster, and at each step the two closest clusters merge, until everything is one cluster. The full merge history is a dendrogram, a tree whose merge heights show how dissimilar the joined clusters were. You pick the number of clusters by cutting the tree at a chosen height.
The role of linkage
“Closest” depends on the linkage rule, which is the key choice:
- Single linkage: distance between the nearest pair — finds elongated chains, sensitive to noise.
- Complete linkage: distance between the farthest pair — compact clusters.
- Average linkage: mean pairwise distance — a balance.
- Ward: merges that minimize the increase in within-cluster variance — tends to give even, spherical clusters.
The Devinterview cluster-analysis questions treat linkage choice as a core interview point.
Reading the dendrogram
Cut horizontally where vertical merge distances jump sharply — a tall vertical segment means you’re about to merge two genuinely dissimilar groups, so stop just below it. The number of vertical lines the cut crosses is your cluster count.
The common trap
Hierarchical clustering is O(n²) or worse in memory and time, so it doesn’t scale to large datasets the way k-means does. It’s also greedy — a wrong early merge can’t be undone — and sensitive to noise and outliers (especially single linkage). Always scale features. Follow-up: “vs k-means?” — hierarchical needs no preset k and yields a full hierarchy, but is far more expensive; k-means is fast but needs k and assumes round clusters.