How does hierarchical clustering work, and how do you decide the number of clusters from a dendrogram?

Agglomerative hierarchical clustering starts with each point as its own cluster and repeatedly merges the two closest clusters using a linkage rule (single, complete, average, or Ward) until one cluster remains, producing a dendrogram. You choose the number of clusters by cutting the dendrogram at a height where merges jump sharply, which indicates joining dissimilar groups. Unlike k-means it needs no preset k but is computationally expensive.

How do hierarchical clustering and DBSCAN differ from k-means?

Hierarchical clustering builds a tree of nested merges or splits and does not require specifying k upfront, but it is O(n² log n) and cannot revise early decisions. DBSCAN finds arbitrarily shaped clusters by density reachability, naturally marks outliers as noise, and also needs no k — but its results are sensitive to the eps and min_samples hyperparameters.

How does k-means clustering work?

K-means partitions n points into k clusters by alternating between two steps: assigning each point to its nearest centroid, then recomputing each centroid as the mean of its assigned points. It repeats until assignments stop changing, which guarantees convergence but not a globally optimal solution.

What's the difference between k-means and k-nearest neighbors? People confuse them.

K-means is an unsupervised clustering algorithm that partitions unlabeled data into k groups by iteratively updating centroids. KNN is a supervised algorithm that classifies or predicts a new point using the labels of its k closest training points. They share the letter k and the use of distances but solve completely different problems.

Hierarchical Clustering & Linkage — GATE DA

What you'll learn

Agglomerative clustering is bottom-up (merge); divisive is top-down (split)

Single linkage = minimum pairwise distance; complete linkage = maximum pairwise distance

The merge history is drawn as a dendrogram

Finding the first merge: compute pairwise distances and join the closest pair

Last lesson left k-means demanding a number — k, fixed before you have seen a single grouping — and gambling on a random start. Hierarchical clustering asks for neither. It builds a whole tree of nested clusters, so you choose how many you want after the fact, by slicing the tree wherever you like. The usual flavour is agglomerative: start with every point as its own tiny cluster, then repeatedly fuse the two closest clusters until only one remains. That nested tree is why biologists reach for it on gene-expression and phylogeny heatmaps, and why it is the natural default when you do not know how many clusters you want in advance.

Bottom-up vs top-down

There are two directions to build the tree:

Agglomerative (bottom-up) — begin with n singleton clusters and repeatedly merge the two nearest. This is the common one.
Divisive (top-down) — begin with one giant cluster holding everything and repeatedly split it.

The full merge history is recorded as a dendrogram — a tree whose join heights show how far apart the clusters were when they fused.

A dendrogram: the lowest join is the closest (first) merge; higher joins fuse clusters that are farther apart.

Linkage — what does “distance between two clusters” mean?

Two points have one obvious distance. But two clusters, each a bag of several points, do not — so we need a rule, called the linkage:

Single linkage = the MINIMUM distance between any one point in cluster A and any one point in cluster B (the two nearest neighbours across the gap).
Complete linkage = the MAXIMUM such distance (the two farthest points).

Single linkage measures clusters by their nearest points (can produce long, chained clusters); complete linkage by their farthest points (produces compact, tight clusters).

Because single linkage needs only one close pair to merge, it can chain points into long straggly clusters; complete linkage demands the whole far end be close too, so it builds compact clusters. The rule you choose genuinely changes the answer.

How GATE asks this

Two recurring shapes. First, a NAT: given a few points, compute the pairwise distances and report which pair merges first (the smallest distance), or that smallest distance itself. Second, an MSQ testing the linkage definitions — GATE DA 2025 asked you to identify that single linkage uses the minimum and complete linkage the maximum pairwise distance. GATE DA 2026 asked the first-merge question.

Worked example — GATE DA 2026 (Manhattan distance)

Points P1 = (2, 3, −1), P2 = (3, 1, 1), P3 = (5, −2, 3), P4 = (3, 3, 3). Using Manhattan (L1) distance, which pair merges first in agglomerative clustering?

With every point its own cluster, the first merge is just the closest pair of points. Manhattan distance sums the absolute coordinate differences, so compute the candidates:

d(P2, P4) = |3−3| + |1−3| + |1−3| = 0 + 2 + 2 = 4
d(P1, P2) = |2−3| + |3−1| + |−1−1| = 1 + 2 + 2 = 5
d(P1, P4) = |2−3| + |3−3| + |−1−3| = 1 + 0 + 4 = 5
d(P3, P4) = |5−3| + |−2−3| + |3−3| = 2 + 5 + 0 = 7

The smallest distance is d(P2, P4) = 4, so P2 and P4 merge first — a real GATE DA 2026 answer. And recall the GATE DA 2025 fact alongside it: single linkage = min, complete linkage = max.

In one breath

Hierarchical clustering needs no k: the agglomerative (bottom-up) version starts every point in its own cluster and repeatedly merges the two nearest, recording the history as a dendrogram you cut at any height — while divisive goes top-down by splitting; the distance between two clusters is set by the linkage rule, single = minimum cross-cluster pair (chains into straggly clusters) and complete = maximum cross-cluster pair (compact clusters), and the first merge is always simply the closest pair of points.

Practice

Quick check

0/6

Q1Recall — Which statements about hierarchical clustering are TRUE? (select all that apply)select all that apply

Q2Recall — Which statements about linkage are correct? (select all that apply)select all that apply

Q3Recall — In single-linkage agglomerative clustering on a set of singleton points, which pair is merged in the very first step?

Q4Trace — Points P2 = (3, 1, 1) and P4 = (3, 3, 3). What is their Manhattan (L1) distance? (the 2026 PYQ first-merge value)numerical answer — type a number

Q5Trace — Cluster A = {(0,0), (0,2)} and cluster B = {(5,0), (5,4)}, using Euclidean distance. What is the SINGLE-linkage distance between A and B?numerical answer — type a number

Q6Apply — For the same clusters A = {(0,0), (0,2)} and B = {(5,0), (5,4)}, what is the COMPLETE-linkage distance? (round to 2 decimals)numerical answer — type a number

A question to carry forward

Both clustering methods — k-means and this tree — rest entirely on one operation: measuring the distance between points. That works beautifully in two or three dimensions, where you can even see the clusters. But real data has fifty features, or five hundred, and in those vast spaces distances grow strange and unhelpful, every point drifting roughly equidistant from every other, and no plot can show you what is going on.

So the other half of unsupervised learning turns the problem sideways. Instead of grouping the rows, it shrinks the columns — replacing fifty tangled features with two or three new axes that still capture most of the data’s spread. Here is the thread onward, and the last step of this chapter: how do you find those few directions of greatest variation, squeeze fifty features down to two with the least possible loss, and finally meet face to face the method LDA was forever measured against — PCA?

Hierarchical Clustering & Linkage

What you'll learn

Before you start

Bottom-up vs top-down

Linkage — what does “distance between two clusters” mean?

How GATE asks this

Worked example — GATE DA 2026 (Manhattan distance)

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further