datarekha
Machine Learning Medium Asked at AmazonAsked at AirbnbAsked at Uber

How do you choose the number of clusters k in k-means?

The short answer

The elbow method plots inertia against k and looks for the bend where adding another cluster gives diminishing returns. The silhouette score measures how similar each point is to its own cluster versus its nearest rival, with values closer to 1 indicating tighter, better-separated clusters. Both should be used together, not in isolation.

How to think about it

Choosing k is a model-selection problem with no ground-truth label, so you triangulate with at least two signals rather than trusting a single metric.

Elbow method

Run k-means for k = 2, 3, …, 10 and record inertia (within-cluster SSE). Plot inertia vs k. Where the curve bends — gains in tightness flatten out — is the “elbow.” It’s heuristic: real data often produces a smooth curve with no sharp bend, forcing you to pick a range rather than a single point.

Silhouette score

For each point i, compute:

  • a(i): mean distance to other points in the same cluster (cohesion).
  • b(i): mean distance to points in the nearest other cluster (separation).

Silhouette for point i = (b(i) - a(i)) / max(a(i), b(i)).

The mean over all points gives the overall silhouette score. Values range from -1 (wrong cluster) to +1 (tight, well-separated). Pick k where this is maximised.

Other signals

MethodWhen to use
Gap statisticCompares inertia to a null reference distribution; more principled but slow
Domain knowledgeOften the strongest signal — “we have 4 product tiers”
Downstream metricIf clusters feed a model, optimise that model’s performance directly
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

inertias, sil_scores = [], []
ks = range(2, 11)

for k in ks:
    km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X, km.labels_))

best_k = ks[np.argmax(sil_scores)]

Keep practising

All Machine Learning questions

Explore further

Skip to content