Machine Learning Medium Asked at AmazonAsked at AirbnbAsked at Uber

How do you choose the number of clusters k in k-means?

For Data Scientist ML Engineer AI / LLM Engineer Data Analyst

The short answer

The elbow method plots inertia against k and looks for the bend where adding another cluster gives diminishing returns. The silhouette score measures how similar each point is to its own cluster versus its nearest rival, with values closer to 1 indicating tighter, better-separated clusters. Both should be used together, not in isolation.

How to think about it

Choosing k is a model-selection problem with no ground-truth label, so you triangulate with at least two signals rather than trusting a single metric.

Elbow method

Run k-means for k = 2, 3, …, 10 and record inertia (within-cluster SSE). Plot inertia vs k. Where the curve bends — gains in tightness flatten out — is the “elbow.” It’s heuristic: real data often produces a smooth curve with no sharp bend, forcing you to pick a range rather than a single point.

Silhouette score

For each point i, compute:

a(i): mean distance to other points in the same cluster (cohesion).
b(i): mean distance to points in the nearest other cluster (separation).

Silhouette for point i = (b(i) - a(i)) / max(a(i), b(i)).

The mean over all points gives the overall silhouette score. Values range from -1 (wrong cluster) to +1 (tight, well-separated). Pick k where this is maximised.

Other signals

Method	When to use
Gap statistic	Compares inertia to a null reference distribution; more principled but slow
Domain knowledge	Often the strongest signal — “we have 4 product tiers”
Downstream metric	If clusters feed a model, optimise that model’s performance directly

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

inertias, sil_scores = [], []
ks = range(2, 11)

for k in ks:
    km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X, km.labels_))

best_k = ks[np.argmax(sil_scores)]

How do you choose the number of clusters k in k-means?

Elbow method

Silhouette score

Other signals

Keep practising

Explore further