How do you choose k in k-means, and when does k-means fail?
Choose k using the elbow method on within-cluster sum of squares, the silhouette score, or domain knowledge, validating stability across runs. K-means fails when clusters are non-spherical, have very different sizes or densities, when outliers are present, or when k is genuinely unknown. It also only finds a local optimum, so initialization (k-means++) matters.
How to think about it
The crisp answer
There’s no single correct k, so you triangulate: the elbow method (plot within-cluster sum of squares vs k and look for the bend), the silhouette score (measures how well each point fits its cluster vs the next-nearest, higher is better), and domain knowledge. K-means fails whenever its core assumptions — convex, isotropic, similarly sized clusters — are violated.
Why it fails
K-means minimizes within-cluster squared distance to centroids, which implicitly assumes round, equally sized blobs. As the Analytics Vidhya K-means question set puts it, k-means struggles when the density spread differs across the space or when clusters follow a non-convex shape. Concretely it breaks on:
- Non-spherical clusters (two crescents, concentric rings).
- Unequal sizes or densities — large clusters get split, small ones absorbed.
- Outliers, which drag centroids since the mean isn’t robust.
- Wrong k, which it can’t detect on its own.
A concrete example
On two interleaving half-moons, k-means draws a straight boundary and cuts both moons in half; DBSCAN or spectral clustering recovers the true shapes.
The common trap
Forgetting that k-means finds only a local optimum and is sensitive to initialization — always use k-means++ seeding and multiple restarts (n_init). Also forgetting to scale features, since it’s distance-based. Expected follow-up: “What would you use instead?” — DBSCAN for arbitrary shapes and noise, Gaussian mixtures for elliptical clusters and soft assignments, hierarchical clustering when you don’t want to fix k upfront.