datarekha

How do you choose k in k-means, and when does k-means fail?

The short answer

Choose k using the elbow method on within-cluster sum of squares, the silhouette score, or domain knowledge, validating stability across runs. K-means fails when clusters are non-spherical, have very different sizes or densities, when outliers are present, or when k is genuinely unknown. It also only finds a local optimum, so initialization (k-means++) matters.

How to think about it

The crisp answer

There’s no single correct k, so you triangulate: the elbow method (plot within-cluster sum of squares vs k and look for the bend), the silhouette score (measures how well each point fits its cluster vs the next-nearest, higher is better), and domain knowledge. K-means fails whenever its core assumptions — convex, isotropic, similarly sized clusters — are violated.

Why it fails

K-means minimizes within-cluster squared distance to centroids, which implicitly assumes round, equally sized blobs. As the Analytics Vidhya K-means question set puts it, k-means struggles when the density spread differs across the space or when clusters follow a non-convex shape. Concretely it breaks on:

  • Non-spherical clusters (two crescents, concentric rings).
  • Unequal sizes or densities — large clusters get split, small ones absorbed.
  • Outliers, which drag centroids since the mean isn’t robust.
  • Wrong k, which it can’t detect on its own.

A concrete example

On two interleaving half-moons, k-means draws a straight boundary and cuts both moons in half; DBSCAN or spectral clustering recovers the true shapes.

The common trap

Forgetting that k-means finds only a local optimum and is sensitive to initialization — always use k-means++ seeding and multiple restarts (n_init). Also forgetting to scale features, since it’s distance-based. Expected follow-up: “What would you use instead?” — DBSCAN for arbitrary shapes and noise, Gaussian mixtures for elliptical clusters and soft assignments, hierarchical clustering when you don’t want to fix k upfront.

Learn it properly K-means clustering

Keep practising

All Machine Learning questions

Explore further

Skip to content