What are the main limitations of k-means clustering?
K-means requires specifying k upfront, assumes clusters are convex and roughly equal in size and density, is sensitive to outliers and feature scale, and can converge to local minima. It struggles with non-globular shapes such as rings or crescents, and it assigns every point to exactly one cluster with no notion of uncertainty.
How to think about it
K-means is fast and interpretable, but it bakes in assumptions that fail on a surprising fraction of real datasets. Knowing the failure modes tells you when to reach for DBSCAN, GMM, or hierarchical clustering instead.
Shape and geometry
K-means partitions space with straight Voronoi boundaries — the decision boundary between two centroids is always a perpendicular bisector. This means it cannot recover:
- Non-convex clusters (rings, crescents, interleaved spirals)
- Clusters of very different densities — a sparse cluster gets carved up to fill equal-volume regions
- Clusters of very different sizes — small tight clusters get absorbed into larger ones
Sensitivity to scale and outliers
Distance-based, so a feature measured in thousands dominates. Standardise features beforehand. Outliers pull centroids toward them, distorting the entire partition. Pre-remove obvious outliers or use k-medoids (PAM), which uses actual data points as centres.
Requires k in advance
You must decide k before you see the result. For explorative analysis this is circular. DBSCAN and hierarchical clustering avoid this; Gaussian Mixture Models let you use BIC to choose.
Local minima
The objective (inertia) is non-convex. Different random starts can yield different partitions. Mitigate with k-means++ initialisation and multiple restarts (n_init=10 in scikit-learn).
Hard, exclusive assignment
Every point belongs to exactly one cluster. There is no soft probability of membership. Use Gaussian Mixture Models if you need probabilistic cluster assignments.
# When k-means fails on non-convex data — DBSCAN as the fix
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.3, min_samples=5)
labels = db.fit_predict(X) # -1 marks noise/outliers