datarekha

K-means clustering

The first unsupervised algorithm — group unlabeled data into k clusters by alternating assign and update. How Lloyd's algorithm works, why init matters, and how to choose k with the elbow and silhouette.

7 min read Beginner Machine Learning Lesson 26 of 33

What you'll learn

  • How k-means alternates assign and update to minimize inertia
  • Why initialization matters (and what k-means++ fixes)
  • Choosing k with the elbow method and the silhouette score

Before you start

Everything so far has been supervised — you had labels. But often you just have data and a question: are there natural groups in here? Customer segments, document topics, image colors. That’s unsupervised learning, and k-means is where it starts. It’s beautifully simple — two steps repeated — and it’s a near-guaranteed interview topic.

Two steps, repeated

You tell k-means how many clusters to find (k). It then loops Lloyd’s algorithm:

  1. Assign — put each point in the cluster of its nearest centroid.
  2. Update — move each centroid to the mean of the points assigned to it.

Repeat until nothing moves. Each round lowers inertia — the total squared distance from points to their assigned centroid — until it settles into a local minimum. Step through it:

That’s the whole algorithm. No labels, no gradient descent — just alternating assignment and averaging.

The two things that bite

  • Initialization matters. A bad random start can leave centroids stuck in a poor configuration. The fix is k-means++, which spreads the initial centroids out, plus running several restarts and keeping the best. scikit-learn does both by default (init="k-means++", n_init="auto").
  • You have to choose k. k-means can’t discover the number of clusters. Two standard methods:
    • Elbow — plot inertia vs k. It always decreases, but bends sharply at the “right” k (watch the elbow plot in the widget bend at 3).
    • Silhouette score — measures how well-separated the clusters are (−1 to 1); pick the k that maximizes it. More reliable than the elbow when it’s ambiguous.

Quick check

Quick check

0/3
Q1What are the two repeated steps of k-means (Lloyd's algorithm)?
Q2How do you choose the number of clusters k?
Q3Your data forms two interleaved crescent shapes. Will k-means cluster them correctly?

Next

K-means fails on non-blob shapes — DBSCAN & hierarchical clustering handle those. And to see high-dimensional clusters at all, you first reduce dimensions with PCA.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How does k-means clustering work?

K-means partitions n points into k clusters by alternating between two steps: assigning each point to its nearest centroid, then recomputing each centroid as the mean of its assigned points. It repeats until assignments stop changing, which guarantees convergence but not a globally optimal solution.

How do you choose k in k-means, and when does k-means fail?

Choose k using the elbow method on within-cluster sum of squares, the silhouette score, or domain knowledge, validating stability across runs. K-means fails when clusters are non-spherical, have very different sizes or densities, when outliers are present, or when k is genuinely unknown. It also only finds a local optimum, so initialization (k-means++) matters.

How do you choose the number of clusters k in k-means?

The elbow method plots inertia against k and looks for the bend where adding another cluster gives diminishing returns. The silhouette score measures how similar each point is to its own cluster versus its nearest rival, with values closer to 1 indicating tighter, better-separated clusters. Both should be used together, not in isolation.

What are the main limitations of k-means clustering?

K-means requires specifying k upfront, assumes clusters are convex and roughly equal in size and density, is sensitive to outliers and feature scale, and can converge to local minima. It struggles with non-globular shapes such as rings or crescents, and it assigns every point to exactly one cluster with no notion of uncertainty.

Related lessons

Explore further

Skip to content