k-Nearest Neighbours
A lazy learner with no training step: classify a point by the majority vote of its k closest neighbours, where k trades bias against variance.
What you'll learn
- kNN is a lazy learner — it does no training, it just stores the data
- Classify by finding the k nearest neighbours (Euclidean / Manhattan) and taking a majority vote
- Small k is flexible (low bias, high variance); large k is smooth (high bias, low variance)
- Feature scaling matters because distances do
Before you start
k-Nearest Neighbours is the most intuitive classifier there is: to label a new
point, look at the points nearest to it and copy the majority. If your k = 5
nearest neighbours are 3 cats and 2 dogs, you call it a cat. That is the entire
algorithm.
What makes it unusual is that there is no training. A linear model fits weights; a tree builds splits. kNN does nothing up front — it simply stores the training data. All the work happens at prediction time, when it measures distances to find the neighbours. That is why kNN is called a lazy learner (and why predictions can be slow on big datasets).
Find the neighbours, take a vote
Two ingredients: a distance to decide who is “near,” and a vote to combine their labels.
Distance. The usual choices are Euclidean (straight-line,
√Σ(aᵢ − bᵢ)²) and Manhattan (sum of absolute coordinate differences,
Σ|aᵢ − bᵢ|). For two 2-D points a and b:
Euclidean: d = √[ (a₁−b₁)² + (a₂−b₂)² ]
Manhattan: d = |a₁−b₁| + |a₂−b₂|
Vote. For classification, take the majority label among the k neighbours.
For regression, average their values instead.
k controls the bias–variance trade-off
The only real knob is k, and it directly sets how wiggly the decision boundary is:
- Small
k(e.g.k = 1): the prediction follows the nearest single point, so the boundary is jagged and chases noise — low bias, high variance. - Large
k: the vote averages over many points, smoothing the boundary — higher bias, lower variance. Pushkto the size of the dataset and every query just returns the overall majority class.
How GATE asks this
Typically an MCQ or MSQ on the properties: identify kNN as a lazy
(instance-based, non-parametric — it learns no fixed set of weights, it just keeps
the data) learner with no training phase; state the effect of increasing k
(smoother boundary, more bias, less variance); or choose the right distance metric. NAT versions hand you a tiny labelled set and a
query point and ask you to compute a distance or name the predicted class.
Worked example — classify by the 3 nearest neighbours
Training points:
A = (1, 2)is class +,B = (2, 3)is class +,C = (5, 5)is class −,D = (6, 4)is class −. Classify the queryq = (3, 3)usingk = 3and Euclidean distance.
Compute each distance from q = (3, 3) (we can compare squared distances — the
ordering is the same, and it avoids square roots):
d(q,A)² = (3−1)² + (3−2)² = 4 + 1 = 5 → d ≈ 2.24
d(q,B)² = (3−2)² + (3−3)² = 1 + 0 = 1 → d = 1.00
d(q,C)² = (3−5)² + (3−5)² = 4 + 4 = 8 → d ≈ 2.83
d(q,D)² = (3−6)² + (3−4)² = 9 + 1 = 10 → d ≈ 3.16
Sort by distance: B (1.00) < A (2.24) < C (2.83) < D (3.16). The 3 nearest are
B(+), A(+), C(−). Vote: 2 (+) vs 1 (−) → predict +.
(Sanity check: with k = 1 we’d use only B, also +. With k = 5 we’d be
forced to look beyond the four points we have — k must not exceed the dataset
size.)
Quick check
Quick check
Practice this in an interview
All questionsKNN stores the entire training set and defers all computation to prediction time: for a new point it finds the k closest training examples by distance, then returns the majority class (classification) or mean value (regression). It is called lazy because there is no training phase — the model is the data itself.
K-means partitions n points into k clusters by alternating between two steps: assigning each point to its nearest centroid, then recomputing each centroid as the mean of its assigned points. It repeats until assignments stop changing, which guarantees convergence but not a globally optimal solution.
In high-dimensional spaces all pairwise distances concentrate around the same value, so the concept of a 'nearest' neighbour breaks down — the k-th nearest neighbour is almost as far as every other point. KNN's accuracy degrades sharply as dimensionality increases unless the data has much lower intrinsic dimensionality.
K-means++ initialises centroids by probabilistically spacing them apart: the first centroid is chosen uniformly at random, and each subsequent centroid is chosen with probability proportional to its squared distance from the nearest already-chosen centroid. This reduces the chance of bad starts, cuts the number of iterations to convergence, and provides an O(log k) approximation guarantee on the final inertia.