In KNN, how do you choose k, and how does the curse of dimensionality affect it?

Choose k by cross-validation, balancing a small k (low bias, high variance, noisy) against a large k (smoother, higher bias); an odd k avoids ties in binary classification. The curse of dimensionality hurts KNN because in high dimensions all points become nearly equidistant, so 'nearest' loses meaning and accuracy degrades. Feature scaling and dimensionality reduction help.

Why is KNN called a lazy learner, and what are the practical tradeoffs at inference time?

KNN is lazy because it does no real training; it just stores the training data and defers all computation to prediction time, when it searches for the nearest neighbors. The tradeoff is fast (zero) training but slow, memory-heavy inference that scales with dataset size. Approximate nearest-neighbor indexes and dimensionality reduction make it practical at scale.

How does k-nearest neighbours work, and why is it called a lazy learner?

KNN stores the entire training set and defers all computation to prediction time: for a new point it finds the k closest training examples by distance, then returns the majority class (classification) or mean value (regression). It is called lazy because there is no training phase — the model is the data itself.

What's the difference between k-means and k-nearest neighbors? People confuse them.

K-means is an unsupervised clustering algorithm that partitions unlabeled data into k groups by iteratively updating centroids. KNN is a supervised algorithm that classifies or predicts a new point using the labels of its k closest training points. They share the letter k and the use of distances but solve completely different problems.

K-nearest neighbors — Machine Learning

Every other model learns a function. k-nearest neighbors learns nothing — it just memorizes the training set and, when a new point arrives, looks at its k closest neighbors and takes a majority vote. It’s the simplest classifier in ML, and a perfect vehicle for seeing decision boundaries and the bias-variance tradeoff in action.

Store, then vote

There is no training phase. To predict a new point: compute its distance to every training point, take the k nearest, and return their majority class (for regression, their average). That’s it.

The picture below is one prediction: a new point (the ?) and the dashed circle holding its 5 nearest neighbors. Three belong to class A, two to class B — so the vote, and the prediction, is A.

k-NN at prediction time: take the k closest points and let them vote. Here k=5 gives 3 A vs 2 B → class A.

k is the bias-variance dial

The choice of k is the whole game, and it maps directly onto bias and variance:

Small k (k=1) — the boundary wraps every single point. Zero training error, but it memorizes noise — high variance, overfit, jagged.
Large k — each prediction averages over many points, smoothing the boundary. Too large and far-away points outvote the locals — high bias, underfit.

You tune k by cross-validation, and an odd k avoids ties in binary classification.

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=800, n_features=10, n_informative=5, random_state=0)

print(f"{'k':>3} {'CV accuracy':>12}")
for k in [1, 3, 5, 11, 25, 101]:
    clf = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=k))
    print(f"{k:3d} {cross_val_score(clf, X, y, cv=5).mean():12.3f}")

print("\nk=1 overfits; very large k underfits; the best k is in between.")

In one breath

k-NN is a lazy learner: no training — it stores the data and, at prediction time, finds the k nearest points and takes their majority vote (average for regression).
k is the bias-variance dial: small k (k=1) wraps every point → overfit / high variance; large k averages too widely → underfit / high bias; tune by cross-validation (an odd k avoids ties).
It’s pure distance, so you must standardize — an unscaled big-range feature dominates the distance and the rest contribute nothing.
It breaks in high dimensions (distance concentration — the curse), where “nearest” stops being meaningful.
A great baseline for small, low-dimensional, locally-irregular problems; but prediction is slow (scans all points) and memory-heavy — trees usually win at scale.

Quick check

0/3

Q1What happens during k-NN 'training'?

Q2You set k=1 and get perfect training accuracy but poor test accuracy. Why?

Q3Why must you scale features before using k-NN?

k-NN votes by distance; Naive Bayes votes by probability — a fast, surprisingly strong probabilistic baseline, especially for text.

K-nearest neighbors

What you'll learn

Before you start

Store, then vote

k is the bias-variance dial

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further