The Bias-Variance Trade-off
Total error splits into bias, variance, and irreducible noise. Reduce one and you usually raise the other — the conceptual backbone of the ML section.
What you'll learn
- Bias = error from over-simplifying (underfitting); variance = sensitivity to the training set (overfitting)
- Total expected error ≈ bias² + variance + irreducible noise
- The trade-off: lowering one term often raises the other
- What lowers variance (more data, regularization, simpler models, bagging) vs what lowers bias
- Reading the U-shaped error-vs-complexity curve
Before you start
Every supervised model makes errors for three different reasons, and separating them is the single most useful lens in machine learning. Bias is the error from a model that is too simple to capture the truth — it underfits. Variance is the error from a model so flexible that it chases the quirks of this particular training set — it overfits. The third piece, irreducible noise, is randomness in the data itself that no model can remove.
The catch is that bias and variance pull in opposite directions: make a model more flexible and its bias drops but its variance climbs. You cannot drive both to zero; you find the sweet spot.
The decomposition and the U-curve
Expected prediction error decomposes into three additive pieces:
Expected error ≈ bias² + variance + irreducible noise
As model complexity grows, bias falls (the model can express more) while variance rises (it reacts more to the training sample). Their sum is U-shaped — the lowest point is the best-generalising model:
The handles you can pull, and which term they move:
- Lowers variance: more training data, regularization (e.g. ridge’s λ), simpler models, bagging / averaging ensembles.
- Lowers bias: more expressive models (higher-degree polynomials, deeper trees), adding informative features.
Almost every lever that cuts one term raises the other. That tension is the trade-off.
How GATE asks this
Reliably an MCQ or MSQ on the direction of an effect: “increasing model complexity does what to bias and variance?”, or “which of the following reduce variance?” The answer pattern is fixed — more complexity means lower bias, higher variance; regularization and more data mean lower variance, higher (or unchanged) bias. It also rides inside ridge questions (more λ → more bias, less variance) and overfitting/underfitting questions across nearly every paper.
Worked example
Compare two extreme models on the same data:
- A high-degree polynomial can bend through almost every training point. It has low bias (flexible enough to capture the true shape) but high variance — shift the training set slightly and the wiggly curve changes drastically. This is the overfitting corner (right side of the U-curve).
- A constant predictor (always outputs the mean) ignores the inputs entirely. It has high bias (it cannot represent any real structure) but low variance — it barely changes when the training set changes. This is the underfitting corner (left side).
Neither extreme generalises well. The best model sits in between, where the rising variance and falling bias sum to the smallest total error.
Now put numbers on the decomposition. Suppose a tuned model has bias = 0.2,
variance = 0.05, and irreducible noise = 0.01. Then
total error = bias² + variance + noise
= 0.2² + 0.05 + 0.01
= 0.04 + 0.05 + 0.01
= 0.10
so the expected error is 0.10. (Note it is bias², not bias, that enters the sum.)
Quick check
Quick check
Practice this in an interview
All questionsA model's expected test error splits into bias (error from over-simplified assumptions, causing underfitting), variance (sensitivity to the particular training sample, causing overfitting), and irreducible noise. Adding complexity lowers bias but raises variance, so the best model minimises their sum on unseen data — not the training error.
Bagging trains many independent models on bootstrap samples in parallel and averages their predictions, primarily reducing variance. Boosting trains models sequentially, each correcting the errors of its predecessor, primarily reducing bias.
Multicollinearity occurs when two or more predictors are highly linearly correlated, inflating the variance of coefficient estimates and making them numerically unstable and uninterpretable. The Variance Inflation Factor (VIF) quantifies how much each coefficient's variance is inflated relative to an orthogonal design.
λ is a bias-variance trade-off knob: too low leaves the model overfit (high variance); too high over-regularizes and underfits (high bias). The standard approach is k-fold cross-validation over a logarithmic grid of λ values, minimizing held-out loss.