datarekha
Machine Learning Medium Asked at GoogleAsked at DeepMindAsked at Amazon

What is generalization in machine learning, and what factors determine how well a model generalizes?

The short answer

Generalization is a model's ability to perform well on unseen data drawn from the same distribution as the training set. It is controlled by the interplay of model capacity, dataset size, regularization, and distributional shift between training and deployment.

How to think about it

A model that achieves low training loss but high test loss has failed to generalize — it has memorized training particulars rather than learning the underlying mapping. Measuring generalization requires a held-out test set the model has never seen.

PAC learning bound (informal): for a hypothesis class of VC dimension h and n training examples, with probability at least 1 - δ:

generalization_error ≤ training_error + O(sqrt((h * log(n/h) + log(1/δ)) / n))

The bound tightens as n grows and loosens as h (model complexity) grows. Regularization effectively reduces h.

Key factors governing generalization:

1. Model capacity vs. dataset size More parameters relative to training examples → higher risk of memorization. A 175B-parameter LLM generalizes when trained on trillions of tokens; the same model on 10K examples would overfit catastrophically.

2. Regularization L1/L2 weight penalties, dropout, batch normalization, data augmentation, and early stopping all constrain effective capacity and improve generalization.

3. Distribution shift Generalization assumes test data is i.i.d. from the training distribution. Covariate shift (P(x) changes), label shift (P(y) changes), and concept drift (P(y|x) changes) all break this assumption and require domain adaptation or retraining strategies.

4. Inductive bias Architectural choices that match problem structure improve generalization by reducing the search space. CNNs generalize better on images than fully connected networks of equal parameter count because weight sharing encodes translation invariance.

5. Data quality Label noise, feature noise, and non-representative sampling all increase the gap between the learnable signal and noise, hurting generalization independently of capacity.

Diagnosing generalization failure:

  • Train error low, val error high → overfit; increase regularization or data
  • Train error high, val error high → underfit; increase capacity
  • Val error low but prod error high → distribution shift; audit data pipeline
Learn it properly What ML actually is

Keep practising

All Machine Learning questions

Explore further

Skip to content