What is L2 regularisation (weight decay), and how does it reduce overfitting?
L2 regularisation adds a penalty equal to the sum of squared weights multiplied by a coefficient λ to the loss function, which encourages the optimiser to keep weights small. This penalises large, specialised weights and pushes the model toward simpler solutions that generalise better. In SGD it is equivalent to shrinking weights by a constant factor each step (hence weight decay), though in Adam the two diverge — requiring AdamW for correct decoupled decay.
How to think about it
L2 regularisation is the most common explicit regulariser in deep learning. It directly penalises weight magnitude, discouraging the network from memorising training-set idiosyncrasies.
The augmented loss
L_reg = L_task + (λ/2) · ||θ||²
Taking the gradient with respect to weights θ:
∂L_reg/∂θ = ∂L_task/∂θ + λ·θ
The extra term λ·θ pushes gradients to pull weights toward zero on every update.
SGD update: weight decay is literal
For SGD, adding the L2 term is identical to multiplying each weight by (1 − α·λ) before the gradient step — hence the name “weight decay”:
θ_t = θ_{t-1}(1 − α·λ) − α·∇L_task
Adam: why L2 ≠ weight decay
In Adam, gradients are scaled by the adaptive term 1/√v̂. The L2 gradient λ·θ is also scaled by this term, so frequently-updated parameters receive a different effective decay than sparse ones. AdamW fixes this by decoupling the decay and applying it directly to the parameter:
import torch
model = torch.nn.Linear(256, 10)
# Adam + L2 via loss (NOT correct decoupled decay)
opt_adam = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# AdamW: decoupled weight decay (preferred)
opt_adamw = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
Choosing λ
Typical range: 1e-4 to 1e-2. Too large and the model underfits; too small and there is no regularisation benefit. Tune with a logarithmic sweep on validation loss.
L2 vs L1
L1 (Lasso) adds λ·|θ|, which produces sparse weights (many exactly zero). L2 distributes weights evenly and is generally preferred for neural networks where sparsity is not the goal.