Explain how gradient boosting fits residuals. What role does the learning rate play?
Gradient boosting builds an additive model by fitting each new tree to the negative gradient of the loss with respect to the current ensemble's predictions — effectively the residuals for squared error. The learning rate shrinks each tree's contribution, keeping the ensemble from over-correcting and acting as a regulariser.
How to think about it
The core algorithm
Start with a constant prediction F₀ (e.g., the mean for regression). At each iteration m:
- Compute the pseudo-residuals — the negative gradient of the loss L with respect to the current predictions:
r_i = -∂L(y_i, F_{m-1}(x_i)) / ∂F_{m-1}(x_i)
For mean squared error, this simplifies exactly to r_i = y_i - F_{m-1}(x_i) — the familiar residual.
-
Fit a shallow tree h_m to the pseudo-residuals.
-
Update the ensemble:
F_m(x) = F_{m-1}(x) + η · h_m(x)
The process works for any differentiable loss (log-loss for classification, MAE, Huber, etc.) because gradient descent is applied in function space, not parameter space.
Learning rate η (shrinkage)
η ∈ (0, 1] scales down each tree’s contribution. Smaller η means:
- The ensemble moves more conservatively toward the minimum
- More trees are needed to achieve the same training loss
- Better generalisation because the model is harder to overfit
The interaction: n_estimators and eta must be tuned jointly. Halving eta roughly requires doubling n_estimators to match training performance.
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(
n_estimators=500,
learning_rate=0.05, # small lr → need more trees
max_depth=3, # shallow trees keep variance low
subsample=0.8, # stochastic gradient boosting
random_state=42
)
gbr.fit(X_train, y_train)
Stochastic gradient boosting — setting subsample < 1 adds row sampling per tree (like bagging), which further reduces overfitting and often improves accuracy.