Machine Learning Medium Asked at GoogleAsked at AmazonAsked at MetaAsked at UberAsked at LyftAsked at Stripe

Explain how gradient boosting fits residuals. What role does the learning rate play?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Gradient boosting builds an additive model by fitting each new tree to the negative gradient of the loss with respect to the current ensemble's predictions — effectively the residuals for squared error. The learning rate shrinks each tree's contribution, keeping the ensemble from over-correcting and acting as a regulariser.

How to think about it

The core algorithm

Start with a constant prediction F₀ (e.g., the mean for regression). At each iteration m:

Compute the pseudo-residuals — the negative gradient of the loss L with respect to the current predictions:

r_i = -∂L(y_i, F_{m-1}(x_i)) / ∂F_{m-1}(x_i)

For mean squared error, this simplifies exactly to r_i = y_i - F_{m-1}(x_i) — the familiar residual.

Fit a shallow tree h_m to the pseudo-residuals.
Update the ensemble:

F_m(x) = F_{m-1}(x) + η · h_m(x)

The process works for any differentiable loss (log-loss for classification, MAE, Huber, etc.) because gradient descent is applied in function space, not parameter space.

Learning rate η (shrinkage)

η ∈ (0, 1] scales down each tree’s contribution. Smaller η means:

The ensemble moves more conservatively toward the minimum
More trees are needed to achieve the same training loss
Better generalisation because the model is harder to overfit

The interaction: n_estimators and eta must be tuned jointly. Halving eta roughly requires doubling n_estimators to match training performance.

from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(
    n_estimators=500,
    learning_rate=0.05,   # small lr → need more trees
    max_depth=3,          # shallow trees keep variance low
    subsample=0.8,        # stochastic gradient boosting
    random_state=42
)
gbr.fit(X_train, y_train)

Stochastic gradient boosting — setting subsample < 1 adds row sampling per tree (like bagging), which further reduces overfitting and often improves accuracy.

Learn it properly XGBoost, LightGBM, CatBoost

Explain how gradient boosting fits residuals. What role does the learning rate play?

Keep practising

Explore further