datarekha
Machine Learning Medium Asked at GoogleAsked at AmazonAsked at MetaAsked at Stripe

How does early stopping work in gradient boosting, and why is it necessary?

The short answer

Early stopping monitors a held-out validation metric after each tree is added and stops training when the metric has not improved for a given number of rounds. It is necessary because gradient boosting is not regularised by the number of trees alone — the training loss always decreases, but test loss will eventually increase.

How to think about it

Why boosting needs early stopping

Each tree in gradient boosting reduces training loss by construction. Unlike bagging, where adding more trees never hurts training performance, gradient boosting actively continues to fit — including noise. The model effectively memorises the training set given enough trees.

Early stopping acts as an adaptive regulariser: training ends at the optimal point on the validation curve automatically, without needing to re-tune n_estimators for every dataset or learning rate change.

Mechanics

At each iteration m:

  1. Add tree h_m to the ensemble.
  2. Evaluate the validation metric (e.g., log-loss, RMSE, AUC) on the held-out set.
  3. If the metric did not improve in the last early_stopping_rounds iterations, stop.
  4. Restore the model to the best iteration.
import xgboost as xgb
from sklearn.model_selection import train_test_split

X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.15, random_state=0)

model = xgb.XGBClassifier(
    n_estimators=2000,          # set high; early stopping will cut it
    learning_rate=0.02,
    max_depth=5,
    early_stopping_rounds=50,   # stop if no improvement for 50 rounds
    eval_metric="logloss"
)
model.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    verbose=100
)
print(f"Best iteration: {model.best_iteration}")

For LightGBM the API is nearly identical via callbacks=[lgb.early_stopping(50)].

Learning rate interaction

Smaller learning rates generally find a better optimum but push the best iteration further out. A common production pattern: search for the optimal learning_rate with early stopping, then set n_estimators to best_iteration + 10% buffer and retrain on the full training set without early stopping.

Learn it properly XGBoost, LightGBM, CatBoost

Keep practising

All Machine Learning questions

Explore further

Skip to content