How does early stopping work in gradient boosting, and why is it necessary?
Early stopping monitors a held-out validation metric after each tree is added and stops training when the metric has not improved for a given number of rounds. It is necessary because gradient boosting is not regularised by the number of trees alone — the training loss always decreases, but test loss will eventually increase.
How to think about it
Why boosting needs early stopping
Each tree in gradient boosting reduces training loss by construction. Unlike bagging, where adding more trees never hurts training performance, gradient boosting actively continues to fit — including noise. The model effectively memorises the training set given enough trees.
Early stopping acts as an adaptive regulariser: training ends at the optimal point on the validation curve automatically, without needing to re-tune n_estimators for every dataset or learning rate change.
Mechanics
At each iteration m:
- Add tree h_m to the ensemble.
- Evaluate the validation metric (e.g., log-loss, RMSE, AUC) on the held-out set.
- If the metric did not improve in the last
early_stopping_roundsiterations, stop. - Restore the model to the best iteration.
import xgboost as xgb
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.15, random_state=0)
model = xgb.XGBClassifier(
n_estimators=2000, # set high; early stopping will cut it
learning_rate=0.02,
max_depth=5,
early_stopping_rounds=50, # stop if no improvement for 50 rounds
eval_metric="logloss"
)
model.fit(
X_tr, y_tr,
eval_set=[(X_val, y_val)],
verbose=100
)
print(f"Best iteration: {model.best_iteration}")
For LightGBM the API is nearly identical via callbacks=[lgb.early_stopping(50)].
Learning rate interaction
Smaller learning rates generally find a better optimum but push the best iteration further out. A common production pattern: search for the optimal learning_rate with early stopping, then set n_estimators to best_iteration + 10% buffer and retrain on the full training set without early stopping.