Machine Learning Hard Asked at GoogleAsked at MetaAsked at UberAsked at LyftAsked at Booking.com

What regularisation mechanisms does XGBoost add on top of standard gradient boosting?

The short answer

XGBoost adds L1 (alpha) and L2 (lambda) regularisation on leaf weights directly into the objective function, a minimum child weight that prevents splits on sparse sub-groups, a tree complexity penalty (gamma) that requires a minimum gain before a split is accepted, and column and row subsampling analogous to random forests.

How to think about it

XGBoost objective

Standard gradient boosting minimises a loss over the data. XGBoost adds an explicit regularisation term on the tree structure:

Obj = Σ l(y_i, ŷ_i) + Σ_t Ω(f_t)

Ω(f) = γ · T + (λ/2) · Σ_j w_j² + α · Σ_j |w_j|

where T is the number of leaves, w_j are leaf weights, γ is the complexity penalty, λ is L2, and α is L1.

Each regulariser’s effect

Parameter	Effect
`reg_lambda` (L2, default 1)	Shrinks leaf weights toward zero; reduces the magnitude of each tree’s correction
`reg_alpha` (L1, default 0)	Promotes sparse leaf weights; can zero out leaves with weak signal
`gamma` (min_split_loss)	A split is only made if gain > gamma; acts as a minimum information-gain threshold
`min_child_weight`	Minimum sum of instance weights in a child; prevents overfitting on small sub-groups
`subsample`	Row sampling per tree (like stochastic GBM)
`colsample_bytree` / `colsample_bylevel`	Column sampling per tree or per level (like random forest)

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=5,
    reg_alpha=0.1,          # L1
    reg_lambda=1.5,         # L2
    gamma=0.1,              # min gain to split
    min_child_weight=5,     # min samples per leaf (sum of hessians)
    subsample=0.8,
    colsample_bytree=0.8,
    early_stopping_rounds=30,
    eval_metric="logloss"
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

Tuning priority

Most practitioners start with max_depth and min_child_weight, then subsample/colsample_bytree, and finally gamma and the L1/L2 terms. The regularisation parameters interact: large L2 makes γ less necessary, and high min_child_weight partially substitutes for alpha/lambda on sparse data.

Learn it properly XGBoost, LightGBM, CatBoost

What regularisation mechanisms does XGBoost add on top of standard gradient boosting?

Keep practising

Explore further