What regularisation mechanisms does XGBoost add on top of standard gradient boosting?
XGBoost adds L1 (alpha) and L2 (lambda) regularisation on leaf weights directly into the objective function, a minimum child weight that prevents splits on sparse sub-groups, a tree complexity penalty (gamma) that requires a minimum gain before a split is accepted, and column and row subsampling analogous to random forests.
How to think about it
XGBoost objective
Standard gradient boosting minimises a loss over the data. XGBoost adds an explicit regularisation term on the tree structure:
Obj = Σ l(y_i, ŷ_i) + Σ_t Ω(f_t)
Ω(f) = γ · T + (λ/2) · Σ_j w_j² + α · Σ_j |w_j|
where T is the number of leaves, w_j are leaf weights, γ is the complexity penalty, λ is L2, and α is L1.
Each regulariser’s effect
| Parameter | Effect |
|---|---|
reg_lambda (L2, default 1) | Shrinks leaf weights toward zero; reduces the magnitude of each tree’s correction |
reg_alpha (L1, default 0) | Promotes sparse leaf weights; can zero out leaves with weak signal |
gamma (min_split_loss) | A split is only made if gain > gamma; acts as a minimum information-gain threshold |
min_child_weight | Minimum sum of instance weights in a child; prevents overfitting on small sub-groups |
subsample | Row sampling per tree (like stochastic GBM) |
colsample_bytree / colsample_bylevel | Column sampling per tree or per level (like random forest) |
import xgboost as xgb
model = xgb.XGBClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=5,
reg_alpha=0.1, # L1
reg_lambda=1.5, # L2
gamma=0.1, # min gain to split
min_child_weight=5, # min samples per leaf (sum of hessians)
subsample=0.8,
colsample_bytree=0.8,
early_stopping_rounds=30,
eval_metric="logloss"
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
Tuning priority
Most practitioners start with max_depth and min_child_weight, then subsample/colsample_bytree, and finally gamma and the L1/L2 terms. The regularisation parameters interact: large L2 makes γ less necessary, and high min_child_weight partially substitutes for alpha/lambda on sparse data.