What are the key algorithmic differences between XGBoost and LightGBM?
The short answer
XGBoost grows trees level-by-level (breadth-first), uses exact or approximate split finding, and adds L1/L2 regularisation on leaf weights. LightGBM grows leaf-wise (best-first), uses histogram-based split finding, and applies Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) for speed on large datasets.
How to think about it
Tree growth strategy
- XGBoost (level-wise): splits all leaves at the same depth before going deeper. This produces balanced trees and is less prone to overfitting on small datasets.
- LightGBM (leaf-wise / best-first): always splits the leaf with the highest loss reduction, regardless of depth. This finds better fits in fewer iterations but can overfit if
num_leavesormin_data_in_leafare not constrained.
Split finding
- XGBoost offers an exact greedy algorithm (tries all sorted thresholds) and an approximate variant using quantile sketches.
- LightGBM always uses histogram binning: continuous features are discretised into 256 bins, and split candidates are evaluated over the histogram rather than raw values. This dramatically reduces memory and computation on datasets with millions of rows.
Additional LightGBM techniques
- GOSS (Gradient-based One-Side Sampling): keeps all large-gradient samples (hard examples) and randomly subsamples small-gradient ones. Preserves information while processing fewer rows per tree.
- EFB (Exclusive Feature Bundling): bundles mutually exclusive sparse features into single features, reducing effective feature count.
import xgboost as xgb
import lightgbm as lgb
# XGBoost
xgb_model = xgb.XGBClassifier(
n_estimators=500, learning_rate=0.05,
max_depth=6, reg_alpha=0.1, reg_lambda=1.0,
tree_method="hist", device="cuda", # GPU hist mode
eval_metric="logloss", early_stopping_rounds=20
)
# LightGBM
lgb_model = lgb.LGBMClassifier(
n_estimators=500, learning_rate=0.05,
num_leaves=63, min_child_samples=20,
reg_alpha=0.1, reg_lambda=1.0
)
When to pick which
| Scenario | Prefer |
|---|---|
| Tabular data, speed priority, large dataset | LightGBM |
| Smaller dataset, want balanced trees | XGBoost |
| Need GPU training out-of-box | Both (XGBoost tree_method="hist") |