Bagging, boosting & stacking
Why a committee of models beats any single one. The unifying theory behind random forests and XGBoost — bagging cuts variance, boosting cuts bias, and stacking blends diverse models to win.
What you'll learn
- Why ensembles win — diverse models with uncorrelated errors cancel out
- Bagging (parallel, cuts variance) vs boosting (sequential, cuts bias)
- Stacking and voting — blending different model families
Before you start
If you’ve wondered why random forests and XGBoost dominate tabular ML, here’s the unifying answer: ensembles. A committee of models, combined well, beats any single one — and nearly every winning Kaggle solution is an ensemble. This lesson is the theory that ties the tree methods together.
Why a committee wins
The intuition is the wisdom of crowds: if you average many models that each make different, uncorrelated errors, the errors cancel and the consensus is more accurate than any individual. The crucial word is uncorrelated — ten copies of the same model add nothing. Diversity is the whole game. Ensembles work precisely to the degree their members are wrong in different ways.
Bagging — parallel, cuts variance
Bagging (bootstrap aggregating) trains many models in parallel, each on a different bootstrap sample (a random resample of the data, with replacement), then averages them. Because each model sees a slightly different dataset, they make different errors — and averaging cancels the noise, sharply reducing variance. Resample the data and watch how each bootstrap differs:
A random forest is exactly this: bag decision trees, and also randomize the features each split considers, which decorrelates the trees even more.
Boosting — sequential, cuts bias
Boosting flips the idea: train models one after another, each new one focused on the examples the ensemble got wrong so far. Instead of averaging independent models, it builds an additive sequence that keeps correcting its own mistakes — which reduces bias and produces the extremely accurate models you saw in XGBoost.
Stacking & voting — blend different families
The third family combines different model types. Voting just averages (or majority-votes) their predictions. Stacking goes further: it trains a small meta-model on the base models’ predictions, learning how to weight them. Because a tree, a linear model, and a k-NN make very different errors, blending them often beats any one — which is why multi-level stacking routinely wins Kaggle competitions.
Quick check
Quick check
Next
That completes the supervised core. Next, evaluation done rigorously — feature selection and model selection with nested CV.
Practice this in an interview
All questionsBagging trains many independent models in parallel on bootstrap samples and averages them, which mainly reduces variance; boosting trains models sequentially so each corrects its predecessor's errors, which mainly reduces bias. Use bagging (e.g. random forests) when your base learner is high-variance and overfits; use boosting (e.g. gradient boosting) when you need to squeeze out bias and maximize accuracy, accepting more tuning and overfitting risk.
Bagging trains many independent models on bootstrap samples in parallel and averages their predictions, primarily reducing variance. Boosting trains models sequentially, each correcting the errors of its predecessor, primarily reducing bias.
Random forests are faster to train, easier to tune, robust to noisy features, and hard to overfit with more trees — making them a strong default baseline. Gradient boosting typically achieves higher accuracy on structured/tabular data, but requires careful tuning of learning rate, tree depth, and early stopping to avoid overfitting.
Random forest builds deep trees independently in parallel and averages them, making it robust, low-tuning, and resistant to overfitting; gradient boosting builds shallow trees sequentially to correct residual errors, usually achieving higher accuracy when carefully tuned. Choose random forest for a fast, stable baseline on noisy data, and gradient boosting when squeezing out maximum accuracy on tabular data is worth the tuning effort.