datarekha
Machine Learning Medium Asked at GoogleAsked at AmazonAsked at MetaAsked at NetflixAsked at Airbnb

How does a random forest work, and why does feature sampling at each split help more than row sampling alone?

The short answer

A random forest grows many deep decision trees, each on a bootstrap sample of the rows, but also restricts each split to a random subset of features. Feature sampling decorrelates the trees so their errors cancel when averaged, which is the key source of variance reduction beyond what row sampling achieves.

How to think about it

The two sources of randomness

  1. Row sampling (bootstrap) — each tree sees a random 63.2% of training rows (with replacement). This alone creates diversity but not enough: if one feature is strongly predictive, every tree will still split on it first, making the trees highly correlated.

  2. Feature sampling at each split — at each node, only a random subset of m features is considered as candidates. This forces trees to rely on different signals, breaking correlation.

The default value is m = sqrt(d) for classification and m = d/3 for regression (where d is the total number of features).

Why decorrelation matters

For n trees each with variance σ², pairwise correlation ρ:

Var(average) = ρ·σ² + (1-ρ)/n · σ²

As n → ∞, the irreducible term is ρ·σ². Reducing ρ (via feature sampling) drives this floor down. Row sampling alone leaves ρ high when a dominant feature exists.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_features="sqrt",   # feature sampling per split
    bootstrap=True,        # row sampling
    oob_score=True,        # free validation estimate
    n_jobs=-1,
    random_state=42
)
rf.fit(X_train, y_train)

print(f"OOB accuracy: {rf.oob_score_:.4f}")

Out-of-bag (OOB) estimate — the ~36.8% of rows not included in a tree’s bootstrap sample act as a validation set for that tree. Averaging across all trees gives a nearly unbiased generalisation estimate without a separate hold-out split.

Hyperparameters that matter most

ParameterWhat it controls
n_estimatorsMore is always better, diminishing returns after ~200-500
max_featuresCorrelation vs. bias trade-off
max_depth / min_samples_leafIndividual tree variance
Learn it properly Random forests

Keep practising

All Machine Learning questions

Explore further

Skip to content