How does a random forest work, and why does feature sampling at each split help more than row sampling alone?
A random forest grows many deep decision trees, each on a bootstrap sample of the rows, but also restricts each split to a random subset of features. Feature sampling decorrelates the trees so their errors cancel when averaged, which is the key source of variance reduction beyond what row sampling achieves.
How to think about it
The two sources of randomness
-
Row sampling (bootstrap) — each tree sees a random 63.2% of training rows (with replacement). This alone creates diversity but not enough: if one feature is strongly predictive, every tree will still split on it first, making the trees highly correlated.
-
Feature sampling at each split — at each node, only a random subset of
mfeatures is considered as candidates. This forces trees to rely on different signals, breaking correlation.
The default value is m = sqrt(d) for classification and m = d/3 for regression (where d is the total number of features).
Why decorrelation matters
For n trees each with variance σ², pairwise correlation ρ:
Var(average) = ρ·σ² + (1-ρ)/n · σ²
As n → ∞, the irreducible term is ρ·σ². Reducing ρ (via feature sampling) drives this floor down. Row sampling alone leaves ρ high when a dominant feature exists.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=300,
max_features="sqrt", # feature sampling per split
bootstrap=True, # row sampling
oob_score=True, # free validation estimate
n_jobs=-1,
random_state=42
)
rf.fit(X_train, y_train)
print(f"OOB accuracy: {rf.oob_score_:.4f}")
Out-of-bag (OOB) estimate — the ~36.8% of rows not included in a tree’s bootstrap sample act as a validation set for that tree. Averaging across all trees gives a nearly unbiased generalisation estimate without a separate hold-out split.
Hyperparameters that matter most
| Parameter | What it controls |
|---|---|
n_estimators | More is always better, diminishing returns after ~200-500 |
max_features | Correlation vs. bias trade-off |
max_depth / min_samples_leaf | Individual tree variance |