What is pruning in decision trees and when would you use pre-pruning versus post-pruning?
Pruning removes splits that do not improve generalisation. Pre-pruning stops growth early via hyperparameters like max_depth or min_samples_leaf. Post-pruning (cost-complexity pruning) grows the full tree then collapses nodes whose removal does not hurt held-out accuracy enough.
How to think about it
An unpruned tree will grow until every leaf is pure, perfectly fitting training data but failing on new samples. Pruning is the primary regularisation tool for single decision trees.
Pre-pruning (early stopping)
Controlled entirely by hyperparameters evaluated before or during growth:
| Parameter | Effect |
|---|---|
max_depth | Hard cap on tree height |
min_samples_split | Node must have at least N samples to split |
min_samples_leaf | Each resulting leaf must have at least N samples |
min_impurity_decrease | Split must improve impurity by at least this amount |
Pre-pruning is fast but can stop too early — a split with zero immediate gain might still unlock a high-gain split one level deeper.
Post-pruning (cost-complexity / reduced-error pruning)
sklearn implements minimal cost-complexity pruning, which adds a penalty proportional to the number of leaves:
R_alpha(T) = R(T) + alpha * |leaves(T)|
As alpha increases, more leaves are collapsed. Cross-validate over alpha to find the right balance:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
clf = DecisionTreeClassifier(random_state=0)
path = clf.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas
scores = [
cross_val_score(
DecisionTreeClassifier(ccp_alpha=a, random_state=0),
X_train, y_train, cv=5
).mean()
for a in alphas
]
best_alpha = alphas[np.argmax(scores)]
final_tree = DecisionTreeClassifier(ccp_alpha=best_alpha).fit(X_train, y_train)
Which to use? Pre-pruning is cheaper and sufficient for most ensemble contexts (trees in random forests are deliberately shallow). Post-pruning is better when you want a single interpretable tree that is as small as possible without sacrificing accuracy.