How do you handle class imbalance in a machine-learning model?
Class imbalance is handled at the data level (oversampling with SMOTE, undersampling), the algorithm level (class weights, balanced bagging), and the decision level (threshold tuning). The right approach depends on how severe the imbalance is, how much data you have, and whether the minority class has sufficient local density to synthesise meaningfully. Always choose your evaluation metric first — accuracy is useless on imbalanced data.
How to think about it
Structure: metric choice first, then data-level, algorithm-level, and threshold-level remedies, with a SMOTE data-leakage trap at the end.
Step 0: fix the metric first
Accuracy is meaningless when one class dominates. Switch to PR-AUC, F1 (per class), or a cost-weighted metric before touching the data or model. This also sets the target that will guide threshold tuning.
Data-level approaches
Random oversampling duplicates minority samples verbatim. Simple, but the model can overfit those exact copies.
SMOTE (Synthetic Minority Oversampling Technique) generates synthetic minority examples by interpolating between a sample and its k-nearest minority neighbours. It diversifies the minority region without exact duplication. Variants like ADASYN oversample harder-to-learn regions more aggressively.
Random undersampling removes majority samples. Fast and avoids synthetic data, but discards potentially useful information. Works well when the majority class is enormous and redundant.
Combined approaches — e.g., SMOTE + Tomek Links — oversample the minority and then remove ambiguous borderline majority samples, cleaning the decision boundary.
Algorithm-level approaches
Class weights tell the loss function to penalise errors on the minority class more heavily. In scikit-learn: class_weight='balanced' sets weights inversely proportional to class frequency. This is often the cleanest first fix because it requires no data transformation.
Balanced bagging / BalancedRandomForest undersamples each bootstrap to balance classes per tree, then ensembles. Effective and less likely to overfit than naive oversampling.
Ensemble methods with cost-sensitive splits (e.g., XGBoost scale_pos_weight) up-weight positive class errors during tree construction.
Decision-level: threshold tuning
A classifier outputs probabilities; the default threshold (0.5) was not chosen for your class distribution. After training, plot the PR curve and choose the threshold that satisfies your operational constraint — e.g., “precision >= 0.7 at the highest feasible recall.” Threshold tuning costs nothing and should always be tried before resampling.