What is the F1 score, why use the harmonic mean, and when is it the wrong metric?
F1 is the harmonic mean of precision and recall: 2PR/(P+R). The harmonic mean penalises extreme imbalance between the two — a model with 1.0 precision and 0.01 recall gets F1 = 0.02, not 0.505. F1 is the wrong metric when the classes are heavily imbalanced or when the costs of false positives and false negatives differ sharply, in which case F-beta, PR-AUC, or a cost-weighted metric is more appropriate.
How to think about it
Cover the formula, the harmonic-mean motivation, the generalisation to F-beta, and the cases where F1 misleads.
The formula
F1 = 2 * Precision * Recall / (Precision + Recall)
Equivalently: F1 = 2TP / (2TP + FP + FN).
Why harmonic mean, not arithmetic mean?
The arithmetic mean of precision and recall — (P + R) / 2 — rewards a model that nails one metric while completely ignoring the other. A model with precision = 1.0 and recall = 0.01 has an arithmetic mean of 0.505, suggesting it’s decent. The harmonic mean gives 0.02 — correctly signalling that the model is nearly useless.
The harmonic mean is always dominated by the smaller of the two values. That’s the intent: both precision and recall must be simultaneously high for F1 to be high.
F-beta: weighting one direction
When the costs are asymmetric, use F-beta:
F-beta = (1 + beta²) * P * R / (beta² * P + R)
- beta = 1 → standard F1, equal weight.
- beta = 2 → recall counts twice as much (good for medical screening, fraud detection).
- beta = 0.5 → precision counts twice as much (good for spam filtering, precision-first retrieval).
When F1 is the wrong metric
| Situation | Better metric |
|---|---|
| Severe class imbalance (less than 1% positives) | PR-AUC or macro-averaged F1 per class |
| Need threshold-free comparison | PR-AUC or ROC-AUC |
| Cost of FP and FN differ a lot | F-beta, or explicit cost matrix |
| Multi-class with unequal class sizes | Macro or weighted F1, not micro |
| True negatives matter (e.g. content moderation both directions) | Matthews Correlation Coefficient (MCC) |
Macro vs. micro F1
In multi-class problems:
- Micro F1 pools all TP, FP, FN before computing — dominated by the largest class.
- Macro F1 computes F1 per class then averages — treats every class equally regardless of size.
- Weighted F1 weights per-class F1 by support — a compromise for imbalanced multi-class.