Confusion Matrix, Precision, Recall, ROC
Accuracy hides failure on imbalanced data. The confusion matrix splits errors into the two kinds that matter — and precision, recall, F1, and AUC read straight off it.
What you'll learn
- The confusion matrix: TP, FP, FN, TN — the four cells every metric is built from
- accuracy = (TP+TN)/total; precision = TP/(TP+FP); recall = TP/(TP+FN)
- F1 is the harmonic mean of precision and recall, and why accuracy lies on imbalanced data
- ROC curve (TPR vs FPR) and AUC as a threshold-free ranking score
Before you start
A spam filter that calls everything “not spam” is 99% accurate when 99% of mail is real — and completely useless. A single accuracy number cannot tell you which kind of mistake the model makes. The confusion matrix can: it splits every prediction into four cells, and precision, recall, F1, and AUC all read directly off them.
The confusion matrix
Cross “what the model predicted” against “what was actually true.” Each of the four cells counts one outcome:
- TP — predicted positive, actually positive (a correct hit).
- FP — predicted positive, actually negative (a false alarm).
- FN — predicted negative, actually positive (a miss).
- TN — predicted negative, actually negative (a correct rejection).
The four metrics
From those four counts:
- Accuracy =
(TP + TN) / total— fraction of all predictions that were correct. - Precision =
TP / (TP + FP)— of everything flagged positive, how many really were. High precision means few false alarms. - Recall (sensitivity, TPR) =
TP / (TP + FN)— of everything actually positive, how many you caught. High recall means few misses. - F1 =
2 · P · R / (P + R)— the harmonic mean of precision and recall; it stays low unless both are high, so it summarises the trade-off in one number.
Drag the decision threshold and watch the four cells — and therefore every metric — move:
ROC and AUC
A classifier outputs a score; you pick a threshold to turn it into a yes/no. Sweep
that threshold from strict to lenient and plot TPR (recall, TP/(TP+FN)) on the
y-axis against FPR (FP/(FP+TN)) on the x-axis — that traced curve is the ROC
curve. AUC (area under it) is the probability that the model ranks a random
positive above a random negative: 0.5 is random guessing, 1.0 is perfect. Because
AUC is threshold-free, it is a clean single-number comparison of two models’ ranking
ability.
How GATE asks this
The reliable NAT hands you a populated confusion matrix and asks for one metric — precision, recall, F1, or accuracy — to two or three decimals. The recipe never varies: read off TP, FP, FN, TN, then plug into the ratio. The MSQ variant builds the same counts from a word problem and asks which statements hold — GATE DA 2026 (Q47) gave one class 20 items and the other 10, with a handful misclassified each way, then asked you to compare the two classes’ accuracy, precision, and recall. Either way you must have the precision-vs-recall denominators straight (and know accuracy lies under class imbalance).
Worked example
A binary classifier on 30 samples produces TP = 8, FP = 2, FN = 6, TN = 14. Compute accuracy, precision, recall, and F1.
Check the total first: 8 + 2 + 6 + 14 = 30. Then:
accuracy = (TP + TN) / total = (8 + 14) / 30 = 22/30 ≈ 0.733
precision = TP / (TP + FP) = 8 / (8 + 2) = 8/10 = 0.800
recall = TP / (TP + FN) = 8 / (8 + 6) = 8/14 ≈ 0.571
F1 = 2·P·R / (P + R) = 2·0.8·0.571 / (0.8 + 0.571)
= 0.914 / 1.371 ≈ 0.667
So accuracy ≈ 0.733, precision = 0.80, recall ≈ 0.571, F1 ≈ 0.667. Note precision (0.80) beats recall (0.571): the model is cautious — when it says positive it is usually right, but it still misses 6 of the 14 actual positives.
Quick check
Quick check
Practice this in an interview
All questionsA confusion matrix tallies predictions against ground truth in a 2x2 table: true positives, true negatives, false positives, and false negatives. From those four cells every classification metric — accuracy, precision, recall, F1, specificity — can be derived. It exposes *which kind* of error a model makes, not just how often it errs.
The PR curve plots precision against recall as the decision threshold varies. On imbalanced datasets it is more informative than ROC because it ignores the large pool of true negatives that inflate ROC-AUC — a model that looks good on ROC can still have dismal precision, which PR-AUC immediately exposes. PR-AUC is the better metric whenever the positive class is rare and getting predictions right matters more than ranking.
The ROC curve plots True Positive Rate (recall) against False Positive Rate at every decision threshold. AUC — the area under that curve — equals the probability that the model ranks a randomly chosen positive example above a randomly chosen negative one. A random classifier scores 0.5; a perfect classifier scores 1.0.
MAE, RMSE, MAPE, and R² each measure a different aspect of regression quality and each has a regime where it misleads. RMSE is dominated by outliers; MAE is robust but hides large-error tails; MAPE is undefined at zero and asymmetrically penalises under-prediction; R² can appear high even when absolute errors are large, and can be negative, yet is still commonly misread as a percentage-correct. Choosing the right metric requires knowing the cost structure of the prediction task.