Evaluating recommenders (precision@k, NDCG)
Why RMSE is the wrong target for recommender systems, and how precision@k, recall@k, MAP, and NDCG measure what users actually experience.
What you'll learn
- Why rating-prediction error (RMSE/MAE) is the wrong optimization target — recommendation is a top-k ranking problem
- How precision@k, recall@k, MAP, and NDCG@k each capture a different aspect of ranking quality
- How to set up a sound evaluation: temporal or leave-one-out splits, offline metrics, and the gap to online A/B testing
Before you start
Why RMSE is the wrong target
The most natural instinct when building a recommender is to frame it as a regression problem: predict the rating a user would give each item, then sort by predicted rating. Minimize RMSE (root-mean-square error) or MAE (mean-absolute error) on held-out ratings, and call it a day.
This is a seductive idea, and it is mostly wrong.
Here is the core problem. Suppose your model predicts a 4.5 for Item A (true rating: 4.0) and a 2.0 for Item B (true rating: 1.8). Your RMSE looks great — both predictions are close. But from the user’s perspective, neither item is going to be recommended: they are both buried far down the ranked list behind dozens of items predicted above 4.5. The error on low-rated items contributes to RMSE just as much as error on top-rated items, even though the user will never see those items.
The recommendation problem is fundamentally a ranking problem. What matters is whether the items the model pushes to the top of the list are relevant — not how well it estimates ratings you will never show.
The top-k frame
Fix a small integer k — the number of recommendations you will actually show. For a given user, define a set of relevant items: the items they would genuinely like, measured by their held-out interactions (purchases, clicks, high ratings).
Now the question becomes: of all the items you could have shown, did you put the right ones in those k slots?
Precision@k and Recall@k
Precision@k answers: of the k items you showed, how many were relevant?
precision@k = (number of relevant items in top k) / k
It is a number between 0 and 1. A precision@10 of 0.3 means 3 of the 10 shown items were relevant.
Recall@k answers: of all the relevant items that exist for this user, how many made it into the top k?
recall@k = (number of relevant items in top k) / (total relevant items for this user)
Precision and recall trade off against each other. Showing more items (larger k) tends to increase recall — you are casting a wider net — but can hurt precision if you are padding the list with irrelevant items. Neither metric alone tells the full story.
MAP — Mean Average Precision
Average precision (AP) for a single user rewards systems that put relevant items not just somewhere in the top k, but near the top. For each relevant item in the ranked list, compute the precision at the rank where that item appears, then average those precision values over all relevant items.
MAP (mean average precision) is simply the mean of AP across all users. It is more informative than precision@k because it is sensitive to where in the list relevant items appear — but it treats every position above k as equally important, which does not match how users actually scan a list.
NDCG — Normalized Discounted Cumulative Gain
NDCG@k is the most commonly used ranking metric in production recommender evaluation, because it directly encodes the observation that position 1 is far more valuable than position 10.
Build it up in three steps.
Step 1 — DCG (Discounted Cumulative Gain). Assign each position in the list a positional discount. Position 1 gets a discount of 1 (no penalty). Position i gets a discount of 1 / log2(i + 1). Sum the relevance of each item, weighted by its positional discount:
DCG@k = sum over i from 1 to k of: relevance(i) / log2(i + 1)
For binary relevance (relevant = 1, not relevant = 0), this becomes a sum of 1 / log2(i + 1) for each position i that contains a relevant item.
Step 2 — IDCG (Ideal DCG). Compute the DCG you would get if all relevant items were placed in the top positions — the best possible ranked list. This is the upper bound.
Step 3 — NDCG. Normalize by the ideal:
NDCG@k = DCG@k / IDCG@k
The result is in [0, 1]. An NDCG@10 of 1.0 means every relevant item was ranked ahead of every non-relevant item within the top 10. An NDCG@10 of 0.0 means no relevant item appeared in the top 10 at all.
The logarithmic discount is the key design choice. Dropping a relevant item from position 1 to position 2 hurts your score more than dropping it from position 9 to position 10, because users pay exponentially less attention further down the list.
Beyond accuracy: coverage, diversity, novelty, serendipity
A system that only optimizes NDCG@10 can still fail its users in subtler ways.
Coverage is the fraction of the item catalog that ever appears in any recommendation list. A system with low coverage ignores the long tail — it keeps recommending the same popular items to everyone, which is wasteful and unfair to item producers.
Diversity measures how different the items within a single user’s recommendation list are from each other. If all 10 items in a carousel are action movies from the same decade, even a high-NDCG list may feel stale.
Novelty measures how unexpected the recommendations are. Recommending the top-10 globally most popular items is trivially safe but useless — the user has almost certainly seen them already. A novel recommendation surfaces something the user would not have found on their own.
Serendipity combines novelty with relevance: a serendipitous recommendation is both unexpected and genuinely liked. It is the hardest quality to measure and the most memorable when it works.
None of these replace NDCG — they complement it. Production systems typically track a dashboard of metrics and accept small accuracy trade-offs to improve diversity or novelty.
The right evaluation setup
Leave-one-out is simple and widely used in academic benchmarks: hold out each user’s single most recent interaction and rank all items against it. Compute NDCG@k or hit rate@k across users. The weakness is that a single held-out item is a noisy signal.
Temporal split is closer to production reality. Every interaction before a chosen cutoff date goes into training; everything after goes into evaluation. This tests the system’s ability to generalize to a future time period, which is exactly what deployment requires.
Offline metrics vs. online A/B testing
Offline evaluation — running metrics on historical data — is fast, reproducible, and cheap. It is how you iterate during development. But it has a fundamental limitation: the offline-online gap.
Historical data only tells you about items that were actually shown to users. You do not know what would have happened if you had shown them something different. An item buried at rank 50 in your historical logs might have been loved — but you have no signal because users never saw it. This is the exposure bias or popularity bias problem.
Online A/B testing is the gold standard. Show system A to half your users and system B to the other half, then measure real outcomes: click-through rate (CTR), conversion rate, revenue, session length, long-term retention. These metrics capture actual user behavior without the bias introduced by historical recommendation decisions.
The gap between offline and online metrics is real and sometimes large. Systems that look identical offline can behave very differently in production. Treat offline NDCG as a filter for ruling out bad models, and online A/B tests as the final arbiter.
Code — precision@k, recall@k, and NDCG@k
Observe as you run:
precision@5counts only hits in the first 5 slots — items beyond rank 5 do not count, even if they are relevant.recall@5is limited by the total number of relevant items. With 5 relevant items and only a few in the top 5, recall will be noticeably less than 1.NDCG@kis highest when hits cluster near rank 1. Try mentally swapping a hit at rank 4 with a miss at rank 2 and notice how much the score would change — that is the positional discount doing its job.item_Zis inrelevant_itemsbut not in the ranked list at all. It counts against recall and pullsNDCGbelow its ideal value.
Summary
| Metric | What it measures | Position-sensitive? |
|---|---|---|
precision@k | Fraction of shown items that are relevant | No |
recall@k | Fraction of relevant items that were shown | No |
| MAP | Mean precision at each hit position, averaged over users | Partially |
NDCG@k | Discounted gain — rewards earlier hits more | Yes |
Alongside accuracy metrics, track coverage (catalog breadth), diversity (within-list variety), and novelty (how surprising the recommendations are).
Use a temporal split or leave-one-out setup — never random cross-validation on sequential data. Treat offline NDCG as a development compass and online A/B test results (CTR, conversion, retention) as the production truth.
Quick check
Practice this in an interview
All questionsMAP (Mean Average Precision) is the mean across queries of the area under the precision-recall curve, computed only at positions where relevant items appear. NDCG (Normalized Discounted Cumulative Gain) accounts for graded relevance and position discount — a relevant item at rank 1 is worth more than one at rank 10. Use MAP when relevance is binary and every relevant result matters equally; use NDCG when items have graded relevance or when top-of-list quality is more important than tail coverage.
Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.
MAE, RMSE, MAPE, and R² each measure a different aspect of regression quality and each has a regime where it misleads. RMSE is dominated by outliers; MAE is robust but hides large-error tails; MAPE is undefined at zero and asymmetrically penalises under-prediction; R² can appear high even when absolute errors are large, and can be negative, yet is still commonly misread as a percentage-correct. Choosing the right metric requires knowing the cost structure of the prediction task.
Optimize precision when a false positive is costly — spam filters, ad targeting, legal evidence — because you'd rather miss some positives than act on wrong ones. Optimize recall when a false negative is costly — cancer screening, fraud detection, safety systems — because missing a true positive can be catastrophic. The business cost of each error type should drive the choice, not the metric itself.