What are MAP and NDCG, and when would you use each for evaluating a ranking system?

MAP (Mean Average Precision) is the mean across queries of the area under the precision-recall curve, computed only at positions where relevant items appear. NDCG (Normalized Discounted Cumulative Gain) accounts for graded relevance and position discount — a relevant item at rank 1 is worth more than one at rank 10. Use MAP when relevance is binary and every relevant result matters equally; use NDCG when items have graded relevance or when top-of-list quality is more important than tail coverage.

How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

What are the key regression metrics — MAE, RMSE, MAPE, R² — and what are their failure modes?

MAE, RMSE, MAPE, and R² each measure a different aspect of regression quality and each has a regime where it misleads. RMSE is dominated by outliers; MAE is robust but hides large-error tails; MAPE is undefined at zero and asymmetrically penalises under-prediction; R² can appear high even when absolute errors are large, and can be negative, yet is still commonly misread as a percentage-correct. Choosing the right metric requires knowing the cost structure of the prediction task.

When should you optimize precision and when should you optimize recall?

Optimize precision when a false positive is costly — spam filters, ad targeting, legal evidence — because you'd rather miss some positives than act on wrong ones. Optimize recall when a false negative is costly — cancer screening, fraud detection, safety systems — because missing a true positive can be catastrophic. The business cost of each error type should drive the choice, not the metric itself.

Evaluating recommenders (precision@k, NDCG) — Recommender Systems

The last lesson left us holding a recommender built entirely from clicks — and an uncomfortable question. How would we even know if it is any good? With star ratings we could at least compare our predicted stars against held-out real ones. With implicit data there are no stars to check against, and the user only ever sees the top few items we surface, never the rest of the list. So squared error on rating predictions, we suspected, is measuring the wrong thing entirely.

It is. This lesson makes that precise and then builds the metrics that measure the right thing — the quality of a short ranked list, where getting position 1 right matters far more than position 50.

Why RMSE is the wrong target

The natural first instinct is to treat recommendation as regression: predict the rating a user would give every item, sort by that prediction, and minimize RMSE (root-mean-square error) on held-out ratings. It feels rigorous. It is mostly wrong.

Here is why. Suppose your model predicts 4.5 for Item A (true 4.0) and 2.0 for Item B (true 1.8). Your RMSE looks wonderful — both are close. But neither item changes the user’s life: both are buried far down the list, behind dozens of items predicted above 4.5. The error you so carefully minimized on Item B counts exactly as much toward RMSE as error on a top item — even though the user will never lay eyes on Item B.

That is the mismatch in one sentence. Recommendation is a ranking problem, not a rating-prediction problem. What matters is whether the items you push to the top are relevant — not how faithfully you estimate ratings on items no one will see.

The top-k frame

So change the frame. Fix a small integer k — the number of slots you will actually show, the size of the carousel or the result page. For each user, define a set of relevant items: the ones they genuinely liked, read off from their held-out interactions (purchases, clicks, high ratings).

Now the only question worth asking is: of everything you could have shown, did you get the right items into those k slots? Every metric below is a different way of scoring that one question.

Precision@k and Recall@k

Precision@k asks: of the k items you showed, how many were relevant?

precision@k = (number of relevant items in top k) / k

It runs from 0 to 1. A precision@10 of 0.3 means 3 of the 10 you showed landed.

Recall@k asks the complementary question: of all the relevant items that exist for this user, how many did you manage to surface in the top k?

recall@k = (number of relevant items in top k) / (total relevant items for this user)

These two pull against each other. Show more items — a bigger k — and recall tends to rise, because you are casting a wider net; but precision can fall, because you are padding the list with weaker picks. Neither number alone tells the whole story, which is exactly why we will watch both move together in the code.

MAP — Mean Average Precision

Average precision (AP) for one user rewards putting relevant items not merely somewhere in the top k, but high. For each relevant item in the ranked list, you take the precision measured at the rank where it appears, then average those across the user’s relevant items. MAP is then just the mean of AP over all users.

MAP is richer than plain precision@k because it notices where the hits fall. Its limitation is that it treats every position as equally weighted up to k, which is not how a real user scans a list — attention drops off a cliff after the first few rows. That observation is exactly what the next metric fixes.

NDCG — Normalized Discounted Cumulative Gain

NDCG@k is the workhorse of production ranking evaluation, because it bakes in the truth that position 1 is worth far more than position 10. Build it in three steps.

Step 1 — DCG (Discounted Cumulative Gain). Give each position a shrinking discount: position 1 gets 1 (no penalty), and position i gets 1 / log2(i + 1). Sum each item’s relevance, scaled by its position’s discount:

DCG@k = sum over i from 1 to k of: relevance(i) / log2(i + 1)

For binary relevance (relevant = 1, else 0), that is simply adding up 1 / log2(i + 1) at every position holding a hit.

Step 2 — IDCG (Ideal DCG). Compute the DCG of the best possible ordering — all relevant items packed into the top positions. This is the ceiling.

Step 3 — NDCG. Divide the real by the ideal:

NDCG@k = DCG@k / IDCG@k

The result lands in [0, 1]. An NDCG@10 of 1.0 means every relevant item sat ahead of every irrelevant one in the top 10; an NDCG@10 of 0.0 means not a single relevant item made the top 10. The logarithmic discount is the whole point: moving a hit from rank 1 to rank 2 costs you more than moving it from rank 9 to rank 10, because users pay steeply less attention as they scroll.

A ranked top-10 list with three hits (green border, ✓) and seven misses. The accent bar shrinks with each rank — the logarithmic positional discount. A hit at rank 1 contributes far more to DCG than the same hit at rank 6.

Watch the metrics move

Now the payoff: one ranked list of ten items, a set of five truly relevant items, and every metric computed at both k=5 and k=10. Read the output as a story about the trade-offs, not just a table of numbers.

import numpy as np

# A ranked list of item ids (position 0 = rank 1, the top recommendation)
ranked_items = ["item_C", "item_A", "item_X", "item_B", "item_D",
                "item_F", "item_G", "item_E", "item_H", "item_I"]

# Ground truth: items the user actually liked (held-out interactions)
relevant_items = {"item_A", "item_B", "item_D", "item_F", "item_Z"}

def precision_at_k(ranked, relevant, k):
    hits = sum(1 for item in ranked[:k] if item in relevant)
    return hits / k

def recall_at_k(ranked, relevant, k):
    hits = sum(1 for item in ranked[:k] if item in relevant)
    return hits / len(relevant) if relevant else 0.0

def dcg_at_k(ranked, relevant, k):
    return sum(1.0 / np.log2(i + 1)
               for i, item in enumerate(ranked[:k], start=1) if item in relevant)

def ndcg_at_k(ranked, relevant, k):
    actual = dcg_at_k(ranked, relevant, k)
    n_rel = min(len(relevant), k)                       # ideal: relevant items first
    ideal = sum(1.0 / np.log2(i + 1) for i in range(1, n_rel + 1))
    return actual / ideal if ideal > 0 else 0.0

for k in [5, 10]:
    p = precision_at_k(ranked_items, relevant_items, k)
    r = recall_at_k(ranked_items, relevant_items, k)
    n = ndcg_at_k(ranked_items, relevant_items, k)
    print(f"k={k:2d}  precision@k={p:.3f}  recall@k={r:.3f}  NDCG@k={n:.3f}")

print("\nHits in ranked list:")
for i, item in enumerate(ranked_items, start=1):
    marker = "<-- HIT" if item in relevant_items else ""
    print(f"  rank {i:2d}: {item:8s}  discount={1.0/np.log2(i+1):.3f}  {marker}")

k= 5  precision@k=0.600  recall@k=0.600  NDCG@k=0.491
k=10  precision@k=0.400  recall@k=0.800  NDCG@k=0.612

Hits in ranked list:
  rank  1: item_C    discount=1.000
  rank  2: item_A    discount=0.631  <-- HIT
  rank  3: item_X    discount=0.500
  rank  4: item_B    discount=0.431  <-- HIT
  rank  5: item_D    discount=0.387  <-- HIT
  rank  6: item_F    discount=0.356  <-- HIT
  rank  7: item_G    discount=0.333
  rank  8: item_E    discount=0.315
  rank  9: item_H    discount=0.301
  rank 10: item_I    discount=0.289

Three lessons hide in those six numbers. First, the precision/recall tug-of-war: going from k=5 to k=10, precision falls (0.600 → 0.400, because slots 6–10 added only one hit while the denominator doubled) yet recall rises (0.600 → 0.800, a wider net catches more). Second, the ceiling on recall: it stops at 0.800 and can never reach 1.0, because item_Z is genuinely relevant but absent from the list entirely — no value of k can surface an item that was never ranked. Third, the position penalty: NDCG@5 is only 0.491 even though three of five slots are hits, because those hits sit at ranks 2, 4, and 5 rather than 1, 2, 3. Push the same three hits toward the top and NDCG would climb toward 1.0 — that is the logarithmic discount doing its job.

Beyond accuracy: coverage, diversity, novelty, serendipity

A system that maxes out NDCG@10 can still quietly fail its users.

Coverage is the fraction of the catalog that ever shows up in any recommendation. Low coverage means the long tail is ignored — the same popular items recommended to everyone, wasteful and unfair to item producers.

Diversity measures how different the items within one user’s list are from each other. Ten action movies from the same decade can score a high NDCG and still feel monotonous.

Novelty measures how unexpected a recommendation is. Serving the globally top-10 most popular titles is safe and useless — the user has already seen them. Novelty surfaces things they would not have stumbled on alone.

Serendipity is novelty plus relevance: unexpected and genuinely loved. It is the hardest to measure and the most magical when it lands — and, you may remember, exactly the quality content-based filtering could never deliver.

None of these replace NDCG; they guard its blind spots. Real systems track a dashboard and willingly trade a sliver of accuracy for more diversity or novelty.

Setting up the evaluation honestly

Don't do this

Two traps that quietly invalidate your numbers.

Trap 1 — Optimizing RMSE instead of ranking quality. You can drive RMSE to near zero while your ranked lists stay terrible, because error on low-rated items counts as much as error on the items you would actually show. Always report at least one ranking metric (NDCG@k, precision@k) alongside or instead of RMSE.

Trap 2 — Random cross-validation on sequential behavior. Splitting interactions at random leaks the future into training. If a user watched A, B, C, D, E in order and your split tests on B and D while training on A, C, E, the model has literally seen the future. Use a temporal split (everything before time T trains, everything after tests) or leave-one-out (each user’s most recent interaction is the test item).

Leave-one-out is simple and common in academic benchmarks: hold out each user’s single most recent interaction, rank all items against it, and average NDCG@k or hit-rate@k across users. Its weakness is that one held-out item is a noisy target.

Temporal split is closer to production reality: everything before a cutoff date trains, everything after is evaluated. It tests generalization to a future period — exactly what deployment demands.

Offline metrics vs. online A/B testing

Offline evaluation — metrics on historical logs — is fast, cheap, and reproducible, and it is how you iterate during development. But it carries one deep limitation: the offline-online gap.

Your logs only record items that were actually shown. You cannot know what would have happened had you shown something else. An item buried at rank 50 in the history might have been adored — but no one saw it, so there is no signal. This is exposure bias (or popularity bias): the data you evaluate on was itself shaped by past recommendation decisions.

Online A/B testing is the gold standard. Show system A to half your users and B to the other half, then measure real outcomes — click-through rate, conversion, revenue, session length, long-term retention. Those capture genuine behavior, free of the historical bias baked into offline logs. The gap between offline and online is real and sometimes large: systems that look identical offline can diverge sharply in production. Treat offline NDCG as a filter for ruling bad models out, and online A/B tests as the final word.

The four metrics side by side

Metric	What it measures	Position-sensitive?
`precision@k`	Fraction of shown items that are relevant	No
`recall@k`	Fraction of relevant items that were shown	No
MAP	Mean precision at each hit position, averaged over users	Partially
`NDCG@k`	Discounted gain — rewards earlier hits more	Yes

Alongside these accuracy metrics, track coverage, diversity, and novelty; split your data temporally or leave-one-out, never randomly; and let online A/B results have the last word over offline NDCG.

In one breath

Recommendation is a top-k ranking problem, not a rating-prediction problem, so RMSE is the wrong target — precision@k and recall@k score how many of the shown slots were relevant (trading off as k grows), MAP and especially NDCG@k reward putting hits near the top via a logarithmic position discount, and all of them must be computed on a temporal or leave-one-out split and ultimately deferred to online A/B tests because offline logs carry exposure bias.

Practice

Before the quiz, work the trade-off by hand. In the output, precision dropped from 0.600 at k=5 to 0.400 at k=10 while recall rose from 0.600 to 0.800. Explain in one sentence why those two moved in opposite directions. Then the harder one: NDCG@5 was only 0.491 despite three hits in five slots — if you could move just one of those hits to rank 1, would NDCG rise a little or a lot, and why?

Quick check

0/3

Q1A recommender system achieves a very low RMSE on a held-out rating set, but its NDCG@10 is mediocre. The most likely explanation is:

Q2A team splits user interaction data randomly into 80% train and 20% test, then reports strong NDCG@10. Why might this overestimate real-world performance?

Q3A video platform launches two recommender variants in an A/B test. Variant A has higher NDCG@10 in offline evaluation. Variant B has higher 7-day retention online. Which variant should they ship, and what does this illustrate?

A question to carry forward

Look back at everything we just measured. Precision, recall, NDCG, the temporal split, the leave-one-out hold-out — every one of them quietly assumed the same thing: that the user has a history to hold out, and the items have interactions to rank. The whole apparatus runs on a past.

But the most important moments in a recommender’s life are the ones with no past. A visitor signs up this second and has rated nothing. A product is added to the catalog this morning and no one has touched it. Collaborative filtering has no neighbors to find, matrix factorization has no rows to factor, and every metric in this lesson has nothing to compute. So the question to carry forward is the one that decides whether a new user ever comes back, or a new item ever gets discovered: how do you recommend when you have no history at all? That is the cold-start problem, and it is the next lesson.

Evaluating recommenders (precision@k, NDCG)

What you'll learn

Before you start