How would you design a metric to evaluate the relevance of a content recommendation feed?
Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.
How to think about it
The relevance-measurement challenge
Unlike search (query + result pair, explicit relevance judgment possible), a recommendation feed has no explicit query. Relevance must be inferred. Two failure modes: (1) optimising on shallow engagement signals (clickbait maximises clicks but destroys satisfaction); (2) optimising solely on satisfaction surveys (too sparse, too slow for A/B testing).
Tier 1 — Online behavioural metrics (primary A/B signal)
Behavioural signals are available for every impression and update in near-real-time.
| Signal | Definition | Why it’s better than raw click |
|---|---|---|
| Long dwell rate | % of impressions where dwell time exceeds 10 s | Filters accidental taps |
| Save / bookmark rate | User explicitly saved the item | Deliberate positive signal |
| Share rate | User shared to a friend | High-intent positive signal |
| Scroll-past rate at rank 1 | User skipped the top item | Mild negative signal |
| Explicit negative feedback rate | ”Not interested” clicks | Strong negative signal |
Primary composite: a weighted sum, e.g., (long-dwell + 3save + 5share - 2*not-interested), normalised by total impressions. Weights are calibrated to correlate with user-satisfaction survey scores from Tier 2.
Tier 2 — Explicit satisfaction signal (periodic validation)
Send a brief in-app survey to a random 1 % of users: “Were the recommendations relevant today?” (5-point scale). This is the ground truth — use it to validate that Tier 1 composite changes are directionally consistent.
Tier 3 — Offline ranking metric (model development)
Use historical data: items a user eventually engaged with (long dwell, save, share) within a session are treated as relevant. Compute NDCG@10 against this pseudo-label on a held-out week. Used for rapid model iteration without running a full A/B test.
Worked example. New ranking model improves NDCG@10 from 0.61 to 0.67 offline. A/B test (10 % of users, 2 weeks): long-dwell rate +5.2 %, not-interested rate -2.1 %. Satisfaction survey validation: average score 3.8 vs 3.5 baseline. All three tiers agree — ship.
Guardrails to include: diversity metric (fraction of feed items from sources the user has not seen before), recency metric (median age of recommended items), to ensure the model does not collapse to a narrow filter bubble.