datarekha
Case & Behavioral Hard Asked at MetaAsked at TikTokAsked at NetflixAsked at LinkedInAsked at Pinterest

How would you design a metric to evaluate the relevance of a content recommendation feed?

The short answer

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

How to think about it

The relevance-measurement challenge

Unlike search (query + result pair, explicit relevance judgment possible), a recommendation feed has no explicit query. Relevance must be inferred. Two failure modes: (1) optimising on shallow engagement signals (clickbait maximises clicks but destroys satisfaction); (2) optimising solely on satisfaction surveys (too sparse, too slow for A/B testing).

Tier 1 — Online behavioural metrics (primary A/B signal)

Behavioural signals are available for every impression and update in near-real-time.

SignalDefinitionWhy it’s better than raw click
Long dwell rate% of impressions where dwell time exceeds 10 sFilters accidental taps
Save / bookmark rateUser explicitly saved the itemDeliberate positive signal
Share rateUser shared to a friendHigh-intent positive signal
Scroll-past rate at rank 1User skipped the top itemMild negative signal
Explicit negative feedback rate”Not interested” clicksStrong negative signal

Primary composite: a weighted sum, e.g., (long-dwell + 3save + 5share - 2*not-interested), normalised by total impressions. Weights are calibrated to correlate with user-satisfaction survey scores from Tier 2.

Tier 2 — Explicit satisfaction signal (periodic validation)

Send a brief in-app survey to a random 1 % of users: “Were the recommendations relevant today?” (5-point scale). This is the ground truth — use it to validate that Tier 1 composite changes are directionally consistent.

Tier 3 — Offline ranking metric (model development)

Use historical data: items a user eventually engaged with (long dwell, save, share) within a session are treated as relevant. Compute NDCG@10 against this pseudo-label on a held-out week. Used for rapid model iteration without running a full A/B test.

Worked example. New ranking model improves NDCG@10 from 0.61 to 0.67 offline. A/B test (10 % of users, 2 weeks): long-dwell rate +5.2 %, not-interested rate -2.1 %. Satisfaction survey validation: average score 3.8 vs 3.5 baseline. All three tiers agree — ship.

Guardrails to include: diversity metric (fraction of feed items from sources the user has not seen before), recency metric (median age of recommended items), to ensure the model does not collapse to a narrow filter bubble.

Keep practising

All Case & Behavioral questions

Explore further

Skip to content