How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

How do you choose between batch and real-time inference for a model?

Decide based on how fresh the prediction must be versus the cost and complexity of serving live. Use batch when results are needed every few hours or days, like daily churn lists, because it is cheap, simple, and can use spot or scheduled compute. Use real-time when a late or stale decision causes immediate loss, like fraud or ad auctions needing sub-100ms responses, accepting higher cost and complexity. Most production systems are hybrid: precompute heavy signals offline and do lightweight re-ranking online.

What is the difference between retrieval and reranking in a RAG pipeline?

Retrieval cheaply searches a large corpus and returns a candidate set, prioritizing recall. Reranking applies a more expensive query-document model to that small set and improves precision and ordering at the top. A reranker cannot recover relevant documents absent from the retrieved candidates, so evaluate first-stage recall separately.

Why recommenders matter — Recommender Systems

Time-series forecasting predicted the next value of one sequence over time. This section flips the problem sideways: not “what comes next,” but “what would this specific person love, out of millions of items they’ve never seen?” It’s a different shape of prediction — matching people to items by taste — and it powers a huge share of the revenue at Netflix, Spotify, and Amazon.

The problem: too much, too fast

Modern platforms carry catalogs that no individual could ever browse. Netflix hosts tens of thousands of titles. Spotify has over 100 million tracks. Amazon lists hundreds of millions of products. A first-time visitor faces a wall of noise.

This is the information overload problem: the catalog is so large that unguided browsing fails. Users leave without finding something they would have loved — and the platform loses engagement, retention, and revenue all at once.

A recommender system (also called a recommendation engine) is the layer that translates raw catalog size into personal relevance. Its job is to predict which unseen items a specific user would value, then surface them in a ranked list.

The long tail: why personalization is an economic force

In any large catalog, a small number of blockbuster items attract most of the attention. Call these the head. Below them sits a very long list of niche items — obscure albums, indie films, specialist tools — each with tiny individual demand. Collectively, though, the long tail can dwarf the head.

Popularity follows a power-law curve. Blockbusters dominate individually, but the long tail contains most of the catalog — and most of the latent demand.

Without personalization, platforms default to promoting the head — the same blockbusters everyone already knows. The tail stays dark. With personalization, a user who loves post-punk jazz fusion gets surfaced exactly that, generating a stream or a sale that a popularity-only system would have missed entirely.

This is why recommendation systems are not a nice-to-have feature. A large share of Netflix viewing and a large share of Amazon purchases are driven by recommendations. Those numbers represent items users would never have discovered otherwise — and revenue that would simply not exist.

The core task: predict, then rank

Formally, a recommender system solves one problem:

Given a user and a large set of items they have not yet interacted with, estimate a preference score for each item, then return a ranked list of the top-k items.

The input signals vary: explicit ratings (stars, thumbs up/down), implicit feedback (clicks, playtime, purchases, skips), item metadata (genre, cast, price), and contextual signals (time of day, device, location).

Users and items feed into the recommender, which scores every unseen item for that user and returns the top-k as a personalized ranked list.

Three families of recommenders

The field has converged on three main approaches. You will learn each in depth; here is the map.

Content-based filtering

Idea: recommend items similar to ones the user already liked, using item features.

If you watched three sci-fi films set in space, a content-based system recommends more films with the same genre, director style, or thematic tags. It needs no information about other users — only the items themselves and the target user’s history.

Strength: works for new users with even a small history; highly interpretable.
Weakness: tends to recommend more of the same, limiting discovery.

Collaborative filtering

Idea: find users with similar taste patterns and recommend what they liked.

If users A and B both rated the same ten niche albums highly, and B recently loved an eleventh that A has not heard, the system recommends that eleventh album to A. The system never looks at item features — it works entirely from the pattern of interactions across users.

Strength: surfaces genuine surprises across genre boundaries; scales well.
Weakness: struggles with new users and new items that have no interaction history yet (the cold-start problem).

Hybrid systems

Real production systems — Netflix, Spotify, YouTube — blend both approaches. A hybrid might use content signals to handle cold start and collaborative patterns to improve long-tail discovery. Hybrids consistently outperform either approach alone.

Baseline: what does a naive recommender look like?

Before building anything complex, data scientists always check a popularity baseline: simply recommend the most-interacted-with items to everyone. It is fast, easy to implement, and surprisingly hard to beat on aggregate metrics.

import pandas as pd

interactions = pd.DataFrame({
    "user": ["alice","alice","bob","bob","carol","carol","carol","dave"],
    "item":  ["itemA","itemB","itemB","itemC","itemA","itemC","itemD","itemD"],
    "rating": [5, 4, 5, 3, 4, 5, 2, 4],
})

popularity = (
    interactions
    .groupby("item")["rating"]
    .agg(count="count", mean_rating="mean")
    .sort_values("count", ascending=False)
)

print("Popularity baseline (most interactions first):")
print(popularity)
print()
print("Top recommendation for every user: itemA or itemB")
print("(same list regardless of individual taste)")

Popularity baseline (most interactions first):
       count  mean_rating
item                     
itemA      2          4.5
itemB      2          4.5
itemC      2          4.0
itemD      2          3.0

Top recommendation for every user: itemA or itemB
(same list regardless of individual taste)

The output shows you the fundamental flaw: every user gets the same list, ranked by raw interaction count. itemD — beloved by carol and dave — sits at the bottom and may never surface for alice, even though she might enjoy it. The baseline knows what is popular; it knows nothing about who alice is.

In one breath

Recommender systems exist because catalogs are too large to browse, and personalization unlocks the enormous latent demand sitting in the long tail — the niche items a popularity-only system leaves dark. The core task is always the same: given a user and millions of unseen items, predict a preference score for each, then rank and return the top-k (and ranking matters as much as the score — the user only sees the top slots). Three families do this: content-based (recommend items similar to ones you liked, using item features — handles new users, but echoes your past), collaborative filtering (find users with similar taste and borrow their likes — surprising cross-genre discovery, but cold-starts badly), and hybrid (blend both — what production systems actually ship). The popularity baseline — recommend the most-interacted items to everyone — is your first benchmark, and a humbling one to beat.

Practice

Quick check

0/3

Q1A streaming platform recommends the same top-20 most-watched shows to every new user. Which problem does this most directly fail to address?

Q2Alice has rated 50 films. A recommender finds other users whose ratings match Alice's closely and suggests films those users loved that Alice has not seen. Which family does this belong to?

Q3A music app launches in a country where it has no historical listening data yet. Which approach is most likely to degrade the least in this situation?

A question to carry forward

Look again at the data the popularity baseline chewed on: a little table of (user, item, rating) rows. Every recommender in this section — content-based, collaborative, matrix factorization, neural — ultimately reads from that same raw material. But a flat list of interactions isn’t quite the right shape to reason about “who has rated what.” Reshape it into a grid — users down the rows, items across the columns, ratings in the cells — and the whole problem suddenly becomes visual: most of the grid is empty, and recommending is literally filling in the blanks.

So the question to carry forward is: what does that grid look like, why is it almost entirely missing, and why is its emptiness both the central challenge and the central opportunity of recsys? The next lesson is the utility matrix — the one data structure that every recommender, from the simplest to the most neural, is secretly trying to complete.

Why recommenders matter

What you'll learn

Before you start