How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

What is hybrid search and why is it often better than pure vector search?

Hybrid search combines dense vector similarity with sparse keyword search such as BM25, then fuses the rankings. Dense retrieval captures semantic meaning while keyword search nails exact terms, identifiers, and rare tokens, so combining them improves recall and precision over either alone.

What is hybrid search and when should you use semantic vs keyword retrieval?

Keyword search (BM25) excels at exact term matching — product codes, proper nouns, rare abbreviations. Semantic search (dense embeddings) captures meaning and handles paraphrases. Hybrid search runs both in parallel and merges scores with Reciprocal Rank Fusion, giving the best of both worlds for most production RAG systems.

How does RLHF work and what problem does it solve?

RLHF (Reinforcement Learning from Human Feedback) aligns a language model's outputs to human preferences by training a reward model on ranked human comparisons, then using that reward signal to fine-tune the policy with reinforcement learning. It solves the gap between a model that is good at next-token prediction and a model that is genuinely helpful, harmless, and honest.

Hybrid & neural recommenders — Recommender Systems

The last lesson ended with a discomfort. Every cure for cold start was a patch bolted onto collaborative filtering from the outside — swap in content when CF goes blind, fall back to popularity, explore with a bandit — a committee of single-purpose models held together with routing rules. We asked whether we could do better: could a single model absorb every signal at once — collaborative history, item content, user context, sequence — and learn its own internal blend, so that cold and warm start become two ends of one continuum rather than two code paths?

The answer is yes, and it is what powers recommendation at the largest scale on Earth. YouTube serves over 800 million videos across billions of sessions a day; Netflix personalizes for 200-plus million subscribers. No single classical CF model survives that scale and that cold-start pressure. This final lesson of the section is how real systems get there — by hybridizing deliberately, and by going neural.

Why one model is never enough

Start by naming the structural weakness in each technique we have built.

Collaborative filtering knows nothing about item content. It cannot touch a brand-new item until users rate it (the item cold-start we just met), and it is blind to genre, cast, or description.
Content-based filtering treats every user as an island. It can only echo what you already liked, and it throws away the enormous signal that millions of other users carry.
Matrix factorization is a powerful CF variant, but its latent factors still need enough observed ratings to stabilize, and they cannot fold in side information — demographics, metadata, context — without surgery.

Each one’s blind spot is another one’s strength. That observation points to two moves: hybridize, combining methods so each covers the others’ gaps, and go neural, replacing hand-crafted similarity with learned representations that can swallow every signal at once.

Hybrid recommenders

A hybrid recommender combines two or more models or signal types. Four strategies dominate.

1. Weighted blend. The simplest: score with several models and average them.

score(u, i) = α · score_CF(u, i)  +  (1 − α) · score_CB(u, i)

The weight α can be a fixed, tuned hyperparameter — or itself a learned function of context (time of day, session length, device). It works well when both components are well-calibrated.

2. Switching by context. Use CF when you have data, content-based when you do not. A user who just signed up gets content recommendations; once they cross fifty interactions, switch to CF. The switch can be a threshold or a small classifier on session features. (This is exactly the cold-start routing from the last lesson, named.)

3. Feature combination. Inject content features directly into the CF model. In matrix-factorization terms, augment the user or item embedding with hand-crafted features before training. The SVD++ variant from the Netflix Prize was an early example: it adds an implicit-feedback term for the set of items a user rated, regardless of the rating value.

4. Stacking / cascade. Use one model to draw up a shortlist, then a second, more expensive model to re-rank it. This is more than a hybrid trick — it is the dominant production pattern, and we will give it its own section below.

Going neural

Around 2016 the field pivoted from hand-designed similarity functions to end-to-end learned models. The core move: replace explicit dot products over hand-crafted vectors with embeddings — dense, low-dimensional representations of users and items learned directly from interaction data (and, optionally, from content).

Embeddings. An embedding is a vector of real numbers — typically 32 to 512 dimensions — learned jointly with the rest of the model. Two users with similar taste land close together in embedding space. Unlike plain matrix-factorization factors, neural embeddings can be conditioned on context, updated incrementally, and composed with other learned modules.

Neural Collaborative Filtering (NCF). Classic MF predicts with a dot product:

score(u, i) = embedding_user(u) · embedding_item(i)

NCF (He et al., 2017) swaps that dot product for a small multi-layer perceptron (MLP):

score(u, i) = MLP( [embedding_user(u) ; embedding_item(i)] )

The bracket means concatenation. An MLP can learn interaction patterns a bare dot product cannot — crossed features, non-linear taste. The NeuMF variant runs both paths, a dot-product path and an MLP path, and merges them before a final sigmoid.

The two-tower architecture

The most influential neural design for large-scale retrieval is the two-tower model (or dual-encoder). The name is the architecture: a user tower and an item tower that independently encode their inputs into one shared embedding space, with the relevance score being the dot product of the two outputs.

Two-tower architecture. The user tower and item tower are trained jointly. At serving time the item tower runs offline to pre-compute all item embeddings; only the user tower runs online. Retrieval becomes a nearest-neighbour search in embedding space.

The engineering payoff hides in one word: independent. Because neither tower depends on the other during encoding, you can run the item tower offline once and pre-compute an embedding for every item in the catalog, stored in a vector index. At query time you run only the user tower — one fast forward pass — and then find the nearest item embeddings with approximate nearest neighbor (ANN) search (HNSW, FAISS). That buys sub-millisecond retrieval over hundreds of millions of items, which is simply impossible for a model that needs each user-item pair as joint input. The model is trained with a contrastive loss: pull the user embedding toward items they engaged with, push it away from randomly sampled negatives.

Sequence and session models

Taste shifts within a session. Twenty minutes into cooking videos, your next recommendation should reflect that moment, not your six-month average. Sequential recommenders model the ordered list of recent interactions as a sequence, and the dominant architecture is a transformer over recent item embeddings — attention lets the model lean on the most contextually relevant past items when predicting the next. BERT4Rec and SASRec are the canonical examples; Amazon and Pinterest run production variants of the idea.

The industrial two-stage funnel

Large systems universally split recommendation into two stages.

Stage 1 — Candidate generation (retrieval). A cheap model narrows millions of items down to a few hundred or thousand. Speed is everything; quality is secondary. The two-tower model with ANN search is the canonical choice; matrix factorization with pre-computed item vectors also works and is simpler to maintain.

Stage 2 — Ranking. A far more expensive model — a deep network with hundreds of features, including cross-user-item interactions, freshness, diversity penalties, and business rules — scores only the retrieved candidates and produces the final order.

This funnel appears in YouTube’s recommender (Covington, Adams, Sargin, 2016), Pinterest’s Pinnability, Spotify’s session personalization, and effectively every large deployment. As systems grow it often stretches to three or four stages (retrieval → pre-ranking → ranking → re-ranking and business rules).

A score in embedding space

Underneath all the architecture, the scoring step at retrieval is the same dot-product-in-embedding-space idea we built in matrix factorization — just with learned towers producing the vectors. Here it is in miniature: one user embedding against three item embeddings.

import numpy as np

# Simulated 6-dim embeddings (in production, 128-512 dims, learned by the towers)
user_embedding = np.array([0.8, -0.3,  0.5,  0.1,  0.9, -0.2])
item_a = np.array([ 0.7, -0.2,  0.6,  0.0,  0.8, -0.1])  # very similar taste
item_b = np.array([-0.5,  0.8, -0.4,  0.3, -0.6,  0.9])  # very different taste
item_c = np.array([ 0.2,  0.1,  0.1,  0.5,  0.3,  0.0])  # weakly related

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

scores = {
    "item_a (similar)":    cosine(user_embedding, item_a),
    "item_b (different)":  cosine(user_embedding, item_b),
    "item_c (weak match)": cosine(user_embedding, item_c),
}

print("Relevance scores (cosine similarity in embedding space):\n")
for name, score in sorted(scores.items(), key=lambda x: -x[1]):
    bar = "█" * int(max(0, score) * 30)
    print(f"  {name:<28} {score:+.3f}  {bar}")

print("\nHigher score → retrieved first in ANN search.")

Relevance scores (cosine similarity in embedding space):

  item_a (similar)             +0.986  █████████████████████████████
  item_c (weak match)          +0.583  █████████████████
  item_b (different)           -0.742

Higher score → retrieved first in ANN search.

The aligned item (item_a, +0.986) is what an ANN index would return first; the weak match trails at +0.583; and the genuinely different item lands at −0.742 with no bar at all — pushed to the far end of the list. Swap these toy six-dimensional vectors for 256-dimensional embeddings emitted by two trained towers, and this exact operation — one dot product per candidate — is what serves billions of retrievals a day.

A note on LLM-era recommenders

Large language models have added two patterns. First, LLMs as feature encoders: pass an item’s description through a pretrained language model to get a rich text embedding, and feed that into the item tower instead of a learned ID embedding — which directly eases cold start for new items that have metadata but no interactions. Second, LLMs as rankers or explainers: a small model re-ranks candidates by reasoning over a user’s history in natural language, or generates a why-you’re-seeing-this explanation. This is experimental and expensive, but it points toward more transparent, context-aware systems. Neither pattern repeals the two-stage funnel — the retrieval bottleneck is a compute constraint, not a modeling one.

In one breath

No single technique covers every case, so production recommenders hybridize (blend, switch, combine features, or cascade) and go neural (learned embeddings, an MLP in place of the dot product, and above all the two-tower model whose independent towers let item embeddings be precomputed for sub-millisecond ANN retrieval) — all wired into a two-stage funnel where cheap retrieval feeds expensive ranking, because no expensive model can score millions of items in time.

Practice

Before the quiz, reason about the two-tower trick. The architecture’s whole scaling advantage comes from the towers being independent — no cross-tower interaction during encoding. Explain why that independence is exactly what lets you precompute every item embedding offline, and what you would lose if you instead fed each user-item pair jointly into one network for a richer score. Then connect it to the funnel: why is that richer joint model welcome in Stage 2 but forbidden in Stage 1?

Quick check

0/3

Q1Why does the two-tower architecture enable fast retrieval over millions of items when a joint user-item model cannot?

Q2A startup has 5 000 users and 2 000 products with sparse ratings. They want to improve recommendations. Which step is most likely to help first?

Q3A content platform adds 10 000 new videos every day. Which hybridization strategy best addresses the item cold-start problem for these new videos?

A question to carry forward

Look at what this section has actually given you. From a raw matrix of who-liked-what, you can now build a content-based recommender, a collaborative one, a matrix-factorization model, a confidence-weighted implicit system, and a two-tower neural retriever feeding a two-stage funnel. You can measure them with NDCG and survive a cold start. You can, in short, build the model.

But every architecture in this lesson quietly assumed something we have never once discussed: that the model is running. That an item tower is recomputing embeddings on a schedule. That the user tower answers in under a hundred milliseconds, every time, for billions of requests. That when the world shifts and yesterday’s embeddings go stale, something notices and retrains. None of that is modeling — it is operations, and it is precisely where most real ML projects succeed or quietly die. So the question to carry forward, out of recommenders and into the next section, is the one a notebook never has to answer: how do you take a model that works on your machine and turn it into a system that keeps working in production — served, monitored, versioned, and retrained? That is MLOps, and it begins with the real machine-learning lifecycle.

Hybrid & neural recommenders

What you'll learn

Before you start