How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

How does RLHF work and what problem does it solve?

RLHF (Reinforcement Learning from Human Feedback) aligns a language model's outputs to human preferences by training a reward model on ranked human comparisons, then using that reward signal to fine-tune the policy with reinforcement learning. It solves the gap between a model that is good at next-token prediction and a model that is genuinely helpful, harmless, and honest.

How do you explain a technical result or model to non-technical stakeholders?

The best communicators translate outputs into decisions, not equations. Lead with the business implication, use an analogy for the mechanism, and reserve technical detail for an appendix or follow-up. Calibrate depth to the audience in the room, not to what you find interesting.

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

Implicit vs Explicit Feedback — Recommender Systems

The last lesson ended on a sharp question. Our matrix-factorization objective minimized squared error against star ratings — numbers a user typed on purpose — yet the production examples, ALS and the implicit library, were all built for clicks and plays. We slid from one to the other as if they were the same data. They are not. A five-star rating and a one-star rating are both honest signal; a click only says “looked,” and a non-click says almost nothing at all. So we asked: when your only signal is what people clicked, how do you train a model on data that has no honest “no”?

This lesson answers it. And the answer is not a patch on the rating model — it is a different way of seeing the data, because implicit feedback has a different shape.

Two very different kinds of signal

Every recommender runs on feedback: some measure of how much a user liked an item. But the two kinds of feedback could hardly be more different.

Explicit feedback is what you ask for directly — star ratings, a thumbs up or down, a written review, a “save to favorites.” The signal is clean and unambiguous: five stars is a plain statement of love. The trouble is that almost nobody gives it. Across streaming, e-commerce, and news, study after study finds fewer than 1% of users ever rate anything, and the ones who do are a self-selected sliver who may look nothing like your real audience.

Implicit feedback is everything you observe without asking — clicks, streams, purchases, dwell time, search queries, scroll depth, add-to-cart events. It pours in automatically, at enormous scale, for every user. It is abundant where explicit data is scarce.

But abundance comes with a catch, and the catch is the whole lesson: implicit data is far noisier, and noisy in a very particular, asymmetric way.

The asymmetry: you only ever see “yes”

With explicit ratings you get the full range — a user can tell you they loved a thing or hated it. With implicit data you observe only the interactions that happened. You know exactly when someone clicked. You have no idea why they did not click on everything else.

And “did not click” is desperately ambiguous. Picture three users who all left item X untouched:

One saw it and genuinely disliked it.
One never saw it — the interface never surfaced it.
One saw it but was in the wrong moment — wrong device, wrong time of day, distracted.

Only the first is a true negative. The other two simply mean the item is undiscovered, not unwanted. This is the missing-data problem, and it is the cardinal trap of implicit feedback.

The reframing: preference, weighted by confidence

So how do you use the zeros without lying about them? The answer behind most production systems — including the widely-cited implicit ALS of Hu, Koren, and Volinsky (2008) — is to stop pretending the count is a rating and split it into two separate ideas.

Preference is binary, and it answers one question: did the user interact with this item at all? A single click makes preference 1. No interaction makes it 0. That is your best guess at whether they like it.

Confidence is a weight that answers a second question: how much do you trust that guess? One click is flimsy evidence. Twenty replays of the same song is overwhelming evidence. The standard formula ties confidence to the raw count:

confidence(u, i) = 1 + alpha * count(u, i)

where count(u, i) is how many times user u touched item i, and alpha is a scaling hyperparameter (the original paper used 40). The crucial + 1 means even an item with zero interactions keeps a small baseline confidence — you are not certain it is a negative, you are just barely confident it is a positive. Items touched many times earn high confidence, so the model works hardest to get those predictions right.

That single move lets you keep every entry of the matrix, zeros included, without ever claiming a certainty you do not have.

From counts to two matrices

Here is the split, drawn out before we run it.

A count matrix splits into a binary preference matrix and a confidence matrix. High counts produce high confidence; zeros yield the baseline confidence of 1.0.

Read user A’s three plays of item2: preference becomes 1, confidence becomes 1 + 40×3 = 121. User B’s eight plays of item3: preference 1, confidence 1 + 40×8 = 321. And every untouched cell stays preference 0 with the baseline confidence of 1.0 — present in the training, but barely weighted, never a hard “no.”

Building the two matrices

import numpy as np

# Interaction count matrix: rows = users, cols = items
counts = np.array([
    [0, 3, 0, 1],   # user A
    [1, 0, 8, 0],   # user B
    [0, 0, 2, 5],   # user C
    [4, 1, 0, 0],   # user D
])

alpha = 40  # confidence scaling factor (a tuned hyperparameter)

# Preference: binary — did the user interact at all?
preference = (counts > 0).astype(np.float32)

# Confidence: 1 + alpha * count.  A zero count gives baseline 1.0, not 0.
confidence = 1 + alpha * counts

print("=== Preference Matrix (0 or 1) ===")
print(preference)
print("\n=== Confidence Matrix (1 + alpha * count) ===")
print(confidence)
print("\n--- Interpretation ---")
print(f"User A, item2: count={counts[0,1]}, preference={preference[0,1]}, confidence={confidence[0,1]}")
print(f"User B, item3: count={counts[1,2]}, preference={preference[1,2]}, confidence={confidence[1,2]}")
print(f"User A, item1: count={counts[0,0]}, preference={preference[0,0]}, confidence={confidence[0,0]}")

=== Preference Matrix (0 or 1) ===
[[0. 1. 0. 1.]
 [1. 0. 1. 0.]
 [0. 0. 1. 1.]
 [1. 1. 0. 0.]]

=== Confidence Matrix (1 + alpha * count) ===
[[  1 121   1  41]
 [ 41   1 321   1]
 [  1   1  81 201]
 [161  41   1   1]]

--- Interpretation ---
User A, item2: count=3, preference=1.0, confidence=121
User B, item3: count=8, preference=1.0, confidence=321
User A, item1: count=0, preference=0.0, confidence=1

The preference matrix flattens every count to a 0 or a 1 — it remembers whether, not how much. The confidence matrix remembers how sure: user B’s eight plays of item3 tower at 321, while every zero sits quietly at the baseline 1. The model that learns from this pair — implicit ALS — minimizes a confidence-weighted least-squares objective: a cell’s pull on the latent factors is scaled by its confidence, so the 321s shout and the 1s barely whisper. That is how it learns from the whole matrix without ever mistaking silence for rejection.

A different fix: negative sampling

Confidence weighting is one way out of the missing-data trap. A second family of models — pairwise rankers like BPR (Bayesian Personalized Ranking) — takes a different route called negative sampling.

The idea: for each observed positive interaction, randomly draw a few items the user has not touched and treat those as negatives for that one training step. You are not claiming they are truly disliked — you are just handing the model a little contrast so it learns to rank the real positive above some random alternative. Because the negatives are resampled at random each time, the model never hardens into believing any specific item is hated. This trick is everywhere in neural recommenders — two-tower models, sequence models — wherever you need cheap contrast without the full weighted-least-squares machinery.

Choosing between the two

In practice the product usually chooses for you.

If you have star ratings or thumbs that are dense enough to be useful, treat them as explicit feedback. Clean signal is precious; do not throw it away.
If you are building on clicks, streams, purchases, or dwell time — which is nearly every consumer product at scale — you are in implicit territory. Reach for preference + confidence, or pairwise ranking with negative sampling.

Many systems blend both: explicit ratings anchor the factors for the handful of users who bother to rate, while implicit signals fill in the silent majority.

The idea to carry past this lesson is that implicit data is not a degraded version of explicit data. It is a different kind of signal with its own structure — and modeled correctly, it is often more powerful than ratings, simply because there is so much more of it.

In one breath

Explicit feedback (ratings) is clean but vanishingly rare, while implicit feedback (clicks, plays) is abundant but one-sided — you see every “yes” and can never tell a true “no” from an undiscovered item — so production systems reframe the raw counts as a binary preference (did they interact?) weighted by a confidence of 1 + alpha * count, which lets the model learn from the zeros without ever treating silence as rejection.

Practice

Before the quiz, reason about the boundary case. Suppose you bumped alpha from 40 to 400. Look back at user B’s eight plays of item3: how does its confidence change, and what happens to the relative weight of a single-click item next to it? Then answer the deeper one: why does raising alpha make the model lean harder on your heaviest users — and when might that be exactly the wrong thing to do?

Quick check

0/3

Q1A user has streamed a song 12 times. With alpha=40, what is their confidence score for that song?

Q2A user has never clicked on item X. In an implicit ALS model, how should item X be treated?

Q3You are building a recommender for a recipe app. Users rarely rate recipes, but you log every recipe page they view, every ingredient list they expand, and every recipe they save. Which approach fits best?

A question to carry forward

We have now built a recommender from nothing but clicks. But here is the uncomfortable thing: how would we even know if it is any good? With explicit ratings you could at least check your predicted stars against held-out real stars. With implicit data there are no stars to check against — and worse, the user only ever sees the top few items you surface, never the rest of the ranked list.

So the question to carry forward is about measurement, not modeling. If a recommender’s whole job is to put the right handful of items in the top few slots, then squared error on rating predictions is measuring the wrong thing entirely. What should we measure — and how do we score a ranked list in a way that rewards getting position 1 right far more than position 50? That is the subject of the next lesson, evaluating recommenders: precision@k, recall@k, and NDCG.

Implicit vs Explicit Feedback

What you'll learn

Before you start

Two very different kinds of signal

The asymmetry: you only ever see “yes”

The reframing: preference, weighted by confidence

From counts to two matrices

Building the two matrices

A different fix: negative sampling

Choosing between the two

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further