How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

What is a confusion matrix and what four quantities does it report?

A confusion matrix tallies predictions against ground truth in a 2x2 table: true positives, true negatives, false positives, and false negatives. From those four cells every classification metric — accuracy, precision, recall, F1, specificity — can be derived. It exposes *which kind* of error a model makes, not just how often it errs.

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

You are asked to 'use ML to improve the user experience on our platform.' How do you approach this completely open-ended problem?

Open-ended ML problems require scoping before modelling: translate the vague ask into a measurable business objective, identify which user interaction has the highest improvement potential, formulate it as a concrete ML task with a defined label and evaluation metric, then propose the simplest viable model first. Jumping to model architecture before this scoping is the most common interview failure mode.

The utility matrix — Recommender Systems

The last lesson’s popularity baseline chewed on a flat list of (user, item, rating) rows and ended on a hint: reshape that list into a grid and the whole problem turns visual. This lesson is that grid — the one data structure every recommender, from the simplest to the most neural, is secretly trying to complete.

Framing the problem as a table

Imagine you are building a movie recommender. You have four users and six films. Some users have rated some films; most have not. Write all of that down in a table — users along the rows, films along the columns, each cell holding the rating that user gave that film (or left blank if they never rated it).

That table is the utility matrix (also called the user-item matrix or ratings matrix). It is the canonical data structure for collaborative filtering and sits at the heart of almost every recommender system you will encounter.

A 4-user × 6-item utility matrix. Shaded cells are observed ratings. Dashed cells are unknown — predicting them is the recommendation problem.

The recommendation problem is now crisp: fill in the dashed cells with the ratings a user would most plausibly give, then surface the items with the highest predicted scores.

Sparsity — the central challenge

In the toy grid above, about half the cells are filled. In practice the situation is far worse. Netflix has hundreds of millions of users and tens of thousands of titles. Even a prolific reviewer who rates 500 films touches fewer than 2% of the catalogue. The average user interacts with far less. A real utility matrix is typically more than 99% empty.

This extreme emptiness is called sparsity, and it creates three compounding problems:

Cold cells. For most user-item pairs there is no signal whatsoever. You cannot directly look up whether Anika would like Coco; you have to infer it from structure elsewhere in the matrix.
Few overlapping observations. Collaborative filtering works by comparing users who rated the same items. With high sparsity, two users may share ratings on only one or two items — too thin a basis for confident similarity estimates.
Cold-start users and items. A brand-new user has no row entries at all. A brand-new item has no column entries. Both are nearly invisible to algorithms that depend on the matrix.

Every technique in this course — content-based filtering, matrix factorization, neural collaborative filtering — exists primarily to deal with sparsity in one way or another.

Explicit feedback vs implicit feedback

Not all signals are created equal. There are two fundamentally different kinds of data that can fill (or implicitly populate) a utility matrix.

Explicit feedback is a deliberate rating. A five-star review on Amazon, a thumbs-up on YouTube, a heart on Spotify — the user consciously expressed a preference. The signal is clean and unambiguous. The problem: most users never rate anything. Explicit feedback is sparse by construction.

Implicit feedback is derived from behavior: a click, a view, a purchase, time spent on a page, a song played to completion. It is abundant — systems collect it passively at scale — and it is the dominant input for most production recommenders today. The tradeoff: it is noisy and it is one-sided.

The practical upshot: when you have explicit ratings, predicting the exact score is a regression problem. When you have implicit interactions, the task shifts to ranking — predicting which unobserved items the user would most likely engage with, given everything you know about their history.

Build the matrix in code

import numpy as np
import pandas as pd

# Raw interaction log: (user, item, rating)
interactions = [
    ("Anika",  "Inception",      5),
    ("Anika",  "Dune",           4),
    ("Bruno",  "Parasite",       3),
    ("Bruno",  "Coco",           5),
    ("Cleo",   "Dune",           2),
    ("Cleo",   "Arrival",        4),
    ("Dev",    "Inception",      4),
    ("Dev",    "Arrival",        5),
]

users = ["Anika", "Bruno", "Cleo", "Dev"]
items = ["Inception", "Parasite", "Dune", "Coco", "Arrival", "Oppenheimer"]

# Build the utility matrix: NaN = unobserved
utility = pd.DataFrame(np.nan, index=users, columns=items)
for user, item, rating in interactions:
    utility.loc[user, item] = rating

print("Utility matrix:")
print(utility.to_string())

# Sparsity = fraction of missing entries
total_cells = utility.size
observed    = utility.notna().sum().sum()
sparsity    = 1 - observed / total_cells

print(f"\nTotal cells : {total_cells}")
print(f"Observed    : {observed}")
print(f"Missing     : {total_cells - observed}")
print(f"Sparsity    : {sparsity:.1%}")

Utility matrix:
       Inception  Parasite  Dune  Coco  Arrival  Oppenheimer
Anika        5.0       NaN   4.0   NaN      NaN          NaN
Bruno        NaN       3.0   NaN   5.0      NaN          NaN
Cleo         NaN       NaN   2.0   NaN      4.0          NaN
Dev          4.0       NaN   NaN   NaN      5.0          NaN

Total cells : 24
Observed    : 8
Missing     : 16
Sparsity    : 66.7%

The matrix prints NaN wherever there is no rating — 16 of 24 cells, 66.7% sparse, and note Oppenheimer’s column is entirely empty (a brand-new item nobody has rated). That two-thirds emptiness is gentle: on any real platform the same number would be above 99%. That gap between a toy dataset and production is what makes the problem genuinely hard.

What the matrix does not capture

The utility matrix is a convenient abstraction, but it flattens a lot:

Temporal dynamics. A rating given five years ago matters less than one given last week; the flat matrix treats them identically unless you explicitly incorporate timestamps.
Context. A user’s mood, device, or time of day affects what they want. The matrix collapses all of that into a single number per user-item pair.
Side information. The matrix knows nothing about item genres, user demographics, or social relationships unless you bolt those on separately.

Later lessons build on the utility matrix and add these dimensions back in. For now, the sparse table of known ratings — and the challenge of predicting what belongs in the blanks — is the conceptual foundation everything else rests on.

In one breath

The utility matrix (user-item / ratings matrix) lays interactions in a grid — users down the rows, items across the columns, ratings in the cells — and recommending is literally filling in the blanks and surfacing the highest predicted scores. Its defining property is sparsity: real matrices are >99% empty, which breeds cold cells, too few overlapping ratings to compare users, and cold-start rows (new user) and columns (new item). The cells come from two feedback types: explicit (deliberate ratings — clean but rare) and implicit (clicks, plays, purchases — abundant but noisy and one-sided). The cardinal trap: a missing implicit entry means undiscovered, not disliked — never treat absence as a negative. Explicit ratings make prediction a regression; implicit data makes it a ranking problem.

Practice

Quick check

0/3

Q1A utility matrix for a streaming service has 50 million users and 80,000 titles. Even if every user rated 200 titles, what is the approximate sparsity?

Q2A user has watched 40 films on a platform but never rated any of them. Which type of feedback is available, and what is the main risk if the system treats all unobserved items as dislikes?

Q3A music app wants to recommend songs to a user who signed up yesterday and has played exactly one track. Which problem does this illustrate, and what does the utility matrix look like for this user?

A question to carry forward

Look at the worst cell in that matrix: Oppenheimer’s column, completely empty. No user has rated it, so any method that learns from the pattern of ratings — comparing users, factorizing the matrix — has literally nothing to grab onto. Yet a brand-new release is exactly the item a platform most wants to push. This is the item cold-start problem, and it’s a wall for every collaborative technique.

So the question to carry forward is: how can you recommend an item that nobody has rated yet? The trick is to stop staring at the empty column and look at the item itself — its genre, its description, its tags. The next lesson, content-based filtering, builds a feature vector for every item from its metadata and matches it to what each user has already loved, so a film can be recommended the minute it’s added — zero ratings required.

The utility matrix

What you'll learn

Before you start

Framing the problem as a table

Sparsity — the central challenge

Explicit feedback vs implicit feedback

Build the matrix in code

What the matrix does not capture

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further