Compare filter, wrapper, and embedded feature-selection methods. When would you use each?

Filter methods score features by statistical relevance to the target independently of any model, so they're fast but ignore feature interactions. Wrapper methods (like recursive feature elimination) search subsets by training a model and evaluating performance, which is accurate but computationally expensive. Embedded methods select features as part of model training (like lasso or tree importances), giving a good balance of accuracy and efficiency.

How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

What is hybrid search and why is it often better than pure vector search?

Hybrid search combines dense vector similarity with sparse keyword search such as BM25, then fuses the rankings. Dense retrieval captures semantic meaning while keyword search nails exact terms, identifiers, and rare tokens, so combining them improves recall and precision over either alone.

Item-based collaborative filtering — Recommender Systems

We left the last lesson staring at two walls — sparsity and scale — and noticing they were both walls of the same axis. There are too many users, their tastes shift overnight, and any two of them share almost no co-rated items. Then we asked: what if we turned the matrix on its side and compared items to each other instead?

That sideways glance turns out to be one of the most consequential ideas in the history of recommendation. In 1998 a small team at Amazon had exactly this thought, and the algorithm they built from it still quietly powers “customers who bought this also bought” on what became the largest store on Earth. This lesson is that flip.

The axis flip: from user-user to item-item

Picture the utility matrix one more time — users down the rows, items across the columns, ratings in the cells.

User-based CF reads across the rows. To recommend for you, it scans every other row looking for users whose pattern of ratings resembles yours.

Item-based CF reads down the columns instead. Two items are similar, it says, when the same users tend to rate them the same way — both get high marks from the same crowd, or both get low marks from the same crowd. The column for “The Dark Knight” and the column for “Inception” rise and fall together across users, so those two items are neighbors.

And here is the move that changes everything. Once you know which items resemble which, you no longer need to hunt for similar users at all. You look at what the target user has already rated, pull up the items most similar to those, and recommend them. The user’s own ratings stop being something to match and become the weights in the prediction.

Why this scales and stays stable

Four things fall into place the moment you flip the axis.

You can precompute it offline. Item-item similarity depends only on the rating matrix — never on who is asking right now. So you compute the entire similarity table once, overnight, store it, and at request time do a lookup plus a short weighted sum. A user-based system has no such luxury: each arriving user needs fresh similarities computed on the spot.

There are fewer items than users. Most platforms have millions of users but only thousands, or low tens of thousands, of items. The table you must compute and store is item-count squared, not user-count squared — orders of magnitude smaller and cheaper.

Item relationships hold still. Whether one thriller resembles another thriller barely moves as new ratings trickle in. Whether two particular users are alike can flip overnight as their tastes wander apart. Item-item similarities converge sooner and stay trustworthy longer.

Co-ratings are denser. For two items to be compared, some users must have rated both — and popular items have been rated by thousands, giving a rich, low-variance estimate. For two users to be compared, they must have rated the same items, which only gets sparser as the catalog grows.

How a recommendation is scored

Take a target user u and a candidate item j they have not yet rated. The predicted score is a weighted average of u’s own ratings on the items most similar to j:

score(u, j) = Σ_i [ sim(i, j) × rating(u, i) ]  /  Σ_i |sim(i, j)|

where the sum runs over the items i that u has rated. In plain words: look at everything u already rated, weight each of those ratings by how similar that item is to j, and average. Items close to j count heavily; items unlike j barely register.

The item-item similarity sim(i, j) — the number, between 0 and 1 (or −1 and 1), that says how alike two items are in rating space — is almost always cosine similarity computed from the two rating columns. We will keep leaning on cosine here, and at the very end of the lesson that casual reliance is going to become a question worth asking out loud.

A picture of the lookup

Item A is the target user’s liked item. Dashed edges show precomputed cosine similarities. The top-k neighbours become the recommendation list.

The same problem, the other way round

Lay the two methods side by side and the trade is clear: item-based CF buys scale and stability by moving the expensive work offline, and pays for it with a different cold-start problem.

Axis	User-based CF	Item-based CF
Similarity computed between	Users	Items
Computed when	At request time (or expensively cached)	Offline, in batch
Scales to millions of users	Poorly	Well
Signal stability	Noisy; user tastes shift	Stable; item relationships persist
Cold start problem	New user with no ratings	New item with no ratings
Classic example	Social recommendations (“your friends liked”)	“Customers who bought this also bought”

Watch the similarities form — and watch one go wrong

Here is the whole computation on a tiny five-user, six-item matrix. A zero means “not rated,” and we deliberately ignore zeros so unrated entries never contribute to a similarity. The plan: build the item-item similarity table, then recommend for user 0, whose top rating is item 0.

import numpy as np

# Utility matrix: rows = users, columns = items
# 0 means "not rated" (we'll ignore zeros in similarity)
R = np.array([
    [5, 3, 4, 0, 1, 0],
    [4, 0, 5, 0, 2, 0],
    [0, 1, 0, 2, 4, 3],
    [5, 0, 4, 0, 0, 1],
    [0, 2, 0, 3, 5, 4],
], dtype=float)

def item_cosine_sim(R):
    n_items = R.shape[1]
    sim = np.zeros((n_items, n_items))
    for i in range(n_items):
        for j in range(n_items):
            if i == j:
                sim[i, j] = 1.0
                continue
            # only users who rated BOTH items
            mask = (R[:, i] > 0) & (R[:, j] > 0)
            if mask.sum() == 0:
                continue
            a, b = R[mask, i], R[mask, j]
            denom = (np.linalg.norm(a) * np.linalg.norm(b))
            if denom > 0:
                sim[i, j] = np.dot(a, b) / denom
    return sim

sim = item_cosine_sim(R)

print("Item-item cosine similarity matrix (rounded):")
print(np.round(sim, 2))

# Recommend items similar to item 0 for user 0
target_user = 0
liked_item  = 0   # user rated this 5 — we find its neighbours

sim_scores = sim[liked_item].copy()
sim_scores[liked_item] = -1          # exclude self
sim_scores[R[target_user] > 0] = -1  # exclude already-rated items

top2 = np.argsort(sim_scores)[::-1][:2]
print(f"\nTop-2 recommendations for user {target_user} based on item {liked_item}:")
for item in top2:
    print(f"  item {item}  similarity={sim[liked_item, item]:.2f}")

Item-item cosine similarity matrix (rounded):
[[1.   1.   0.98 0.   0.91 1.  ]
 [1.   1.   1.   0.99 0.7  0.98]
 [0.98 1.   1.   0.   0.98 1.  ]
 [0.   0.99 0.   1.   1.   1.  ]
 [0.91 0.7  0.98 1.   1.   1.  ]
 [1.   0.98 1.   1.   1.   1.  ]]

Top-2 recommendations for user 0 based on item 0:
  item 5  similarity=1.00
  item 3  similarity=0.00

Now look closely, because this small run is honest in a way a tidied-up example would hide. Item 0 and item 2 — both loved by users 0, 1, and 3 — come out at 0.98, the genuine, well-evidenced match the “predict before you run” prompt expected. That is the signal working exactly as advertised.

But the recommendation for user 0 is item 5 at a perfect 1.00, with item 2 nowhere in sight. What happened? Two things. First, user 0 has already rated items 0, 1, 2, and 4, so all of them are excluded as candidates — only items 3 and 5 are left to recommend. Second, that flawless 1.00 between item 0 and item 5 rests on a single shared rater: only user 3 rated both (item 0 a 5, item 5 a 1). The cosine of two one-number vectors is always 1.0, no matter the numbers — so the score is not strong agreement, it is no evidence at all wearing the mask of perfect agreement. This is the exact sparsity trap we met with users in the last lesson, reappearing on the item axis. Flipping the matrix bought us scale; it did not buy us immunity from thin data.

What “Amazon also-bought” actually does

The paper “Amazon.com Recommendations: Item-to-Item Collaborative Filtering” (Linden, Smith, York — IEEE Internet Computing, 2003) describes this pipeline almost exactly:

Offline nightly job: scan all purchase histories, compute item-item co-occurrence and similarity.
Store a similarity table: for each item, keep the top-N most similar items.
At request time: look up the items in the current user’s history, union their top-N neighbour lists, score by similarity weighted by the user’s own purchase frequency, return ranked list.

The entire online step is a handful of table lookups and a sort — trivially fast even under enormous traffic. That is why item-based CF replaced memory-based user-CF as the default approach at scale.

In one breath

Item-based collaborative filtering flips the utility matrix on its side, measuring how similar items are from the columns of ratings rather than how similar users are from the rows — which lets you precompute a small, stable similarity table offline and serve a recommendation as a quick weighted sum of the user’s own ratings, at the cost of an item cold-start problem and a pull toward already-popular items.

Practice

Before the quiz, sit with the surprising output above. The honest 0.98 between items 0 and 2 came from three shared raters; the misleading 1.00 between items 0 and 5 came from one. In your own words, why does cosine similarity hand back a flawless 1.0 when two items share exactly one rater — and what would you change about the code to refuse a similarity score computed from too little overlap?

Quick check

0/3

Q1Why can item-item similarity be precomputed offline, but user-user similarity typically cannot?

Q2In item-based CF, the predicted score for item j is a weighted average of the target user's ratings on items similar to j. What are the weights?

Q3A startup launches a new product on its platform. The item-based CF system recommends it to almost no one for the first week. What is the most likely cause, and what is a practical mitigation?

A question to carry forward

Notice what we did three times in this lesson without ever pausing to justify it: we reached for cosine similarity as if it were the only tool in the drawer. And the one place it embarrassed us — handing back a perfect 1.00 for two items that shared a single rater — was not really cosine’s fault so much as a clue. Cosine measures the angle between two vectors; it is blind to how much evidence those vectors rest on, and blind to whether we should have centered the ratings first.

So the question to carry forward is the one we have been quietly dodging: is cosine the right ruler at all? When ratings are on a 1-to-5 scale, when some users are generous and others stingy, when overlap is thin, when the signal is a click rather than a star — does the choice of similarity measure change who gets recommended what? The next lesson, similarity metrics, opens that drawer fully: cosine, Pearson, Jaccard, Euclidean, and the question of which one fits which kind of data.

Item-based collaborative filtering

What you'll learn

Before you start

The axis flip: from user-user to item-item

Why this scales and stays stable

How a recommendation is scored

A picture of the lookup

The same problem, the other way round

Watch the similarities form — and watch one go wrong

What “Amazon also-bought” actually does

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further