How does PCA work, and how do you choose the number of components?

PCA finds orthogonal directions (principal components) of maximum variance by computing the eigenvectors of the covariance matrix, then projects data onto the top components. Choose the number of components by the cumulative explained variance ratio (e.g. enough to retain 95%), a scree-plot elbow, or downstream task performance. Always standardize features first, since PCA is variance-driven.

What are t-SNE and UMAP, how do they differ from PCA, and what are their limitations for ML workflows?

t-SNE and UMAP are nonlinear dimensionality reduction algorithms designed primarily for 2D/3D visualization of high-dimensional data. Unlike PCA, they preserve local neighborhood structure rather than global variance, producing cleaner cluster separations in plots. Neither should be used as a preprocessing step for training a supervised model because they are transductive and their output is not stable across runs.

What is the kernel trick in SVM, and why does it work?

The kernel trick lets an SVM find a nonlinear decision boundary by implicitly mapping data into a higher-dimensional space where it becomes linearly separable, without ever computing that mapping explicitly. It works because the SVM's dual formulation depends only on dot products between points, and a kernel function computes that dot product directly in the high-dimensional space. Common kernels are linear, polynomial, and RBF.

Why shouldn't you use t-SNE output as features for a downstream model, and what would you use instead?

t-SNE is a visualization method that optimizes a non-parametric 2D/3D embedding preserving local neighborhoods; it has no stable transform, distorts global structure and distances, and is stochastic, so its coordinates aren't reliable predictive features. It also can't naturally project new (test) points. For feature compression use PCA, autoencoders, or supervised embeddings, which provide a consistent, reusable mapping.

Matrix factorization (SVD, ALS) — Recommender Systems

We ended the last lesson on a promise. Every similarity ruler we built — cosine, Pearson, Jaccard — compared two rows of the raw matrix directly, cell against cell. That is the whole neighborhood family, and we had just hit its ceiling: it can only compare what is literally written down, so it stays shallow and noisy on a matrix that is 99% empty. We asked what would happen if, instead, we learned a small set of hidden taste dimensions and placed every user and every item as a short vector in that learned space.

That leap is matrix factorization, and it is not a minor refinement. When Netflix offered a million dollars for a 10% improvement on its recommender, the winning team’s central weapon was exactly this idea: stop averaging neighbors, and start modeling latent taste. This lesson builds it.

Why neighborhoods fall short

When you predict a rating by averaging the k most-similar users, you are quietly making two bets — and on a big, sparse matrix, both go bad.

The first bet is that closeness in raw rating space means shared taste. But two people can love the same obscure film for opposite reasons, and a shared fondness for “action” can hide a deep split between people who want explosions and people who want heist plots. The raw numbers blur all of that together.

The second bet is that a handful of neighbors holds everything predictive about you. Yet most users have rated a tiny sliver of the catalog, so the neighbor set itself is drawn from a thin, noisy signal — you are averaging a few near-strangers and hoping.

Matrix factorization refuses both bets. Instead of comparing rows after the fact, it learns a compressed, dense representation of every user and item at once, fitting the entire observed pattern of ratings in a single model.

The core idea: a low-rank approximation

Let R be the utility matrix, shaped (users × items), with most entries missing. The plan is to find two thin matrices whose product reconstructs the ratings we do have:

U of shape (users × k) — one row per user, each a latent vector (an embedding) of length k.
V of shape (items × k) — one row per item, each a latent vector of length k.

so that the product U Vᵀ matches the observed entries of R as closely as possible. The predicted rating of user u on item i is just the dot product of their two vectors:

r̂(u, i) = U[u] · V[i]   (dot product of two length-k vectors)

The utility matrix R (users × items) is approximated as the product of a user-factor matrix U and an item-factor matrix Vᵀ. Each row of U and V is a length-k embedding in latent taste space.

What are latent factors?

Latent factors are dimensions of taste the model learns — never ones you hand-label. After training you might peer at factor 0 and notice it loads heavily on action blockbusters and lightly on literary dramas; factor 1 might split mainstream from indie. But the model invents these axes purely from rating patterns, and they rarely line up with tidy human categories. The number of them, k, is a hyperparameter: a larger k buys richer representations but risks overfitting on sparse data.

Training: fit the observed entries only

The training objective for basic matrix factorization is:

minimize  Σ_(u,i) observed  (r(u,i) − U[u]·V[i])²  +  λ(‖U[u]‖² + ‖V[i]‖²)

The λ term is L2 regularization — it penalizes large-magnitude vectors to prevent overfitting. Two standard solvers minimize it:

SGD (stochastic gradient descent): nudge U[u] and V[i] after each observed (u, i, r) triple. Simple to implement, and it extends gracefully to implicit feedback.
ALS (alternating least squares): freeze V and solve for U in closed form, then freeze U and solve for V, and alternate. Each half is an ordinary least-squares system. ALS parallelizes naturally and underpins Spark ALS and the implicit library.

Bias terms

A raw dot product tries to explain everything with taste alignment — but some of a rating is not taste at all. A blockbuster is rated high by nearly everyone; a generous user rates everything high. The full model peels those off explicitly:

r̂(u, i) = μ + b_u + b_i + U[u]·V[i]

Here μ is the global mean, b_u is a user bias (does this person rate high overall?), and b_i is an item bias (is this item generally well-liked?). With those absorbed, the latent factors are freed to model only the residual interaction — the genuinely personal part of “you, specifically, and this item, specifically.”

Watch it fill in the blanks

Let us train two factorizations on the same tiny 5-user, 6-item matrix and compare. The first, NMF from scikit-learn, factorizes the dense matrix as-is — which means it reads every 0 as a literal zero rating. The second is a hand-written SGD loop that trains on the observed entries only, with bias terms, exactly as described above.

import numpy as np
from sklearn.decomposition import NMF

# Small ratings matrix (0 = unobserved / missing); 5 users x 6 items
R = np.array([
    [5, 3, 0, 1, 0, 4],
    [4, 0, 4, 1, 0, 0],
    [0, 3, 0, 5, 4, 0],
    [1, 0, 0, 4, 5, 1],
    [0, 4, 5, 0, 0, 3],
], dtype=float)

# --- Approach 1: NMF (treats 0 as 0, not missing) ---
model = NMF(n_components=3, max_iter=500, random_state=42)
U = model.fit_transform(R)   # (users, k)
V = model.components_         # (k, items)
print("=== NMF reconstruction (all entries) ===")
print(np.round(U @ V, 2))

# --- Approach 2: SGD MF on observed entries only, with biases ---
np.random.seed(7)
n_users, n_items = R.shape
k, lr, lam, n_epochs = 3, 0.01, 0.1, 300

P  = np.random.normal(0, 0.1, (n_users, k))
Q  = np.random.normal(0, 0.1, (n_items, k))
bu = np.zeros(n_users)
bi = np.zeros(n_items)
mu = R[R > 0].mean()
observed = [(u, i, R[u, i]) for u in range(n_users)
            for i in range(n_items) if R[u, i] > 0]

for epoch in range(n_epochs):
    np.random.shuffle(observed)
    for u, i, r in observed:
        err = r - (mu + bu[u] + bi[i] + P[u] @ Q[i])
        P[u]  += lr * (err * Q[i] - lam * P[u])
        Q[i]  += lr * (err * P[u] - lam * Q[i])
        bu[u] += lr * (err - lam * bu[u])
        bi[i] += lr * (err - lam * bi[i])

R_hat = np.array([[mu + bu[u] + bi[i] + P[u] @ Q[i]
                   for i in range(n_items)] for u in range(n_users)])
print("\n=== SGD-MF reconstruction (observed entries only) ===")
print(np.round(R_hat, 2))

print("\nSGD-MF predictions for originally-missing entries:")
for u in range(n_users):
    for i in range(n_items):
        if R[u, i] == 0:
            print(f"  User {u}, Item {i} => predicted {R_hat[u, i]:.2f}")

=== NMF reconstruction (all entries) ===
[[5.61 1.92 1.11 1.17 0.16 3.08]
 [2.75 1.84 1.91 0.52 0.   2.08]
 [0.09 1.78 0.64 4.66 4.67 0.45]
 [0.87 1.38 0.   4.37 4.26 0.54]
 [0.3  3.72 5.43 0.17 0.   2.41]]

=== SGD-MF reconstruction (observed entries only) ===
[[4.86 3.08 4.31 1.16 3.32 3.93]
 [4.   2.74 3.94 1.18 3.18 3.18]
 [1.59 3.07 3.97 4.59 4.42 1.6 ]
 [1.22 2.98 3.88 4.22 4.48 1.18]
 [3.45 3.86 4.89 3.87 4.88 3.07]]

SGD-MF predictions for originally-missing entries:
  User 0, Item 2 => predicted 4.31
  User 0, Item 4 => predicted 3.32
  User 1, Item 1 => predicted 2.74
  User 1, Item 4 => predicted 3.18
  User 1, Item 5 => predicted 3.18
  User 2, Item 0 => predicted 1.59
  User 2, Item 2 => predicted 3.97
  User 2, Item 5 => predicted 1.60
  User 3, Item 1 => predicted 2.98
  User 3, Item 2 => predicted 3.88
  User 4, Item 0 => predicted 3.45
  User 4, Item 3 => predicted 3.87
  User 4, Item 4 => predicted 4.88

Two things to read here. First, the masked SGD model reconstructs the observed ratings faithfully — user 0’s true 5 on item 0 comes back as 4.86, user 2’s 5 on item 3 as 4.59, user 3’s 5 on item 4 as 4.48. It learned the real structure rather than memorizing zeros. Second, the prediction the callout asked about: User 2, Item 0 => 1.59 — low, exactly as the geometry demanded. The model placed user 2 in a latent region far from the item-0-lovers, so it predicts they would dislike it. Compare that to the NMF reconstruction above it, where the unmasked 0s have dragged whole rows toward zero (look at the 0.09 and 0.00 entries) — the very bias the warning predicted.

Production-scale ALS and implicit feedback

That gradient loop is wonderful for understanding and hopeless for a hundred million users. In production, two patterns dominate — and notice that both have quietly switched from star ratings to implicit signals like plays and clicks.

ALS with the implicit library (Python, CPU/GPU, built for implicit feedback):

import implicit
import scipy.sparse as sp

# Build a sparse user-item matrix (counts or confidence weights)
user_item = sp.csr_matrix(plays_matrix)

model = implicit.als.AlternatingLeastSquares(factors=64, regularization=0.1, iterations=20)
model.fit(user_item)

# Get recommendations for user 0
ids, scores = model.recommend(0, user_item[0], N=10)

Spark ALS (distributed, for warehouse-scale data):

from pyspark.ml.recommendation import ALS

als = ALS(rank=50, maxIter=10, regParam=0.1,
          userCol="userId", itemCol="itemId", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(train_df)
predictions = model.transform(test_df)

Key concepts recap

Term	Meaning
Latent factor / embedding	A learned dense vector representing a user or item in k-dimensional taste space
k (rank)	Number of latent dimensions; controls model capacity
Observed mask	Only train on entries where a rating exists — never impute missing as 0 for explicit ratings
Regularization (λ)	L2 penalty on vector magnitudes; prevents overfitting on sparse data
Bias terms (μ, b_u, b_i)	Absorb global and per-entity rating offsets; keep factors focused on interaction
ALS	Closed-form alternating solver; parallelizable; standard for implicit feedback at scale
SGD	Stochastic gradient solver; flexible; good for explicit ratings and online learning

In one breath

Matrix factorization stops comparing raw rows and instead learns a short dense vector for every user and every item, so a predicted rating is just the dot product of two embeddings (plus global, user, and item bias terms) — trained by SGD or ALS on the observed entries only, which is how it sees through a 99%-empty matrix where neighborhood methods drown in noise.

Practice

Before the quiz, look once more at the two reconstructions. In the NMF output, several originally-missing entries came back near 0.00; in the masked SGD output, the same positions hold sensible mid-range predictions. Explain, in terms of the training objective, why the unmasked model is pulled toward zero on missing entries — and what one line of the SGD code (the observed list) is doing to avoid it.

Quick check

0/3

Q1In matrix factorization, what does the dot product U[u] · V[i] represent?

Q2Why is it wrong to treat missing entries as 0 when training a matrix factorization model on explicit ratings (e.g., 1–5 stars)?

Q3A music streaming service has billions of play-count events but no explicit star ratings. They want to train a matrix factorization model at scale. Which approach fits best?

A question to carry forward

Notice the quiet switch that happened in this lesson. The objective we derived minimized squared error against star ratings — explicit numbers a user typed on purpose. But the production examples, ALS and the implicit library, were all built for play counts and clicks. We slid from one to the other as if they were the same kind of data. They are not.

A 5-star rating and a 1-star rating are both genuine signal — one says love, the other says hate. But a click only ever says “looked,” and a non-click says almost nothing: maybe you disliked it, maybe you never saw it, maybe you were asleep. There are no real negatives in implicit data, only presence and silence. So the question to carry forward is sharp: when your only signal is what people clicked, how do you train a model on data that has no honest “no”? That asymmetry between explicit and implicit feedback — and what it forces you to change — is the next lesson.

Matrix factorization (SVD, ALS)

What you'll learn

Before you start