datarekha

Matrix factorization (SVD, ALS)

Learn how the Netflix Prize winners modeled taste as latent factors by approximating the sparse utility matrix as a product of two thin matrices — the idea behind modern recommendation engines.

10 min read Advanced Recommender Systems Lesson 7 of 11

What you'll learn

  • Latent factor models: represent every user and item as a short dense vector of learned taste dimensions
  • Training objective: minimize squared error on observed entries with L2 regularization, using SGD or ALS
  • Bias terms: absorb global mean, per-user, and per-item offsets so factors capture only interaction structure

Before you start

Why neighborhoods fall short

When you predict a rating by averaging the k most-similar users, you are making two implicit bets:

  1. Similarity in raw rating space is a good proxy for shared taste.
  2. A handful of neighbors captures everything predictive about a user.

Both bets fail on large, sparse matrices. Two users can love the same obscure film for completely different reasons, or share a genre preference that cancels out across different sub-genres. And because most users have rated only a tiny fraction of items, the neighbor set is chosen from a noisy, incomplete signal.

Matrix factorization sidesteps both problems by learning a compressed, dense representation of every user and every item simultaneously, fitting the full pattern of observed ratings at once.


The core idea: low-rank approximation

Let R be the utility matrix with shape (users × items). Most entries are missing (unobserved). The goal is to find two thin matrices:

  • U of shape (users × k) — one row per user, each row is a latent vector (also called an embedding) of length k.
  • V of shape (items × k) — one row per item, each row is a latent vector of length k.

such that the product U V^T approximates the observed entries of R as closely as possible.

The predicted rating for user u on item i is:

r̂(u, i) = U[u] · V[i]   (dot product of two length-k vectors)
Rusers × itemsUusers × k×Vᵀk × itemsk latent factors → learned taste dimensions

The utility matrix R (users × items) is approximated as the product of a user-factor matrix U and an item-factor matrix Vᵀ. Each row of U and V is a length-k embedding in latent taste space.

What are latent factors?

Latent factors are learned dimensions of taste — never hand-labeled. After training you might inspect factor 0 and notice it loads heavily on action blockbusters and lightly on literary dramas; factor 1 might separate mainstream from indie. But the model discovers these dimensions purely from rating patterns, and they rarely map onto clean human-interpretable categories. The dimension k is a hyperparameter: larger k allows richer representations but risks overfitting on sparse data.


Training: minimize error on observed entries only

The training objective for basic matrix factorization is:

minimize  Σ_(u,i) observed  (r(u,i) − U[u]·V[i])²  +  λ(‖U[u]‖² + ‖V[i]‖²)

The λ term is L2 regularization — it penalizes large-magnitude vectors to prevent overfitting. Two standard solvers exist:

  • SGD (stochastic gradient descent): update U[u] and V[i] after each observed (u, i, r) triple. Simple to implement, works well with implicit feedback.
  • ALS (alternating least squares): hold V fixed and solve for U in closed form, then hold U fixed and solve for V, and alternate. Each sub-problem is a standard least-squares system. ALS parallelizes naturally and is the basis of Spark ALS and the implicit library.

Bias terms

Raw dot products explain interaction patterns, but they conflate general popularity and user leniency with genuine taste affinity. The full model adds:

r̂(u, i) = μ + b_u + b_i + U[u]·V[i]

where μ is the global mean, b_u is a user bias (does this user rate high overall?), and b_i is an item bias (is this item generally well-liked?). The latent factors then model only the residual interaction structure — which is exactly what you want them to learn.


Hands-on: matrix factorization in the browser

The playground below demonstrates two approaches on a small synthetic ratings matrix:

  1. NMF (Non-negative Matrix Factorization) from scikit-learn — fits the full dense matrix (fine for tiny examples; always mask in production).
  2. Gradient-descent MF in numpy — fits only the observed entries, showing the training objective directly.

Production-scale ALS and implicit feedback

The gradient-descent loop above is instructive but slow for millions of users and items. In production, two patterns dominate:

ALS with the implicit library (Python, CPU/GPU, optimized for implicit feedback — clicks, plays, purchases):

import implicit
import scipy.sparse as sp

# Build a sparse user-item matrix (counts or confidence weights)
user_item = sp.csr_matrix(plays_matrix)

model = implicit.als.AlternatingLeastSquares(factors=64, regularization=0.1, iterations=20)
model.fit(user_item)

# Get recommendations for user 0
ids, scores = model.recommend(0, user_item[0], N=10)

Spark ALS (distributed, for warehouse-scale data):

from pyspark.ml.recommendation import ALS

als = ALS(rank=50, maxIter=10, regParam=0.1,
          userCol="userId", itemCol="itemId", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(train_df)
predictions = model.transform(test_df)

Key concepts recap

TermMeaning
Latent factor / embeddingA learned dense vector representing a user or item in k-dimensional taste space
k (rank)Number of latent dimensions; controls model capacity
Observed maskOnly train on entries where a rating exists — never impute missing as 0 for explicit ratings
Regularization (λ)L2 penalty on vector magnitudes; prevents overfitting on sparse data
Bias terms (μ, b_u, b_i)Absorb global and per-entity rating offsets; keep factors focused on interaction
ALSClosed-form alternating solver; parallelizable; standard for implicit feedback at scale
SGDStochastic gradient solver; flexible; good for explicit ratings and online learning

Quick check

0/3
Q1In matrix factorization, what does the dot product U[u] · V[i] represent?
Q2Why is it wrong to treat missing entries as 0 when training a matrix factorization model on explicit ratings (e.g., 1–5 stars)?
Q3A music streaming service has billions of play-count events but no explicit star ratings. They want to train a matrix factorization model at scale. Which approach fits best?

Practice this in an interview

All questions
What are t-SNE and UMAP, how do they differ from PCA, and what are their limitations for ML workflows?

t-SNE and UMAP are nonlinear dimensionality reduction algorithms designed primarily for 2D/3D visualization of high-dimensional data. Unlike PCA, they preserve local neighborhood structure rather than global variance, producing cleaner cluster separations in plots. Neither should be used as a preprocessing step for training a supervised model because they are transductive and their output is not stable across runs.

What is PCA, when should you use it, and what are its key limitations?

PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.

What is the curse of dimensionality, and how does it affect machine learning models?

As the number of features grows, the volume of the feature space increases exponentially, so training data becomes exponentially sparse. Distance-based algorithms degrade because points become approximately equidistant; density estimation requires data that grows exponentially; and overfitting risk rises for any fixed training set size.

How does an SVM work, and what is the kernel trick?

An SVM finds the hyperplane that maximises the margin between the two nearest points of each class (the support vectors). When data is not linearly separable, the kernel trick implicitly maps inputs to a high-dimensional feature space — computing inner products there without ever materialising the transformation — enabling non-linear decision boundaries at the cost of linear-space computation.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content