What's the difference between full retraining, incremental (warm-start) training, and continual online learning?

Full retraining trains a fresh model from scratch on the latest data window, giving the cleanest result but at the highest cost and slowest cadence. Incremental or warm-start training continues from existing weights on new data, which is cheaper and faster but can accumulate drift and forgetting. Continual online learning updates the model continuously from a live stream for maximum freshness, at the cost of stability, harder evaluation, and vulnerability to bad or poisoned data.

You are asked to 'use ML to improve the user experience on our platform.' How do you approach this completely open-ended problem?

Open-ended ML problems require scoping before modelling: translate the vague ask into a measurable business objective, identify which user interaction has the highest improvement potential, formulate it as a concrete ML task with a defined label and evaluation metric, then propose the simplest viable model first. Jumping to model architecture before this scoping is the most common interview failure mode.

The cold-start problem — Recommender Systems

The last lesson left us with a quiet but devastating observation. Every metric we built to judge a recommender — precision, recall, NDCG, the temporal split — assumed the user had a history to hold out and the items had interactions to rank. The whole apparatus runs on a past. But the most consequential moments in a recommender’s life are the ones with no past: the visitor who signed up ten seconds ago, the product added to the catalog this morning. So we asked the question that decides whether either one is ever served well: how do you recommend when you have no history at all?

That is the cold-start problem, and every production recommender has to solve it before it can do anything else. A system that is brilliant for established users and items but useless for new ones will bleed both — new users churn before they are hooked, new items die before they are discovered.

The root cause: collaborative filtering needs overlap

Recall how collaborative filtering works, user-based or item-based. It hunts for overlap — users who rated the same items, or items rated by the same users. Given enough overlap, it uncovers taste patterns no hand-crafted feature could capture. That was its whole magic.

But overlap is built from history. Strip the history away and there is nothing to compute: the similarity that powers every CF prediction is undefined. The algorithm does not throw an error — it just quietly returns nothing, or the same popular items for everyone, or noise. That silent emptiness is the cold-start problem, and it shows up in three distinct shapes.

Three distinct cold-start cases

The single phrase “cold start” actually covers three separate situations, each breaking CF in its own way and each needing its own remedy.

The three cold-start cases and the primary remedies for each.

Case 1 — New user

Someone just registered. No clicks, no ratings, no purchases. CF cannot place them anywhere in taste space because there is no signal to anchor on — with zero interactions, the new user is exactly equidistant from everyone, so the similarity computation has no answer to give.

The remedies all share one goal: manufacture a little signal, fast.

Preference elicitation. Ask, directly. A short onboarding screen — “Pick a few genres you like,” “Rate these three titles” — turns the void into a starting point. Even four explicit signals shrink the cold zone dramatically. Spotify, Netflix, and YouTube all do a version of this.
Demographic and context priors. You usually know something even at signup: region, device, referral source, time of day. Look up what similar contextual cohorts tend to enjoy and use it as a prior.
Popularity and trending fallbacks. With no individual signal at all, show what is broadly popular right now. It is not personalized — but it beats random, and it is honest about what it is.

Case 2 — New item

A fresh product, article, or song lands in the catalog with no ratings, no clicks, no history. CF cannot recommend it to anyone, because it appears in no user’s vector — item-based CF finds items that co-occur in histories, and a brand-new item has co-occurred with nothing.

Content-based filtering on metadata. Here is where the lesson two chapters back pays off. If you know the item’s features — genre, author, price range, ingredient list, audio tempo — you can match it to users whose profiles fit those features, no ratings required. This is the exact problem content-based filtering was built for.
Deliberate exploration. Show the item to a small slice of users and watch what happens. That is the exploration half of a trade-off we will make precise in a moment.

Case 3 — New system

A brand-new platform: no users, no rated items, no interactions anywhere. The hardest case, because every other remedy assumed at least one side of the matrix had data.

Bootstrap with content and editorial curation. Start with hand-picked recommendations from human experts — editors, curators — combined with content-based matching on metadata.
Import or synthesize prior signal. Seed the system with aggregate popularity from public sources (chart positions, review aggregators) until real interactions accumulate.
Accept the ramp-up. A new system is simply less accurate than a mature one. Design the UX to set expectations honestly and to harvest feedback aggressively, so the cold period is as short as possible.

The hybrid answer: lean on content early, shift to CF over time

Step back and the pattern is clear. Cold entities have no collaborative signal but often do have content; warm entities have rich collaborative signal. So the production answer is a hybrid that routes by how much history is available:

Zero to a few interactions: content-based filtering and popularity fallbacks only.
Tens of interactions: blend content and CF, weighting CF lightly.
Rich history: trust CF heavily; content becomes a gentle regularizer rather than the driver.

Concretely, if cf_score and cb_score are both normalized to [0, 1], a weight alpha interpolates between them — and alpha itself rises with the number of interactions accumulated. At alpha = 0 the system is pure content-and-popularity; as alpha climbs toward 1 it becomes pure CF. The schedule for alpha is a tunable knob.

Exploration vs. exploitation: bandits

The “show a new item to a few users” idea hides a genuine dilemma. Every slot you give to an unknown item is a slot you did not give to something you are already confident about. Do you exploit what you know, or explore to learn?

This is the multi-armed bandit framing. Picture each item as a lever (“arm”) on a slot machine. Pulling an arm means recommending the item and watching the reward — a click, a purchase, a watch-through. The goal is to maximize total reward over time while still gathering information on the arms you have not tried.

Epsilon-greedy is the simplest strategy and a fine place to start:

With probability epsilon, pick a random item — possibly a new one you know nothing about (explore).
With probability 1 - epsilon, pick the item with the best current estimated reward (exploit).

A new item gradually accrues enough observations to earn a reliable estimate, after which it competes on its own merits. epsilon is usually annealed — reduced over time — as the catalog of true unknowns shrinks. The fancier UCB (Upper Confidence Bound) and Thompson Sampling balance the trade-off more cleverly, but epsilon-greedy is easy to debug and reason about, which is often worth more than theoretical optimality early on.

Watch the hybrid score rescue a new item

Let us put two remedies together — a popularity fallback for the new user, a content score for the new items — and watch a cold-start ranking come out sensible. Our new user told us one thing at signup: they like sci-fi. Two of the five items (a sci-fi short with only 5 interactions, a documentary with 0) are essentially new.

import numpy as np
import pandas as pd

items = pd.DataFrame({
    "item_id": ["A", "B", "C", "D", "E"],
    "title":   ["Sci-fi epic", "Rom-com", "Thriller", "Sci-fi short", "Documentary"],
    "sci_fi":  [1, 0, 0, 1, 0],
    "romance": [0, 1, 0, 0, 0],
    "thriller":[0, 0, 1, 0, 0],
    "doc":     [0, 0, 0, 0, 1],
    "interactions": [820, 450, 310, 5, 0],   # D and E are essentially new
})

feature_cols = ["sci_fi", "romance", "thriller", "doc"]
user_prefs = np.array([1, 0, 0, 0], dtype=float)        # "I like sci-fi"
item_features = items[feature_cols].to_numpy(dtype=float)

def cosine_sim(u, v):
    denom = np.linalg.norm(u) * np.linalg.norm(v)
    return float(np.dot(u, v) / denom) if denom > 0 else 0.0

items["cb_score"]  = [cosine_sim(user_prefs, row) for row in item_features]
items["pop_score"] = items["interactions"] / items["interactions"].max()

# Brand-new user => alpha = 0 (no CF signal yet); blend content + popularity
alpha, cb_weight, pop_weight = 0.0, 0.6, 0.4
items["hybrid_score"] = (1 - alpha) * (cb_weight * items["cb_score"]
                                       + pop_weight * items["pop_score"])

ranked = items[["title", "interactions", "cb_score", "pop_score", "hybrid_score"]] \
    .sort_values("hybrid_score", ascending=False)
print("Cold-start recommendations for a new sci-fi fan:")
print(ranked.to_string(index=False))

Cold-start recommendations for a new sci-fi fan:
       title  interactions  cb_score  pop_score  hybrid_score
 Sci-fi epic           820       1.0   1.000000      1.000000
Sci-fi short             5       1.0   0.006098      0.602439
     Rom-com           450       0.0   0.548780      0.219512
    Thriller           310       0.0   0.378049      0.151220
 Documentary             0       0.0   0.000000      0.000000

Read the ranking and the cold-start strategy is visible in the numbers. “Sci-fi epic” tops the list — it is both a perfect content match (cb 1.0) and wildly popular (pop 1.0), exactly what you want to show a new sci-fi fan. But the quietly important result is second place: “Sci-fi short,” with a mere 5 interactions, leaps over the far more popular Rom-com and Thriller. Pure CF would have buried it for lack of history; the content score (cb 1.0) rescues it, and the hybrid lands it at 0.602 against Rom-com’s 0.219. Meanwhile the Documentary, with no content match and no popularity, correctly settles to the bottom. The new item got a fair hearing — which is the whole point.

In one breath

Cold start is the failure of collaborative filtering when there is no interaction history to compute overlap from, and it comes in three flavors — new user, new item, new system — each met by leaning on content and popularity until collaborative signal accrues, blended through a history-weighted hybrid and supplemented by deliberate bandit-style exploration that trades a little short-term relevance for the signal new items need.

Practice

Before the quiz, reason about the alpha schedule. For a brand-new user we set alpha = 0 — pure content and popularity. As they rack up interactions, alpha should climb toward 1. In your own words: what goes wrong if you leave alpha at 0 forever, and what goes wrong if you jump it to 1 the moment a user has just three interactions? Then connect it back: which of the three cold-start cases does the epsilon-greedy bandit most directly help, and why?

Quick check

0/3

Q1Why does pure collaborative filtering fail for a brand-new item that was added to the catalog five minutes ago?

Q2A streaming platform onboards a new user with a 'Pick 3 genres you enjoy' screen. Which cold-start technique is this?

Q3An e-commerce site uses epsilon-greedy with epsilon=0.1 to handle new products. A product team proposes raising epsilon to 0.4 for the holiday season when many new items launch. What is the most likely trade-off?

A question to carry forward

Notice what we have really been doing in this lesson. Every cure for cold start was a patch bolted onto collaborative filtering from the outside: when CF goes blind, swap in content; when content is not enough, fall back to popularity; when even that fails, explore with a bandit. We kept a CF core and surrounded it with rescuers, switching between separate models by hand.

That works — but it is a little uneasy, a committee of single-purpose models held together with routing rules. So the question to carry forward is whether we can do better than bolting models together. What if a single model could absorb all the signals at once — collaborative history, item content, user context, sequence — and learn its own internal blend, so cold start and warm start are just two ends of one continuum rather than two code paths? That is the promise of hybrid and neural recommenders, the architecture behind YouTube and Netflix at billions-of-sessions scale, and it is the final lesson of this section.

The cold-start problem

What you'll learn

Before you start