LLM recommenders: Netflix, Spotify, and the post-collaborative-filtering era

The recommendation industry has spent two years watching everyone outside it confidently predict its disruption. “LLMs will replace recommenders.” “The For You feed will be generated by GPT.” “Collaborative filtering is dead.” Two years on, the dominant pattern at Netflix, Spotify, YouTube, and TikTok is something different and much more nuanced — a hybrid stack where LLMs do specific load-bearing work at specific edges, and the classical collaborative-filtering plus content-embedding workhorse handles the core. The pure-LLM-as-recommender swing has not survived contact with the constraints these systems actually run under.

This post is what’s actually shipping. The architectures, the latency and freshness numbers that made pure-LLM fail, and the surprising edges where LLMs have genuinely changed the game.

The constraint nobody outside RecSys appreciates

Recommendation systems at scale operate under three constraints simultaneously, and most LLM-only proposals fail on at least two of them:

Three constraints that pure-LLM-as-recommender architectures fail on simultaneously. Latency budget is too tight, content freshness is too fast, and per-user personalisation scale is too large to encode in any single set of LLM weights.

Latency. TikTok’s For You feed runs at sub-150ms p99; YouTube’s homepage is in the same ballpark. An LLM call alone — even with aggressive caching and a small model — typically runs 500-800ms. That’s the entire latency budget of the multi-stage retrieval-plus-ranking pipeline, before you’ve done any retrieval. Pure LLM-as-ranker fails arithmetically.
Freshness. A breaking news video on TikTok needs to surface to users within seconds. A new podcast episode on Spotify within minutes. The classical embedding pipeline updates item representations on streaming Kafka topics — new items appear in the candidate set in seconds. LLM retraining cycles are days to weeks. Pure LLM recommenders are structurally late.
Personalisation scale. Netflix serves 300M+ users; Spotify serves 500M+. Both maintain dense user embeddings that update on every interaction. There is no single LLM context window that fits “what this user has watched in the last six months” for 300M users simultaneously. Even with retrieval-augmented prompting, the per-user cost-per-query of an LLM is orders of magnitude higher than the embedding-lookup cost of a classical recommender.

These three constraints, taken together, are why the production answer is hybrid. The classical pipeline owns the latency-and-freshness-critical core; LLMs do specific work where the latency budget and the personalisation shape make them viable.

What Netflix actually shipped

Netflix has been the loudest about its recommendation system evolution. Their late-2024 foundation model post and Eugene Yan’s PRS talk recap are the cleanest public sketches. The shape:

The foundation model is a transformer trained on user interaction sequences as tokens. Each interaction event (a watch, a like, a bounce, a session) is tokenised as a multi-facet token, and the transformer learns unified user representations by predicting future interactions. It’s autoregressive, in the same scaling-law sense as GPT — Netflix’s experiments confirm scaling laws hold for recommendation foundation models the same way they do for language.
The foundation model doesn’t replace the existing recommender stack. It sits above it, producing user and item representations that downstream specialised models consume. The downstream models — homepage ranker, “Continue Watching” ranker, “Because You Watched” ranker — are still the things actually generating the user-facing recommendations, but they now share a unified embedding space rather than maintaining separate per-team representations.
LLMs appear in three specific edges. First, cold-start metadata enrichment — an LLM generates structured tags (“dark comedy with unreliable narrator, set in 1970s NYC”) for newly added titles, which enrich the content-based features the classical model uses. Second, generated explanations — the “because you watched X” copy is LLM-generated at request time. Third, conversational interfaces — the experimental Netflix search, where you can describe a vibe and get titles back, uses an LLM as the user-facing parser-and-ranker.

Notice what’s not in that list: the homepage ranking model. The thing that actually decides what shows up in your rows of tiles is a deep neural ranker trained on collaborative-filtering signals, content embeddings (which now incorporate LLM-generated tags as features), and context features. The LLM is upstream and downstream of it, but not in it.

Spotify’s pattern is the same shape, different edges

Spotify’s recommendation system is the same hybrid architecture with different specifics. The 2024-2025 research output — GLIDE for podcast discovery, contextualised narratives via LLMs, the AI DJ commentary product — all fit the same pattern:

The core music recommender stays classical-CF-plus-content. The homepage, Discover Weekly, Daily Mixes — all driven by the same embedding-based stack Spotify has run for years, refined but not fundamentally re-architected.
LLMs do four specific jobs. Podcast topic understanding (“this episode is about cold-case forensics” generated from the transcript), conversational playlist generation (“tell me about space exploration” produces a queue), AI DJ commentary (the spoken transitions between tracks), and the GLIDE generative retrieval system which uses an LLM as a discovery-objective-conditioned ranker.
Semantic IDs are the integration trick. Spotify’s GLIDE represents items not as opaque IDs but as semantically meaningful tokens that the LLM can generate directly. This is the architectural innovation that makes LLM-as-generative-retriever viable at all — without semantic IDs, the LLM would have to predict from an unbounded item vocabulary, which doesn’t scale.

The Spotify pattern is a useful counterpoint to Netflix because the LLM does more user-facing work — playlist generation and AI DJ are products users interact with directly, where Netflix’s LLM work is mostly behind-the-scenes feature enrichment. The shared lesson is that the classical recommender carries the core ranking weight in both cases.

Where pure-LLM-as-recommender does work

The pattern isn’t universal — there are domains where pure LLM recommendation has shipped successfully. The common feature: the catalog is small and the personalisation signal is sparse.

The most cited example is conversational shopping assistants — e.g. Shopify’s Sidekick or Amazon’s Rufus. The catalogs are large but the relevant subset per user query is small (a handful of products at any given turn), the user explicitly states intent in natural language, and the personalisation signal is mostly the current session. In this regime, an LLM doing retrieval-augmented ranking over a tens-of-thousands-of-items candidate set is genuinely viable.

The other working case is cold-start everywhere. New e-commerce marketplaces, internal enterprise search, niche professional tools — anywhere the classical CF signal is too sparse to produce reasonable rankings. An LLM with good content embeddings outperforms a CF model that has barely any data to learn from.

Both of these have the same structural feature: they are not the high-QPS personalised-discovery feeds where pure-LLM keeps failing.

The hybrid stack, drawn

The shape that’s converged in production. The classical recommender carries candidate generation and deep ranking. LLMs sit at the edges — enriching item metadata for cold start, producing the user-facing copy, and (in the foundation-model layer) generating the unified embeddings consumed downstream.

The structure is consistent across the big shops:

Top — foundation model layer. A transformer trained on interaction sequences produces unified embeddings. This is the layer that’s “LLM-shaped” but isn’t an LLM in the chat-API sense; it’s a recommender-domain foundation model.
Middle — classical CF + content embedding candidate generation, plus deep ranking. This is the latency-critical workhorse. Sub-150ms p99 lives or dies here.
Edges — LLMs do the long-tail jobs. Cold-start metadata enrichment, explanation generation, conversational interface, exploratory search. None of these are in the hot path; they’re either pre-computed offline (metadata enrichment) or only triggered when the user explicitly opts in (conversational search).

The interesting recent work in production is the foundation model layer. Both Netflix’s recommendation foundation model and Spotify’s semantic IDs work are converging on the same idea: train a transformer on user-item sequences, use it as a representation generator feeding the downstream classical pipeline. The training is LLM-shaped; the serving is not.

What the all-LLM proposals get wrong

The proposals that keep failing share a small set of mistakes:

They treat the recommender as a single LLM call. A real recommender pipeline at scale runs candidate generation, scoring, diversity reranking, and policy enforcement as separate stages. Collapsing all of these into one LLM call is what creates the latency problem in the first place.
They underestimate the freshness gap. A pre-trained LLM has a knowledge cutoff. Even with fine-tuning, the cycle from “new item exists” to “model reflects new item” is days. Embedding pipelines close that gap in minutes.
They underestimate the personalisation cost. Per-user personalisation in a classical recommender is an embedding lookup. Per-user personalisation in an LLM is a retrieval-augmented prompt, with the user’s interaction history as context tokens. That’s two to three orders of magnitude more cost per query.
They overestimate the natural-language interface. Most users don’t want to type “I’m in the mood for a thoughtful sci-fi with female leads.” Most users want to tap on a thumbnail. The conversational interface is a useful additional surface, not a replacement for the existing one.

The thing the all-LLM proposals get right is that pure CF has limits — it can’t reason about explanations, it can’t handle cold start well, it can’t expose a conversational interface. The hybrid pattern that won is the one that takes those wins seriously without giving up the core.

What’s likely next

Two trends through the rest of 2026:

Recommendation foundation models become a standard layer. Netflix shipped first; Spotify shipped semantic-ID variants; Pinterest and TikTok have published similar architectures. The “transformer trained on interaction sequences as the representation backbone” is becoming table stakes for new recommender stacks.
Generative retrieval gets more serious. Spotify’s GLIDE, Google’s TIGER (the original semantic-ID paper), and a wave of follow-on research is making “LLM directly generates item IDs” a viable alternative to retrieve-then-rank in specific domains. Watch for production deployments to widen through 2026.

The high-order story is unchanged: the recommender industry absorbed LLMs without being absorbed by them. The classical pipeline is still the workhorse. The LLMs do specific jobs at the edges, and the most impactful new architecture is the recommendation-domain foundation model that sits above the existing stack rather than replacing it.

The teams who predicted “GPT will replace the recommender” had the right intuition that something was going to change. They had the specifics exactly wrong.

One closing thought worth dwelling on: the most economically important change of the last two years in recommendations was not LLMs at all. It was the foundation-model layer — a transformer trained on user-item sequences, scaled like an LLM, producing unified representations that collapse what used to be dozens of per-team feature stores into one shared embedding space. That’s the architectural shift that’s actually reshaped the stacks at Netflix, TikTok, Pinterest, and (in different shape) Spotify. The user-facing LLM features get the headlines; the foundation-model layer is the part that compounds into a durable advantage. If you’re rebuilding a recommender stack in 2026 and you’re spending all your time thinking about which LLM to call from the homepage, you’re optimising the wrong layer.

Further reading: Netflix’s foundation model post on recommendations, Eugene Yan’s PRS 2024 talk on LLMs for recommendations, Spotify’s GLIDE generative podcast retrieval paper, the Spotify contextualised narratives post, and the LLMs for Recommender Systems survey for the academic landscape.