datarekha
Patterns April 17, 2026

Fine-tuning vs RAG: the settled debate of 2026

Three years of Twitter brawling about whether to fine-tune or retrieve has ended in a boring, useful answer: RAG for knowledge, fine-tuning for behaviour, and both when you actually need both. Here's the decision table production teams actually use.

12 min read · by datarekha · fine-tuningragloradecision-framework

For about two years — call it mid-2023 to early 2025 — the LLM internet re-litigated the same fight every six weeks. Just fine-tune it. No, just use RAG. Fine-tuning is dead. RAG is a hack. Every conference had a panel on it. Every consultant had a deck. The only thing the discourse managed to do reliably was confuse the people actually trying to ship something.

By mid-2026 the debate is, mercifully, over. Not because someone won, but because the question turned out to be wrong. Fine-tuning and RAG are not substitutes; they solve different problems. The teams shipping useful LLM-backed products — Cursor, Cohere’s enterprise customers, Anthropic’s own product surface — have largely converged on the same playbook. This post is that playbook, with the receipts.

The two-line rule

If you take nothing else away, take this: use RAG when the model needs facts it doesn’t have. Fine-tune when the model needs a behaviour it can’t reliably produce from prompting alone.

Everything that follows is elaboration on those two lines.

RAG is a knowledge problem. The model’s parameters are frozen at training time; the world changes; users ask about things that happened yesterday or that live in your private corpus. You retrieve those facts at inference time and put them in the prompt. The model is then generally smart enough to ground on the retrieved text.

Fine-tuning is a behaviour problem. The model knows English, but you need it to always emit a specific JSON schema. The model knows medicine, but it won’t use the radiology shorthand your customers expect. The model can reason about code, but it can’t predict the next diff in the format your IDE wants. None of these are knowledge gaps you can paper over with retrieval — they’re production behaviours you have to bake into the weights.

WHAT EACH TECHNIQUE FIXESRAGKnowledge gaps- private documents- post-cutoff news- per-user state- citations and provenance- freshness on the order of secondsFINE-TUNINGBehaviour gaps- a JSON schema it always follows- a brand voice, a tone, a refusal style- domain jargon (medical, legal, code-diff)- a tool-calling format the base model dislikes- shrinking a 70B behaviour into a 7B model
The two boxes are almost orthogonal. The hard part is recognising which side of the line your problem actually lives on.

The reason teams used to argue endlessly is that some problems sit on the boundary. “The model doesn’t know my product” can be knowledge (docs the model has never seen) or behaviour (it knows the docs but won’t follow the SQL dialect your API expects). The honest answer is usually both — RAG to retrieve the docs, fine-tuning to make the model emit your SQL dialect.

What PEFT actually changed

In 2022 fine-tuning meant updating every weight in the network. For a 7B-parameter model that was 100-120 GB of VRAM at fp16 and a multi-day run on at least one H100. For 70B it was a small cluster. Most teams couldn’t afford to try fine-tuning, let alone iterate on it. RAG won the discourse partly by default — it was the only option for shops without an ML platform team.

LoRA (2021) and then QLoRA (2023) inverted the economics. LoRA freezes the base weights and learns a low-rank update — typically 0.1-1% of the parameter count — applied to specific layers. QLoRA goes further by 4-bit-quantising the frozen base model, dropping VRAM requirements another 4x. The result, widely reported: you can QLoRA-tune a 7B model on a $1,500 RTX 4090, and a 70B model on a single A100 or H100.

The quality cost is small. Public benchmarks consistently put LoRA within 2-5% of full fine-tuning on downstream tasks, and QLoRA within ~5-10%. For behaviour tuning — JSON adherence, tone, schema following — the gap is often noise. The frontier labs use full fine-tuning when they have the compute (and Anthropic’s customer-tuned Sonnet uses it). Everyone else should default to LoRA.

The strategic consequence is the part the discourse missed: PEFT didn’t kill RAG. It killed the excuse for never trying fine-tuning. A team that ten years ago would have said “we don’t have an ML platform” now has the option of spending an afternoon on a Modal or Together AI fine-tune for $50 of compute. That changes the cost-benefit calculation in every RAG-vs-tune meeting.

What “fine-tuning for behaviour” looks like in production

The receipts. Three real examples, all shipped, all measurable.

Cursor’s Apply Model. Cursor’s edit-application pipeline — the thing that takes a model’s proposed diff and writes it into your file in milliseconds — is powered by a custom fine-tune. They describe the work in detail on their blog: a Llama-3-70B base, fine-tuned on “fast-apply” trajectories augmented with GPT-4-generated supervision, then served with a speculative-decoding scheme they call “Speculation Algorithm” for ~9x throughput. The fine-tune is not about teaching the model new code — Llama-3-70B already knows code. It’s about teaching the model the format: the diff structure Cursor’s IDE expects, the predictability needed to deterministically apply edits, the streaming behaviour that lets the diff render as it generates. Knowledge unchanged. Behaviour transformed.

Cohere’s embedding fine-tunes. Cohere’s enterprise pipeline lets customers fine-tune Embed and Command models on their own data, available via Amazon Bedrock since late 2023. The pitch: generic embeddings are good, but a customer-tuned embedding model on your ticket corpus can outperform a frontier model thousands of times its size for your retrieval task. The Cohere case studies report enterprises matching GPT-4-era retrieval quality with a tuned Embed-Light at a small fraction of the inference cost. This is fine-tuning as a cost-shifting exercise — not a quality unlock, a cost unlock.

Anthropic’s tool-use post-training. When Claude 3 launched in early 2024 it was already strong at tool use; by Claude 3.5 it was the category-best, with industry-leading scores on tau-bench and SWE-bench. The public posts from Anthropic credit a heavy investment in tool-use-specific RLHF + supervised fine-tuning during post-training. They didn’t make the model smarter about tools — they made it more reliable in the tool-calling format. Knowledge unchanged. Behaviour transformed.

The pattern across all three: fine-tuning is for the parts of the system that need to be predictable and cheap, not for the parts that need to be smart.

What RAG does, and what it doesn’t fix

RAG’s failure mode is almost always confused with a fine-tuning problem. Symptom: the model knows the facts in the retrieved passages, but it formats the answer wrong, or refuses, or hedges. People conclude “RAG isn’t working” and go off to fine-tune. The actual problem is that the generation prompt isn’t strong enough — or that the model’s default behaviour is at odds with what you want it to do with the retrieved context. The first is a prompting problem. The second is a fine-tuning problem. Neither is a RAG problem.

What RAG is good at:

  • Freshness. Updated within seconds of a new document landing. Faster than any fine-tune loop.
  • Per-user knowledge. Multi-tenant systems can’t fine-tune one model per tenant; they can absolutely retrieve from a per-tenant index.
  • Provenance. Every claim in the answer can point back to a chunk in a document. Auditable in a way that a fine-tuned model never is.
  • Long-tail coverage. A retrieval system covers the long tail at almost-zero marginal cost; baking the long tail into model weights is exponentially expensive in parameters.

What RAG is not good at:

  • Reasoning over many retrieved chunks at once. Context windows have grown, but model attention is not uniform; the middle of a long context is reliably worse than the edges. Long-context RAG is a research frontier, not a default.
  • Behavioural consistency. Different retrieved contexts produce different stylistic outputs. You can’t RAG your way to a brand voice.
  • Tasks that have no “retrievable” answer. Code completion, free-form generation, classification with subtle distinctions — these are fundamentally model-skill problems.

The decision table

A working version, used in real production reviews:

DECISION TABLEPROBLEMRIGHT TOOLWHYmodel doesn’t know our docsRAGfreshness, provenanceoutput must always be valid JSONfine-tune (LoRA)behavioural reliabilitybrand voice in every replyfine-tune (LoRA)style is a behaviourcustomer support QA on ticketsRAG + fine-tune bothfacts + formatSQL in a specific dialectfine-tune (LoRA)dialect is a behaviourcite documents in every answerRAG (mandatory)citation = retrievalcheaper 7B that acts like 70Bfine-tune (distill)distillation onto smalldomain jargon (medical/legal)fine-tune, then RAGjargon first, docs secondmulti-tenant private knowledgeRAG (per-tenant)no per-tenant fine-tune
A working decision table. Most production systems land in row 4 — both, because behaviour and knowledge are both problems.

The thing the table makes obvious: rows 4 and 7 are increasingly common. The most-shipped production setup in 2026 is a fine-tuned model (LoRA on a base like Llama-3.1-70B, Qwen-2.5, or a Sonnet-class commercial model where the platform allows it) wrapped around a RAG retrieval layer. Not either-or. Both.

The cost numbers, since people always ask

Approximate 2026 cost of a one-time fine-tune of an open Llama-class 70B model on a hosted serverless platform (Together AI, Fireworks, Modal, Anyscale):

  • LoRA on 10K examples: $30-150 of compute, 2-4 hours wall-clock.
  • QLoRA on 10K examples: $15-50, slightly slower wall-clock, slightly worse quality.
  • Full fine-tune on 10K examples: $1,000-4,000, 8-24 hours on a multi-GPU node.

The OpenAI / Anthropic hosted fine-tuning APIs sit somewhere in between on cost (per-token-trained pricing) but with the advantage of being on the exact base model that powers their API endpoints, with no inference setup. Read OpenAI’s fine-tuning guide and Anthropic’s fine-tuning docs for current pricing — the rates shift quarterly. The pricing for Cohere fine-tunes on Bedrock is similar.

For comparison, building a competent RAG stack from scratch — vector database, embedding model, chunking pipeline, evaluation harness — is $2,000-20,000 of engineering time depending on whether you assemble open source pieces or buy a platform. The compute itself is rounding-error cheap. The cost of RAG is engineering; the cost of fine-tuning is compute and data.

The data prep, the part nobody warns you about

The dirty secret of fine-tuning in 2026 is that the compute is cheap and the data is hard. The Cursor team has written about this: their Apply Model worked because they had millions of real edit trajectories — proposed-change → applied-result pairs — that nobody else had. The model architecture mattered less than the dataset.

The data prep workflow that actually ships:

  • Mine production traffic. Real user queries and the responses your team accepted. This is the highest-signal training data you can possibly have, because it’s literally the distribution you’re serving.
  • Synthetic augmentation. Use a frontier model (GPT-4 or Claude) to generate variations, harder edge cases, and labelled examples your production data is thin on. Cursor explicitly says they did this with GPT-4 for their Apply Model.
  • A curated eval set, held out from training. This is the part teams skip and pay for later. The eval set should mirror your worst cases, not your average ones. A model that’s 5% worse on the average and 30% worse on the long tail is unacceptable; a model that’s 1% worse on the average and 10% better on the long tail is the win.
  • Iterate. A LoRA fine-tune takes a few hours; you should be running ten of them before deciding you’re done.

The hard part is not the LoRA hyperparameters. It’s the dataset, the labels, and the eval set. Plan for at least 60% of the project budget going to those three.

What the next 12 months will probably change

A few things on the horizon that may shift the table:

  • Inference-time learning. Anthropic’s Memory features and the related research on test-time adaptation hint at a world where the model “remembers” facts across conversations without a separate retrieval step. If this generalises, it eats some of RAG’s freshness advantage.
  • Cheaper RFT (reinforcement fine-tuning). OpenAI’s RFT and Anthropic’s Constitutional-AI-style methods are making behaviour tuning cheaper still, with smaller data requirements. Expect the “fine-tune for behaviour” column to broaden.
  • Long-context retrieval becoming reliable. Gemini-2.5, Claude 4.5, GPT-5 all reliably handle 1M+ token contexts. RAG over a million-token document collection may merge with “just put it all in the prompt” for smaller corpora.

None of these eliminate either tool. They shift the boundary, slowly.

Takeaway

Three lines for the wall:

  • RAG for knowledge. Fine-tuning for behaviour. Both, when you need both — which is most of the time.
  • PEFT (LoRA, QLoRA) made fine-tuning a “try it on a Friday afternoon” experiment, not a quarter-long project. Use it.
  • The teams shipping reliable LLM products in 2026 stopped picking a side and started picking the right tool for each part of the system. So should you.

The discourse will eventually find a new fight (probably about inference-time learning vs. fine-tuning, or about which open-weight fine-tune base is best). The next time it does, remember that the fine-tune-vs-RAG argument seemed urgent for two years, and turned out to be a category error. The right answer was both. It usually is.


Further reading: the original LoRA paper and the QLoRA paper still reward a careful read. OpenAI fine-tuning docs, Anthropic fine-tuning docs, and Cohere’s enterprise fine-tuning suite blog post cover the current state of hosted offerings.

Skip to content