Compare RAG and fine-tuning. When would you use each?

RAG injects external knowledge at inference time and is best when information changes often, must be cited, or is too large to bake into weights. Fine-tuning changes model behavior, style, or format and is best for teaching new skills or domain tone; the two are complementary and often combined.

When should you use RAG vs fine-tuning vs a long-context model?

RAG is the default for dynamic, proprietary, or frequently updated knowledge. Fine-tuning is correct when you need to change the model's behavior, format, or domain-specific reasoning style — not just its knowledge. Long-context models are appropriate when your entire knowledge base fits in a single context window and latency is acceptable.

What is LoRA and how does it make fine-tuning parameter-efficient?

LoRA freezes the pretrained weights and injects small trainable low-rank matrices into selected layers, learning the weight update as their low-rank product. This trains a tiny fraction of parameters, slashing memory and storage while approximating full fine-tuning, and the adapters can be merged back at inference.

How does LoRA work and why is it preferred over full fine-tuning for large models?

LoRA (Low-Rank Adaptation) freezes the original model weights and injects trainable low-rank decomposition matrices into attention layers. This cuts the number of trainable parameters by 100x-1000x while matching or approaching full fine-tuning quality, making it practical on a single GPU.

Fine-tune vs RAG: the decision — Generative AI

“The model doesn’t know our product, so we need to fine-tune it.” That sentence is wrong more often than it’s right, and getting it wrong wastes weeks and a GPU budget. The single most useful idea in this whole topic fits on one line:

RAG adds knowledge. Fine-tuning changes behavior.

They solve different problems. If the model lacks facts, retrieval is the answer. If the model has the facts but won’t act the way you need — wrong tone, wrong format, a missing skill — that’s fine-tuning. Confusing the two is the classic mistake.

TryFine-tune vs RAG · the decision

Answer three questions, get the right strategy

The most common adaptation mistake is reaching for fine-tuning when the real fix is retrieval — or a better prompt. RAG adds knowledge; fine-tuning changes behavior. Different problems. Answer below.

Does it need fresh or frequently-changing facts?docs, prices, tickets, anything that updates

Do you need to change how it behaves?tone, format, a domain skill, strict structure

Is it a high-volume, narrow task on a tight budget?classify/extract millions of times, on-device, private

Answer all three to see the recommendation.

The decision, in order of effort

Reach for the cheapest tool that solves your problem, escalating only when it doesn’t:

Prompt / few-shot — try this first, always. A clear prompt and a few examples solve a shocking fraction of “we need fine-tuning” requests.
RAG — the task needs facts the model doesn’t have, or facts that change. Put them in retrieval, not the weights.
Fine-tune (LoRA/QLoRA) — you need to change behavior: a consistent format, a domain tone, a specialized skill, or reliability on a narrow task.
Distill — high volume, narrow task, tight budget: compress a big model’s behavior into a small one you can run cheaply.

Picture it as a staircase — each rung costs more than the one below, so you climb only when the cheaper rung genuinely isn’t enough:

Why “just fine-tune it” usually fails for knowledge

Fine-tuning teaches patterns of behavior, not a reliable fact store. Train a model on your docs and it learns to sound like your docs — but it will still hallucinate specifics, and the moment a document changes you’d have to re-train. RAG, by contrast, retrieves the current document at query time and grounds the answer in it. For anything that updates, RAG wins decisively.

What LoRA/QLoRA actually change

When you do fine-tune, you almost never touch all the weights. LoRA freezes the base model and trains tiny low-rank adapter matrices — a fraction of a percent of the parameters. QLoRA goes further: it 4-bit-quantizes the frozen base so the whole thing fits in far less memory, letting you fine-tune large models on a single GPU. (Both are covered hands-on in the LoRA & QLoRA lesson.)

The key mental model: this adjusts behavior and format, cheaply. It does not reliably inject new factual knowledge — for that, you still want RAG.

A quick gut-check — match the symptom to the tool:

Symptom	The fix
Answers go stale when our docs change	RAG
It doesn’t know facts from our internal wiki	RAG
Output format is inconsistent (we need strict JSON)	Fine-tune / constrained decoding
Tone is too generic; we want our brand voice	Fine-tune (LoRA)
It works, but a frontier model is too slow/expensive	Distill to a small model
It mostly works with a clearer instruction	Prompt / few-shot

The pattern across the table is the one-liner again: a knowledge problem points to RAG, a behaviour problem points to fine-tuning — and you try prompting before either.

In one breath

One line decides most of it: RAG adds knowledge; fine-tuning changes behaviour.
Climb the ladder of effort: prompt/few-shot → RAG → fine-tune → distill, stopping at the first rung that works.
Most “we need to fine-tune” requests are really a retrieval or prompt problem — full fine-tuning of a base model is rarely the right first move.
LoRA/QLoRA train tiny adapters (LoRA) on a quantized frozen base (QLoRA) — cheap behaviour/format changes, not a reliable new fact store.
They aren’t rivals: production systems often fine-tune for format/tone and use RAG for live facts at the same time.

Quick check

0/3

Q1Your chatbot needs to answer from company docs that update weekly. RAG or fine-tuning?

Q2What does fine-tuning (e.g. LoRA) primarily change?

Q3What's the recommended order when adapting an LLM to a task?

If you concluded you genuinely need to fine-tune, the next lesson opens it up: Fine-tuning — LoRA, QLoRA & PEFT shows how the adapters actually work and why QLoRA fits big models on one GPU. If your real need was reliable structure, see constrained decoding instead. And whichever path you take, LLM evals are how you prove the change actually helped.

Fine-tune vs RAG: the decision

What you'll learn

Before you start

Answer three questions, get the right strategy

The decision, in order of effort

Why “just fine-tune it” usually fails for knowledge

What LoRA/QLoRA actually change

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further