datarekha

Fine-tune vs RAG: the decision

RAG adds knowledge; fine-tuning changes behavior. The decision framework every team gets wrong, what LoRA/QLoRA actually change, and why 'we need to fine-tune' is usually solved by better retrieval.

8 min read Intermediate Generative AI Lesson 17 of 33

What you'll learn

  • The real difference between what RAG and fine-tuning each fix
  • The decision order — prompt, then RAG, then fine-tune, then distill
  • What LoRA/QLoRA change (behavior and format, not fresh knowledge)

Before you start

“The model doesn’t know our product, so we need to fine-tune it.” That sentence is wrong more often than it’s right, and getting it wrong wastes weeks and a GPU budget. The single most useful idea in this whole topic fits on one line:

RAG adds knowledge. Fine-tuning changes behavior.

They solve different problems. If the model lacks facts, retrieval is the answer. If the model has the facts but won’t act the way you need — wrong tone, wrong format, a missing skill — that’s fine-tuning. Confusing the two is the classic mistake.

The decision, in order of effort

Reach for the cheapest tool that solves your problem, escalating only when it doesn’t:

  1. Prompt / few-shot — try this first, always. A clear prompt and a few examples solve a shocking fraction of “we need fine-tuning” requests.
  2. RAG — the task needs facts the model doesn’t have, or facts that change. Put them in retrieval, not the weights.
  3. Fine-tune (LoRA/QLoRA) — you need to change behavior: a consistent format, a domain tone, a specialized skill, or reliability on a narrow task.
  4. Distill — high volume, narrow task, tight budget: compress a big model’s behavior into a small one you can run cheaply.

Answer three questions and see where you land:

Why “just fine-tune it” usually fails for knowledge

Fine-tuning teaches patterns of behavior, not a reliable fact store. Train a model on your docs and it learns to sound like your docs — but it will still hallucinate specifics, and the moment a document changes you’d have to re-train. RAG, by contrast, retrieves the current document at query time and grounds the answer in it. For anything that updates, RAG wins decisively.

What LoRA/QLoRA actually change

When you do fine-tune, you almost never touch all the weights. LoRA freezes the base model and trains tiny low-rank adapter matrices — a fraction of a percent of the parameters. QLoRA goes further: it 4-bit-quantizes the frozen base so the whole thing fits in far less memory, letting you fine-tune large models on a single GPU. (Both are covered hands-on in the LoRA & QLoRA lesson.)

The key mental model: this adjusts behavior and format, cheaply. It does not reliably inject new factual knowledge — for that, you still want RAG.

Quick check

Quick check

0/3
Q1Your chatbot needs to answer from company docs that update weekly. RAG or fine-tuning?
Q2What does fine-tuning (e.g. LoRA) primarily change?
Q3What's the recommended order when adapting an LLM to a task?

Next

If your real need was reliable structure, see constrained decoding. And whichever path you take, LLM evals are how you prove the change actually helped.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
Compare RAG and fine-tuning. When would you use each?

RAG injects external knowledge at inference time and is best when information changes often, must be cited, or is too large to bake into weights. Fine-tuning changes model behavior, style, or format and is best for teaching new skills or domain tone; the two are complementary and often combined.

When should you use RAG vs fine-tuning vs a long-context model?

RAG is the default for dynamic, proprietary, or frequently updated knowledge. Fine-tuning is correct when you need to change the model's behavior, format, or domain-specific reasoning style — not just its knowledge. Long-context models are appropriate when your entire knowledge base fits in a single context window and latency is acceptable.

What is LoRA and how does it make fine-tuning parameter-efficient?

LoRA freezes the pretrained weights and injects small trainable low-rank matrices into selected layers, learning the weight update as their low-rank product. This trains a tiny fraction of parameters, slashing memory and storage while approximating full fine-tuning, and the adapters can be merged back at inference.

How does LoRA work and why is it preferred over full fine-tuning for large models?

LoRA (Low-Rank Adaptation) freezes the original model weights and injects trainable low-rank decomposition matrices into attention layers. This cuts the number of trainable parameters by 100x-1000x while matching or approaching full fine-tuning quality, making it practical on a single GPU.

Related lessons

Explore further

Skip to content