Fine-tune vs RAG: the decision
RAG adds knowledge; fine-tuning changes behavior. The decision framework every team gets wrong, what LoRA/QLoRA actually change, and why 'we need to fine-tune' is usually solved by better retrieval.
What you'll learn
- The real difference between what RAG and fine-tuning each fix
- The decision order — prompt, then RAG, then fine-tune, then distill
- What LoRA/QLoRA change (behavior and format, not fresh knowledge)
Before you start
“The model doesn’t know our product, so we need to fine-tune it.” That sentence is wrong more often than it’s right, and getting it wrong wastes weeks and a GPU budget. The single most useful idea in this whole topic fits on one line:
RAG adds knowledge. Fine-tuning changes behavior.
They solve different problems. If the model lacks facts, retrieval is the answer. If the model has the facts but won’t act the way you need — wrong tone, wrong format, a missing skill — that’s fine-tuning. Confusing the two is the classic mistake.
The decision, in order of effort
Reach for the cheapest tool that solves your problem, escalating only when it doesn’t:
- Prompt / few-shot — try this first, always. A clear prompt and a few examples solve a shocking fraction of “we need fine-tuning” requests.
- RAG — the task needs facts the model doesn’t have, or facts that change. Put them in retrieval, not the weights.
- Fine-tune (LoRA/QLoRA) — you need to change behavior: a consistent format, a domain tone, a specialized skill, or reliability on a narrow task.
- Distill — high volume, narrow task, tight budget: compress a big model’s behavior into a small one you can run cheaply.
Answer three questions and see where you land:
Why “just fine-tune it” usually fails for knowledge
Fine-tuning teaches patterns of behavior, not a reliable fact store. Train a model on your docs and it learns to sound like your docs — but it will still hallucinate specifics, and the moment a document changes you’d have to re-train. RAG, by contrast, retrieves the current document at query time and grounds the answer in it. For anything that updates, RAG wins decisively.
What LoRA/QLoRA actually change
When you do fine-tune, you almost never touch all the weights. LoRA freezes the base model and trains tiny low-rank adapter matrices — a fraction of a percent of the parameters. QLoRA goes further: it 4-bit-quantizes the frozen base so the whole thing fits in far less memory, letting you fine-tune large models on a single GPU. (Both are covered hands-on in the LoRA & QLoRA lesson.)
The key mental model: this adjusts behavior and format, cheaply. It does not reliably inject new factual knowledge — for that, you still want RAG.
Quick check
Quick check
Next
If your real need was reliable structure, see constrained decoding. And whichever path you take, LLM evals are how you prove the change actually helped.
Practice this in an interview
All questionsRAG injects external knowledge at inference time and is best when information changes often, must be cited, or is too large to bake into weights. Fine-tuning changes model behavior, style, or format and is best for teaching new skills or domain tone; the two are complementary and often combined.
RAG is the default for dynamic, proprietary, or frequently updated knowledge. Fine-tuning is correct when you need to change the model's behavior, format, or domain-specific reasoning style — not just its knowledge. Long-context models are appropriate when your entire knowledge base fits in a single context window and latency is acceptable.
LoRA freezes the pretrained weights and injects small trainable low-rank matrices into selected layers, learning the weight update as their low-rank product. This trains a tiny fraction of parameters, slashing memory and storage while approximating full fine-tuning, and the adapters can be merged back at inference.
LoRA (Low-Rank Adaptation) freezes the original model weights and injects trainable low-rank decomposition matrices into attention layers. This cuts the number of trainable parameters by 100x-1000x while matching or approaching full fine-tuning quality, making it practical on a single GPU.