When should you use RAG vs fine-tuning vs a long-context model?
RAG is the default for dynamic, proprietary, or frequently updated knowledge. Fine-tuning is correct when you need to change the model's behavior, format, or domain-specific reasoning style — not just its knowledge. Long-context models are appropriate when your entire knowledge base fits in a single context window and latency is acceptable.
How to think about it
This is one of the most commonly misapplied decisions in LLM engineering. The three approaches solve different problems.
Decision matrix
| Criterion | RAG | Fine-tuning | Long-context |
|---|---|---|---|
| Knowledge changes frequently | Best | Poor (retraining needed) | Good if window fits |
| Knowledge is confidential / not in training data | Best | Good (baked in) | Good |
| Need to change output format / tone | Poor | Best | Poor |
| Need domain-specific reasoning style | Poor | Best | Poor |
| Knowledge base fits in context window | Overkill | Overkill | Best |
| Latency budget tight | Good | Best | Poor (long prompts) |
| Data labeling budget limited | Best | Poor | Best |
RAG
Use when: the knowledge base is large, changes often (daily/weekly), contains sensitive data you do not want to embed in weights, or you need citations. RAG is also the lowest-risk starting point — it does not require labeled training data.
Fine-tuning
Use when: the model needs to speak a domain dialect (medical, legal, code in an obscure language), adopt a specific output schema consistently, or learn reasoning patterns not covered by prompting alone. Fine-tuning on knowledge alone is a common mistake — a fine-tuned model cannot tell you about events after its fine-tuning cutoff.
Fine-tune on: 10 000 examples of "here is a support ticket → here is the correct JSON triage output"
Do NOT fine-tune to memorize: "our product launched on March 3rd 2025" — use RAG for that
Long-context (1M+ token) models
Use when: the entire corpus fits in one context window (e.g., a single large codebase or a small handbook), you want zero indexing infrastructure, and latency is not critical. Gemini 1.5 Pro and Claude 3.7 Sonnet support windows above 200K tokens. The practical limit is cost — a 200K token prompt is expensive per call.
Can you combine them?
Yes. RAG + fine-tuning is often the best production architecture: fine-tune for format and style, then plug in a RAG layer for fresh knowledge. This is called Retrieval-Augmented Fine-Tuning (RAFT) — the model is fine-tuned specifically to synthesize retrieved documents, making it more faithful and less likely to hallucinate when context is provided.