datarekha
NLP & LLMs Hard Asked at OpenAIAsked at AnthropicAsked at GoogleAsked at DatabricksAsked at Cohere

When should you use RAG vs fine-tuning vs a long-context model?

The short answer

RAG is the default for dynamic, proprietary, or frequently updated knowledge. Fine-tuning is correct when you need to change the model's behavior, format, or domain-specific reasoning style — not just its knowledge. Long-context models are appropriate when your entire knowledge base fits in a single context window and latency is acceptable.

How to think about it

This is one of the most commonly misapplied decisions in LLM engineering. The three approaches solve different problems.

Decision matrix

CriterionRAGFine-tuningLong-context
Knowledge changes frequentlyBestPoor (retraining needed)Good if window fits
Knowledge is confidential / not in training dataBestGood (baked in)Good
Need to change output format / tonePoorBestPoor
Need domain-specific reasoning stylePoorBestPoor
Knowledge base fits in context windowOverkillOverkillBest
Latency budget tightGoodBestPoor (long prompts)
Data labeling budget limitedBestPoorBest

RAG

Use when: the knowledge base is large, changes often (daily/weekly), contains sensitive data you do not want to embed in weights, or you need citations. RAG is also the lowest-risk starting point — it does not require labeled training data.

Fine-tuning

Use when: the model needs to speak a domain dialect (medical, legal, code in an obscure language), adopt a specific output schema consistently, or learn reasoning patterns not covered by prompting alone. Fine-tuning on knowledge alone is a common mistake — a fine-tuned model cannot tell you about events after its fine-tuning cutoff.

Fine-tune on: 10 000 examples of "here is a support ticket → here is the correct JSON triage output"
Do NOT fine-tune to memorize: "our product launched on March 3rd 2025" — use RAG for that

Long-context (1M+ token) models

Use when: the entire corpus fits in one context window (e.g., a single large codebase or a small handbook), you want zero indexing infrastructure, and latency is not critical. Gemini 1.5 Pro and Claude 3.7 Sonnet support windows above 200K tokens. The practical limit is cost — a 200K token prompt is expensive per call.

Can you combine them?

Yes. RAG + fine-tuning is often the best production architecture: fine-tune for format and style, then plug in a RAG layer for fresh knowledge. This is called Retrieval-Augmented Fine-Tuning (RAFT) — the model is fine-tuned specifically to synthesize retrieved documents, making it more faithful and less likely to hallucinate when context is provided.

Keep practising

All NLP & LLMs questions

Explore further

Skip to content