NLP & LLMs Hard Asked at OpenAIAsked at AnthropicAsked at GoogleAsked at DatabricksAsked at Cohere

When should you use RAG vs fine-tuning vs a long-context model?

For AI / LLM Engineer ML Engineer Data Scientist

The short answer

RAG is the default for dynamic, proprietary, or frequently updated knowledge. Fine-tuning is correct when you need to change the model's behavior, format, or domain-specific reasoning style — not just its knowledge. Long-context models are appropriate when your entire knowledge base fits in a single context window and latency is acceptable.

How to think about it

This is one of the most commonly misapplied decisions in LLM engineering. The three approaches solve different problems.

Decision matrix

Criterion	RAG	Fine-tuning	Long-context
Knowledge changes frequently	Best	Poor (retraining needed)	Good if window fits
Knowledge is confidential / not in training data	Best	Good (baked in)	Good
Need to change output format / tone	Poor	Best	Poor
Need domain-specific reasoning style	Poor	Best	Poor
Knowledge base fits in context window	Overkill	Overkill	Best
Latency budget tight	Good	Best	Poor (long prompts)
Data labeling budget limited	Best	Poor	Best

RAG

Use when: the knowledge base is large, changes often (daily/weekly), contains sensitive data you do not want to embed in weights, or you need citations. RAG is also the lowest-risk starting point — it does not require labeled training data.

Fine-tuning

Use when: the model needs to speak a domain dialect (medical, legal, code in an obscure language), adopt a specific output schema consistently, or learn reasoning patterns not covered by prompting alone. Fine-tuning on knowledge alone is a common mistake — a fine-tuned model cannot tell you about events after its fine-tuning cutoff.

Fine-tune on: 10 000 examples of "here is a support ticket → here is the correct JSON triage output"
Do NOT fine-tune to memorize: "our product launched on March 3rd 2025" — use RAG for that

Long-context (1M+ token) models

Use when: the entire corpus fits in one context window (e.g., a single large codebase or a small handbook), you want zero indexing infrastructure, and latency is not critical. Gemini 1.5 Pro and Claude 3.7 Sonnet support windows above 200K tokens. The practical limit is cost — a 200K token prompt is expensive per call.

Can you combine them?

Yes. RAG + fine-tuning is often the best production architecture: fine-tune for format and style, then plug in a RAG layer for fresh knowledge. This is called Retrieval-Augmented Fine-Tuning (RAFT) — the model is fine-tuned specifically to synthesize retrieved documents, making it more faithful and less likely to hallucinate when context is provided.

When should you use RAG vs fine-tuning vs a long-context model?

Keep practising

Explore further