When not to use RAG
RAG is the default answer to 'how do I give my LLM custom knowledge?' — and for a wide class of problems, it's the wrong answer. Long-context prompt caching, fine-tuning, and in-context learning each beat RAG in regimes where the costs and assumptions actually pencil out. Here's the decision tree.
There’s a default Reddit answer to “I want to give my LLM knowledge about my company”: use RAG. The default is right often enough that it’s become unexamined, and it’s quietly wrong for a surprising number of cases. In 2026, the alternatives — prompt caching over a long context window, fine-tuning, structured in-context learning — have all gotten dramatically better, while RAG has only gotten incrementally better. The right tool isn’t always the popular one.
This post is the counter-piece. When does RAG actually beat the alternatives? When do you pay more for worse answers because you defaulted to RAG without doing the math?
The four regimes
A simplified mental model. For “give an LLM access to some custom knowledge,” there are roughly four mature approaches:
- Prompt caching with long context. Put the whole corpus in the system prompt; cache it; pay 10% of input price on cache hits.
- Fine-tuning (full or LoRA). Bake the knowledge or style into the model weights.
- RAG. Index, retrieve top-K, stuff into prompt.
- In-context learning with few-shot examples. A small number of curated examples in the prompt.
Each one has a regime where it dominates the others on the quality/cost/latency triple. RAG dominates one regime — but it is not the regime most teams think it is.
Regime 1: Tiny corpus → just put it in the prompt
If your “knowledge base” is under ~10K tokens — a single product description, an internal style guide, a list of 50 FAQs — there is no reason to build a RAG system. You’re paying for vector storage, an embedding API, a retrieval step, and engineering complexity, and the model would happily ingest the whole corpus in the system prompt for roughly free.
A surprising number of “AI chatbot” features in shipped products are in this regime and don’t know it. The product manager said “we’ll need a vector database” because the tutorial they read said so. They didn’t notice their corpus is 4,000 tokens and fits in every model above Haiku with room to spare.
Rule of thumb: if your entire corpus fits in 10K tokens, do not use RAG. Use a system prompt. Iterate on the prompt; you’ll ship in a week instead of a month.
Regime 2: Medium corpus → prompt caching is the new sweet spot
This is the regime that has shifted the most since 2024 and that the most teams haven’t recalibrated for.
Modern frontier models offer 1M-token context windows (Gemini 2.5, Claude Sonnet 4.x extended-context, GPT-5) with aggressive prefix caching. Anthropic’s cache reduces input cost to roughly 10% of the base rate on cache hits, with cache TTL configurable up to one hour (or longer on the extended-cache tier). Gemini’s implicit caching caches input prefixes automatically at a 75% discount. OpenAI’s prompt caching applies automatically to repeated prefixes above 1K tokens.
For a corpus of, say, 200K tokens (an entire technical product manual, or a year of legal contracts, or the full Bible), the math now looks like this for a typical 1000-token query:
| Approach | Per-query input cost | Latency | Quality |
|---|---|---|---|
| RAG (top-20 chunks, ~8K tokens) | ~$0.024 (Sonnet 4) | ~200ms retrieval + generation | OK — depends on retrieval recall |
| Cached 200K-token prompt | ~$0.060 first call, ~$0.006 cached | ~400ms TTFT (first), ~150ms (cached) | Excellent — model sees everything |
The cached number is wildly cheaper per query after the first call. The quality is higher because the model sees the whole corpus and can reason across passages — no retrieval recall ceiling. The first-call latency penalty is gone after the first hit and the cache survives for the configured TTL.
The break-even is roughly: if your corpus is under 500K tokens, your QPS is above ~1 per minute (so the cache stays warm), and your queries benefit from cross-document reasoning, prompt caching beats RAG on every axis except worst-case cold latency.
The teams I’ve seen quietly cut over from RAG to long-context prompt caching include:
- Customer support over a single product’s documentation — the docs are 50-300K tokens, the cache stays warm during business hours, and answer quality on multi-section questions jumps noticeably.
- Code copilots over a single repository — the whole repo (under 1M tokens) fits in cache; the model can answer “where is X used” and “how does Y depend on Z” without a retrieval step that would miss cross-file references.
- Legal research over a single contract or filing — even a 500-page contract is under 300K tokens. RAG loses badly here because the queries are inherently cross-section.
The Anthropic prompt caching announcement gives the original cost math; Gemini’s context caching guide makes the case explicit for the 1M-token context. Neither company is shy about saying caching can substitute for RAG for many use cases.
Regime 3: Large, churning corpus → RAG actually wins
The regime where RAG dominates is the one it was designed for:
- Corpus size above ~1-5M tokens (too large to cache cost-effectively).
- High churn — documents added or updated daily.
- Multi-tenant — each tenant has their own corpus, can’t share a cache.
- Queries are narrow — a single question is answered from a small subset of the corpus, not by cross-document reasoning.
Enterprise knowledge bases, public search engines, multi-tenant SaaS “chat with your docs” features — these are RAG’s home turf. The math that makes prompt caching attractive for medium corpora breaks at the 5M+ token range because cache costs scale with corpus size and the cache hit rate falls if many tenants share infrastructure.
This is where the techniques in the other posts in this series — hybrid search, contextual retrieval, self-correcting loops — actually pay off. If you’re in this regime, the investment is worth it. If you’re not, it isn’t.
Regime 4: Behavior, not knowledge → fine-tune
The most-misused alternative to RAG is fine-tuning, because it’s not actually an alternative — it solves a different problem.
Fine-tuning is the right answer when you want to change the model’s behavior: output format, refusal style, vocabulary, persona, domain-specific phrasing. It is the wrong answer when you want the model to know new facts. The reason is that fine-tuning blends facts into the weights so thoroughly that you can’t update them later without retraining, and the model will confidently generate “facts” that were close to the training data but not exactly in it. RAG keeps facts external and inspectable; fine-tuning blurs them into the parameter space.
The 2024-2025 wave of LoRA fine-tuning tools (Unsloth, Together AI’s fine-tuning, OpenAI’s hosted fine-tuning, Anthropic’s custom model service) made fine-tuning cheap enough — typically $10-100 per training run for a LoRA adapter on a base of a few thousand examples — that it’s worth doing for behavior shaping. Combine fine-tuning (for style/format) with RAG or caching (for facts) and you cover both axes.
A concrete example: an internal AI assistant that responds in a specific corporate voice, with bullet-list-first formatting, citing sources in a specific format. Fine-tune the base model on a few hundred well-formatted example responses. Use RAG or caching to give it the corpus. Don’t try to make RAG do the formatting — it can’t, and prompting it endlessly is brittle.
A concrete cost worked example
Let’s price out a hypothetical “AI assistant over our product documentation” — a real use case I’ve seen teams default-to-RAG on. The corpus is 250K tokens. Expected query volume is 5,000 queries per day.
Option A — RAG with hybrid search + rerank:
- Indexing cost (one-time): embeddings + storage ≈ $50.
- Per-query cost: top-20 chunks ≈ 8K tokens input + reranker call ≈ $0.030 input + $0.015 generation = $0.045.
- Daily cost: 5,000 × $0.045 = $225/day, ~$6,750/month.
- Latency: 150-250ms.
- Quality: bounded by retrieval recall. Cross-section questions are hit-or-miss.
Option B — Prompt caching with 250K-token cached prompt:
- One-time cache write: 250K × full input rate ≈ $0.75 per cache write. Refresh once per hour during business hours = ~$10/day.
- Per-query cost: 250K cached tokens at 10% rate ≈ $0.075 + 1K query tokens + 500 output tokens ≈ $0.078 per query.
- Daily cost: 5,000 × $0.078 + cache refresh ≈ $400/day, ~$12,000/month.
- Latency: 200-400ms for the first call after a cache miss, 100-200ms on warm cache.
- Quality: model sees everything. Cross-section reasoning is native.
The RAG option is cheaper per query in this example — but the quality gap is real, and the operational burden (vector DB, reranker subscription, retrieval evals, hybrid tuning) is substantial. If quality matters more than $5K/month, caching wins. If the team is small and quality is “good enough” with RAG, RAG wins.
The interesting case is Option C: start with caching, switch to RAG when the cost crosses your threshold. That’s the trajectory I’ve seen most often. You ship in two weeks with caching, get user feedback, and only invest in RAG infrastructure when the unit economics force it.
When the conventional wisdom is right
To be clear: RAG remains the right answer for a wide class of problems. If any of these describe you, build RAG:
- Multi-tenant SaaS with one corpus per tenant, where tenants pay individually.
- Knowledge bases over 5M+ tokens, especially with daily churn.
- Public search products (Perplexity, Glean, Notion AI Connect).
- Compliance-regulated domains where you need to log exactly which document was retrieved for each query.
The mistake isn’t building RAG. The mistake is reaching for RAG before checking whether the corpus is small enough that you don’t need it.
What to take away
Three lines, with footnotes:
- Measure your corpus size first. Under 10K tokens, use a system prompt. Under 500K tokens with a stable corpus and any meaningful query volume, evaluate prompt caching seriously — the cost has dropped 10× since 2024.
- Fine-tune for behavior, not facts. A LoRA adapter changes how the model writes; it does not reliably teach the model what’s true. Combine fine-tuning with retrieval or caching, never as a substitute.
- RAG is a tool, not a default. It’s the right tool for large, churning, multi-tenant corpora. It’s the wrong tool for a 50K-token product manual that ships once a quarter. The most common production mistake is using RAG in the second regime because the team built it for what they assumed was the first.
The deeper point is that 2026-era LLM infrastructure offers four techniques for getting custom knowledge into a model, and the fashionable one isn’t always the right one. The teams that ship quickly are the teams that pick the technique that matches the regime — not the technique that was on the front page of Hacker News when they started the project.
Further reading: Anthropic’s prompt caching announcement, Gemini’s context caching docs, OpenAI’s prompt caching guide. For the fine-tuning side, see the LoRA paper and the Unsloth blog for current best practices. The LongBench evaluation is the canonical reference for long-context vs retrieval-based question answering.