RAG in one diagram: retrieve, augment, generate

A fintech company I know spent four months fine-tuning a model on their internal compliance docs. Cost: north of $80,000 in compute and engineering time. Result: the model still hallucinated policy numbers it had definitely seen during training, because the policies had been updated three weeks after the training data cutoff.

They rebuilt the whole thing with RAG in eleven days.

That story is not an advertisement for RAG. It is an advertisement for understanding what problem each tool actually solves. RAG — retrieval-augmented generation — exists because LLMs have a fundamental limitation that no amount of training can fully fix: they are frozen in time the moment training ends, and they cannot memorize a corpus reliably even within that window.

The core insight

A language model is not a database. It has read an enormous amount of text and compressed that reading into billions of floating-point weights. When you ask it a question, it does not look anything up; it generates an answer shaped by what its weights learned to produce. This is a beautiful, generative thing. It is also a terrible thing if you need the model to quote your API documentation from last Tuesday.

The fix is almost embarrassingly simple: tell the model the answer before you ask the question.

More precisely: when a user asks a question, you find the most relevant passages from your own corpus, paste them into the prompt alongside the question, and then ask the model to answer using that material. The model still does the language work — synthesizing, reasoning, reformulating — but it is anchored to real text you provided, not floating on learned associations.

That is RAG. The rest is engineering.

The three-step loop

Every RAG system, from a weekend prototype to a production system handling 50,000 daily queries, follows the same three steps. The diagram below captures the whole pipeline.

The RAG pipeline. Left to right: embed the question, retrieve the nearest chunks by cosine similarity, stuff them into the prompt alongside the question, let the LLM generate an answer grounded in that context. Two failure modes are marked in red.

Step 1: Retrieve. Your question is converted to a vector — a list of numbers that encodes its semantic meaning — by an embedding model (a smaller, specialized model trained to turn text into geometry). That vector is compared, via cosine similarity, against a pre-built index of every chunk of every document in your corpus. The top-k most similar chunks come back. This whole round trip, on a well-run vector store, takes under 100 milliseconds.

Step 2: Augment. The retrieved chunks are injected into the prompt, usually between a system instruction and the user’s question. The model now sees the relevant passages as live context. It does not have to remember them; they are right there on the page.

Step 3: Generate. The LLM writes an answer. It can synthesize across multiple chunks, identify contradictions, reformulate ideas in plain language — everything language models are actually good at. But it is doing that reasoning on grounded text, not on statistical associations baked in months ago.

Why this beats fine-tuning for fresh and private knowledge

Fine-tuning updates the weights of a model by re-running gradient descent on your data. The result is a model that has, in some sense, internalized your corpus. This sounds appealing. It is the wrong tool for three specific problems.

First, freshness. The moment you fine-tune, the knowledge is frozen again. Your compliance docs update quarterly. Your API reference updates on every deploy. Fine-tuning cannot keep up without a continuous training pipeline that costs real money to operate.

Second, attribution. A fine-tuned model cannot tell you which document it learned a fact from, because knowledge is spread diffusely across its weights. RAG can. The retrieved chunks are explicit; you can surface them as citations. In regulated industries — finance, healthcare, legal — this is not a nice-to-have.

Third, deletion. Privacy laws increasingly require you to be able to “forget” a piece of data. You cannot surgically remove a document from a model’s weights. You can remove it from your vector index in seconds.

Fine-tuning remains the right choice when you need to change how the model behaves: a JSON schema it always follows, a tone it never breaks, a vocabulary specific to a domain that the base model stumbles on. Behavior is in the weights. Knowledge is in the retrieval layer.

The two failure modes worth understanding

Every RAG system in production eventually breaks in one of two ways. Knowing them in advance is the difference between debugging for a week and fixing in an afternoon.

Failure mode 1: the retrieval is wrong. The vector store returns chunks that are semantically plausible but contextually wrong. This happens when queries are ambiguous, when documents are poorly chunked (a passage split mid-sentence loses the referent of “it”), or when your embedding model is weak on your domain. The symptom is confident, fluent answers that contradict your actual documents. It looks like hallucination. It is not — it is retrieval failure. The fix is in the indexing pipeline, not the model.

Failure mode 2: the model ignores the context. The right chunks surface, but the LLM generates an answer from its training data anyway. This is rarer with modern models but still occurs when the prompt is poorly structured, when the context window is so full that the relevant chunks land in a zone the model attends to weakly (a real phenomenon, called “lost in the middle”), or when the system prompt does not clearly instruct the model to constrain itself to the provided context. The fix is in the prompt design, not the retrieval.

These two modes look identical to the end user — a wrong answer — and require completely different interventions. This is why RAG evaluation is harder than it looks, and why logging retrieved chunks separately from the final answer is non-negotiable from day one.

The embedding layer is doing more work than you think

Most explanations of RAG treat the embedding step as a black box and move on. It deserves a moment.

When an embedding model encodes “What is our refund window for perishables?” it does not just hash the words. It produces a vector that captures the intent: a policy question about time limits applied to a specific product category. A semantically similar chunk — even one that says “Customers may return fresh goods within 48 hours of purchase” without using the word “refund” — lands near that vector in the high-dimensional space.

This is what makes RAG qualitatively different from keyword search. A keyword index would miss that chunk unless the user typed “return” or “perishables.” The embedding captures the meaning, not the surface form. The practical consequence is that users do not need to know your document vocabulary to find relevant information.

The flip side is that embedding quality varies dramatically by domain. A general-purpose embedding model trained on web text may perform poorly on medical abbreviations, legal citations, or internal product codenames. If your corpus is specialized, this is the first thing to test and possibly the first thing to replace.

What RAG does not fix

RAG is not a reasoning upgrade. If the retrieved chunks contain a contradiction, the model will either pick one arbitrarily or hedge in ways that frustrate users. If the answer requires synthesizing information across ten documents in a nuanced causal chain, a single generation pass over three chunks will miss it. If your documents are scanned PDFs with broken OCR, the garbage goes straight into your index and straight into your answers.

RAG also does not fix the model’s fundamental inability to count, to do arithmetic reliably, or to track references across very long contexts. These are weight problems, and no retrieval layer patches them.

The honest framing is this: RAG gives the model the right page to work from. What the model does with that page is still up to the model.

The intuition that makes everything else click

Here is the mental model worth keeping.

An LLM without RAG is a very well-read person who studied everything up to a certain date and then walked into an isolation chamber. Their reasoning is sharp. Their knowledge is stale and imperfect. If you ask them a question, they will answer from memory, and the answer may be wrong in ways neither of you can easily detect.

An LLM with RAG is the same well-read person, except you are allowed to slide documents under the door before they answer. The documents are from your company, updated yesterday, selected specifically for this question. The person still does the reasoning. But they are reasoning from current, relevant, explicit evidence — evidence you chose, evidence you can inspect, evidence you can update whenever the policy changes.

That handoff — from remembered knowledge to supplied context — is the whole idea. The three steps, the vector index, the similarity search: those are all machinery in service of one simple act of epistemic honesty. The model did not know. You told it. It answered.

That is retrieval-augmented generation. Everything else — hybrid search, late interaction, agentic RAG, rerankers, contextual retrieval — is refinement on top of that core loop. Learn the loop first, then worry about the refinements.