Agentic RAG: when the agent decides whether to retrieve

The default architecture of an early RAG system looks like a vending machine: insert a query, the retriever fires, the top-K chunks pop out, the model writes an answer, done. Every query pays the same toll — the embedding lookup, the rerank, the larger prompt — whether the query needed any of it or not.

The architecture that has quietly replaced it in production over the last eighteen months is what the LangChain team and others have started calling agentic RAG: a system where retrieval is a tool the model decides to call, the same way it decides to call a calculator or a SQL query. “How are you today?” gets no retrieval. “What did our Q3 incident report say about the database failover?” gets a retrieval. “How did our incident-response process change between Q3 and Q4?” gets two retrievals, then a synthesis.

This post is about why that small architectural shift turns out to matter so much — and what production teams have figured out about making it actually work.

The naive pipeline’s three failures

The vending-machine pipeline fails in three specific, expensive ways:

It retrieves when it shouldn’t. Greeting messages, clarifying questions, casual chitchat, math problems, and code completion all eat a retrieval round trip. On a customer-facing chatbot, this is maybe 30-40% of inbound turns, and each one pays 200-500ms of retrieval latency plus the longer-prompt token cost — for nothing.
It retrieves once when it should retrieve twice. Complex questions decompose into multiple sub-queries, but the retriever sees only the original. (“Compare our Q3 and Q4 incident response metrics” pulls Q3 chunks or Q4 chunks if your retriever is lucky, rarely both well.)
It can’t ask a clarifying question. The retriever has no capacity to say “wait, by ‘the report’ do you mean the security audit or the financial filing?” — the question goes to retrieval anyway, with whichever interpretation the embedding favours.

Agentic RAG solves all three by giving the model a different shape of control: it can decide whether to retrieve, what to retrieve, how many times to retrieve, and whether to ask first. The retrieval is a tool the agent reasons about, not a pipeline stage the system always executes.

In agentic RAG, the model is the dispatcher. It can answer directly, retrieve once, retrieve repeatedly with refined queries, or ask the user a clarifying question before doing anything.

The three pillars of an agentic RAG loop

Every production agentic RAG system has the same three concerns. The frameworks differ in surface API; the loop shape is essentially the same.

1. Query classification or routing

The agent’s first job is to decide what kind of question this is. A small upstream classifier or a structured-output call to the agent itself partitions inbound queries into roughly four buckets:

Casual — greetings, meta questions, small talk. No retrieval.
Direct-answer — questions the model already knows from training and doesn’t need evidence for. (“What’s the capital of Brazil?”) No retrieval.
Single-hop retrieval — a fact lookup against the corpus. One retrieval, generate, return.
Multi-hop / synthesis — needs comparison or reasoning across multiple corpus regions. Retrieve, reason, retrieve again, then synthesize.

The LangGraph agentic RAG tutorial implements this as a conditional edge: the LLM emits a structured route field, the graph dispatches accordingly. The proportion that hits the “no retrieval” branch on a real chatbot is regularly 30-50% — which means agentic RAG is, in production, often faster than naive RAG on average, not slower.

2. Retrieve-as-tool, with loops

When the agent does retrieve, the retriever is a tool call with a structured signature, not a fixed pipeline stage. The agent reads the results, decides whether they’re sufficient, and either generates an answer or calls the retriever again with a refined query.

This is where Anthropic’s MCP (Model Context Protocol) and OpenAI’s Agents SDK both shine. Both ship a tools=[retriever] pattern where the model decides when and how to invoke retrieval, with streaming tool results and structured returns. Anthropic’s recent code execution with MCP post lays out the reasoning: tool definitions are themselves expensive (a five-server MCP setup with 58 tools consumed ~55K tokens of context before any user turn), so retrieval and other tools live behind a discoverable surface the model dynamically loads.

The retrieve-refine loop is what handles the multi-hop case. The agent issues retrieve("Q3 incident metrics"), sees the chunks, notices the user actually asked for a comparison, and issues retrieve("Q4 incident metrics") as a second call. A naive pipeline would have done one retrieval and merged the wrong things.

3. Clarification before retrieval

The most underrated agentic RAG capability is the agent asking before it retrieves. The naive pipeline can’t, because there’s no agent to ask. The agentic one can, and the wins are concrete:

A production support agent at a large SaaS company (a deployment I saw discussed at a 2026 community meetup) measured that ~12% of inbound questions were ambiguous in a way that retrieval couldn’t resolve. Adding a single clarification turn (“do you mean the admin console or the developer API?”) before retrieval cut their “answer didn’t help” rate by 18 percentage points and reduced average retrieval calls per session by nearly 30%, because the agent no longer fired retrieval against a malformed query and then fired it again after the user corrected itself.

This is the Anthropic-style augmented LLM pattern applied to RAG: the model decides whether to retrieve, whether to answer, or whether to ask. The retrieval tool is just one of three choices, not a default action.

What the frameworks actually give you

The agentic RAG pattern has converged on three implementation stacks in 2026, with surface differences but the same loop underneath:

LangGraph — the state-machine library is a near-perfect fit for the agentic RAG loop. Nodes are LLM calls or tool invocations, conditional edges handle the routing. The LangChain agentic RAG docs walk through the canonical implementation: classify → route → retrieve (maybe) → generate. State persists between turns via the graph’s checkpoint mechanism. Best for teams that want explicit, inspectable workflow control.

OpenAI Agents SDK — the agent loop is the primitive, retrieval is a tool with a typed schema. The model decides; the SDK handles the tool-call dispatch. Cleaner code for simple loops; harder to add sophisticated routing without writing your own dispatcher.

Anthropic Claude + MCP — retrieval is exposed as an MCP server, the model loads tool definitions on demand via Anthropic’s tool search, and the loop is whatever your harness implements. The win here is that the same retriever-as-MCP-server is usable from any MCP client (Claude Code, Claude Desktop, your own app), which makes it easier to share retrieval infrastructure across surfaces.

In all three, the actual logic is small. Most of the engineering effort goes into the eval set (how do you measure routing accuracy?) and the observability (can you replay the agent’s decision path on a failure?). Not the framework.

Production patterns that survived

A few patterns that have survived contact with shipped products:

Pattern 1 — clarify-before-retrieve, always for ambiguous queries

The support agent example above generalizes. If your domain has common ambiguities (Stripe Connect vs Stripe Atlas; admin console vs end-user API; the EU subsidiary vs the US parent), train the model explicitly to ask before retrieving when the query is ambiguous on those axes. The cost is one extra turn; the saving is multiple malformed retrievals.

Pattern 2 — retrieval as a typed tool, never a free-form call

Retrieval tools that look like retrieve(query: str) -> list[Chunk] are too coarse. Production deployments split into something like:

search_docs(query: str, source: Literal["public_docs", "internal_wiki",
            "tickets"], top_k: int = 5) -> list[Chunk]
search_changelog(version_from: str, version_to: str) -> list[Change]
lookup_account(account_id: str) -> Account

The agent picks the right retriever, not just any retriever. This is the single biggest hit-rate improvement most teams see going from naive to agentic RAG. The LlamaIndex agents docs describe this as “RAG over multiple data sources” and it’s the single most common production deployment shape.

Pattern 3 — bounded retrieval rounds, with a final answer step

Letting the agent loop unbounded is how you burn $50 on a single query. Production deployments cap retrieval at 3-5 rounds with a “you must answer with whatever you have” final step. The teams I’ve seen ship this find that 90% of queries answer in 1 retrieval, 8% need 2, and the remaining 2% genuinely need multi-hop — but you have to bound the loop or the long tail of weird queries eats your compute.

The production-shaped agentic RAG loop. Bounded loop iteration, typed retrieval tools, explicit “force answer” fallback. The dashed arrow back is what makes this agentic; the bounded counter is what makes it shippable.

Where this leaves naive RAG

It’s tempting to declare naive RAG dead. It isn’t. There’s still a class of workloads — single-purpose, single-corpus, narrow-vocabulary Q&A — where naive RAG is the right amount of complexity. Docs search over a 500-page manual. FAQ bot over a fixed knowledge base. Anything where every query genuinely needs retrieval and the retrieval is unambiguous.

But the moment your system grows multiple data sources, ambiguous queries, or any need for the model to decide something about the query before answering, the agentic RAG loop is the right shape. Production teams have largely stopped treating “RAG” and “agentic RAG” as separate categories — agentic RAG is just RAG that has grown up enough to know when not to retrieve.

What to take away

Three lines, distilled from the last eighteen months of agentic RAG shipping:

Retrieval is a tool, not a pipeline stage. Letting the model decide whether to retrieve catches 30-50% of queries that didn’t need retrieval at all and saves the latency and token cost on every one.
Typed retrievers beat universal ones. search_docs(source=...), lookup_account(id=...), search_changelog(from, to). Production agentic RAG looks more like an API surface than a vector DB call.
Bound the loop, but allow it. Cap retrieval rounds at 3-5, but do allow re-querying — the multi-hop case is where agentic RAG earns its keep over naive RAG, not the FAQ case.

The agentic RAG shift is one of those architectural changes that looks small in a diagram and reshapes the whole product. The vending machine is dead; the dispatcher is the new default.

Further reading: the LangGraph agentic RAG tutorial, Anthropic’s Building Effective Agents, the MCP code execution post, the LlamaIndex agents docs, and Qdrant’s agentic RAG with LangGraph guide.