datarekha
RAG April 23, 2026

Multi-hop RAG without the hallucinations

Single-hop RAG works until your users start asking questions that require synthesis. Naive multi-hop hallucinates intermediate facts. Here's the working architecture — explicit decomposition, iterative retrieval, self-verification — with the production examples from Harvey and Hebbia that actually ship it.

13 min read · by datarekha · ragmulti-hopiterative-retrievalhallucination

The first time a RAG system has to answer “what changed in our refund policy between 2023 and 2025?” something quietly breaks. The single-hop retriever fetches the most semantically similar paragraphs to the query — probably one chunk from 2023, maybe one from 2025, maybe nothing useful at all — and the LLM happily synthesises a coherent-sounding answer. Sometimes correct. Often half-correct. Occasionally entirely fabricated, citing chunks that don’t say what the model claims they say.

The standard reaction is to “make it multi-hop”: let the LLM decide to search again if its first answer feels incomplete. That makes the problem worse. Naive multi-hop loops hallucinate intermediate “sub-conclusions” with the same confidence they hallucinate final answers, and then condition further retrieval on those hallucinated facts. By hop three, the system is searching for evidence to confirm things it made up itself.

The production teams shipping reliable multi-hop RAG — Harvey for legal, Hebbia for finance, a growing number of internal enterprise search systems — have converged on a small set of techniques that contain the failure mode. None of them eliminate hallucination, but they corner it well enough that the citations a human can verify catch the remaining problems. This post is the working architecture.

What naive multi-hop actually breaks

THREE WAYS NAIVE MULTI-HOP HALLUCINATES MORE, NOT LESSIntermediate hallucinationLLM generates a fake factto bridge the gapthen retrieves to “confirm”the made-up factFixexplicit decompositionbefore any LLM reasoningConfirmation cascaderetrieved chunks reinforcean early wrong directionmodel anchors on hop 1refuses to backtrackFixCRAG-style retrievalquality scoringLoop without stoppingLLM never declares”enough evidence”cost spirals; latency toono clean terminationFixSelf-RAG reflection tokens+ hard hop budget
The three ways naive “let the LLM decide to retrieve again” loops fail. Each fix corresponds to a published technique that production systems have adopted in some form.

The failures are mechanical:

  1. Intermediate hallucination. When the LLM is asked “do you need more information?” it sometimes generates a plausible-sounding intermediate conclusion to fill the gap, then retrieves chunks to “confirm” the fabrication. The system ends up with citations that don’t say what the model claims they say.
  2. Confirmation cascade. Once the model commits to a direction at hop 1, subsequent retrievals tend to reinforce that direction. RAG pipelines without explicit retrieval-quality scoring rarely backtrack.
  3. Loop without stopping. The LLM doesn’t have a clean signal for “I have enough now.” Without an external budget or a self-assessment signal, multi-hop loops either terminate prematurely or run away.

The architecture that survives all three is layered, and each layer addresses a specific failure mode.

Layer 1 — Explicit decomposition before any retrieval

The first thing production multi-hop RAG systems do differently from the naive version is decompose the user query into structured sub-questions before any LLM reasoning happens. The model that decomposes is prompted with one job: turn the user’s question into a DAG of atomic sub-questions whose answers, combined, answer the original.

The canonical academic technique here is HyDE (Hypothetical Document Embeddings), which goes a step further — for each sub-question, the LLM generates a hypothetical answer document, and the embedding of the hypothetical document is used as the retrieval query instead of the raw question. The pseudo-document is closer in embedding space to the real answer chunks than the question itself is.

Production usage is pragmatic. Harvey’s agentic legal search system explicitly decomposes a legal research question into the relevant knowledge sources (case law, statutes, internal vault, third-party databases) and then issues parallel retrievals against each. Hebbia’s Matrix uses Iterative Source Decomposition (ISD) — described in their public material as the engine that breaks a financial analyst’s question into a structured plan of document sources to consult, in order, before any answer is attempted.

The key property: the decomposition happens before the LLM has seen any retrieved content. The sub-questions are generated from the original query and an understanding of available sources, not from a hop-by-hop “let me think about what I need next.” This eliminates the intermediate-hallucination class of failures — the model can’t condition on a fabricated fact because it hasn’t done any fact-shaped reasoning yet.

Layer 2 — Iterative retrieval with reasoning interleaved

For questions that genuinely need iterative retrieval (the next hop depends on what hop N returned), the production winners are variants of FLARE and IRCoT.

FLARE (Forward-Looking Active REtrieval) generates a draft answer one sentence at a time, and triggers a new retrieval whenever the model’s confidence in the next token drops below a threshold. The retrieved chunks then inform the next sentence. This turns retrieval from a one-shot pre-condition into a streaming process that follows the reasoning trajectory.

IRCoT (Interleaving Retrieval with Chain-of-Thought) is the more explicit version: the LLM produces a chain-of-thought reasoning trace, and after each step in the trace, the system retrieves new chunks based on the new partial conclusion. The next reasoning step consumes the new chunks. The trace is the retrieval plan.

Both techniques share the same insight: the model’s own intermediate reasoning state is the best signal for what to retrieve next, much better than a generic “do I need more?” question. The 2025 work on multi-hop iterative retrieval reports IRCoT-style architectures consistently beating single-hop RAG by significant margins on HotpotQA, MuSiQue, and other multi-hop QA benchmarks.

The production gotchas:

  • Hard hop budget. Always set a maximum number of retrieval rounds. 3-5 is typical for chat-shaped queries; 8-12 for deep research. Without it, edge-case queries run away.
  • Retrieval deduplication. Track which chunks have already been retrieved and exclude them from later retrievals. Otherwise the same chunks dominate every hop and the system spins in place.
  • Reasoning trace stored as state. The chain-of-thought between hops needs to be persisted so the final answer-generation step can see the full trajectory, not just the last hop.

Layer 3 — Self-verification and corrective retrieval

The third layer is where most production systems diverge from the academic baseline. Even with explicit decomposition and iterative retrieval, retrieved chunks can be irrelevant, contradictory, or straight-up wrong (especially in heterogeneous enterprise corpora). The fix is an explicit verification step.

Self-RAG (Self-Reflective Retrieval-Augmented Generation) trains the model to emit special reflection tokensIsREL (“is this chunk relevant”), IsSUP (“is this chunk supported”), IsUSE (“is the final answer useful given this context”) — that the system inspects before committing to an answer. A 2025 MDPI Electronics study of 12 RAG architectures measured Self-RAG at a 5.8% hallucination rate — the lowest in the benchmark, versus 12-14% for standard agentic pipelines.

CRAG (Corrective Retrieval-Augmented Generation) takes a complementary angle: a lightweight retrieval evaluator scores each retrieved chunk on relevance to the query, producing a confidence degree. If confidence is high, the chunks are used directly. If confidence is low or ambiguous, the system falls back to a broader search (often web search) or refines the retrieval. The same MDPI benchmark measured CRAG at 10.5% hallucination rate at Precision@5 of 0.69 — slightly worse than Self-RAG but better than plain agentic RAG.

In production the two get composed. Self-RAG handles “should we retrieve at all and what should we trust” — CRAG handles “is what we retrieved actually good enough.” Harvey’s published architecture explicitly scores retrieved chunks on relevance before passing them to the synthesis step; Hebbia’s citation-first design (where “an un-sourced fact is worse than no fact at all,” per their public engineering material) implicitly requires chunk-level verification before any claim makes it to the user.

The full architecture, end to end

MULTI-HOP RAG, END-TO-END, PRODUCTION SHAPEUser question”What changed in our refund policy?“1. Decompose (HyDE / sub-questions)“what was the 2023 policy?” + “what is the 2025 policy?” + “what diffs?“2. Iterative retrieval (FLARE / IRCoT)per sub-question; reasoning state preserved between hops; budget capped3. Verify retrievals (Self-RAG / CRAG)relevance score per chunk; reject low-confidence; trigger re-retrieval if needed4. Synthesise with citations linked to verified chunksre-decomposeon verification fail
The four-stage architecture that works in production. The dashed line is the corrective path — when verification fails, the system re-decomposes and retries rather than committing to a half-confident answer.

The full pipeline is four stages with one corrective loop:

  1. Decompose. Break the user query into structured sub-questions (and optionally HyDE-expand each to a pseudo-document for embedding).
  2. Iteratively retrieve. For each sub-question, run a small loop of retrieve → partial reason → maybe retrieve again, with a hard hop budget.
  3. Verify retrievals. Score each chunk for relevance and support; reject low-confidence chunks; trigger re-retrieval (or fallback to web search, CRAG-style) when the verification score is ambiguous.
  4. Synthesise with grounded citations. The final answer is produced only from verified chunks, with every claim explicitly linked to the chunk it came from.

The corrective loop is what most teams skip and regret. When verification fails, the right behaviour isn’t to “do best effort” — it’s to re-decompose the sub-question or escalate (to a different source, a broader search, or “I don’t know”). The “I don’t know” branch is particularly important: the system that occasionally says “I couldn’t verify this with high confidence” is the system humans learn to trust.

What Harvey and Hebbia actually report

The production evidence for this architecture is unambiguous in the data both companies have published:

  • Harvey reports that their agentic search system improved tool selection precision from “near zero to 0.8-0.9” and enabled complex legal queries to scale from “single tool calls to 3-10 retrieval operations as needed,” with the system explicitly evaluating completeness and synthesising citation-backed responses. They report 91% preference over ChatGPT in tax-law applications.
  • Hebbia reports that their Matrix architecture (multi-agent + iterative source decomposition) achieves 92% accuracy on rigorous financial-and-legal benchmarks, versus 68% for “out-of-the-box RAG.” Their public material is explicit that citation-first design is load-bearing — they treat un-sourced facts as worse than no fact at all.

Neither result requires you to take their numbers at face value to see the pattern: both are explicitly multi-stage, both decompose before retrieving, both score chunks before committing, and both treat the citation chain as a first-class output.

What’s still hard

The architecture works. But three problems remain unsolved and worth naming honestly:

  • Verification is itself an LLM, and LLMs can be wrong about relevance. Self-RAG and CRAG both improve hallucination rates substantially but don’t eliminate them. A 5-10% remaining hallucination rate at the chunk level is the floor for current techniques.
  • Heterogeneous corpora are still painful. Mixing structured databases, PDFs, web content, and internal vault documents in a single retrieval pipeline produces relevance-scoring artefacts that no single embedding model handles well. Harvey’s system has separate retrieval paths per source type for this exact reason.
  • Latency scales with hops. A 5-hop iterative retrieval pipeline takes 5-10 seconds end-to-end. For interactive products this is borderline; for deep-research tools (Harvey, Hebbia, Perplexity Pro) it’s acceptable but a tax. Aggressive caching of decompositions and retrievals helps; sub-second multi-hop is still not where the industry is.

What to take away

Three production rules:

  • Never let the LLM decide on multi-hop without explicit decomposition first. Naive “do I need to search again?” loops are how you get intermediate hallucinations confirmed by retrieved chunks.
  • Verify retrievals before synthesis. Self-RAG and CRAG variants are load-bearing in every production system that actually trusts its multi-hop outputs.
  • Citations are not a UI feature — they’re the verification surface. A claim without a citation linked to a verified chunk is a hallucination with extra steps. Teach your users to check the citations and the remaining hallucinations become catchable.

The hallucination doesn’t go away — the techniques in this post don’t push it to zero. But they corner it. Production multi-hop RAG in 2026 isn’t a guarantee of correctness; it’s a system that makes incorrectness visible, verifiable, and rare enough that humans can trust the workflow. That’s what shipping looks like.


Further reading: the HyDE paper, the FLARE paper, the IRCoT paper, the Self-RAG paper, the CRAG paper, and Harvey’s agentic search blog post for the production architecture story.

Skip to content