Hallucination & grounding
Why models hallucinate, and how to measure faithfulness for real — decompose an answer into atomic claims and check each against the source. The metric, and the limit current evaluators hit.
What you'll learn
- Why LLMs hallucinate and what 'grounded' actually means
- How to measure faithfulness by decomposing answers into atomic claims
- The blind spot — telling grounded facts from plausible inferences
Before you start
A hallucination isn’t the model “lying” — it’s the model doing exactly what it was built to do: predict plausible next tokens. When it doesn’t know, it generates something that sounds right, fluently and confidently. That’s why hallucination is so dangerous: the wrong answer looks identical to the right one. The defense is grounding — tying answers to a source — and the skill is measuring whether that grounding actually held.
What “grounded” means
An answer is grounded when its claims are supported by a source you trust (the retrieved context in RAG, a tool result, a document). RAG is grounding by construction — but grounding is a goal, not a guarantee. A model can still ignore the context, over-read it, or blend it with memorized (and wrong) priors. So you have to check.
Measure faithfulness, don’t eyeball it
The reliable method: decompose the answer into atomic claims, then check each one against the source and label it:
- Grounded — directly supported by the source.
- Inferred — plausible, but not actually in the source. The sneaky one.
- Contradicted — conflicts with the source. The dangerous one.
Faithfulness is the fraction of claims that are grounded. Run two answers through it — notice how a fluent, confident answer can still be mostly air:
That decomposition is exactly what faithfulness metrics in RAG eval frameworks (RAGAS, TruLens, DeepEval) automate — usually with an LLM judge that extracts claims and verifies each against the retrieved context.
Reducing hallucination
- Ground it. Retrieval + “answer only from the context; if it’s not there, say you don’t know” is the single biggest lever.
- Ask for citations. Make the model cite which chunk supports each claim — uncited sentences are your hallucination suspects, and it makes QA trivial.
- Constrain when you can. For structured facts, constrained decoding and tool calls beat free-form prose.
- Calibrate uncertainty. Encourage “I don’t know” over a confident guess; measure how often the model abstains when it should.
Quick check
Quick check
Next
Grounding metrics are a specialized branch of LLM evals, and they pair directly with RAG evaluations. To cut cost while keeping quality high, see model routing.
Practice this in an interview
All questionsHallucinations are fluent but unsupported or false outputs, arising because LLMs predict likely text rather than retrieve verified facts and have no built-in grounding. Mitigations include retrieval-augmented grounding with citations, constraining the model to answer only from provided context, lower temperature, verification or self-check steps, and faithfulness-focused evaluation.
Hallucinations occur because an LLM is trained to produce plausible next tokens, not verified facts — it has no internal truth-checking mechanism, only statistical patterns. Common causes include rare or conflicting training data, overconfident decoding, and prompts that lead the model to extrapolate beyond what it learned. Mitigation strategies include retrieval-augmented generation, grounding responses to retrieved sources, lowering temperature, and calibrated refusal training.
Check that the retrieved chunk actually contains the answer and survived the context window without truncation, then inspect prompt construction and instructions telling the model to answer only from context. Add grounding and citation requirements, lower temperature, and use a faithfulness metric or judge to verify the answer is entailed by the retrieved text, also checking for conflicting context or parametric-knowledge override.
Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.