What causes hallucinations in LLMs and how do you mitigate them?

Hallucinations are fluent but unsupported or false outputs, arising because LLMs predict likely text rather than retrieve verified facts and have no built-in grounding. Mitigations include retrieval-augmented grounding with citations, constraining the model to answer only from provided context, lower temperature, verification or self-check steps, and faithfulness-focused evaluation.

Your RAG system is hallucinating even though the correct context was retrieved. How do you debug it?

Check that the retrieved chunk actually contains the answer and survived the context window without truncation, then inspect prompt construction and instructions telling the model to answer only from context. Add grounding and citation requirements, lower temperature, and use a faithfulness metric or judge to verify the answer is entailed by the retrieved text, also checking for conflicting context or parametric-knowledge override.

What causes LLM hallucinations and how can they be reduced?

Hallucinations occur because an LLM is trained to produce plausible next tokens, not verified facts — it has no internal truth-checking mechanism, only statistical patterns. Common causes include rare or conflicting training data, overconfident decoding, and prompts that lead the model to extrapolate beyond what it learned. Mitigation strategies include retrieval-augmented generation, grounding responses to retrieved sources, lowering temperature, and calibrated refusal training.

How do you evaluate the quality of an LLM or RAG system?

Evaluation splits into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them correctly?). Key metrics are context precision/recall for retrieval and faithfulness plus answer relevance for generation. Frameworks like RAGAS automate LLM-as-judge scoring; human annotation anchors the ground truth.

Hallucination & grounding — Generative AI

A hallucination isn’t the model “lying” — it’s the model doing exactly what it was built to do: predict plausible next tokens. When it doesn’t know, it generates something that sounds right, fluently and confidently. That’s why hallucination is so dangerous: the wrong answer looks identical to the right one. The defense is grounding — tying answers to a source — and the skill is measuring whether that grounding actually held.

What “grounded” means

An answer is grounded when its claims are supported by a source you trust (the retrieved context in RAG, a tool result, a document). RAG is grounding by construction — but grounding is a goal, not a guarantee. A model can still ignore the context, over-read it, or blend it with memorized (and wrong) priors. So you have to check.

Measure faithfulness, don’t eyeball it

The reliable method: decompose the answer into atomic claims, then check each one against the source and label it:

Grounded — directly supported by the source.
Inferred — plausible, but not actually in the source. The sneaky one.
Contradicted — conflicts with the source. The dangerous one.

Faithfulness is the fraction of claims that are grounded. Decompose one fluent, confident answer this way and watch how much of it turns out to be air — two claims supported, one quietly invented, one outright wrong:

That decomposition is exactly what faithfulness metrics in RAG eval frameworks (RAGAS, TruLens, DeepEval) automate — usually with an LLM judge that extracts claims and verifies each against the retrieved context.

Reducing hallucination

Ground it. Retrieval + “answer only from the context; if it’s not there, say you don’t know” is the single biggest lever.
Ask for citations. Make the model cite which chunk supports each claim — uncited sentences are your hallucination suspects, and it makes QA trivial.
Constrain when you can. For structured facts, constrained decoding and tool calls beat free-form prose.
Calibrate uncertainty. Encourage “I don’t know” over a confident guess; measure how often the model abstains when it should.

# Faithfulness = grounded claims / total claims.
answer_claims = [
    ("Enterprise refunds within 30 days of invoice", "grounded"),
    ("Refunds prorated after 30 days",               "grounded"),
    ("Refunds processed within 2 hours",             "inferred"),     # not in source
    ("There's a 60-day extended window",             "contradicted"), # conflicts!
]

total = len(answer_claims)
grounded = sum(1 for _, lab in answer_claims if lab == "grounded")
contradicted = sum(1 for _, lab in answer_claims if lab == "contradicted")

print(f"faithfulness (strict) = {grounded}/{total} = {grounded/total:.0%}")
print(f"contradictions = {contradicted}  <- any contradiction is a red flag")
print("\nThe 'inferred' claim reads fine but isn't supported — exactly where")
print("a fluent hallucination hides. Look at per-claim labels, not just the score.")

faithfulness (strict) = 2/4 = 50%
contradictions = 1  <- any contradiction is a red flag

The 'inferred' claim reads fine but isn't supported — exactly where
a fluent hallucination hides. Look at per-claim labels, not just the score.

The answer reads as four confident sentences, but only two are actually in the source. A holistic “looks good” glance scores it near-perfect; claim decomposition scores it 50% and flags the one contradicted claim as a red flag. The number isn’t the point — the per-claim labels are.

In one breath

A hallucination isn’t lying — it’s the model predicting plausible tokens when it doesn’t know, so the wrong answer looks exactly like the right one.
An answer is grounded when its claims trace to a trusted source; RAG grounds by construction but a model can still ignore or over-read the context, so you must check.
Measure faithfulness by decomposing into atomic claims and labelling each grounded / inferred / contradicted — faithfulness is the grounded fraction.
Cut hallucination by grounding (“say you don’t know”), demanding citations, constraining structured outputs, and rewarding abstention.
The 2026 blind spot: scalar scores blur inferred (plausible-but-unsupported) into the number — for high-stakes outputs read the per-claim labels, not the percentage.

Quick check

0/3

Q1Why do LLMs hallucinate?

Q2How is faithfulness measured by claim decomposition?

Q3What's the 2026 limitation of scalar faithfulness scores?

Grounding metrics are a specialized branch of LLM evals, and they pair directly with RAG evaluations. To cut cost while keeping quality high, see model routing.

Hallucination & grounding

What you'll learn

Before you start

What “grounded” means

Measure faithfulness, don’t eyeball it

Reducing hallucination

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further