Rag Scenarios And Solutions
Hallucination in Responses
LLMs generate confident but factually incorrect information not present in retrieved context, leading to wrong answers despite having correct source material.
TL;DR
LLMs generate confident but factually incorrect information not present in retrieved context, leading to wrong answers despite having correct source material.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
LLMs generate confident but factually incorrect information not present in retrieved context, leading to wrong answers despite having correct source material.
Symptoms
- ❌ AI invents facts not in retrieved documents
- ❌ Confident responses with fabricated details
- ❌ Mixing real and made-up information
- ❌ Cannot cite source for invented claims
- ❌ Plausible-sounding but wrong answers
Real-World Example
Retrieved context: "API rate limit is 1000 requests per hour"
User query: "What happens if I exceed the rate limit?"
AI response: "If you exceed the rate limit of 1000 requests per hour,
your account will be temporarily suspended for 15 minutes and you'll
receive a 429 error. After three violations, your API key will be
permanently revoked."
Problem: Context only mentions the limit
→ "15 minutes suspension" - INVENTED
→ "three violations" policy - INVENTED
→ "permanent revocation" - INVENTED
Only "1000 requests/hour" and "429 error" might be accurate
Deep Technical Analysis
Retrieval-Generation Gap
LLM operates beyond retrieved context:
Context Window Usage:
Retrieved chunks (3000 tokens):
→ Mentions: Rate limit exists
→ States: 1000 per hour
→ Missing: Enforcement behavior
LLM trained on internet data:
→ "Knows" typical API behaviors
→ Generalizes from training
→ Fills gaps with plausible patterns
Result: Blends context + training knowledge
→ Cannot distinguish what's in context vs learned
→ Generates "typical" API behavior
Pattern Completion Bias:
LLM sees pattern: "rate limit → enforcement"
Training data common pattern:
→ Rate limits typically have retry delays
→ Often 60-second cooldowns
→ Some APIs ban after violations
Generates archetypal response:
→ Even if specific API doesn't work that way
→ "Sounds right" but isn't factually correct
Instruction Following vs Grounding
Tension between creativity and accuracy:
System Prompt Dilemma:
Instruction: "Answer based only on provided context"
But LLM also trained to:
→ Be helpful and complete
→ Answer user's actual question
→ Provide useful information
User asks: "What happens if I exceed limit?"
Context: Silent on this
LLM choice:
A) Say "I don't know" (honest but unhelpful)
B) Infer typical behavior (helpful but risky)
Often chooses B → hallucination
The Helpful Assistant Problem:
LLM fine-tuned to be helpful:
→ Avoid saying "I don't know"
→ Provide complete answers
→ Be conversational
Conflicts with RAG goal:
→ Only cite sources
→ Admit gaps in knowledge
→ Stay strictly grounded
Training objectives misaligned with RAG requirements
Confidence Calibration Failure
LLMs don't know what they don't know:
Equal Confidence for All Outputs:
Response based on context: "The limit is 1000/hour"
Response invented: "Suspension lasts 15 minutes"
Both generated with same confidence:
→ No internal uncertainty signal
→ User cannot distinguish
→ Both sound equally authoritative
LLM lacks epistemic awareness
→ Doesn't track source of information
→ Cannot flag invented content
The Plausibility Trap:
Hallucinated content is often plausible:
→ Follows typical API patterns
→ Uses correct terminology
→ Grammatically coherent
→ Logically consistent
Harder to detect than random nonsense:
→ "15-minute suspension" sounds reasonable
→ "Three strikes policy" is common pattern
→ User may accept as fact
Context Length Limitations
Retrieved context may be insufficient:
Incomplete Information:
Query: "Complete setup procedure"
Retrieved (partial):
→ Step 1: Install package
→ Step 3: Configure settings
Missing: Step 2
LLM fills gap:
→ Invents "Step 2: Initialize database"
→ Sounds logical
→ May be wrong
Should instead: Acknowledge missing steps
Contradictory Sources:
Chunk A: "Premium plan includes 10 users"
Chunk B: "Premium supports up to 5 users"
LLM must reconcile:
→ May choose one arbitrarily
→ May average: "Premium offers 5-10 users"
→ May invent: "Premium basic has 5, Premium plus has 10"
Adding details not in either source
Mitigation Strategies
1. Aggressive System Prompting:
Prompt: "ONLY answer using the provided context. If the
context doesn't contain the answer, respond: 'I don't
have that information in my knowledge base.' Never make
up information."
Helps but doesn't eliminate hallucination
→ LLM still fills gaps unconsciously
→ Instruction adherence ~85%, not 100%
2. Citation Requirements:
Prompt: "For each claim, cite the source chunk. Format:
[Source: chunk_id]"
Forces grounding:
→ Cannot cite invented facts
→ Makes hallucination visible
→ User can verify claims
Example: "Rate limit is 1000/hour [Source: api_docs_chunk_5]"
3. Confidence Scoring:
Ask LLM: "On scale 1-5, how confident are you this
information is in the provided context?"
Response: "Based on context, rate limit is 1000/hour
(confidence: 5). Enforcement behavior: likely temporary
suspension (confidence: 2 - inferred, not stated)."
Makes uncertainty explicit
4. Two-Stage Generation:
Stage 1: Extract relevant facts from context
→ LLM outputs: "Rate limit: 1000/hour. Enforcement: not specified"
Stage 2: Answer using only extracted facts
→ "The rate limit is 1000 requests per hour. The enforcement
mechanism is not documented in the provided materials."
Structured pipeline reduces hallucination
How to Solve
Use strict system prompts requiring citation + implement confidence scoring + apply two-stage extraction-then-answer pipeline + penalize hallucination in prompt engineering + use models fine-tuned for RAG (instruction-following models). See Hallucination Prevention.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/llm/hallucination-deep.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Last updated January 26, 2026


