Rag Scenarios And Solutions
Context Window Overflow
Retrieved chunks exceed LLM context window capacity, forcing truncation and losing critical information needed for accurate answers.
TL;DR
Retrieved chunks exceed LLM context window capacity, forcing truncation and losing critical information needed for accurate answers.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Retrieved chunks exceed LLM context window capacity, forcing truncation and losing critical information needed for accurate answers.
Symptoms
- ❌ "Context length exceeded" errors
- ❌ Later chunks cut off mid-sentence
- ❌ Inconsistent answers (depends what fits)
- ❌ Cannot use all relevant retrieved docs
- ❌ Quality degrades with more context
Real-World Example
LLM context window: 8,000 tokens
System prompt: 500 tokens
User query: 50 tokens
Response generation buffer: 1,000 tokens
Available for retrieval: 6,450 tokens
Retrieved top-10 chunks (1,000 tokens each):
→ Total: 10,000 tokens
→ Exceeds available 6,450 tokens
→ Last 4 chunks truncated
Most relevant chunk was #8 (truncated)
→ AI cannot see it
→ Gives incomplete answer
Deep Technical Analysis
Context Window Constraints
Models have fixed input limits:
Common Limits:
GPT-3.5-turbo: 4K tokens (16K variant available)
GPT-4: 8K tokens (32K/128K variants)
Claude 2: 100K tokens
Claude 3: 200K tokens
But practical limits lower:
→ Need space for response
→ System prompts consume tokens
→ Effective capacity ~70% of max
The Token Budget:
8K model breakdown:
- System prompt: 300 tokens
- Conversation history: 500 tokens (multi-turn)
- Current query: 100 tokens
- Response generation: 1,000 tokens (reserve)
- Available for context: 6,100 tokens
If chunks average 800 tokens:
→ Can fit ~7 chunks maximum
→ Must be selective
Truncation Strategies
When context exceeds limit:
Last-In-First-Out (LIFO):
Include chunks in order until full:
→ Chunk 1 (highest relevance): Included
→ Chunk 2: Included
→ ...
→ Chunk 8: Included (reaches limit)
→ Chunks 9-10: Truncated
Problem: Lower-ranked chunks may have unique info
Sliding Window:
For very long context:
→ Use first N tokens
→ Use last M tokens
→ Skip middle
Preserves beginning and end:
→ Good for narratives
→ Bad for scattered information
Smart Truncation:
Analyze chunks:
→ Compute diversity score
→ Prioritize unique information
→ Include redundant content last
Maximizes information density within limit
Chunk Size vs Retrieval K Trade-off
Competing objectives:
Large Chunks:
Chunk size: 2,000 tokens
Capacity: 6,000 tokens
Chunks that fit: 3
Pros:
+ More context per chunk
+ Preserves coherence
Cons:
- Fewer chunks (less diversity)
- Higher chance of irrelevant content
Small Chunks:
Chunk size: 500 tokens
Capacity: 6,000 tokens
Chunks that fit: 12
Pros:
+ More information sources
+ Higher precision
Cons:
- Less context per chunk
- May lose coherence
Dynamic Adjustment:
Measure query complexity:
→ Simple factual: Small chunks, high K
→ Complex explanatory: Large chunks, low K
Adaptive strategy based on query type
Lossy Compression Techniques
Summarize context to fit:
Extractive Summarization:
For each chunk:
1. Extract key sentences (top 30%)
2. Discard filler content
3. Preserve critical facts
Original: 1,000 tokens → Compressed: 300 tokens
→ 3x more chunks fit
→ But: Some nuance lost
Abstractive Summarization:
Use LLM to summarize chunks:
→ "Summarize in 100 words"
→ Condenses information
→ Rephrases for brevity
Trade-off:
+ Fit more content
- Risk: Summary LLM may hallucinate
- Extra API call (cost, latency)
Hierarchical Context Assembly
Multi-level retrieval:
Coarse-to-Fine:
Stage 1: Broad retrieval
→ Get top-20 chunks
→ Scan for relevant sections
Stage 2: Focused retrieval
→ Identify most relevant 3 sections
→ Retrieve full content for those
→ Discard others
Progressive refinement within token budget
Section-Level Granularity:
Store documents at multiple levels:
→ Document summary (200 tokens)
→ Section summaries (500 tokens each)
→ Full sections (2,000 tokens each)
Retrieval:
1. Match at summary level
2. Fetch full sections only for top matches
3. Efficient token usage
Long-Context Model Considerations
Models with larger windows:
Claude 200K Benefits:
Can include entire documents:
→ No truncation needed
→ Full context available
But:
→ More expensive
→ Slower inference
→ Diminishing returns (recall degradation)
The Lost-in-the-Middle Problem:
Research shows:
→ LLMs attend to beginning and end
→ Middle content often ignored
Even with 100K context:
→ Relevant info in middle may be missed
→ Positioning matters
Optimal Positioning:
Place most relevant chunks:
→ At beginning (primacy)
→ At end (recency)
Less relevant:
→ In middle
Despite large window, order still matters
How to Solve
Implement dynamic chunk sizing based on query type + use extractive summarization for less critical chunks + prioritize top-K chunks and truncate lower-ranked + consider long-context models (Claude 200K) for comprehensive docs + apply smart truncation preserving key info. See Context Management.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/llm/context-overflow.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Comparisons
Last updated January 26, 2026


