Context Window Overflow

The Problem

Retrieved chunks exceed LLM context window capacity, forcing truncation and losing critical information needed for accurate answers.

Symptoms

❌ "Context length exceeded" errors
❌ Later chunks cut off mid-sentence
❌ Inconsistent answers (depends what fits)
❌ Cannot use all relevant retrieved docs
❌ Quality degrades with more context

Real-World Example

LLM context window: 8,000 tokens
System prompt: 500 tokens
User query: 50 tokens
Response generation buffer: 1,000 tokens
Available for retrieval: 6,450 tokens

Retrieved top-10 chunks (1,000 tokens each):
→ Total: 10,000 tokens
→ Exceeds available 6,450 tokens
→ Last 4 chunks truncated

Most relevant chunk was #8 (truncated)
→ AI cannot see it
→ Gives incomplete answer

Deep Technical Analysis

Context Window Constraints

Models have fixed input limits:

Common Limits:

GPT-3.5-turbo: 4K tokens (16K variant available)
GPT-4: 8K tokens (32K/128K variants)
Claude 2: 100K tokens
Claude 3: 200K tokens

But practical limits lower:
→ Need space for response
→ System prompts consume tokens
→ Effective capacity ~70% of max

The Token Budget:

8K model breakdown:
- System prompt: 300 tokens
- Conversation history: 500 tokens (multi-turn)
- Current query: 100 tokens
- Response generation: 1,000 tokens (reserve)
- Available for context: 6,100 tokens

If chunks average 800 tokens:
→ Can fit ~7 chunks maximum
→ Must be selective

Truncation Strategies

When context exceeds limit:

Last-In-First-Out (LIFO):

Include chunks in order until full:
→ Chunk 1 (highest relevance): Included
→ Chunk 2: Included
→ ...
→ Chunk 8: Included (reaches limit)
→ Chunks 9-10: Truncated

Problem: Lower-ranked chunks may have unique info

Sliding Window:

For very long context:
→ Use first N tokens
→ Use last M tokens
→ Skip middle

Preserves beginning and end:
→ Good for narratives
→ Bad for scattered information

Smart Truncation:

Analyze chunks:
→ Compute diversity score
→ Prioritize unique information
→ Include redundant content last

Maximizes information density within limit

Chunk Size vs Retrieval K Trade-off

Competing objectives:

Large Chunks:

Chunk size: 2,000 tokens
Capacity: 6,000 tokens
Chunks that fit: 3

Pros:
+ More context per chunk
+ Preserves coherence

Cons:
- Fewer chunks (less diversity)
- Higher chance of irrelevant content

Small Chunks:

Chunk size: 500 tokens
Capacity: 6,000 tokens  
Chunks that fit: 12

Pros:
+ More information sources
+ Higher precision

Cons:
- Less context per chunk
- May lose coherence

Dynamic Adjustment:

Measure query complexity:
→ Simple factual: Small chunks, high K
→ Complex explanatory: Large chunks, low K

Adaptive strategy based on query type

Lossy Compression Techniques

Summarize context to fit:

Extractive Summarization:

For each chunk:
1. Extract key sentences (top 30%)
2. Discard filler content
3. Preserve critical facts

Original: 1,000 tokens → Compressed: 300 tokens
→ 3x more chunks fit
→ But: Some nuance lost

Abstractive Summarization:

Use LLM to summarize chunks:
→ "Summarize in 100 words"
→ Condenses information
→ Rephrases for brevity

Trade-off:
+ Fit more content
- Risk: Summary LLM may hallucinate
- Extra API call (cost, latency)

Hierarchical Context Assembly

Multi-level retrieval:

Coarse-to-Fine:

Stage 1: Broad retrieval
→ Get top-20 chunks
→ Scan for relevant sections

Stage 2: Focused retrieval
→ Identify most relevant 3 sections
→ Retrieve full content for those
→ Discard others

Progressive refinement within token budget

Section-Level Granularity:

Store documents at multiple levels:
→ Document summary (200 tokens)
→ Section summaries (500 tokens each)
→ Full sections (2,000 tokens each)

Retrieval:
1. Match at summary level
2. Fetch full sections only for top matches
3. Efficient token usage

Long-Context Model Considerations

Models with larger windows:

Claude 200K Benefits:

Can include entire documents:
→ No truncation needed
→ Full context available

But:
→ More expensive
→ Slower inference
→ Diminishing returns (recall degradation)

The Lost-in-the-Middle Problem:

Research shows:
→ LLMs attend to beginning and end
→ Middle content often ignored

Even with 100K context:
→ Relevant info in middle may be missed
→ Positioning matters

Optimal Positioning:

Place most relevant chunks:
→ At beginning (primacy)
→ At end (recency)

Less relevant:
→ In middle

Despite large window, order still matters

How to Solve

Implement dynamic chunk sizing based on query type + use extractive summarization for less critical chunks + prioritize top-K chunks and truncate lower-ranked + consider long-context models (Claude 200K) for comprehensive docs + apply smart truncation preserving key info. See Context Management.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/llm/context-overflow.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Context Window Overflow

Key Takeaways