Rag Scenarios And Solutions
Context Window Utilization
Cannot monitor how efficiently the LLM context window is used, leading to wasted tokens, truncation, or suboptimal retrieval configurations.
TL;DR
Cannot monitor how efficiently the LLM context window is used, leading to wasted tokens, truncation, or suboptimal retrieval configurations.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Cannot monitor how efficiently the LLM context window is used, leading to wasted tokens, truncation, or suboptimal retrieval configurations.
Symptoms
- ❌ Don't know % of context window used
- ❌ Frequent context overflow unexplained
- ❌ Wasteful token usage
- ❌ Cannot optimize K parameter
- ❌ No visibility into token budget
Real-World Example
Configuration:
→ LLM: GPT-4 (8K context)
→ Retrieval: K=10 chunks
→ Chunk size: ~500 tokens each
Observed:
→ Context overflow errors: 15% of queries
Investigation:
→ System prompt: 300 tokens
→ User query: 100 tokens average
→ Retrieved context: 10 × 500 = 5,000 tokens
→ Response budget: 1,000 tokens
→ Total: 6,400 tokens (fits in 8K)
Why overflows?
→ No monitoring of actual token usage
→ Some chunks larger than 500 tokens (outliers)
→ Some queries longer (max: 800 tokens)
→ Total occasionally exceeds 8K
Deep Technical Analysis
Token Accounting
Component Breakdown:
Total context (8,000 tokens):
1. System prompt: 300 tokens (fixed)
2. Conversation history: 0-2,000 tokens (variable)
3. Retrieved context: 2,000-6,000 tokens (variable)
4. Current query: 50-500 tokens (variable)
5. Response generation: 500-2,000 tokens (reserved)
Monitor each component:
→ Which consumes most?
→ Where to optimize?
Utilization Percentage:
Metric: Context utilization
= (used_tokens / max_tokens) × 100%
Examples:
→ Query A: 6,400 / 8,000 = 80% (good)
→ Query B: 8,500 / 8,000 = 106% (overflow!)
→ Query C: 3,200 / 8,000 = 40% (underutilized)
Target: 70-85% utilization
→ Below 70%: Retrieving too few chunks
→ Above 85%: Risk of overflow
Dynamic K Adjustment
Token-Based K Selection:
Instead of fixed K=10:
→ Retrieve chunks until token budget nearly full
Algorithm:
1. Calculate available: 8,000 - system - query - response_buffer = 5,500 tokens
2. Retrieve chunks sequentially:
- Chunk 1: 450 tokens (total: 450)
- Chunk 2: 520 tokens (total: 970)
- ...
- Chunk 11: 480 tokens (total: 5,520 → exceeds 5,500)
3. Stop at Chunk 10
Adaptive K based on token budget
Query Complexity Adaptation:
Simple query: "What is X?"
→ Short query (30 tokens)
→ More budget for context
→ K=15 possible
Complex query: "Explain how X, Y, and Z interact..."
→ Long query (200 tokens)
→ Less budget for context
→ K=8
Dynamic based on query length
Truncation Strategy Monitoring
Where Truncation Happens:
Monitor truncation point:
→ Truncated at chunk #8 of 10
→ Lost chunks 9-10
Were important chunks lost?
→ Chunk #9 score: 0.72 (relevant)
→ Chunk #10 score: 0.68 (marginal)
Impact: Moderate (lost one relevant chunk)
Sliding Window Stats:
Track truncation patterns:
→ 5% of queries: Truncate at chunk 5-7
→ 10% of queries: Truncate at chunk 8-10
→ 85% of queries: No truncation
Optimize:
→ Reduce K for shorter queries
→ Increase response buffer
Optimization Opportunities
Token Waste Detection:
Query uses 40% of context:
→ Only 3,200 / 8,000 tokens
→ Underutilized
Could increase K:
→ From K=5 to K=8
→ More context for better answers
Compression Opportunities:
System prompt: 300 tokens
→ Can compress to 200 tokens?
→ Saves 100 tokens
→ More for context
Conversation history: 1,500 tokens
→ Summarize to 500 tokens?
→ Saves 1,000 tokens
How to Solve
Log token usage per component (system prompt, query, context, response) + calculate utilization % (used/max) + monitor truncation frequency and position + implement dynamic K based on available token budget + alert on high utilization (>85%) or overflow + track token distribution across queries + optimize underutilized queries (increase K) + compress system prompt or conversation history if needed. See Token Utilization.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/monitoring/context-utilization.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


