Rag Scenarios And Solutions
Token Limit Exceeded
Generated responses hit model token limits mid-answer, cutting off responses or preventing generation entirely.
TL;DR
Generated responses hit model token limits mid-answer, cutting off responses or preventing generation entirely.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Generated responses hit model token limits mid-answer, cutting off responses or preventing generation entirely.
Symptoms
- ❌ Responses end abruptly mid-sentence
- ❌ "Maximum tokens reached" errors
- ❌ Incomplete lists or code examples
- ❌ Must request "continue" from user
- ❌ Cannot generate long-form content
Real-World Example
User asks: "List all API endpoints"
AI starts response:
"Here are the API endpoints:
1. POST /auth/login - User authentication
2. GET /users - Retrieve user list
3. POST /users - Create new user
4. GET /users/{id} - Get user details
5. PUT /users/{id} - Update user
..."
[Token limit reached at 1000 tokens]
Response cuts off at endpoint #15 of 50 total
User sees incomplete list
Deep Technical Analysis
Output Token Limits
Separate from input context limits:
Max Tokens Parameter:
API calls specify max_tokens:
→ GPT-4: max_tokens=4096
→ Controls response length
→ Prevents runaway generation
Trade-offs:
→ Too low: Truncated responses
→ Too high: Longer latency, higher cost
→ Must balance
Automatic Truncation:
LLM generates token-by-token until:
1. Reaches natural stop (EOS token)
2. Hits max_tokens limit
3. Encounters stop sequence
If hits #2 mid-generation:
→ Cuts off wherever it stopped
→ No graceful ending
→ Incomplete output
Estimating Response Length
Predicting token needs:
Query Type Heuristics:
Factual query: "What is X?"
→ Expected: 50-200 tokens
→ Set max_tokens: 300 (buffer)
List query: "List all..."
→ Unknown length
→ Set high limit or paginate
Explanatory: "How does X work?"
→ Expected: 300-800 tokens
→ Set max_tokens: 1000
Dynamic Allocation:
Analyze query:
→ Count items in retrieved context
→ "50 API endpoints found"
→ Estimate: 50 × 30 tokens/item = 1500 tokens
→ Set max_tokens: 2000
Adaptive based on content
Pagination Strategies
Breaking responses into chunks:
Explicit Pagination:
System prompt: "If response exceeds 800 tokens, end with
[Continued in next message] and stop."
User experience:
→ AI sends first part
→ User clicks "Continue"
→ AI resumes with context
Preserves continuity across messages
Automatic Chunking:
Backend splits long responses:
1. Generate full response (internally)
2. Split at natural boundaries (paragraphs)
3. Send as multiple messages
4. Stream to user sequentially
Transparent to user
Summarization vs Detail
Adjusting verbosity:
Conciseness Prompting:
Add to system prompt:
"Be concise. Provide direct answers without unnecessary
elaboration."
Reduces token usage:
→ Same information
→ Fewer words
→ Fits in token budget
Detail Level Control:
User specifies preference:
→ "Give brief overview" (200 tokens)
→ "Explain in detail" (1000 tokens)
Adjust max_tokens accordingly
Token Accounting
Tracking usage:
Input + Output Budget:
Total model capacity: 8K tokens
Input (6K tokens):
→ System prompt: 300
→ Context: 5,500
→ Query: 200
Remaining: 2,000 tokens
→ Maximum possible response length
→ Set max_tokens ≤ 2,000
Conversation History:
Multi-turn chat accumulates:
→ Turn 1: 500 tokens (in + out)
→ Turn 2: 600 tokens
→ Turn 3: 700 tokens
→ Total: 1,800 tokens in history
Context window filling up:
→ Less space for future responses
→ Must prune old turns
Response Compression
Fitting more in less space:
Structured Formats:
Instead of prose:
"The API rate limit is 1000 requests per hour. If you
exceed this limit, you will receive a 429 error..."
Use structured:
{
"rate_limit": "1000/hour",
"error_code": 429,
"retry_after": "60 seconds"
}
Same info, fewer tokens
Tables Over Lists:
Verbose list (200 tokens):
"Endpoint 1: POST /auth/login - Used for authentication...
Endpoint 2: GET /users - Retrieves user list..."
Table (120 tokens):
| Method | Path | Description |
|--------|------|-------------|
| POST | /auth/login | Authentication |
| GET | /users | User list |
How to Solve
Set max_tokens dynamically based on query type + implement response pagination for long outputs + use conciseness prompts to reduce verbosity + employ structured formats (tables, JSON) over prose + track token usage and adjust context accordingly. See Token Management.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/llm/token-limit.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Comparisons
Last updated January 26, 2026


