Token Limit Exceeded

The Problem

Generated responses hit model token limits mid-answer, cutting off responses or preventing generation entirely.

Symptoms

❌ Responses end abruptly mid-sentence
❌ "Maximum tokens reached" errors
❌ Incomplete lists or code examples
❌ Must request "continue" from user
❌ Cannot generate long-form content

Real-World Example

User asks: "List all API endpoints"
AI starts response:
"Here are the API endpoints:
1. POST /auth/login - User authentication
2. GET /users - Retrieve user list
3. POST /users - Create new user
4. GET /users/{id} - Get user details
5. PUT /users/{id} - Update user
..." 

[Token limit reached at 1000 tokens]

Response cuts off at endpoint #15 of 50 total
User sees incomplete list

Deep Technical Analysis

Output Token Limits

Separate from input context limits:

Max Tokens Parameter:

API calls specify max_tokens:
→ GPT-4: max_tokens=4096
→ Controls response length
→ Prevents runaway generation

Trade-offs:
→ Too low: Truncated responses
→ Too high: Longer latency, higher cost
→ Must balance

Automatic Truncation:

LLM generates token-by-token until:
1. Reaches natural stop (EOS token)
2. Hits max_tokens limit
3. Encounters stop sequence

If hits #2 mid-generation:
→ Cuts off wherever it stopped
→ No graceful ending
→ Incomplete output

Estimating Response Length

Predicting token needs:

Query Type Heuristics:

Factual query: "What is X?"
→ Expected: 50-200 tokens
→ Set max_tokens: 300 (buffer)

List query: "List all..."
→ Unknown length
→ Set high limit or paginate

Explanatory: "How does X work?"
→ Expected: 300-800 tokens
→ Set max_tokens: 1000

Dynamic Allocation:

Analyze query:
→ Count items in retrieved context
→ "50 API endpoints found"
→ Estimate: 50 × 30 tokens/item = 1500 tokens
→ Set max_tokens: 2000

Adaptive based on content

Pagination Strategies

Breaking responses into chunks:

Explicit Pagination:

System prompt: "If response exceeds 800 tokens, end with
[Continued in next message] and stop."

User experience:
→ AI sends first part
→ User clicks "Continue"
→ AI resumes with context

Preserves continuity across messages

Automatic Chunking:

Backend splits long responses:
1. Generate full response (internally)
2. Split at natural boundaries (paragraphs)
3. Send as multiple messages
4. Stream to user sequentially

Transparent to user

Summarization vs Detail

Adjusting verbosity:

Conciseness Prompting:

Add to system prompt:
"Be concise. Provide direct answers without unnecessary
elaboration."

Reduces token usage:
→ Same information
→ Fewer words
→ Fits in token budget

Detail Level Control:

User specifies preference:
→ "Give brief overview" (200 tokens)
→ "Explain in detail" (1000 tokens)

Adjust max_tokens accordingly

Token Accounting

Tracking usage:

Input + Output Budget:

Total model capacity: 8K tokens

Input (6K tokens):
→ System prompt: 300
→ Context: 5,500
→ Query: 200

Remaining: 2,000 tokens
→ Maximum possible response length
→ Set max_tokens ≤ 2,000

Conversation History:

Multi-turn chat accumulates:
→ Turn 1: 500 tokens (in + out)
→ Turn 2: 600 tokens
→ Turn 3: 700 tokens
→ Total: 1,800 tokens in history

Context window filling up:
→ Less space for future responses
→ Must prune old turns

Response Compression

Fitting more in less space:

Structured Formats:

Instead of prose:
"The API rate limit is 1000 requests per hour. If you 
exceed this limit, you will receive a 429 error..."

Use structured:
{
  "rate_limit": "1000/hour",
  "error_code": 429,
  "retry_after": "60 seconds"
}

Same info, fewer tokens

Tables Over Lists:

Verbose list (200 tokens):
"Endpoint 1: POST /auth/login - Used for authentication...
Endpoint 2: GET /users - Retrieves user list..."

Table (120 tokens):
| Method | Path | Description |
|--------|------|-------------|
| POST | /auth/login | Authentication |
| GET | /users | User list |

How to Solve

Set max_tokens dynamically based on query type + implement response pagination for long outputs + use conciseness prompts to reduce verbosity + employ structured formats (tables, JSON) over prose + track token usage and adjust context accordingly. See Token Management.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/llm/token-limit.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Token Limit Exceeded

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Output Token Limits

Estimating Response Length

Summarization vs Detail

Token Accounting

Response Compression

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Comparisons

Compliance

Investors

Industry