Product
Performance Tuning
Optimize your AI agents for speed, accuracy, and cost-effectiveness
TL;DR
Optimize your AI agents for speed, accuracy, and cost-effectiveness. Track in Analytics dashboard:
Key Takeaways
- Performance Metrics
- Optimization Tradeoffs
- Speed Optimization
- Accuracy Optimization
- Cost Optimization
- Balanced Optimization
Optimize your AI agents for speed, accuracy, and cost-effectiveness.
Performance Metrics
Track in Analytics dashboard:
| Metric | Target | Measured How |
|---|---|---|
| p50 Latency | <2s | 50th percentile response time |
| p95 Latency | <4s | 95th percentile (worst-case normal) |
| p99 Latency | <6s | 99th percentile (outliers) |
| Accuracy Rate | >85% | % responses marked accurate |
| Cost/Query | <$0.01 | Total API costs / query count |
| Cache Hit Rate | >40% | Cached / total queries |
Location: Dashboard → Analytics → Performance tab
Optimization Tradeoffs
Cannot maximize all three simultaneously:
Speed ⟷ Accuracy ⟷ Cost
Pick 2:
- Speed + Cost → Redwood + GPT-3.5-turbo (accuracy: ~72%)
- Speed + Accuracy → Cedar + GPT-4 + cache (cost: medium)
- Accuracy + Cost → Cypress + GPT-4o-mini (speed: ~3-5s)
Speed Optimization
1. Choose Faster RAG Strategy
| Strategy | Avg Latency | Best For |
|---|---|---|
| Redwood | ~1.2s | Maximum speed |
| Cedar | ~2.0s | Balanced |
| Cypress | ~3.5s | Maximum accuracy |
Switch to Redwood when:
- Questions are clear and direct
- Speed is critical
- High query volume
2. Reduce topK
// Higher topK = slower
topK: 20 → Response time: 2.5s
topK: 10 → Response time: 1.8s
topK: 5 → Response time: 1.2s
Recommendation: Start with 5-7, increase only if accuracy suffers.
3. Use Faster Model
| Model | Speed | Quality | Cost |
|---|---|---|---|
| GPT-3.5-turbo | Fast | Good | Low |
| GPT-4o-mini | Fast | Better | Low |
| GPT-4o | Medium | Excellent | High |
| GPT-4 | Slow | Excellent | High |
For speed: Use GPT-3.5-turbo for simple queries, GPT-4o for complex.
4. Enable Caching
{
"cache": {
"enabled": true,
"ttl": 300, // 5 minutes
"keyBy": ["prompt", "agentId"]
}
}
Impact: 50-100ms for cached responses vs 1-3s for uncached.
5. Optimize Context
// Reduce context size
{
"topK": 5, // Fewer documents
"maxContextTokens": 2000, // Limit context size
"chunkSize": 300 // Smaller chunks
}
6. Use Streaming
// User sees response immediately
stream: true
// First token: ~500ms
// Complete response: ~2s
// Perceived latency: Much faster
Accuracy Optimization
1. Choose Better RAG Strategy
Cypress > Cedar > Redwood for accuracy.
2. Increase topK
topK: 5 → Accuracy: 85%
topK: 10 → Accuracy: 89%
topK: 20 → Accuracy: 91%
Diminishing returns after topK ~15.
3. Use Better Model
GPT-4o or GPT-4 for highest quality.
4. Improve Instructions
// Detailed system prompt
instructions: `
You are an expert [domain] assistant.
When answering:
1. Always cite specific sources
2. Provide step-by-step explanations
3. Include code examples when relevant
4. Verify facts against documentation
5. Admit uncertainty when appropriate
`
5. Add High-Quality Data Sources
✅ Official documentation ✅ Verified knowledge base ✅ Recent, updated content ❌ Low-quality, outdated content
6. Enable Reranking (Cypress)
Reranking improves precision by 20-30%.
7. Use Private Data Only
configAIUseOnlyPrivateData: true
Prevents hallucination from general knowledge.
Cost Optimization
1. Choose Cost-Effective Model
| Model | Cost per 1M Tokens |
|---|---|
| GPT-3.5-turbo | $0.50 |
| GPT-4o-mini | $0.15 |
| GPT-4o | $5.00 |
| GPT-4 | $30.00 |
Recommendation: GPT-4o-mini for most use cases.
2. Reduce Token Usage
{
"maxTokens": 300, // Limit response length
"topK": 5, // Fewer documents
"memoryTurns": 3, // Less conversation history
"temperature": 0.3 // More focused (fewer tokens)
}
3. Aggressive Caching
{
"cache": {
"enabled": true,
"ttl": 3600, // 1 hour (longer cache)
"fuzzyMatching": true // Match similar queries
}
}
4. Use Redwood Strategy
Redwood is cheapest (single LLM call, no reranking).
5. Batch Operations
Process multiple queries together to reduce overhead.
6. Smart Routing
def route_query(query):
if is_simple(query):
return "REDWOOD" # Cheapest
elif is_complex(query):
return "CYPRESS" # Worth the cost
else:
return "CEDAR" # Balanced
Balanced Optimization
The Performance Triangle
Speed
/\
/ \
/ \
/ \
/________\
Cost Accuracy
You can optimize 2 of 3:
- Speed + Cost: Use Redwood, GPT-3.5-turbo
- Speed + Accuracy: Use Cedar, GPT-4o, caching
- Cost + Accuracy: Use Cypress, efficient models, batch
Recommended Configurations
High-Volume Support Bot:
Strategy: REDWOOD
Model: gpt-3.5-turbo
topK: 5
Cache: enabled (1 hour)
Goal: Handle 10k+ queries/day cheaply
Technical Documentation:
Strategy: CEDAR
Model: gpt-4o
topK: 10
Cache: enabled (30 min)
Goal: Balance speed and accuracy
Compliance Assistant:
Strategy: CYPRESS
Model: gpt-4o
topK: 10
Reranking: enabled
Goal: Maximum accuracy, cost secondary
Performance Monitoring
Key Metrics Dashboard
Agent: Customer Support
├─ Avg Response Time: 1.8s (target: <2s) ✅
├─ P95 Response Time: 2.4s (target: <3s) ✅
├─ P99 Response Time: 3.1s ⚠️
├─ Cache Hit Rate: 42%
├─ Avg Tokens: 1,523
├─ Cost per Query: $0.0045
└─ Accuracy Score: 89%
Set Performance Targets
{
"targets": {
"responseTime": {
"p50": 1.5,
"p95": 2.5,
"p99": 4.0
},
"accuracy": 0.85,
"costPerQuery": 0.01,
"cacheHitRate": 0.40
}
}
Alerting
{
"alerts": {
"responseTimeSlow": {
"threshold": 3.0,
"duration": "5m",
"notify": "team@company.com"
},
"accuracyDrop": {
"threshold": 0.80,
"compare": "baseline",
"notify": "team@company.com"
}
}
}
A/B Testing
Compare configurations to find optimal settings:
// Test A: Baseline
const configA = {
strategyCode: 'CEDAR',
topK: 10,
temperature: 0.7
};
// Test B: Optimized for speed
const configB = {
strategyCode: 'REDWOOD',
topK: 5,
temperature: 0.7
};
// Route 50% traffic to each
// Measure: speed, accuracy, cost
// Deploy winner after 1 week
Continuous Optimization
Weekly Review
- Check performance metrics
- Identify bottlenecks
- Test optimizations
- Deploy improvements
- Measure impact
Monthly Audit
- Review all configurations
- Benchmark against baselines
- Update targets
- Plan next optimizations
Tools & Techniques
Performance Profiling
const startTime = Date.now();
const response = await twig.chat.create({
prompt,
agentId,
profile: true // Enable profiling
});
console.log('Breakdown:', {
embedding: response.profile.embeddingTime,
retrieval: response.profile.retrievalTime,
llm: response.profile.llmTime,
total: Date.now() - startTime
});
Load Testing
# Using Apache Bench
ab -n 1000 -c 10 -H "Authorization: Bearer KEY" \
-p query.json https://api.twig.so/api/chat
# Results show:
# - Requests per second
# - Average latency
# - P50, P95, P99
Cache Analysis
const cacheStats = await twig.cache.stats();
console.log('Hit rate:', cacheStats.hitRate);
console.log('Avg savings:', cacheStats.avgTimeSaved);
console.log('Most cached:', cacheStats.topQueries);
Next Steps
- Cost Optimization - Reduce expenses
- Analytics Dashboard - Monitor metrics
- RAG Strategies - Choose optimal strategy
- Evaluation Framework - Measure quality
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/product/monitoring/performance-tuning.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Integrations
Industries
Last updated January 26, 2026


