Performance Tuning

Optimize your AI agents for speed, accuracy, and cost-effectiveness.

Performance Metrics

Track in Analytics dashboard:

Metric	Target	Measured How
p50 Latency	<2s	50th percentile response time
p95 Latency	<4s	95th percentile (worst-case normal)
p99 Latency	<6s	99th percentile (outliers)
Accuracy Rate	>85%	% responses marked accurate
Cost/Query	<$0.01	Total API costs / query count
Cache Hit Rate	>40%	Cached / total queries

Location: Dashboard → Analytics → Performance tab

Optimization Tradeoffs

Cannot maximize all three simultaneously:

Speed ⟷ Accuracy ⟷ Cost

Pick 2:

Speed + Cost → Redwood + GPT-3.5-turbo (accuracy: ~72%)
Speed + Accuracy → Cedar + GPT-4 + cache (cost: medium)
Accuracy + Cost → Cypress + GPT-4o-mini (speed: ~3-5s)

Speed Optimization

1. Choose Faster RAG Strategy

Strategy	Avg Latency	Best For
Redwood	~1.2s	Maximum speed
Cedar	~2.0s	Balanced
Cypress	~3.5s	Maximum accuracy

Switch to Redwood when:

Questions are clear and direct
Speed is critical
High query volume

2. Reduce topK

// Higher topK = slower
topK: 20  → Response time: 2.5s
topK: 10  → Response time: 1.8s
topK: 5   → Response time: 1.2s

Recommendation: Start with 5-7, increase only if accuracy suffers.

3. Use Faster Model

Model	Speed	Quality	Cost
GPT-3.5-turbo	Fast	Good	Low
GPT-4o-mini	Fast	Better	Low
GPT-4o	Medium	Excellent	High
GPT-4	Slow	Excellent	High

For speed: Use GPT-3.5-turbo for simple queries, GPT-4o for complex.

4. Enable Caching

{
  "cache": {
    "enabled": true,
    "ttl": 300,              // 5 minutes
    "keyBy": ["prompt", "agentId"]
  }
}

Impact: 50-100ms for cached responses vs 1-3s for uncached.

5. Optimize Context

// Reduce context size
{
  "topK": 5,               // Fewer documents
  "maxContextTokens": 2000, // Limit context size
  "chunkSize": 300         // Smaller chunks
}

6. Use Streaming

// User sees response immediately
stream: true

// First token: ~500ms
// Complete response: ~2s
// Perceived latency: Much faster

Accuracy Optimization

1. Choose Better RAG Strategy

Cypress > Cedar > Redwood for accuracy.

2. Increase topK

topK: 5   → Accuracy: 85%
topK: 10  → Accuracy: 89%
topK: 20  → Accuracy: 91%

Diminishing returns after topK ~15.

3. Use Better Model

GPT-4o or GPT-4 for highest quality.

4. Improve Instructions

// Detailed system prompt
instructions: `
You are an expert [domain] assistant.

When answering:
1. Always cite specific sources
2. Provide step-by-step explanations
3. Include code examples when relevant
4. Verify facts against documentation
5. Admit uncertainty when appropriate
`

5. Add High-Quality Data Sources

✅ Official documentation ✅ Verified knowledge base ✅ Recent, updated content ❌ Low-quality, outdated content

6. Enable Reranking (Cypress)

Reranking improves precision by 20-30%.

7. Use Private Data Only

configAIUseOnlyPrivateData: true

Prevents hallucination from general knowledge.

Cost Optimization

1. Choose Cost-Effective Model

Model	Cost per 1M Tokens
GPT-3.5-turbo	$0.50
GPT-4o-mini	$0.15
GPT-4o	$5.00
GPT-4	$30.00

Recommendation: GPT-4o-mini for most use cases.

2. Reduce Token Usage

{
  "maxTokens": 300,        // Limit response length
  "topK": 5,               // Fewer documents
  "memoryTurns": 3,        // Less conversation history
  "temperature": 0.3       // More focused (fewer tokens)
}

3. Aggressive Caching

{
  "cache": {
    "enabled": true,
    "ttl": 3600,           // 1 hour (longer cache)
    "fuzzyMatching": true   // Match similar queries
  }
}

4. Use Redwood Strategy

Redwood is cheapest (single LLM call, no reranking).

5. Batch Operations

Process multiple queries together to reduce overhead.

6. Smart Routing

def route_query(query):
    if is_simple(query):
        return "REDWOOD"      # Cheapest
    elif is_complex(query):
        return "CYPRESS"      # Worth the cost
    else:
        return "CEDAR"        # Balanced

Balanced Optimization

The Performance Triangle

        Speed
         /\
        /  \
       /    \
      /      \
     /________\
  Cost      Accuracy

You can optimize 2 of 3:

Speed + Cost: Use Redwood, GPT-3.5-turbo
Speed + Accuracy: Use Cedar, GPT-4o, caching
Cost + Accuracy: Use Cypress, efficient models, batch

Recommended Configurations

High-Volume Support Bot:

Strategy: REDWOOD
Model: gpt-3.5-turbo
topK: 5
Cache: enabled (1 hour)
Goal: Handle 10k+ queries/day cheaply

Technical Documentation:

Strategy: CEDAR
Model: gpt-4o
topK: 10
Cache: enabled (30 min)
Goal: Balance speed and accuracy

Compliance Assistant:

Strategy: CYPRESS
Model: gpt-4o
topK: 10
Reranking: enabled
Goal: Maximum accuracy, cost secondary

Performance Monitoring

Key Metrics Dashboard

Agent: Customer Support
├─ Avg Response Time: 1.8s (target: <2s) ✅
├─ P95 Response Time: 2.4s (target: <3s) ✅
├─ P99 Response Time: 3.1s ⚠️
├─ Cache Hit Rate: 42%
├─ Avg Tokens: 1,523
├─ Cost per Query: $0.0045
└─ Accuracy Score: 89%

Set Performance Targets

{
  "targets": {
    "responseTime": {
      "p50": 1.5,
      "p95": 2.5,
      "p99": 4.0
    },
    "accuracy": 0.85,
    "costPerQuery": 0.01,
    "cacheHitRate": 0.40
  }
}

Alerting

{
  "alerts": {
    "responseTimeSlow": {
      "threshold": 3.0,
      "duration": "5m",
      "notify": "team@company.com"
    },
    "accuracyDrop": {
      "threshold": 0.80,
      "compare": "baseline",
      "notify": "team@company.com"
    }
  }
}

A/B Testing

Compare configurations to find optimal settings:

// Test A: Baseline
const configA = {
  strategyCode: 'CEDAR',
  topK: 10,
  temperature: 0.7
};

// Test B: Optimized for speed
const configB = {
  strategyCode: 'REDWOOD',
  topK: 5,
  temperature: 0.7
};

// Route 50% traffic to each
// Measure: speed, accuracy, cost
// Deploy winner after 1 week

Continuous Optimization

Weekly Review

Check performance metrics
Identify bottlenecks
Test optimizations
Deploy improvements
Measure impact

Monthly Audit

Review all configurations
Benchmark against baselines
Update targets
Plan next optimizations

Tools & Techniques

Performance Profiling

const startTime = Date.now();

const response = await twig.chat.create({
  prompt,
  agentId,
  profile: true  // Enable profiling
});

console.log('Breakdown:', {
  embedding: response.profile.embeddingTime,
  retrieval: response.profile.retrievalTime,
  llm: response.profile.llmTime,
  total: Date.now() - startTime
});

Load Testing

# Using Apache Bench
ab -n 1000 -c 10 -H "Authorization: Bearer KEY" \
  -p query.json https://api.twig.so/api/chat

# Results show:
# - Requests per second
# - Average latency
# - P50, P95, P99

Cache Analysis

const cacheStats = await twig.cache.stats();

console.log('Hit rate:', cacheStats.hitRate);
console.log('Avg savings:', cacheStats.avgTimeSaved);
console.log('Most cached:', cacheStats.topQueries);

Next Steps

Cost Optimization - Reduce expenses
Analytics Dashboard - Monitor metrics
RAG Strategies - Choose optimal strategy
Evaluation Framework - Measure quality

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/product/monitoring/performance-tuning.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Key Takeaways