Product

Performance Tuning

Optimize your AI agents for speed, accuracy, and cost-effectiveness

TL;DR

Optimize your AI agents for speed, accuracy, and cost-effectiveness. Track in Analytics dashboard:

Key Takeaways

  • Performance Metrics
  • Optimization Tradeoffs
  • Speed Optimization
  • Accuracy Optimization
  • Cost Optimization
  • Balanced Optimization

Optimize your AI agents for speed, accuracy, and cost-effectiveness.

Performance Metrics

Track in Analytics dashboard:

MetricTargetMeasured How
p50 Latency<2s50th percentile response time
p95 Latency<4s95th percentile (worst-case normal)
p99 Latency<6s99th percentile (outliers)
Accuracy Rate>85%% responses marked accurate
Cost/Query<$0.01Total API costs / query count
Cache Hit Rate>40%Cached / total queries

Location: Dashboard → Analytics → Performance tab

Optimization Tradeoffs

Cannot maximize all three simultaneously:

SpeedAccuracyCost

Pick 2:

  • Speed + Cost → Redwood + GPT-3.5-turbo (accuracy: ~72%)
  • Speed + Accuracy → Cedar + GPT-4 + cache (cost: medium)
  • Accuracy + Cost → Cypress + GPT-4o-mini (speed: ~3-5s)

Speed Optimization

1. Choose Faster RAG Strategy

StrategyAvg LatencyBest For
Redwood~1.2sMaximum speed
Cedar~2.0sBalanced
Cypress~3.5sMaximum accuracy

Switch to Redwood when:

  • Questions are clear and direct
  • Speed is critical
  • High query volume

2. Reduce topK

// Higher topK = slower
topK: 20  → Response time: 2.5s
topK: 10  → Response time: 1.8s
topK: 5   → Response time: 1.2s

Recommendation: Start with 5-7, increase only if accuracy suffers.

3. Use Faster Model

ModelSpeedQualityCost
GPT-3.5-turboFastGoodLow
GPT-4o-miniFastBetterLow
GPT-4oMediumExcellentHigh
GPT-4SlowExcellentHigh

For speed: Use GPT-3.5-turbo for simple queries, GPT-4o for complex.

4. Enable Caching

{
  "cache": {
    "enabled": true,
    "ttl": 300,              // 5 minutes
    "keyBy": ["prompt", "agentId"]
  }
}

Impact: 50-100ms for cached responses vs 1-3s for uncached.

5. Optimize Context

// Reduce context size
{
  "topK": 5,               // Fewer documents
  "maxContextTokens": 2000, // Limit context size
  "chunkSize": 300         // Smaller chunks
}

6. Use Streaming

// User sees response immediately
stream: true

// First token: ~500ms
// Complete response: ~2s
// Perceived latency: Much faster

Accuracy Optimization

1. Choose Better RAG Strategy

Cypress > Cedar > Redwood for accuracy.

2. Increase topK

topK: 5   → Accuracy: 85%
topK: 10  → Accuracy: 89%
topK: 20  → Accuracy: 91%

Diminishing returns after topK ~15.

3. Use Better Model

GPT-4o or GPT-4 for highest quality.

4. Improve Instructions

// Detailed system prompt
instructions: `
You are an expert [domain] assistant.

When answering:
1. Always cite specific sources
2. Provide step-by-step explanations
3. Include code examples when relevant
4. Verify facts against documentation
5. Admit uncertainty when appropriate
`

5. Add High-Quality Data Sources

✅ Official documentation ✅ Verified knowledge base ✅ Recent, updated content ❌ Low-quality, outdated content

6. Enable Reranking (Cypress)

Reranking improves precision by 20-30%.

7. Use Private Data Only

configAIUseOnlyPrivateData: true

Prevents hallucination from general knowledge.

Cost Optimization

1. Choose Cost-Effective Model

ModelCost per 1M Tokens
GPT-3.5-turbo$0.50
GPT-4o-mini$0.15
GPT-4o$5.00
GPT-4$30.00

Recommendation: GPT-4o-mini for most use cases.

2. Reduce Token Usage

{
  "maxTokens": 300,        // Limit response length
  "topK": 5,               // Fewer documents
  "memoryTurns": 3,        // Less conversation history
  "temperature": 0.3       // More focused (fewer tokens)
}

3. Aggressive Caching

{
  "cache": {
    "enabled": true,
    "ttl": 3600,           // 1 hour (longer cache)
    "fuzzyMatching": true   // Match similar queries
  }
}

4. Use Redwood Strategy

Redwood is cheapest (single LLM call, no reranking).

5. Batch Operations

Process multiple queries together to reduce overhead.

6. Smart Routing

def route_query(query):
    if is_simple(query):
        return "REDWOOD"      # Cheapest
    elif is_complex(query):
        return "CYPRESS"      # Worth the cost
    else:
        return "CEDAR"        # Balanced

Balanced Optimization

The Performance Triangle

        Speed
         /\
        /  \
       /    \
      /      \
     /________\
  Cost      Accuracy

You can optimize 2 of 3:

  • Speed + Cost: Use Redwood, GPT-3.5-turbo
  • Speed + Accuracy: Use Cedar, GPT-4o, caching
  • Cost + Accuracy: Use Cypress, efficient models, batch

High-Volume Support Bot:

Strategy: REDWOOD
Model: gpt-3.5-turbo
topK: 5
Cache: enabled (1 hour)
Goal: Handle 10k+ queries/day cheaply

Technical Documentation:

Strategy: CEDAR
Model: gpt-4o
topK: 10
Cache: enabled (30 min)
Goal: Balance speed and accuracy

Compliance Assistant:

Strategy: CYPRESS
Model: gpt-4o
topK: 10
Reranking: enabled
Goal: Maximum accuracy, cost secondary

Performance Monitoring

Key Metrics Dashboard

Agent: Customer Support
├─ Avg Response Time: 1.8s (target: <2s) ✅
├─ P95 Response Time: 2.4s (target: <3s) ✅
├─ P99 Response Time: 3.1s ⚠️
├─ Cache Hit Rate: 42%
├─ Avg Tokens: 1,523
├─ Cost per Query: $0.0045
└─ Accuracy Score: 89%

Set Performance Targets

{
  "targets": {
    "responseTime": {
      "p50": 1.5,
      "p95": 2.5,
      "p99": 4.0
    },
    "accuracy": 0.85,
    "costPerQuery": 0.01,
    "cacheHitRate": 0.40
  }
}

Alerting

{
  "alerts": {
    "responseTimeSlow": {
      "threshold": 3.0,
      "duration": "5m",
      "notify": "team@company.com"
    },
    "accuracyDrop": {
      "threshold": 0.80,
      "compare": "baseline",
      "notify": "team@company.com"
    }
  }
}

A/B Testing

Compare configurations to find optimal settings:

// Test A: Baseline
const configA = {
  strategyCode: 'CEDAR',
  topK: 10,
  temperature: 0.7
};

// Test B: Optimized for speed
const configB = {
  strategyCode: 'REDWOOD',
  topK: 5,
  temperature: 0.7
};

// Route 50% traffic to each
// Measure: speed, accuracy, cost
// Deploy winner after 1 week

Continuous Optimization

Weekly Review

  1. Check performance metrics
  2. Identify bottlenecks
  3. Test optimizations
  4. Deploy improvements
  5. Measure impact

Monthly Audit

  1. Review all configurations
  2. Benchmark against baselines
  3. Update targets
  4. Plan next optimizations

Tools & Techniques

Performance Profiling

const startTime = Date.now();

const response = await twig.chat.create({
  prompt,
  agentId,
  profile: true  // Enable profiling
});

console.log('Breakdown:', {
  embedding: response.profile.embeddingTime,
  retrieval: response.profile.retrievalTime,
  llm: response.profile.llmTime,
  total: Date.now() - startTime
});

Load Testing

# Using Apache Bench
ab -n 1000 -c 10 -H "Authorization: Bearer KEY" \
  -p query.json https://api.twig.so/api/chat

# Results show:
# - Requests per second
# - Average latency
# - P50, P95, P99

Cache Analysis

const cacheStats = await twig.cache.stats();

console.log('Hit rate:', cacheStats.hitRate);
console.log('Avg savings:', cacheStats.avgTimeSaved);
console.log('Most cached:', cacheStats.topQueries);

Next Steps


Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/product/monitoring/performance-tuning.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Related Pages

Last updated January 26, 2026