Cost Optimization

Reduce AI operational costs while maintaining quality through smart configuration and usage patterns.

Cost Components

Per-query cost breakdown:

Component	Cost Factor	Per 1K Queries	Optimization
LLM API	Tokens (in+out)	$0.20-$1.50	Model, max_tokens, temperature
Embeddings	Query embedding	$0.01	Cache queries
Vector Search	Pinecone queries	$0.05	Cache results, lower top_k
Reranking	Cross-encoder (Cypress)	$0.06	Use Cedar/Redwood
Ingestion	Doc processing (one-time)	Variable	Incremental syncs only

Typical cost range: $0.26-$0.90 per query (depending on strategy and model)

Cost Breakdown Example

Per 1,000 Queries:

Redwood (Cheapest)

Embeddings:        $0.01
Vector Search:     $0.05
LLM (GPT-4o-mini): $0.20
─────────────────────────
Total:             $0.26

Cedar (Medium)

Embeddings:        $0.01
Rewriting:         $0.10
Vector Search:     $0.05
LLM (GPT-4o-mini): $0.25
─────────────────────────
Total:             $0.41 (+58%)

Cypress (Highest)

Embeddings:        $0.01
Query Expansion:   $0.15
Vector Search:     $0.08 (higher volume)
Reranking:         $0.06
Rewriting:         $0.10
LLM (GPT-4o):      $0.50
─────────────────────────
Total:             $0.90 (+246%)

Optimization Strategies

1. Model Selection

Cost-Quality Matrix:

GPT-4:           $$$$  ⭐⭐⭐⭐⭐
GPT-4o:          $$$   ⭐⭐⭐⭐⭐
GPT-4o-mini:     $     ⭐⭐⭐⭐
GPT-3.5-turbo:   $     ⭐⭐⭐

Recommendation:

High-volume, simple: GPT-3.5-turbo or GPT-4o-mini
Complex, critical: GPT-4o
Research/analysis: GPT-4

2. Strategy Selection

Annual Cost for 100k queries:

Redwood + GPT-3.5-turbo:    $26
Cedar + GPT-4o-mini:        $41
Cypress + GPT-4o:           $90

Choose based on accuracy requirements.

3. Aggressive Caching

{
  "cache": {
    "enabled": true,
    "ttl": 3600,             // 1 hour (longer = more savings)
    "fuzzyMatching": true,   // Match similar queries
    "minSimilarity": 0.95    // 95% similarity threshold
  }
}

Impact:

Without cache: 100k queries × $0.41 = $41,000
With 40% hit rate: 60k × $0.41 = $24,600
Savings: $16,400/year (40% reduction)

4. Reduce Token Usage

Limit Response Length:

maxTokens: 200  // Was 500

Reduce Context:

topK: 5         // Was 10

Shorter Memory:

memoryTurns: 3  // Was 10

Impact:

Before: 2,500 tokens/query × $0.002 = $0.005
After:  1,200 tokens/query × $0.002 = $0.0024
Savings: 52% per query

5. Smart Routing

Route queries to appropriate strategy:

def route_query(query, history):
    # Simple queries → Cheap strategy
    if is_simple(query) and not history:
        return "REDWOOD"
    
    # Conversational → Medium cost
    elif history:
        return "CEDAR"
    
    # Complex → Worth the cost
    else:
        return "CYPRESS"

# Result: Optimize cost-quality trade-off per query

6. Batch Processing

// Instead of real-time for non-urgent
const queue = createQueue({
  batchSize: 100,
  batchTimeout: 60000  // 1 minute
});

// Process in batches
// Reduce API overhead
// Lower overall cost

Cost Monitoring

Usage Dashboard

Current Month:
├─ Queries: 45,230
├─ Tokens: 68M
├─ Cost: $1,245
├─ Avg Cost/Query: $0.0275
├─ Projected (Month): $1,750
└─ Budget: $2,000 ✅

Set Budgets

{
  "budget": {
    "monthly": 2000,         // $2,000/month
    "alerts": {
      "80percent": true,     // Alert at $1,600
      "90percent": true,     // Alert at $1,800
      "exceeded": true       // Alert if over
    },
    "hardLimit": true,       // Stop at budget
    "limitBehavior": "QUEUE" // Queue requests if limit hit
  }
}

Cost by Component

January 2024 Breakdown:
├─ LLM Calls:      $1,200 (60%)
├─ Embeddings:     $50 (2.5%)
├─ Vector Search:  $150 (7.5%)
├─ Reranking:      $100 (5%)
├─ Data Processing: $500 (25%)
└─ Total:          $2,000

Cost-Saving Tactics

Tactic 1: Tiered Response Quality

// Critical queries: High quality
if (isCritical(query)) {
  return {
    strategyCode: 'CYPRESS',
    model: 'gpt-4o'
  };
}

// Standard queries: Balanced
else if (isStandard(query)) {
  return {
    strategyCode: 'CEDAR',
    model: 'gpt-4o-mini'
  };
}

// Simple queries: Fast & cheap
else {
  return {
    strategyCode: 'REDWOOD',
    model: 'gpt-3.5-turbo'
  };
}

Impact: 40-60% cost reduction vs using highest quality for all.

Tactic 2: Query Deduplication

// Detect duplicate/similar queries
const queryHash = hashQuery(prompt);

if (seenRecently(queryHash, last30Minutes)) {
  return cachedResponse;
}

Tactic 3: Peak/Off-Peak Pricing

// Use cheaper strategy during high volume
const isPeakHours = getCurrentHour() >= 9 && getCurrentHour() <= 17;

const strategy = isPeakHours ? 'REDWOOD' : 'CEDAR';

Tactic 4: Lazy Loading

// Load context progressively
{
  "topK": 3,              // Start with 3 docs
  "expandIfNeeded": true,  // Load more if confidence low
  "confidenceThreshold": 0.85
}

Real-World Examples

Case Study: SaaS Company

Before Optimization:

Queries: 100k/month
Strategy: Cypress for all
Model: GPT-4
Cost: $9,000/month

After Optimization:

Strategy Mix:
- 60% Redwood (simple queries)
- 30% Cedar (conversational)
- 10% Cypress (complex)

Model Mix:
- 70% GPT-3.5-turbo
- 30% GPT-4o

Cache Hit Rate: 45%

New Cost: $2,700/month
Savings: $6,300/month (70% reduction)
Accuracy: Maintained at 87%

Case Study: E-commerce

Optimization:

Cached product queries (60% hit rate)
Redwood for FAQ (70% of queries)
Cedar for complex questions (30%)
GPT-3.5-turbo for product info
GPT-4o for technical support

Results:

Cost: $0.008/query (from $0.025)
68% cost reduction
Response time: 1.3s (from 2.1s)
Accuracy: 86% (from 89%, acceptable trade-off)

Cost Analysis Tools

Cost Calculator

function estimateMonthlyCost(params) {
  const {
    queriesPerDay,
    avgTokensPerQuery,
    modelCostPer1MTokens,
    cacheHitRate,
    strategyOverhead
  } = params;
  
  const uncachedQueries = queriesPerDay * (1 - cacheHitRate);
  const tokensPerDay = uncachedQueries * avgTokensPerQuery;
  const costPerDay = (tokensPerDay / 1000000) * modelCostPer1MTokens;
  const monthlyCost = costPerDay * 30 * (1 + strategyOverhead);
  
  return monthlyCost;
}

// Example
estimateMonthlyCost({
  queriesPerDay: 3000,
  avgTokensPerQuery: 2000,
  modelCostPer1MTokens: 0.50,  // GPT-3.5-turbo
  cacheHitRate: 0.40,
  strategyOverhead: 0.0         // Redwood
});
// Result: ~$54/month

ROI Calculator

const savings = {
  customerSupportTime: 120,      // hours saved/month
  hourlyRate: 50,
  totalSaved: 120 * 50           // $6,000
};

const costs = {
  twigPlatform: 500,             // Subscription
  apiUsage: 300,                 // API costs
  totalCost: 800
};

const roi = (savings.totalSaved - costs.totalCost) / costs.totalCost;
// ROI: 650% ($5,200 net benefit)

Best Practices

1. Start Cheap, Scale Up

✅ Begin with Redwood + GPT-3.5-turbo ✅ Monitor accuracy ✅ Upgrade only if needed ❌ Don't start with most expensive

2. Cache Aggressively

✅ Enable caching ✅ Long TTL for stable content ✅ Fuzzy matching ❌ Don't cache time-sensitive data

3. Monitor Continuously

✅ Track cost trends ✅ Set budget alerts ✅ Review monthly ❌ Don't ignore cost creep

4. Optimize Data Processing

✅ Incremental syncs ✅ Process only changes ✅ Schedule during off-peak ❌ Don't reprocess everything

Troubleshooting

Unexpected High Costs

Investigate:

Check query volume (unexpected spike?)
Review token usage (responses too long?)
Check cache hit rate (caching working?)
Verify strategy mix (using expensive strategies?)
Audit model usage (using GPT-4 too much?)

Budget Exceeded

Immediate Actions:

Enable hard budget limit
Switch to cheaper strategies
Reduce max tokens
Increase cache TTL
Queue non-urgent requests

Next Steps

Performance Tuning - Optimize speed
Analytics Dashboard - Monitor usage
RAG Strategies - Choose cost-effective strategy
Rate Limits - Manage usage

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/product/monitoring/cost-optimization.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Key Takeaways