Product
Cost Optimization
Reduce AI operational costs while maintaining quality through smart configuration and usage patterns.
TL;DR
Reduce AI operational costs while maintaining quality through smart configuration and usage patterns.
Key Takeaways
- Cost Components
- Cost Breakdown Example
- Optimization Strategies
- Cost Monitoring
- Cost-Saving Tactics
- Real-World Examples
Reduce AI operational costs while maintaining quality through smart configuration and usage patterns.
Cost Components
Per-query cost breakdown:
| Component | Cost Factor | Per 1K Queries | Optimization |
|---|---|---|---|
| LLM API | Tokens (in+out) | $0.20-$1.50 | Model, max_tokens, temperature |
| Embeddings | Query embedding | $0.01 | Cache queries |
| Vector Search | Pinecone queries | $0.05 | Cache results, lower top_k |
| Reranking | Cross-encoder (Cypress) | $0.06 | Use Cedar/Redwood |
| Ingestion | Doc processing (one-time) | Variable | Incremental syncs only |
Typical cost range: $0.26-$0.90 per query (depending on strategy and model)
Cost Breakdown Example
Per 1,000 Queries:
Redwood (Cheapest)
Embeddings: $0.01
Vector Search: $0.05
LLM (GPT-4o-mini): $0.20
─────────────────────────
Total: $0.26
Cedar (Medium)
Embeddings: $0.01
Rewriting: $0.10
Vector Search: $0.05
LLM (GPT-4o-mini): $0.25
─────────────────────────
Total: $0.41 (+58%)
Cypress (Highest)
Embeddings: $0.01
Query Expansion: $0.15
Vector Search: $0.08 (higher volume)
Reranking: $0.06
Rewriting: $0.10
LLM (GPT-4o): $0.50
─────────────────────────
Total: $0.90 (+246%)
Optimization Strategies
1. Model Selection
Cost-Quality Matrix:
GPT-4: $$$$ ⭐⭐⭐⭐⭐
GPT-4o: $$$ ⭐⭐⭐⭐⭐
GPT-4o-mini: $ ⭐⭐⭐⭐
GPT-3.5-turbo: $ ⭐⭐⭐
Recommendation:
- High-volume, simple: GPT-3.5-turbo or GPT-4o-mini
- Complex, critical: GPT-4o
- Research/analysis: GPT-4
2. Strategy Selection
Annual Cost for 100k queries:
Redwood + GPT-3.5-turbo: $26
Cedar + GPT-4o-mini: $41
Cypress + GPT-4o: $90
Choose based on accuracy requirements.
3. Aggressive Caching
{
"cache": {
"enabled": true,
"ttl": 3600, // 1 hour (longer = more savings)
"fuzzyMatching": true, // Match similar queries
"minSimilarity": 0.95 // 95% similarity threshold
}
}
Impact:
Without cache: 100k queries × $0.41 = $41,000
With 40% hit rate: 60k × $0.41 = $24,600
Savings: $16,400/year (40% reduction)
4. Reduce Token Usage
Limit Response Length:
maxTokens: 200 // Was 500
Reduce Context:
topK: 5 // Was 10
Shorter Memory:
memoryTurns: 3 // Was 10
Impact:
Before: 2,500 tokens/query × $0.002 = $0.005
After: 1,200 tokens/query × $0.002 = $0.0024
Savings: 52% per query
5. Smart Routing
Route queries to appropriate strategy:
def route_query(query, history):
# Simple queries → Cheap strategy
if is_simple(query) and not history:
return "REDWOOD"
# Conversational → Medium cost
elif history:
return "CEDAR"
# Complex → Worth the cost
else:
return "CYPRESS"
# Result: Optimize cost-quality trade-off per query
6. Batch Processing
// Instead of real-time for non-urgent
const queue = createQueue({
batchSize: 100,
batchTimeout: 60000 // 1 minute
});
// Process in batches
// Reduce API overhead
// Lower overall cost
Cost Monitoring
Usage Dashboard
Current Month:
├─ Queries: 45,230
├─ Tokens: 68M
├─ Cost: $1,245
├─ Avg Cost/Query: $0.0275
├─ Projected (Month): $1,750
└─ Budget: $2,000 ✅
Set Budgets
{
"budget": {
"monthly": 2000, // $2,000/month
"alerts": {
"80percent": true, // Alert at $1,600
"90percent": true, // Alert at $1,800
"exceeded": true // Alert if over
},
"hardLimit": true, // Stop at budget
"limitBehavior": "QUEUE" // Queue requests if limit hit
}
}
Cost by Component
January 2024 Breakdown:
├─ LLM Calls: $1,200 (60%)
├─ Embeddings: $50 (2.5%)
├─ Vector Search: $150 (7.5%)
├─ Reranking: $100 (5%)
├─ Data Processing: $500 (25%)
└─ Total: $2,000
Cost-Saving Tactics
Tactic 1: Tiered Response Quality
// Critical queries: High quality
if (isCritical(query)) {
return {
strategyCode: 'CYPRESS',
model: 'gpt-4o'
};
}
// Standard queries: Balanced
else if (isStandard(query)) {
return {
strategyCode: 'CEDAR',
model: 'gpt-4o-mini'
};
}
// Simple queries: Fast & cheap
else {
return {
strategyCode: 'REDWOOD',
model: 'gpt-3.5-turbo'
};
}
Impact: 40-60% cost reduction vs using highest quality for all.
Tactic 2: Query Deduplication
// Detect duplicate/similar queries
const queryHash = hashQuery(prompt);
if (seenRecently(queryHash, last30Minutes)) {
return cachedResponse;
}
Tactic 3: Peak/Off-Peak Pricing
// Use cheaper strategy during high volume
const isPeakHours = getCurrentHour() >= 9 && getCurrentHour() <= 17;
const strategy = isPeakHours ? 'REDWOOD' : 'CEDAR';
Tactic 4: Lazy Loading
// Load context progressively
{
"topK": 3, // Start with 3 docs
"expandIfNeeded": true, // Load more if confidence low
"confidenceThreshold": 0.85
}
Real-World Examples
Case Study: SaaS Company
Before Optimization:
Queries: 100k/month
Strategy: Cypress for all
Model: GPT-4
Cost: $9,000/month
After Optimization:
Strategy Mix:
- 60% Redwood (simple queries)
- 30% Cedar (conversational)
- 10% Cypress (complex)
Model Mix:
- 70% GPT-3.5-turbo
- 30% GPT-4o
Cache Hit Rate: 45%
New Cost: $2,700/month
Savings: $6,300/month (70% reduction)
Accuracy: Maintained at 87%
Case Study: E-commerce
Optimization:
- Cached product queries (60% hit rate)
- Redwood for FAQ (70% of queries)
- Cedar for complex questions (30%)
- GPT-3.5-turbo for product info
- GPT-4o for technical support
Results:
- Cost: $0.008/query (from $0.025)
- 68% cost reduction
- Response time: 1.3s (from 2.1s)
- Accuracy: 86% (from 89%, acceptable trade-off)
Cost Analysis Tools
Cost Calculator
function estimateMonthlyCost(params) {
const {
queriesPerDay,
avgTokensPerQuery,
modelCostPer1MTokens,
cacheHitRate,
strategyOverhead
} = params;
const uncachedQueries = queriesPerDay * (1 - cacheHitRate);
const tokensPerDay = uncachedQueries * avgTokensPerQuery;
const costPerDay = (tokensPerDay / 1000000) * modelCostPer1MTokens;
const monthlyCost = costPerDay * 30 * (1 + strategyOverhead);
return monthlyCost;
}
// Example
estimateMonthlyCost({
queriesPerDay: 3000,
avgTokensPerQuery: 2000,
modelCostPer1MTokens: 0.50, // GPT-3.5-turbo
cacheHitRate: 0.40,
strategyOverhead: 0.0 // Redwood
});
// Result: ~$54/month
ROI Calculator
const savings = {
customerSupportTime: 120, // hours saved/month
hourlyRate: 50,
totalSaved: 120 * 50 // $6,000
};
const costs = {
twigPlatform: 500, // Subscription
apiUsage: 300, // API costs
totalCost: 800
};
const roi = (savings.totalSaved - costs.totalCost) / costs.totalCost;
// ROI: 650% ($5,200 net benefit)
Best Practices
1. Start Cheap, Scale Up
✅ Begin with Redwood + GPT-3.5-turbo ✅ Monitor accuracy ✅ Upgrade only if needed ❌ Don't start with most expensive
2. Cache Aggressively
✅ Enable caching ✅ Long TTL for stable content ✅ Fuzzy matching ❌ Don't cache time-sensitive data
3. Monitor Continuously
✅ Track cost trends ✅ Set budget alerts ✅ Review monthly ❌ Don't ignore cost creep
4. Optimize Data Processing
✅ Incremental syncs ✅ Process only changes ✅ Schedule during off-peak ❌ Don't reprocess everything
Troubleshooting
Unexpected High Costs
Investigate:
- Check query volume (unexpected spike?)
- Review token usage (responses too long?)
- Check cache hit rate (caching working?)
- Verify strategy mix (using expensive strategies?)
- Audit model usage (using GPT-4 too much?)
Budget Exceeded
Immediate Actions:
- Enable hard budget limit
- Switch to cheaper strategies
- Reduce max tokens
- Increase cache TTL
- Queue non-urgent requests
Next Steps
- Performance Tuning - Optimize speed
- Analytics Dashboard - Monitor usage
- RAG Strategies - Choose cost-effective strategy
- Rate Limits - Manage usage
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/product/monitoring/cost-optimization.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


