Product
Evaluation Framework
Measure and improve AI agent performance with automated evaluation tests and continuous quality monitoring.
TL;DR
Measure and improve AI agent performance with automated evaluation tests and continuous quality monitoring.
Key Takeaways
- What are Evals?
- Why Evals Matter
- Evaluation Metrics
- Creating Eval Sets
- Running Evaluations
- Evaluation Results
Measure and improve AI agent performance with automated evaluation tests and continuous quality monitoring.
What are Evals?
Evals (evaluations) are automated tests that measure how well your AI agents perform. They help you:
- Track accuracy over time
- Compare agent configurations
- Identify knowledge gaps
- Validate changes before deployment
- Ensure consistent quality
Think of evals as unit tests for AI responses.
Why Evals Matter
Traditional Testing vs AI Evals
Traditional Software:
def test_add():
assert add(2, 3) == 5 # Deterministic
AI Systems:
def test_agent():
response = agent.query("What is pricing?")
# Response varies, need nuanced evaluation
# Multiple correct answers possible
# Context and tone matter
Challenges:
- Responses are non-deterministic
- Multiple valid answers exist
- Quality is subjective
- Context-dependent correctness
Evals solve this by using multiple evaluation criteria and techniques.
Evaluation Metrics
1. Relevance
Measures: How relevant is the response to the query?
Scale: 0.0 (irrelevant) to 1.0 (perfectly relevant)
Example:
Query: "What is the refund policy?"
Response: "We offer 30-day money-back guarantee..."
Relevance: 0.95 ✅
Query: "What is the refund policy?"
Response: "Our support hours are 9am-5pm..."
Relevance: 0.20 ❌
How it's measured:
- Semantic similarity between query and response
- Keyword matching
- Topic alignment
- User feedback (optional)
2. Factual Accuracy
Measures: Are the facts in the response correct?
Scale: 0.0 (incorrect) to 1.0 (completely accurate)
Example:
Query: "How many users does the Pro plan include?"
Ground Truth: "Up to 50 users"
Response: "The Pro plan includes up to 50 users."
Accuracy: 1.0 ✅
Response: "The Pro plan includes unlimited users."
Accuracy: 0.0 ❌
How it's measured:
- Comparison with ground truth data
- Fact extraction and verification
- Citation checking
- Entity recognition
3. Completeness
Measures: Does the response cover all aspects of the query?
Scale: 0.0 (incomplete) to 1.0 (comprehensive)
Example:
Query: "How do I integrate with Slack?"
Expected Coverage:
- Prerequisites (API key, Slack workspace admin)
- Installation steps
- Configuration
- Testing
Response covering all: 1.0 ✅
Response covering 2 of 4: 0.5 ⚠️
How it's measured:
- Coverage of expected key points
- Presence of required information
- Depth of explanation
4. Citation Quality
Measures: Are sources properly cited and relevant?
Scale: 0.0 (poor/missing) to 1.0 (excellent)
Example:
Response with citation:
"Our API supports OAuth 2.0 authentication.
[Source: API Documentation > Authentication]"
Citation Quality: 0.95 ✅
Response without citation:
"Our API supports OAuth 2.0 authentication."
Citation Quality: 0.0 ❌
How it's measured:
- Presence of citations
- Relevance of cited sources
- Accuracy of source references
- Citation formatting
5. Response Time
Measures: How fast is the response generated?
Example:
Target: < 2 seconds (p95)
Actual: 1.8 seconds ✅
Target: < 2 seconds
Actual: 3.5 seconds ❌
6. Tone Appropriateness
Measures: Does the response match the desired tone?
Example:
For Technical Agent:
"Utilize the authentication endpoint..." ✅
"Just hit the auth thingy..." ❌
For Friendly Agent:
"I'd be happy to help!" ✅
"Authenticate using bearer token." ⚠️
Creating Eval Sets
Eval Set Structure
An eval set consists of test cases:
{
"evalSet": {
"name": "Product Knowledge - Pricing",
"description": "Test agent knowledge of pricing",
"createdAt": "2024-01-15",
"testCases": [
{
"id": "tc-1",
"query": "What does the Pro plan cost?",
"expectedAnswer": "The Pro plan costs $99/month for up to 50 users",
"acceptableAnswers": [
"$99 per month",
"$99/mo for 50 users",
"99 dollars monthly"
],
"requiredEntities": ["price", "user_limit"],
"requiredCitations": ["pricing_page"],
"maxResponseTime": 2000
}
]
}
}
Best Practices for Test Cases
1. Cover Key Scenarios
✅ Common questions (80% of queries)
✅ Edge cases
✅ Complex multi-part queries
✅ Ambiguous questions
✅ Follow-up questions
2. Include Ground Truth
{
"query": "What's included in Enterprise?",
"groundTruth": {
"features": ["SSO", "API access", "Priority support"],
"userLimit": "Unlimited",
"price": "$299/month"
}
}
3. Define Acceptance Criteria
Required Elements:
- Must mention price
- Must mention user limit
- Must cite pricing page
Acceptable Variations:
- "$99" or "$99/month" or "99 dollars"
- "50 users" or "up to 50 users"
Unacceptable:
- Wrong price
- Missing user limit
- No citation
Creating Eval Sets from Real Data
Method 1: From Inbox
1. Navigate to Inbox
2. Filter: Highly rated interactions
3. Select representative samples
4. Click "Add to Eval Set"
5. Review and confirm
Method 2: From Analytics
1. Analytics → Common Queries
2. Select top queries
3. Create test cases
4. Add ground truth
5. Save as eval set
Method 3: Manual Creation
1. Evaluations → Create New Eval Set
2. Add test cases one by one
3. Define expected answers
4. Set acceptance criteria
5. Save and run
Running Evaluations
Manual Evaluation
Via UI:
1. Navigate to Evaluations
2. Select eval set
3. Click "Run Evaluation"
4. Wait for completion
5. Review results
Via API:
curl -X POST https://api.twig.so/api/evaluations/run \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"evalSetId": "eval-123",
"agentId": "agent-456",
"parallel": true
}'
Automated Evaluation
Scheduled Evals:
{
"schedule": {
"frequency": "daily",
"time": "03:00",
"timezone": "UTC"
},
"evalSets": ["eval-123", "eval-124"],
"agents": ["agent-456"],
"notifyOn": ["degradation", "failure"]
}
CI/CD Integration:
# .github/workflows/eval.yml
name: Run Evals on PR
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Run Evaluations
run: |
curl -X POST https://api.twig.so/api/evaluations/run \
-H "Authorization: Bearer ${{ secrets.TWIG_API_KEY }}" \
-d '{"evalSetId": "eval-123", "agentId": "agent-456"}'
- name: Check Results
run: |
# Fail if accuracy < 85%
if [ $ACCURACY -lt 85 ]; then
exit 1
fi
Regression Testing
Before Deployment:
1. Run eval set on current agent (baseline)
2. Make configuration changes
3. Run same eval set on modified agent
4. Compare results
5. Deploy only if metrics maintain/improve
Example:
Baseline (Current):
- Relevance: 0.89
- Accuracy: 0.92
- Response Time: 1.8s
After Config Change:
- Relevance: 0.91 ✅ (+2%)
- Accuracy: 0.88 ❌ (-4%)
- Response Time: 1.5s ✅ (+17% faster)
Decision: Don't deploy (accuracy regression)
Evaluation Results
Results Dashboard
Key Metrics Display:
Overall Score: 87/100
Breakdown:
├─ Relevance: 0.91 (91%) ✅
├─ Accuracy: 0.85 (85%) ⚠️
├─ Completeness: 0.89 (89%) ✅
├─ Citations: 0.82 (82%) ⚠️
└─ Avg Response: 2.1s ✅
Pass Rate: 42/50 (84%)
Per-Test-Case Results
{
"testCase": {
"id": "tc-1",
"query": "What does the Pro plan cost?",
"response": "The Pro plan costs $99/month for up to 50 users.",
"metrics": {
"relevance": 0.95,
"accuracy": 1.0,
"completeness": 0.90,
"citationQuality": 0.85,
"responseTime": 1.8
},
"citations": [
{
"source": "Pricing Page",
"relevance": 0.95
}
],
"status": "PASS"
}
}
Failure Analysis
Failed Test Example:
{
"testCase": "tc-15",
"query": "How do I export data?",
"expected": "Go to Settings > Data > Export",
"actual": "You can manage your data in the settings panel.",
"failureReason": "INCOMPLETE",
"details": {
"missingInfo": ["specific location", "export button"],
"relevance": 0.65,
"accuracy": 0.70,
"completeness": 0.40
},
"suggestedFix": "Add data export documentation or improve retrieval"
}
Advanced Evaluation Techniques
Human-in-the-Loop Evals
Combine automated + human review:
1. Automated eval runs first (fast, cheap)
2. Flag uncertain cases (confidence < 0.8)
3. Human reviewers assess flagged cases
4. Human labels improve future automation
Example:
Automated: 80% confidence it's correct
→ Flag for human review
Human: Confirms it's actually incorrect
→ Update ground truth
→ Retrain evaluation model
A/B Testing with Evals
Compare two agent versions:
Agent A (Redwood):
- Avg Relevance: 0.88
- Avg Speed: 1.2s
- Cost per 1k: $0.36
Agent B (Cedar):
- Avg Relevance: 0.92 (+5%)
- Avg Speed: 2.1s (-43%)
- Cost per 1k: $0.56 (+56%)
Decision: Deploy Cedar (relevance improvement worth cost)
Continuous Evaluation
Monitor production in real-time:
{
"continuousEval": {
"enabled": true,
"sampleRate": 0.1, // Eval 10% of queries
"metrics": ["relevance", "accuracy"],
"alertThresholds": {
"relevance": 0.80, // Alert if < 0.80
"accuracy": 0.85
},
"notificationChannel": "slack"
}
}
Comparative Evaluation
Compare against competitors or baselines:
Query: "What is OAuth?"
Your Agent:
- Response quality: 0.90
- Response time: 1.8s
ChatGPT:
- Response quality: 0.88
- Response time: 2.3s
Benchmark:
Your agent outperforms ✅
Evaluation Best Practices
1. Representative Test Sets
✅ Cover common query types (80% coverage) ✅ Include edge cases (15%) ✅ Test failure modes (5%) ✅ Update regularly with new patterns ❌ Don't test only happy path
2. Multiple Metrics
✅ Use 3-5 complementary metrics ✅ Weight metrics by importance ✅ Track trends over time ❌ Don't rely on single metric
3. Regular Cadence
✅ Daily automated evals ✅ Weekly human review ✅ Before/after every change ❌ Don't eval only at launch
4. Actionable Results
✅ Identify specific failures ✅ Suggest improvements ✅ Track progress on fixes ❌ Don't just collect scores
5. Version Control
✅ Track eval set versions ✅ Document changes ✅ Maintain baselines ❌ Don't modify tests without tracking
Integration with Development Workflow
Development Cycle
1. Develop agent config
↓
2. Run evals locally
↓
3. Fix issues
↓
4. Create PR
↓
5. Automated evals in CI
↓
6. Review results
↓
7. Deploy to staging
↓
8. Run full eval suite
↓
9. Deploy to production
↓
10. Monitor continuous evals
Pre-Commit Hook
#!/bin/bash
# .git/hooks/pre-commit
# Run quick eval set
echo "Running evaluations..."
result=$(curl -s https://api.twig.so/api/evaluations/run \
-H "Authorization: Bearer $TWIG_API_KEY" \
-d '{"evalSetId": "quick-eval", "agentId": "agent-123"}')
score=$(echo $result | jq '.overallScore')
if [ $score -lt 85 ]; then
echo "❌ Eval score too low: $score"
echo "Fix issues before committing"
exit 1
fi
echo "✅ Evals passed: $score"
Metrics Tracking Over Time
Trend Analysis
Week 1: Relevance: 0.85 ━━━━━━━━━━━━━━━━━━━━━━
Week 2: Relevance: 0.87 ━━━━━━━━━━━━━━━━━━━━━━━━
Week 3: Relevance: 0.89 ━━━━━━━━━━━━━━━━━━━━━━━━━━
Week 4: Relevance: 0.91 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Trend: +6% over 4 weeks ✅
Degradation Alerts
Current Score: 0.82
Baseline: 0.90
Threshold: 0.85
Alert: Score dropped 8% below baseline!
Possible causes:
- New data source with poor quality
- Model change
- Configuration drift
- Data source sync failure
Action: Investigate and rollback if needed
Cost-Benefit Analysis
Evaluation Costs
Time Investment:
- Initial setup: 4-8 hours
- Test case creation: 1 hour per 10 cases
- Maintenance: 2 hours/week
Monetary Costs:
- API calls for evals: $0.01 per test case
- 100 test cases × daily = $30/month
ROI:
- Catch issues before users: Priceless
- Prevent quality regressions: High value
- Enable confident deployments: Essential
Example Eval Sets
Basic Product Knowledge
name: "Product Knowledge - Basic"
testCases:
- query: "What is [Product]?"
expected_topics: ["description", "key_features"]
- query: "How much does it cost?"
required_entities: ["price", "plan_name"]
required_citation: true
Technical Documentation
name: "Technical - API"
testCases:
- query: "How do I authenticate?"
required_elements:
- "API key"
- "Authorization header"
- "Bearer token"
must_include_code_example: true
Customer Support
name: "Support - Common Issues"
testCases:
- query: "I can't log in"
response_must_include:
- troubleshooting_steps
- escalation_path
tone: "empathetic"
Next Steps
- Performance Tuning - Optimize based on eval results
- Analytics Dashboard - Track production metrics
- Inbox & Training - Improve from real interactions
- Agent Configuration - Adjust based on findings
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/product/monitoring/evals.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
People also ask
Related Pages
Industries
Last updated January 25, 2026


