Measuring Voice AI Agent Quality: CSAT, AHT, Containment, and the New Metrics
Voice AI deployments fail on the wrong metric. Here is the 12-KPI scorecard that actually predicts whether your voice AI is creating value — including the four metrics legacy contact centers don't track.

Key Takeaways
- ✓Containment alone is misleading; CSAT-validated autonomous resolution is the honest version
- ✓Twelve KPIs span four families — resolution, efficiency, quality, and operations
- ✓Four new metrics legacy contact centers don't track — silent containment gap, escalation-CSAT, confidence-floor pass rate, hallucination rate
- ✓Sample 100–200 calls per week for hallucination measurement; the rate drifts upward as KB ages
- ✓ROI typically lands in months 4–6 enterprise, 2–3 mid-market
- ✓Twig's text-side self-evaluation produces directly comparable confidence-floor metrics for chat and email
Weekly AI CX insights
How leading support teams deploy autonomous AI. One short email a week.
See how Twig compares to PolyAI
Voice-first AI for contact centers.
Measuring Voice AI Agent Quality: CSAT, AHT, Containment, and the New Metrics
Twig is an autonomous AI support platform that triages, self-evaluates, and resolves customer support tickets by integrating with tools like Zendesk, Salesforce, and Intercom. On the text side, every response runs through a confidence scoring loop that produces directly observable quality numbers. On the voice side, the equivalent measurement discipline is harder — and the legacy contact-center KPIs miss it entirely. This post is for the buyer or operator trying to figure out whether their voice AI deployment is actually creating value.
TL;DR: Legacy contact-center KPIs — AHT, ASA, FCR, containment — were designed to measure human-agent productivity, not AI-agent quality. They miss the failure modes that matter for voice AI: silent containment ("the caller gave up"), low-confidence resolutions that look good on paper, and warm-handoff degradation that destroys CSAT for escalated calls. This post lays out the 12-KPI scorecard that actually predicts whether a voice AI deployment is creating value, why containment is the wrong primary metric, and which four new metrics need to be added to the dashboard.
Key takeaways:
- Containment alone is misleading; CSAT-validated autonomous resolution is the honest version
- Twelve KPIs span four families — resolution, efficiency, quality, and operations
- Four new metrics legacy contact centers don't track — silent containment gap, escalation-CSAT, confidence-floor pass rate, hallucination rate
- Sample 100–200 calls per week for hallucination measurement; the rate drifts upward as KB ages
- ROI typically lands in months 4–6 enterprise, 2–3 mid-market
- Twig's text-side self-evaluation produces directly comparable confidence-floor metrics for chat and email
Why legacy KPIs miss the point
The standard contact-center scorecard was built for human-agent performance management:
| Legacy KPI | What it measures | Why it misses for voice AI |
|---|---|---|
| AHT (Average Handle Time) | Length of call | A bad bot can have great AHT by hanging up faster |
| ASA (Average Speed of Answer) | Queue time | No queue for voice AI; metric becomes meaningless |
| FCR (First Call Resolution) | % calls not requiring callback | Hard to attribute when AI handles part, human handles rest |
| CSAT | Customer satisfaction post-call | Right metric, but underused as primary |
| Occupancy | % of agent time on calls | N/A for AI |
| Containment / Deflection | % calls not transferred to human | Conflates "resolved" with "customer gave up" |
The headline failure mode is silent containment: the customer hung up or accepted a wrong answer rather than transferring to a human. Containment counts it as a success. CSAT counts it as a disaster. A scorecard that watches only containment will tell you the deployment is winning while the customer base quietly defects.
The 12-KPI voice AI scorecard
Group the metrics into four families. Watch them all; weight them by what you're actually optimizing for.
Family 1: Resolution
| KPI | Definition | Target |
|---|---|---|
| Autonomous resolution rate | % calls where intended task completed end-to-end without human transfer | 60–75% |
| CSAT-validated containment | Contained calls with post-call CSAT ≥4 of 5 | ≥90% of contained calls |
| Re-contact rate | % callers calling back within 7 days about same issue | <8% |
The third one is the gotcha. If your containment is 70% but 20% of those callers come back within a week, you only contained 50%. Re-contact rate is the lagging indicator that exposes silent containment.
Family 2: Efficiency
| KPI | Definition | Target |
|---|---|---|
| Average handle time (AHT) | Length of call when AI handles autonomously | 2:30–4:00 typical |
| Time to first token / first audio | Caller end-of-turn → first agent audio | p50 <800ms, p95 <1500ms |
| Cost per autonomous resolution | Total voice AI cost / autonomous resolutions | $0.50–$2.50 typical |
AHT is back on the list, but with a footnote: AHT only matters if autonomous resolution rate is holding. A bot that "improves" AHT by escalating faster has just shifted cost, not removed it.
Family 3: Quality
| KPI | Definition | Target |
|---|---|---|
| Confidence-floor pass rate | % responses passing self-evaluation threshold | >85% |
| Hallucination rate | % responses with factual claims unsupported by KB | <1% on grounded queries |
| Escalation CSAT | CSAT for calls handed off to a human | ≥ human-only baseline |
| Handoff payload completeness | % escalations with all required context delivered | 100% |
The escalation CSAT is the under-measured one. A deployment can look healthy in aggregate while quietly destroying CSAT for the 30% of calls that get handed off — which are precisely the moments customers are already frustrated. See our deeper piece on warm handoff payload design.
Family 4: Operations
| KPI | Definition | Target |
|---|---|---|
| Intent coverage | % call volume mapped to a modeled intent | >85% in 90 days |
| KB freshness | % grounded sources updated in last 90 days | >70% |
Intent coverage is the leading indicator of how much room the deployment has left to grow. KB freshness is the leading indicator of when hallucination rate is about to climb.
The four metrics legacy contact centers don't track
These are the additions that turn a generic CX dashboard into a voice-AI-aware one:
1. Silent containment gap
Silent containment gap = Containment rate − CSAT-validated containment
A widening gap means the bot is winning containment at the expense of customer outcome. Healthy deployments hold this gap under 10 percentage points.
2. Escalation CSAT
Survey customers whose calls were escalated to a human, separately from customers whose calls the AI resolved. Compare both to the pre-AI human-only baseline. If escalation CSAT is lower than human-only baseline, the handoff is destroying value — usually because the warm-handoff payload isn't being delivered or read.
3. Confidence-floor pass rate
The percentage of AI-generated responses that pass self-evaluation on the first try, vs. the percentage that needed re-grounding or triggered escalation. This is a direct quality signal on the model + retrieval stack. A falling pass rate over time means the KB is drifting from the intents the bot is being asked to handle.
Twig produces the equivalent metric on the text side via confidence scoring — the same numerical floor, the same drift detection.
4. Hallucination rate (measured, not assumed)
Sample 100–200 calls per week. Transcribe the AI's responses. Have a human reviewer (or LLM-based grader with human spot-check) score each factual claim against the KB. The hallucination rate is the percentage of claims that are unsupported or false.
Production-grade voice AI runs <1% on grounded queries. Demo-grade systems run 3–8%. The number drifts upward as the KB ages, so the sampling has to be continuous, not one-time at deployment.
What healthy looks like
A representative scorecard for a healthy mid-market voice AI deployment, month 6:
| KPI | Value |
|---|---|
| Autonomous resolution rate | 68% |
| CSAT-validated containment | 92% of contained |
| Re-contact rate | 6% |
| AHT | 3:20 |
| First-audio latency p50 | 720ms |
| Cost per autonomous resolution | $1.40 |
| Confidence-floor pass rate | 91% |
| Hallucination rate | 0.6% |
| Escalation CSAT | 4.2 / 5 (baseline was 4.1) |
| Handoff payload completeness | 100% |
| Intent coverage | 89% |
| KB freshness | 76% updated last 90 days |
The under-celebrated win is escalation CSAT slightly above human-only baseline. That happens when the AI does the boring CRM-lookup work upfront and hands the human a fully framed problem, so the human can move directly to resolution.
What unhealthy looks like (and how to read it)
The same scorecard, broken:
| KPI | Value | Diagnosis |
|---|---|---|
| Autonomous resolution rate | 72% | Looks good |
| CSAT-validated containment | 71% of contained | Silent containment problem |
| Re-contact rate | 14% | Yes — bot resolves wrong, callers come back |
| AHT | 2:15 | Yes — too fast, bot hanging up early |
| Confidence-floor pass rate | 67% | Too many low-confidence responses spoken |
| Hallucination rate | 4.2% | KB drift or bad grounding |
| Escalation CSAT | 3.4 / 5 | Handoff broken |
| Intent coverage | 71% | Big chunk of calls hitting fallback |
The fix sequence:
- Raise the confidence floor (causes more escalations short-term, lowers hallucinations)
- Fix the handoff payload delivery (so escalations recover CSAT)
- Audit the KB and re-ground the intents with the highest hallucination rate
- Expand intent coverage from 71% upward to absorb fallback calls
- Re-measure in 30 days
Measurement infrastructure
The dashboard that supports this scorecard requires:
- Per-call event log: every turn timestamped, intent classified, confidence scored, sources cited, actions attempted
- Post-call CSAT survey (SMS, IVR, or app — voice survey at end of call has higher response rate but lower honesty)
- Re-contact tracking keyed on customer ID across channels
- Sampling infrastructure for hallucination grading
- KB version tracking to attribute drift to specific document changes
- Escalation payload audit log to prove handoff completeness
Most of this lives in the voice AI vendor's analytics product (PolyAI Studio, Parloa Analytics, ASAPP MissionControl) plus the CRM (Salesforce, Zendesk, HubSpot) for CSAT and re-contact. Hallucination sampling is usually external — Excel + reviewer at small scale, dedicated tooling at large scale.
The cross-channel measurement principle
Voice and text-side AI agents should be measured on the same skeleton. The numbers differ; the structure shouldn't:
- Autonomous resolution rate: voice (call) and text (ticket) measure the same thing
- Confidence-floor pass rate: voice and text run the same self-evaluation architecture
- Hallucination rate: same sampling discipline, different surface
- Escalation CSAT: same metric, different handoff endpoint
A buyer running voice AI on one stack and text-side autonomous resolution on another should expect both to report the same KPI set. Twig produces this set natively for chat, email, and helpdesk channels.
The ROI math, honestly
When a deployment hits the scorecard targets, the typical financial result:
| Quantity | Pre-AI | With voice AI |
|---|---|---|
| Inbound voice calls per month | 100,000 | 100,000 |
| Resolved without human | 0 | 65,000 (autonomous resolution) |
| Human-handled calls | 100,000 | 35,000 |
| Cost per human-handled call | $7.00 | $7.00 |
| Cost per AI-handled call | n/a | $1.40 |
| Monthly cost (humans only) | $700,000 | $245,000 |
| Monthly cost (AI) | $0 | $91,000 |
| Total monthly cost | $700,000 | $336,000 |
| Monthly savings | — | $364,000 |
| CSAT | 4.1 | 4.2 |
Payback period on a $400K–$800K voice AI implementation: 2–4 months at this scale. Enterprise deployments with custom telephony and CRM integration take longer to break even (4–6 months) but the absolute dollar return is larger.
The takeaway
Voice AI quality is measurable, but not with the metrics most contact centers already have on the dashboard. Replace containment with CSAT-validated autonomous resolution. Add escalation CSAT, silent containment gap, confidence-floor pass rate, and hallucination rate. Sample continuously. Treat the warm handoff as a first-class metric, not a footnote.
The buyers who win at voice AI in 2026 are not the ones with the best demos — they're the ones with the most honest scorecards.
Try Twig free — see how autonomous AI support works on your tickets
30-minute setup · Free tier available · No credit card required
Related Pages
Integrations
Industries
Comparisons
Related Articles
The 24/7 Booking Engine: After-Hours Appointment Capture for SMBs
30–45% of SMB inbound demand arrives outside business hours. Most goes to voicemail and dies. Here's the AI front desk that captures it — and the revenue math by vertical.
10 min readAI Front Desk Agents: What They Are, How They Differ from Chatbots and IVR, and Where They Fit in 2026
An AI front desk agent is the first-touch AI across voice, chat, and scheduling — not a chatbot, not an IVR. Here is the definition, the use cases, and the buying criteria for 2026.
11 min readCapture the Copay: How AI Front Desks Collect Patient Payments Before the Visit
Unpaid copays and missed deposits trap 15–25% of SMB practice revenue in accounts receivable. AI front desks collect at booking — turning 60-day receivables into same-day cash.
11 min read