Measuring Voice AI Agent Quality: CSAT, AHT, Containment, and the New Metrics

Q: What metrics matter for voice AI?

Twelve KPIs grouped into four families: resolution (autonomous resolution rate, CSAT-validated containment, re-contact rate), efficiency (handle time, queue time, cost-per-resolution), quality (confidence-floor pass rate, hallucination rate, escalation-CSAT), and operations (intent coverage, KB freshness, p95 latency). Containment alone is misleading; CSAT-validated autonomous resolution is the honest version.

Q: What's a good containment rate for voice AI?

Containment alone is the wrong question. Best-in-class voice AI deployments report 60–75% containment, but only 50–65% CSAT-validated autonomous resolution. The gap is silent containment — calls where the customer hung up or accepted a wrong answer. Measure both and watch the gap; if it widens, the bot is winning containment by frustrating callers into hanging up.

Q: How do I measure voice AI hallucination?

Sample 100–200 calls per week, transcribe agent responses, and have a reviewer score factual claims against the underlying knowledge base. Production-grade voice AI runs <1% hallucination on grounded queries; demo-grade systems run 3–8%. The number drifts upward over time as the KB ages, so the sampling has to be ongoing.

Q: What is CSAT-validated containment?

CSAT-validated containment is the subset of contained calls that also score above a CSAT threshold (typically ≥4 of 5 or NPS ≥7) in a post-call survey. It removes the calls where the customer technically didn't transfer but was unhappy with the outcome. This is the metric to drive deployments against — not raw containment.

Q: How long until voice AI shows ROI?

Most enterprise deployments show positive ROI in months 4–6, with the inflection point being when intent coverage crosses 80% (the 80% of call mix covered by 10–15 well-modeled intents). Mid-market deployments hit ROI in months 2–3 because the call mix is narrower and intent coverage scales faster.

Voice AI deployments fail on the wrong metric. Here is the 12-KPI scorecard that actually predicts whether your voice AI is creating value — including the four metrics legacy contact centers don't track.

Chandan Maruthi· CEO, Twig AI

CEO of Twig AI. Previously at H2O.ai and Zyme.

May 21, 2026Updated June 10, 202610 min read

Voice AI agent metrics — CSAT, AHT, containment, and the new KPIs that matter

Key Takeaways

✓Containment alone is misleading; CSAT-validated autonomous resolution is the honest version
✓Twelve KPIs span four families — resolution, efficiency, quality, and operations
✓Four new metrics legacy contact centers don't track — silent containment gap, escalation-CSAT, confidence-floor pass rate, hallucination rate
✓Sample 100–200 calls per week for hallucination measurement; the rate drifts upward as KB ages
✓ROI typically lands in months 4–6 enterprise, 2–3 mid-market
✓Twig's text-side self-evaluation produces directly comparable confidence-floor metrics for chat and email

See how Twig compares to PolyAI

Voice-first AI for contact centers.

Learn more

Twig is an autonomous AI support platform that triages, self-evaluates, and resolves customer support tickets by integrating with tools like Zendesk, Salesforce, and Intercom. On the text side, every response runs through a confidence scoring loop that produces directly observable quality numbers. On the voice side, the equivalent measurement discipline is harder — and the legacy contact-center KPIs miss it entirely. This post is for the buyer or operator trying to figure out whether their voice AI deployment is actually creating value.

TL;DR: Legacy contact-center KPIs — AHT, ASA, FCR, containment — were designed to measure human-agent productivity, not AI-agent quality. They miss the failure modes that matter for voice AI: silent containment ("the caller gave up"), low-confidence resolutions that look good on paper, and warm-handoff degradation that destroys CSAT for escalated calls. This post lays out the 12-KPI scorecard that actually predicts whether a voice AI deployment is creating value, why containment is the wrong primary metric, and which four new metrics need to be added to the dashboard.

Key takeaways:

Containment alone is misleading; CSAT-validated autonomous resolution is the honest version
Twelve KPIs span four families — resolution, efficiency, quality, and operations
Four new metrics legacy contact centers don't track — silent containment gap, escalation-CSAT, confidence-floor pass rate, hallucination rate
Sample 100–200 calls per week for hallucination measurement; the rate drifts upward as KB ages
ROI typically lands in months 4–6 enterprise, 2–3 mid-market
Twig's text-side self-evaluation produces directly comparable confidence-floor metrics for chat and email

Why legacy KPIs miss the point

The standard contact-center scorecard was built for human-agent performance management:

Legacy KPI	What it measures	Why it misses for voice AI
AHT (Average Handle Time)	Length of call	A bad bot can have great AHT by hanging up faster
ASA (Average Speed of Answer)	Queue time	No queue for voice AI; metric becomes meaningless
FCR (First Call Resolution)	% calls not requiring callback	Hard to attribute when AI handles part, human handles rest
CSAT	Customer satisfaction post-call	Right metric, but underused as primary
Occupancy	% of agent time on calls	N/A for AI
Containment / Deflection	% calls not transferred to human	Conflates "resolved" with "customer gave up"

The headline failure mode is silent containment: the customer hung up or accepted a wrong answer rather than transferring to a human. Containment counts it as a success. CSAT counts it as a disaster. A scorecard that watches only containment will tell you the deployment is winning while the customer base quietly defects.

The 12-KPI voice AI scorecard

Group the metrics into four families. Watch them all; weight them by what you're actually optimizing for.

Family 1: Resolution

KPI	Definition	Target
Autonomous resolution rate	% calls where intended task completed end-to-end without human transfer	60–75%
CSAT-validated containment	Contained calls with post-call CSAT ≥4 of 5	≥90% of contained calls
Re-contact rate	% callers calling back within 7 days about same issue	<8%

The third one is the gotcha. If your containment is 70% but 20% of those callers come back within a week, you only contained 50%. Re-contact rate is the lagging indicator that exposes silent containment.

Family 2: Efficiency

KPI	Definition	Target
Average handle time (AHT)	Length of call when AI handles autonomously	2:30–4:00 typical
Time to first token / first audio	Caller end-of-turn → first agent audio	p50 <800ms, p95 <1500ms
Cost per autonomous resolution	Total voice AI cost / autonomous resolutions	$0.50–$2.50 typical

AHT is back on the list, but with a footnote: AHT only matters if autonomous resolution rate is holding. A bot that "improves" AHT by escalating faster has just shifted cost, not removed it.

Family 3: Quality

KPI	Definition	Target
Confidence-floor pass rate	% responses passing self-evaluation threshold	>85%
Hallucination rate	% responses with factual claims unsupported by KB	<1% on grounded queries
Escalation CSAT	CSAT for calls handed off to a human	≥ human-only baseline
Handoff payload completeness	% escalations with all required context delivered	100%

The escalation CSAT is the under-measured one. A deployment can look healthy in aggregate while quietly destroying CSAT for the 30% of calls that get handed off — which are precisely the moments customers are already frustrated. See our deeper piece on warm handoff payload design.

Family 4: Operations

KPI	Definition	Target
Intent coverage	% call volume mapped to a modeled intent	>85% in 90 days
KB freshness	% grounded sources updated in last 90 days	>70%

Intent coverage is the leading indicator of how much room the deployment has left to grow. KB freshness is the leading indicator of when hallucination rate is about to climb.

The four metrics legacy contact centers don't track

These are the additions that turn a generic CX dashboard into a voice-AI-aware one:

1. Silent containment gap

Silent containment gap = Containment rate − CSAT-validated containment

A widening gap means the bot is winning containment at the expense of customer outcome. Healthy deployments hold this gap under 10 percentage points.

2. Escalation CSAT

Survey customers whose calls were escalated to a human, separately from customers whose calls the AI resolved. Compare both to the pre-AI human-only baseline. If escalation CSAT is lower than human-only baseline, the handoff is destroying value — usually because the warm-handoff payload isn't being delivered or read.

3. Confidence-floor pass rate

The percentage of AI-generated responses that pass self-evaluation on the first try, vs. the percentage that needed re-grounding or triggered escalation. This is a direct quality signal on the model + retrieval stack. A falling pass rate over time means the KB is drifting from the intents the bot is being asked to handle.

Twig produces the equivalent metric on the text side via confidence scoring — the same numerical floor, the same drift detection.

4. Hallucination rate (measured, not assumed)

Sample 100–200 calls per week. Transcribe the AI's responses. Have a human reviewer (or LLM-based grader with human spot-check) score each factual claim against the KB. The hallucination rate is the percentage of claims that are unsupported or false.

Production-grade voice AI runs <1% on grounded queries. Demo-grade systems run 3–8%. The number drifts upward as the KB ages, so the sampling has to be continuous, not one-time at deployment.

What healthy looks like

A representative scorecard for a healthy mid-market voice AI deployment, month 6:

KPI	Value
Autonomous resolution rate	68%
CSAT-validated containment	92% of contained
Re-contact rate	6%
AHT	3:20
First-audio latency p50	720ms
Cost per autonomous resolution	$1.40
Confidence-floor pass rate	91%
Hallucination rate	0.6%
Escalation CSAT	4.2 / 5 (baseline was 4.1)
Handoff payload completeness	100%
Intent coverage	89%
KB freshness	76% updated last 90 days

The under-celebrated win is escalation CSAT slightly above human-only baseline. That happens when the AI does the boring CRM-lookup work upfront and hands the human a fully framed problem, so the human can move directly to resolution.

What unhealthy looks like (and how to read it)

The same scorecard, broken:

KPI	Value	Diagnosis
Autonomous resolution rate	72%	Looks good
CSAT-validated containment	71% of contained	Silent containment problem
Re-contact rate	14%	Yes — bot resolves wrong, callers come back
AHT	2:15	Yes — too fast, bot hanging up early
Confidence-floor pass rate	67%	Too many low-confidence responses spoken
Hallucination rate	4.2%	KB drift or bad grounding
Escalation CSAT	3.4 / 5	Handoff broken
Intent coverage	71%	Big chunk of calls hitting fallback

The fix sequence:

Raise the confidence floor (causes more escalations short-term, lowers hallucinations)
Fix the handoff payload delivery (so escalations recover CSAT)
Audit the KB and re-ground the intents with the highest hallucination rate
Expand intent coverage from 71% upward to absorb fallback calls
Re-measure in 30 days

Measurement infrastructure

The dashboard that supports this scorecard requires:

Per-call event log: every turn timestamped, intent classified, confidence scored, sources cited, actions attempted
Post-call CSAT survey (SMS, IVR, or app — voice survey at end of call has higher response rate but lower honesty)
Re-contact tracking keyed on customer ID across channels
Sampling infrastructure for hallucination grading
KB version tracking to attribute drift to specific document changes
Escalation payload audit log to prove handoff completeness

Most of this lives in the voice AI vendor's analytics product (PolyAI Studio, Parloa Analytics, ASAPP MissionControl) plus the CRM (Salesforce, Zendesk, HubSpot) for CSAT and re-contact. Hallucination sampling is usually external — Excel + reviewer at small scale, dedicated tooling at large scale.

The cross-channel measurement principle

Voice and text-side AI agents should be measured on the same skeleton. The numbers differ; the structure shouldn't:

Autonomous resolution rate: voice (call) and text (ticket) measure the same thing
Confidence-floor pass rate: voice and text run the same self-evaluation architecture
Hallucination rate: same sampling discipline, different surface
Escalation CSAT: same metric, different handoff endpoint

A buyer running voice AI on one stack and text-side autonomous resolution on another should expect both to report the same KPI set. Twig produces this set natively for chat, email, and helpdesk channels.

The ROI math, honestly

When a deployment hits the scorecard targets, the typical financial result:

Quantity	Pre-AI	With voice AI
Inbound voice calls per month	100,000	100,000
Resolved without human	0	65,000 (autonomous resolution)
Human-handled calls	100,000	35,000
Cost per human-handled call	$7.00	$7.00
Cost per AI-handled call	n/a	$1.40
Monthly cost (humans only)	$700,000	$245,000
Monthly cost (AI)	$0	$91,000
Total monthly cost	$700,000	$336,000
Monthly savings	—	$364,000
CSAT	4.1	4.2

Payback period on a $400K–$800K voice AI implementation: 2–4 months at this scale. Enterprise deployments with custom telephony and CRM integration take longer to break even (4–6 months) but the absolute dollar return is larger.

The takeaway

Voice AI quality is measurable, but not with the metrics most contact centers already have on the dashboard. Replace containment with CSAT-validated autonomous resolution. Add escalation CSAT, silent containment gap, confidence-floor pass rate, and hallucination rate. Sample continuously. Treat the warm handoff as a first-class metric, not a footnote.

The buyers who win at voice AI in 2026 are not the ones with the best demos — they're the ones with the most honest scorecards.

Try Twig free — see how autonomous AI support works on your tickets

30-minute setup · Free tier available · No credit card required

Learn more

Frequently Asked Questions

What metrics matter for voice AI?

Twelve KPIs grouped into four families: resolution (autonomous resolution rate, CSAT-validated containment, re-contact rate), efficiency (handle time, queue time, cost-per-resolution), quality (confidence-floor pass rate, hallucination rate, escalation-CSAT), and operations (intent coverage, KB freshness, p95 latency). Containment alone is misleading; CSAT-validated autonomous resolution is the honest version.

What's a good containment rate for voice AI?

Containment alone is the wrong question. Best-in-class voice AI deployments report 60–75% containment, but only 50–65% CSAT-validated autonomous resolution. The gap is silent containment — calls where the customer hung up or accepted a wrong answer. Measure both and watch the gap; if it widens, the bot is winning containment by frustrating callers into hanging up.

How do I measure voice AI hallucination?

Sample 100–200 calls per week, transcribe agent responses, and have a reviewer score factual claims against the underlying knowledge base. Production-grade voice AI runs <1% hallucination on grounded queries; demo-grade systems run 3–8%. The number drifts upward over time as the KB ages, so the sampling has to be ongoing.

What is CSAT-validated containment?

CSAT-validated containment is the subset of contained calls that also score above a CSAT threshold (typically ≥4 of 5 or NPS ≥7) in a post-call survey. It removes the calls where the customer technically didn't transfer but was unhappy with the outcome. This is the metric to drive deployments against — not raw containment.

How long until voice AI shows ROI?

Most enterprise deployments show positive ROI in months 4–6, with the inflection point being when intent coverage crosses 80% (the 80% of call mix covered by 10–15 well-modeled intents). Mid-market deployments hit ROI in months 2–3 because the call mix is narrower and intent coverage scales faster.

voice ai cx metrics csat containment ai quality

Integrations

Industries

AI Support for Fintech

Comparisons

Weekly AI CX insights

How leading support teams deploy autonomous AI. One short email a week.

customer support

Decagon vs Sierra vs Twig: Which Is Most Secure?

Twig attaches source attribution and audit trails to every answer. Decagon and Sierra rely on enterprise controls. Which AI support is most trustworthy?

5 min read

customer support

Decagon vs Sierra vs Twig: Best Helpdesk Coverage?

Twig connects 30+ data sources and runs across helpdesks. Decagon and Sierra favor custom enterprise stacks. Which has the best integration coverage?

5 min read

customer support

Decagon vs Sierra vs Twig: Which Fits Mid-Market?

Decagon and Sierra are built for enterprise floors. Twig serves SMB and mid-market with no minimums. Which AI support platform fits a smaller team?

5 min read

Measuring Voice AI Agent Quality: CSAT, AHT, Containment, and the New Metrics

Key Takeaways

Why legacy KPIs miss the point

The 12-KPI voice AI scorecard

Family 1: Resolution

Family 2: Efficiency

Family 3: Quality

Family 4: Operations

The four metrics legacy contact centers don't track

1. Silent containment gap

2. Escalation CSAT

3. Confidence-floor pass rate

4. Hallucination rate (measured, not assumed)

What healthy looks like

What unhealthy looks like (and how to read it)

Measurement infrastructure

The cross-channel measurement principle

The ROI math, honestly

The takeaway

Frequently Asked Questions

What metrics matter for voice AI?

What's a good containment rate for voice AI?

How do I measure voice AI hallucination?

What is CSAT-validated containment?

How long until voice AI shows ROI?

Related Pages

Integrations

Industries

Comparisons

Weekly AI CX insights

Related Articles

Decagon vs Sierra vs Twig: Which Is Most Secure?

Decagon vs Sierra vs Twig: Best Helpdesk Coverage?

Decagon vs Sierra vs Twig: Which Fits Mid-Market?