customer support

Measuring Voice AI Agent Quality: CSAT, AHT, Containment, and the New Metrics

Voice AI deployments fail on the wrong metric. Here is the 12-KPI scorecard that actually predicts whether your voice AI is creating value — including the four metrics legacy contact centers don't track.

Chandan Maruthi· CEO, Twig AI

CEO of Twig AI. Previously at H2O.ai and Zyme.

May 21, 202610 min read
Voice AI agent metrics — CSAT, AHT, containment, and the new KPIs that matter

Key Takeaways

  • Containment alone is misleading; CSAT-validated autonomous resolution is the honest version
  • Twelve KPIs span four families — resolution, efficiency, quality, and operations
  • Four new metrics legacy contact centers don't track — silent containment gap, escalation-CSAT, confidence-floor pass rate, hallucination rate
  • Sample 100–200 calls per week for hallucination measurement; the rate drifts upward as KB ages
  • ROI typically lands in months 4–6 enterprise, 2–3 mid-market
  • Twig's text-side self-evaluation produces directly comparable confidence-floor metrics for chat and email

Weekly AI CX insights

How leading support teams deploy autonomous AI. One short email a week.

See how Twig compares to PolyAI

Voice-first AI for contact centers.

Learn more

Measuring Voice AI Agent Quality: CSAT, AHT, Containment, and the New Metrics

Twig is an autonomous AI support platform that triages, self-evaluates, and resolves customer support tickets by integrating with tools like Zendesk, Salesforce, and Intercom. On the text side, every response runs through a confidence scoring loop that produces directly observable quality numbers. On the voice side, the equivalent measurement discipline is harder — and the legacy contact-center KPIs miss it entirely. This post is for the buyer or operator trying to figure out whether their voice AI deployment is actually creating value.

TL;DR: Legacy contact-center KPIs — AHT, ASA, FCR, containment — were designed to measure human-agent productivity, not AI-agent quality. They miss the failure modes that matter for voice AI: silent containment ("the caller gave up"), low-confidence resolutions that look good on paper, and warm-handoff degradation that destroys CSAT for escalated calls. This post lays out the 12-KPI scorecard that actually predicts whether a voice AI deployment is creating value, why containment is the wrong primary metric, and which four new metrics need to be added to the dashboard.

Key takeaways:

  • Containment alone is misleading; CSAT-validated autonomous resolution is the honest version
  • Twelve KPIs span four families — resolution, efficiency, quality, and operations
  • Four new metrics legacy contact centers don't track — silent containment gap, escalation-CSAT, confidence-floor pass rate, hallucination rate
  • Sample 100–200 calls per week for hallucination measurement; the rate drifts upward as KB ages
  • ROI typically lands in months 4–6 enterprise, 2–3 mid-market
  • Twig's text-side self-evaluation produces directly comparable confidence-floor metrics for chat and email

Why legacy KPIs miss the point

The standard contact-center scorecard was built for human-agent performance management:

Legacy KPIWhat it measuresWhy it misses for voice AI
AHT (Average Handle Time)Length of callA bad bot can have great AHT by hanging up faster
ASA (Average Speed of Answer)Queue timeNo queue for voice AI; metric becomes meaningless
FCR (First Call Resolution)% calls not requiring callbackHard to attribute when AI handles part, human handles rest
CSATCustomer satisfaction post-callRight metric, but underused as primary
Occupancy% of agent time on callsN/A for AI
Containment / Deflection% calls not transferred to humanConflates "resolved" with "customer gave up"

The headline failure mode is silent containment: the customer hung up or accepted a wrong answer rather than transferring to a human. Containment counts it as a success. CSAT counts it as a disaster. A scorecard that watches only containment will tell you the deployment is winning while the customer base quietly defects.

The 12-KPI voice AI scorecard

Group the metrics into four families. Watch them all; weight them by what you're actually optimizing for.

Family 1: Resolution

KPIDefinitionTarget
Autonomous resolution rate% calls where intended task completed end-to-end without human transfer60–75%
CSAT-validated containmentContained calls with post-call CSAT ≥4 of 5≥90% of contained calls
Re-contact rate% callers calling back within 7 days about same issue<8%

The third one is the gotcha. If your containment is 70% but 20% of those callers come back within a week, you only contained 50%. Re-contact rate is the lagging indicator that exposes silent containment.

Family 2: Efficiency

KPIDefinitionTarget
Average handle time (AHT)Length of call when AI handles autonomously2:30–4:00 typical
Time to first token / first audioCaller end-of-turn → first agent audiop50 <800ms, p95 <1500ms
Cost per autonomous resolutionTotal voice AI cost / autonomous resolutions$0.50–$2.50 typical

AHT is back on the list, but with a footnote: AHT only matters if autonomous resolution rate is holding. A bot that "improves" AHT by escalating faster has just shifted cost, not removed it.

Family 3: Quality

KPIDefinitionTarget
Confidence-floor pass rate% responses passing self-evaluation threshold>85%
Hallucination rate% responses with factual claims unsupported by KB<1% on grounded queries
Escalation CSATCSAT for calls handed off to a human≥ human-only baseline
Handoff payload completeness% escalations with all required context delivered100%

The escalation CSAT is the under-measured one. A deployment can look healthy in aggregate while quietly destroying CSAT for the 30% of calls that get handed off — which are precisely the moments customers are already frustrated. See our deeper piece on warm handoff payload design.

Family 4: Operations

KPIDefinitionTarget
Intent coverage% call volume mapped to a modeled intent>85% in 90 days
KB freshness% grounded sources updated in last 90 days>70%

Intent coverage is the leading indicator of how much room the deployment has left to grow. KB freshness is the leading indicator of when hallucination rate is about to climb.

The four metrics legacy contact centers don't track

These are the additions that turn a generic CX dashboard into a voice-AI-aware one:

1. Silent containment gap

Silent containment gap = Containment rate − CSAT-validated containment

A widening gap means the bot is winning containment at the expense of customer outcome. Healthy deployments hold this gap under 10 percentage points.

2. Escalation CSAT

Survey customers whose calls were escalated to a human, separately from customers whose calls the AI resolved. Compare both to the pre-AI human-only baseline. If escalation CSAT is lower than human-only baseline, the handoff is destroying value — usually because the warm-handoff payload isn't being delivered or read.

3. Confidence-floor pass rate

The percentage of AI-generated responses that pass self-evaluation on the first try, vs. the percentage that needed re-grounding or triggered escalation. This is a direct quality signal on the model + retrieval stack. A falling pass rate over time means the KB is drifting from the intents the bot is being asked to handle.

Twig produces the equivalent metric on the text side via confidence scoring — the same numerical floor, the same drift detection.

4. Hallucination rate (measured, not assumed)

Sample 100–200 calls per week. Transcribe the AI's responses. Have a human reviewer (or LLM-based grader with human spot-check) score each factual claim against the KB. The hallucination rate is the percentage of claims that are unsupported or false.

Production-grade voice AI runs <1% on grounded queries. Demo-grade systems run 3–8%. The number drifts upward as the KB ages, so the sampling has to be continuous, not one-time at deployment.

What healthy looks like

A representative scorecard for a healthy mid-market voice AI deployment, month 6:

KPIValue
Autonomous resolution rate68%
CSAT-validated containment92% of contained
Re-contact rate6%
AHT3:20
First-audio latency p50720ms
Cost per autonomous resolution$1.40
Confidence-floor pass rate91%
Hallucination rate0.6%
Escalation CSAT4.2 / 5 (baseline was 4.1)
Handoff payload completeness100%
Intent coverage89%
KB freshness76% updated last 90 days

The under-celebrated win is escalation CSAT slightly above human-only baseline. That happens when the AI does the boring CRM-lookup work upfront and hands the human a fully framed problem, so the human can move directly to resolution.

What unhealthy looks like (and how to read it)

The same scorecard, broken:

KPIValueDiagnosis
Autonomous resolution rate72%Looks good
CSAT-validated containment71% of containedSilent containment problem
Re-contact rate14%Yes — bot resolves wrong, callers come back
AHT2:15Yes — too fast, bot hanging up early
Confidence-floor pass rate67%Too many low-confidence responses spoken
Hallucination rate4.2%KB drift or bad grounding
Escalation CSAT3.4 / 5Handoff broken
Intent coverage71%Big chunk of calls hitting fallback

The fix sequence:

  1. Raise the confidence floor (causes more escalations short-term, lowers hallucinations)
  2. Fix the handoff payload delivery (so escalations recover CSAT)
  3. Audit the KB and re-ground the intents with the highest hallucination rate
  4. Expand intent coverage from 71% upward to absorb fallback calls
  5. Re-measure in 30 days

Measurement infrastructure

The dashboard that supports this scorecard requires:

  • Per-call event log: every turn timestamped, intent classified, confidence scored, sources cited, actions attempted
  • Post-call CSAT survey (SMS, IVR, or app — voice survey at end of call has higher response rate but lower honesty)
  • Re-contact tracking keyed on customer ID across channels
  • Sampling infrastructure for hallucination grading
  • KB version tracking to attribute drift to specific document changes
  • Escalation payload audit log to prove handoff completeness

Most of this lives in the voice AI vendor's analytics product (PolyAI Studio, Parloa Analytics, ASAPP MissionControl) plus the CRM (Salesforce, Zendesk, HubSpot) for CSAT and re-contact. Hallucination sampling is usually external — Excel + reviewer at small scale, dedicated tooling at large scale.

The cross-channel measurement principle

Voice and text-side AI agents should be measured on the same skeleton. The numbers differ; the structure shouldn't:

  • Autonomous resolution rate: voice (call) and text (ticket) measure the same thing
  • Confidence-floor pass rate: voice and text run the same self-evaluation architecture
  • Hallucination rate: same sampling discipline, different surface
  • Escalation CSAT: same metric, different handoff endpoint

A buyer running voice AI on one stack and text-side autonomous resolution on another should expect both to report the same KPI set. Twig produces this set natively for chat, email, and helpdesk channels.

The ROI math, honestly

When a deployment hits the scorecard targets, the typical financial result:

QuantityPre-AIWith voice AI
Inbound voice calls per month100,000100,000
Resolved without human065,000 (autonomous resolution)
Human-handled calls100,00035,000
Cost per human-handled call$7.00$7.00
Cost per AI-handled calln/a$1.40
Monthly cost (humans only)$700,000$245,000
Monthly cost (AI)$0$91,000
Total monthly cost$700,000$336,000
Monthly savings$364,000
CSAT4.14.2

Payback period on a $400K–$800K voice AI implementation: 2–4 months at this scale. Enterprise deployments with custom telephony and CRM integration take longer to break even (4–6 months) but the absolute dollar return is larger.

The takeaway

Voice AI quality is measurable, but not with the metrics most contact centers already have on the dashboard. Replace containment with CSAT-validated autonomous resolution. Add escalation CSAT, silent containment gap, confidence-floor pass rate, and hallucination rate. Sample continuously. Treat the warm handoff as a first-class metric, not a footnote.

The buyers who win at voice AI in 2026 are not the ones with the best demos — they're the ones with the most honest scorecards.

Try Twig free — see how autonomous AI support works on your tickets

30-minute setup · Free tier available · No credit card required

Learn more

Related Pages

Related Articles