Latency Budgets for Voice AI Agents: The 800ms Rule of Natural Conversation

Q: Why does voice AI feel robotic?

Almost always one of two reasons: response latency above 800–1000ms creates a perceptible pause that humans read as 'thinking like a machine,' or the TTS is not streamed (so the agent waits for the full response before speaking the first word). Fix the latency, and most 'robotic' complaints disappear without changing the language model.

Q: What is end-of-turn detection?

End-of-turn (or end-of-utterance) detection is the system's decision that the caller has finished speaking and the agent should respond. Naive systems use a fixed silence threshold (e.g., 800ms of quiet). Modern systems use a model-based endpointer that combines acoustic features, prosodic cues, and semantic completion to fire in 200–400ms.

Q: What is barge-in and why does it matter?

Barge-in is the caller's ability to interrupt the agent mid-response. It requires three things working together: TTS that can be stopped instantly, ASR that continues listening through agent audio (via echo cancellation), and a dialog manager willing to drop a half-spoken response and pivot. Without barge-in, voice AI feels like a recorded message.

Q: How fast does ASR need to be for voice AI?

Streaming ASR must emit partial transcripts within 100–200ms of speech and a final transcript within 300–500ms of end-of-utterance. Non-streaming (batch) ASR adds 500–1500ms of latency and should not be used in conversational voice agents — only in post-call analytics.

Q: Can I run voice AI over a regular phone line?

Yes — via SIP trunking from a telephony provider (Twilio, Vonage, AWS Chime, Telnyx) to your media server. The narrowband G.711 codec adds about 20–50ms one-way; jitter on a busy PSTN path adds another 30–80ms. Plan the latency budget accordingly: PSTN gives you less headroom than WebRTC.

Humans expect a response within 800ms in turn-taking conversation. A voice AI agent that misses that budget feels robotic. Here is how to allocate the budget across ASR, LLM, and TTS.

Chandan Maruthi· CEO, Twig AI

CEO of Twig AI. Previously at H2O.ai and Zyme.

May 21, 2026Updated June 10, 20268 min read

Voice AI agent latency budget — the 800ms rule of natural conversation

Key Takeaways

✓Human turn-taking happens with ~200ms median gap; >800ms feels awkward, >1500ms feels broken
✓The 800ms budget covers endpointing, final ASR, LLM, tools, TTS first chunk, and network jitter
✓Streaming everywhere — ASR partials, LLM tokens, TTS chunks — is what makes the budget achievable
✓Model-based endpointing fires in 200–400ms vs. 800ms+ for naive silence thresholds
✓Tool calls and retrieval should run in parallel with LLM inference, not after
✓Barge-in requires TTS that can stop instantly and ASR that hears through agent audio

See how Twig compares to PolyAI

Voice-first AI for contact centers.

Learn more

Twig is an autonomous AI support platform that triages, self-evaluates, and resolves customer support tickets by integrating with tools like Zendesk, Salesforce, and Intercom. On the text side, latency is measured in seconds and rarely matters to a customer staring at a "..." indicator. On the voice side, every millisecond is felt. This post is for the engineers building voice AI agents who need to know exactly where the time goes — and where to claw it back.

TL;DR: Human-to-human turn-taking happens with a median gap of about 200ms; gaps over 800ms feel awkward. A voice AI agent has roughly 800ms from the caller's end-of-utterance to first audible response — and that budget has to cover endpoint detection, final ASR, LLM inference, optional tool calls, TTS first chunk, and network jitter. Hitting it requires streaming everywhere, parallel tool calls, speculative endpointing, and brutal honesty about which steps can be amortized while the caller is still talking.

Key takeaways:

Human turn-taking happens with ~200ms median gap; >800ms feels awkward, >1500ms feels broken
The 800ms budget covers endpointing, final ASR, LLM, tools, TTS first chunk, and network jitter
Streaming everywhere — ASR partials, LLM tokens, TTS chunks — is what makes the budget achievable
Model-based endpointing fires in 200–400ms vs. 800ms+ for naive silence thresholds
Tool calls and retrieval should run in parallel with LLM inference, not after
Barge-in requires TTS that can stop instantly and ASR that hears through agent audio

Where the 800ms number comes from

Conversation analysts have studied turn-taking gaps across languages for decades. The findings are remarkably consistent:

Median gap in casual human-to-human conversation: ~200ms
Gap perceived as "natural": 0–500ms
Gap perceived as "thinking" or "thoughtful": 500–800ms
Gap perceived as awkward or robotic: 800–1500ms
Gap perceived as broken / "still there?": >1500ms

Sources: Stivers et al. (2009) on cross-linguistic turn-taking; Levinson & Torreira (2015) on conversational timing; replicated in synthetic voice studies by Microsoft, Google, and Meta voice teams.

The implication for voice AI: aim for <800ms first-audio latency at p50, and <1500ms at p95. Above that, every CSAT survey will tell you the bot "felt slow."

The full latency breakdown

Here is the chain from "caller finishes speaking" to "caller hears first syllable of response":

Stage	What happens	p50 budget	p95 budget
End-of-turn detection	System decides caller is done	200–300ms	400ms
Final ASR transcript	Last partial → final text	50–150ms	250ms
LLM first token	Prompt assembled → first model token	200–400ms	700ms
Tool calls (if needed)	CRM read, KB retrieval	0ms (overlap)	0–300ms
TTS first audio chunk	First model token → audible audio	100–200ms	350ms
Network jitter (PSTN/WebRTC)	Media path latency	30–80ms	150ms
Total to first audio		~600–850ms	~1450ms

The total only fits in 800ms if every stage is streamed and overlapping. Sequential, non-streaming pipelines blow the budget by 2–3x and produce the classic "lights are on, nobody is home" voicebot experience.

Stage-by-stage: where the time actually goes

Stage 1: End-of-turn detection (200–400ms)

The naive approach — "if I hear 800ms of silence, the caller is done" — burns most of the budget before any other work starts. Worse, it cuts off callers who pause mid-sentence and lets callers talk past the agent's response start.

The 2025-era approach is model-based endpointing:

Acoustic features: pitch contour, energy fall-off, vowel lengthening
Prosodic cues: terminal intonation patterns
Semantic completion: the partial transcript looks grammatically/semantically complete

A model that fuses all three fires in 200–400ms when confidence is high and falls back to longer silence thresholds only when it isn't. Production voice agents from PolyAI and Parloa publish endpointing latencies in this range.

Stage 2: Final ASR (50–250ms)

If you're using streaming ASR (Whisper-streaming, Deepgram Nova, AssemblyAI Universal-Streaming, or proprietary contact-center ASR), the "final" transcript at end-of-turn is mostly a stabilization pass over partials that already arrived. It should land within 50–150ms.

If you're using batch ASR — even fast batch — you've already lost. Batch ASR processes the whole utterance after end-of-turn and adds 500–1500ms. Acceptable for analytics, not for live dialog.

Stage 3: LLM inference (200–700ms)

This is the biggest variable. The right metric is time-to-first-token (TTFT), not time-to-completion — the TTS can start speaking the first sentence while the model is still generating the third.

Practical tactics to lower TTFT:

Prompt caching: the system prompt, persona, and policy rules don't change per turn. Cache them. (Twig uses prompt caching on the text side for the same reason.)
Smaller models for routing, larger for resolution: a fast classifier picks the intent in 50–100ms; the larger model handles only the intents that need it.
Co-located inference: model in the same region as the media server. Cross-region inference adds 50–150ms one-way.
Streaming output: the TTS pipeline starts on token #1, not token #N.

Stage 4: Tool calls (ideally overlapped to 0ms)

The temptation is to wait for the LLM to decide which tool to call. The faster pattern is speculative retrieval: as soon as the intent is classified (often from a partial transcript), start the likely CRM read and KB retrieval. By the time the LLM emits the tool call, the result is already cached.

This is also why entity-graph retrieval matters — Twig's text-side architecture uses entity-grounded retrieval (Zendesk ticket history, Salesforce account state, PostgreSQL live data, REST API endpoints) so the relevant context arrives in parallel with intent classification.

Stage 5: TTS first audio (100–350ms)

Streaming neural TTS emits the first audio chunk within 100–200ms of receiving the first LLM token. The trap: if your TTS waits for sentence boundaries or paragraph boundaries before synthesizing, you've added 300–800ms of silence the caller perceives as "thinking."

Specific TTS choices that hit the budget:

ElevenLabs Flash / Turbo models (first audio in 75–200ms)
Cartesia Sonic (first audio in 90ms)
Deepgram Aura (first audio in 200ms)
PlayHT Play3.0 (first audio in 150ms)

These numbers are vendor-published; real-world latency on a busy SIP path adds 30–80ms of jitter.

Stage 6: Network jitter (30–150ms)

The least controllable stage. PSTN over G.711 codecs adds 20–50ms one-way at the carrier. Jitter buffers (usually 30–60ms) absorb micro-variations. WebRTC over Opus typically beats PSTN by 10–20ms but isn't an option for most inbound phone numbers.

Plan for p95 jitter of 150ms and don't try to claw it back.

Barge-in: the conversational-feel multiplier

Even with the 800ms budget hit perfectly, the agent feels like a recording without barge-in. Three components:

TTS that can stop instantly: a kill signal interrupts streaming TTS in <50ms and stops any audio in the jitter buffer.
ASR that hears through agent audio: full-duplex echo cancellation lets the ASR transcribe the caller while the TTS is still playing. Without this, the caller has to wait for the agent to finish to interrupt — which defeats the point.
Dialog manager that drops gracefully: when barge-in fires, the manager discards the in-flight response and routes the new caller turn to a fresh LLM call.

Barge-in adds noticeable engineering complexity but is the single highest-impact UX feature after raw latency.

The diagnostics: when latency goes wrong

When p50 first-audio creeps above 1 second, the failure mode is usually one of these:

Symptom	Most likely cause
"Bot pauses before answering"	Endpointing threshold too high (try model-based)
"Bot answers a turn late"	ASR is batch, not streaming
"Bot pauses mid-sentence"	TTS not streaming, or waiting for full LLM completion
"First few words clipped"	Jitter buffer too small or TTS warm-up cost
"Bot ignores interruptions"	Echo cancellation off, or TTS not killable
"Random 3-second pauses"	Cold-start inference (no prompt cache, or model autoscaling)

Instrument every stage with per-call timing. Without a stage-level latency dashboard, you cannot reason about p95 regressions.

What this has to do with text-side support

The cross-channel principle is shared, even though the numbers are different by an order of magnitude. On the text side — chat, email, helpdesk — Twig optimizes for time-to-first-token under 1 second on chat (visible "..." indicator) and time-to-complete-response under 5 seconds. The architectural moves are the same: streaming inference, prompt caching, speculative retrieval, parallelized tool calls.

The difference is the tolerance for missing the budget. A chat user might tolerate 3 seconds. A voice caller will not tolerate 1.5 seconds. So the voice channel forces the engineering rigor that, applied to text, just makes everything else faster too.

The takeaway

The 800ms rule is the conversational-feel target. Hit it at p50, hold it at p95, and the rest of the voice AI architecture has room to breathe. Miss it, and no amount of model quality, persona tuning, or knowledge grounding will rescue the experience.

The hardest part isn't picking the right ASR or LLM or TTS — vendor benchmarks are close enough that the choice rarely makes or breaks a deployment. The hard part is wiring them together so that every stage streams, tool calls overlap, and the system stops trying to be sequential. The teams that get this right ship voice AI that callers describe as "fast." The teams that get it wrong ship voice AI that callers describe as "AI."

Try Twig free — see how autonomous AI support works on your tickets

30-minute setup · Free tier available · No credit card required

Learn more

Frequently Asked Questions

Why does voice AI feel robotic?

Almost always one of two reasons: response latency above 800–1000ms creates a perceptible pause that humans read as 'thinking like a machine,' or the TTS is not streamed (so the agent waits for the full response before speaking the first word). Fix the latency, and most 'robotic' complaints disappear without changing the language model.

What is end-of-turn detection?

End-of-turn (or end-of-utterance) detection is the system's decision that the caller has finished speaking and the agent should respond. Naive systems use a fixed silence threshold (e.g., 800ms of quiet). Modern systems use a model-based endpointer that combines acoustic features, prosodic cues, and semantic completion to fire in 200–400ms.

What is barge-in and why does it matter?

Barge-in is the caller's ability to interrupt the agent mid-response. It requires three things working together: TTS that can be stopped instantly, ASR that continues listening through agent audio (via echo cancellation), and a dialog manager willing to drop a half-spoken response and pivot. Without barge-in, voice AI feels like a recorded message.

How fast does ASR need to be for voice AI?

Streaming ASR must emit partial transcripts within 100–200ms of speech and a final transcript within 300–500ms of end-of-utterance. Non-streaming (batch) ASR adds 500–1500ms of latency and should not be used in conversational voice agents — only in post-call analytics.

Can I run voice AI over a regular phone line?

Yes — via SIP trunking from a telephony provider (Twilio, Vonage, AWS Chime, Telnyx) to your media server. The narrowband G.711 codec adds about 20–50ms one-way; jitter on a busy PSTN path adds another 30–80ms. Plan the latency budget accordingly: PSTN gives you less headroom than WebRTC.

voice ai latency ai engineering ai agents speech recognition

Integrations

Comparisons

Weekly AI CX insights

How leading support teams deploy autonomous AI. One short email a week.

customer support

Decagon vs Sierra vs Twig: Which Is Most Secure?

Twig attaches source attribution and audit trails to every answer. Decagon and Sierra rely on enterprise controls. Which AI support is most trustworthy?

5 min read

customer support

Decagon vs Sierra vs Twig: Best Helpdesk Coverage?

Twig connects 30+ data sources and runs across helpdesks. Decagon and Sierra favor custom enterprise stacks. Which has the best integration coverage?

5 min read

customer support

Decagon vs Sierra vs Twig: Which Fits Mid-Market?

Decagon and Sierra are built for enterprise floors. Twig serves SMB and mid-market with no minimums. Which AI support platform fits a smaller team?

5 min read

Latency Budgets for Voice AI Agents: The 800ms Rule of Natural Conversation

Key Takeaways

Where the 800ms number comes from

The full latency breakdown

Stage-by-stage: where the time actually goes

Stage 1: End-of-turn detection (200–400ms)

Stage 2: Final ASR (50–250ms)

Stage 3: LLM inference (200–700ms)

Stage 4: Tool calls (ideally overlapped to 0ms)

Stage 5: TTS first audio (100–350ms)

Stage 6: Network jitter (30–150ms)

Barge-in: the conversational-feel multiplier

The diagnostics: when latency goes wrong

What this has to do with text-side support

The takeaway

Frequently Asked Questions

Why does voice AI feel robotic?

What is end-of-turn detection?

What is barge-in and why does it matter?

How fast does ASR need to be for voice AI?

Can I run voice AI over a regular phone line?

Related Pages

Integrations

Comparisons

Weekly AI CX insights

Related Articles

Decagon vs Sierra vs Twig: Which Is Most Secure?

Decagon vs Sierra vs Twig: Best Helpdesk Coverage?

Decagon vs Sierra vs Twig: Which Fits Mid-Market?