Latency Budgets for Voice AI Agents: The 800ms Rule of Natural Conversation
Humans expect a response within 800ms in turn-taking conversation. A voice AI agent that misses that budget feels robotic. Here is how to allocate the budget across ASR, LLM, and TTS.

Key Takeaways
- ✓Human turn-taking happens with ~200ms median gap; >800ms feels awkward, >1500ms feels broken
- ✓The 800ms budget covers endpointing, final ASR, LLM, tools, TTS first chunk, and network jitter
- ✓Streaming everywhere — ASR partials, LLM tokens, TTS chunks — is what makes the budget achievable
- ✓Model-based endpointing fires in 200–400ms vs. 800ms+ for naive silence thresholds
- ✓Tool calls and retrieval should run in parallel with LLM inference, not after
- ✓Barge-in requires TTS that can stop instantly and ASR that hears through agent audio
Weekly AI CX insights
How leading support teams deploy autonomous AI. One short email a week.
See how Twig compares to PolyAI
Voice-first AI for contact centers.
Latency Budgets for Voice AI Agents: The 800ms Rule of Natural Conversation
Twig is an autonomous AI support platform that triages, self-evaluates, and resolves customer support tickets by integrating with tools like Zendesk, Salesforce, and Intercom. On the text side, latency is measured in seconds and rarely matters to a customer staring at a "..." indicator. On the voice side, every millisecond is felt. This post is for the engineers building voice AI agents who need to know exactly where the time goes — and where to claw it back.
TL;DR: Human-to-human turn-taking happens with a median gap of about 200ms; gaps over 800ms feel awkward. A voice AI agent has roughly 800ms from the caller's end-of-utterance to first audible response — and that budget has to cover endpoint detection, final ASR, LLM inference, optional tool calls, TTS first chunk, and network jitter. Hitting it requires streaming everywhere, parallel tool calls, speculative endpointing, and brutal honesty about which steps can be amortized while the caller is still talking.
Key takeaways:
- Human turn-taking happens with ~200ms median gap; >800ms feels awkward, >1500ms feels broken
- The 800ms budget covers endpointing, final ASR, LLM, tools, TTS first chunk, and network jitter
- Streaming everywhere — ASR partials, LLM tokens, TTS chunks — is what makes the budget achievable
- Model-based endpointing fires in 200–400ms vs. 800ms+ for naive silence thresholds
- Tool calls and retrieval should run in parallel with LLM inference, not after
- Barge-in requires TTS that can stop instantly and ASR that hears through agent audio
Where the 800ms number comes from
Conversation analysts have studied turn-taking gaps across languages for decades. The findings are remarkably consistent:
- Median gap in casual human-to-human conversation: ~200ms
- Gap perceived as "natural": 0–500ms
- Gap perceived as "thinking" or "thoughtful": 500–800ms
- Gap perceived as awkward or robotic: 800–1500ms
- Gap perceived as broken / "still there?": >1500ms
Sources: Stivers et al. (2009) on cross-linguistic turn-taking; Levinson & Torreira (2015) on conversational timing; replicated in synthetic voice studies by Microsoft, Google, and Meta voice teams.
The implication for voice AI: aim for <800ms first-audio latency at p50, and <1500ms at p95. Above that, every CSAT survey will tell you the bot "felt slow."
The full latency breakdown
Here is the chain from "caller finishes speaking" to "caller hears first syllable of response":
| Stage | What happens | p50 budget | p95 budget |
|---|---|---|---|
| End-of-turn detection | System decides caller is done | 200–300ms | 400ms |
| Final ASR transcript | Last partial → final text | 50–150ms | 250ms |
| LLM first token | Prompt assembled → first model token | 200–400ms | 700ms |
| Tool calls (if needed) | CRM read, KB retrieval | 0ms (overlap) | 0–300ms |
| TTS first audio chunk | First model token → audible audio | 100–200ms | 350ms |
| Network jitter (PSTN/WebRTC) | Media path latency | 30–80ms | 150ms |
| Total to first audio | ~600–850ms | ~1450ms |
The total only fits in 800ms if every stage is streamed and overlapping. Sequential, non-streaming pipelines blow the budget by 2–3x and produce the classic "lights are on, nobody is home" voicebot experience.
Stage-by-stage: where the time actually goes
Stage 1: End-of-turn detection (200–400ms)
The naive approach — "if I hear 800ms of silence, the caller is done" — burns most of the budget before any other work starts. Worse, it cuts off callers who pause mid-sentence and lets callers talk past the agent's response start.
The 2025-era approach is model-based endpointing:
- Acoustic features: pitch contour, energy fall-off, vowel lengthening
- Prosodic cues: terminal intonation patterns
- Semantic completion: the partial transcript looks grammatically/semantically complete
A model that fuses all three fires in 200–400ms when confidence is high and falls back to longer silence thresholds only when it isn't. Production voice agents from PolyAI and Parloa publish endpointing latencies in this range.
Stage 2: Final ASR (50–250ms)
If you're using streaming ASR (Whisper-streaming, Deepgram Nova, AssemblyAI Universal-Streaming, or proprietary contact-center ASR), the "final" transcript at end-of-turn is mostly a stabilization pass over partials that already arrived. It should land within 50–150ms.
If you're using batch ASR — even fast batch — you've already lost. Batch ASR processes the whole utterance after end-of-turn and adds 500–1500ms. Acceptable for analytics, not for live dialog.
Stage 3: LLM inference (200–700ms)
This is the biggest variable. The right metric is time-to-first-token (TTFT), not time-to-completion — the TTS can start speaking the first sentence while the model is still generating the third.
Practical tactics to lower TTFT:
- Prompt caching: the system prompt, persona, and policy rules don't change per turn. Cache them. (Twig uses prompt caching on the text side for the same reason.)
- Smaller models for routing, larger for resolution: a fast classifier picks the intent in 50–100ms; the larger model handles only the intents that need it.
- Co-located inference: model in the same region as the media server. Cross-region inference adds 50–150ms one-way.
- Streaming output: the TTS pipeline starts on token #1, not token #N.
Stage 4: Tool calls (ideally overlapped to 0ms)
The temptation is to wait for the LLM to decide which tool to call. The faster pattern is speculative retrieval: as soon as the intent is classified (often from a partial transcript), start the likely CRM read and KB retrieval. By the time the LLM emits the tool call, the result is already cached.
This is also why entity-graph retrieval matters — Twig's text-side architecture uses entity-grounded retrieval (Zendesk ticket history, Salesforce account state, PostgreSQL live data, REST API endpoints) so the relevant context arrives in parallel with intent classification.
Stage 5: TTS first audio (100–350ms)
Streaming neural TTS emits the first audio chunk within 100–200ms of receiving the first LLM token. The trap: if your TTS waits for sentence boundaries or paragraph boundaries before synthesizing, you've added 300–800ms of silence the caller perceives as "thinking."
Specific TTS choices that hit the budget:
- ElevenLabs Flash / Turbo models (first audio in 75–200ms)
- Cartesia Sonic (first audio in 90ms)
- Deepgram Aura (first audio in 200ms)
- PlayHT Play3.0 (first audio in 150ms)
These numbers are vendor-published; real-world latency on a busy SIP path adds 30–80ms of jitter.
Stage 6: Network jitter (30–150ms)
The least controllable stage. PSTN over G.711 codecs adds 20–50ms one-way at the carrier. Jitter buffers (usually 30–60ms) absorb micro-variations. WebRTC over Opus typically beats PSTN by 10–20ms but isn't an option for most inbound phone numbers.
Plan for p95 jitter of 150ms and don't try to claw it back.
Barge-in: the conversational-feel multiplier
Even with the 800ms budget hit perfectly, the agent feels like a recording without barge-in. Three components:
- TTS that can stop instantly: a kill signal interrupts streaming TTS in <50ms and stops any audio in the jitter buffer.
- ASR that hears through agent audio: full-duplex echo cancellation lets the ASR transcribe the caller while the TTS is still playing. Without this, the caller has to wait for the agent to finish to interrupt — which defeats the point.
- Dialog manager that drops gracefully: when barge-in fires, the manager discards the in-flight response and routes the new caller turn to a fresh LLM call.
Barge-in adds noticeable engineering complexity but is the single highest-impact UX feature after raw latency.
The diagnostics: when latency goes wrong
When p50 first-audio creeps above 1 second, the failure mode is usually one of these:
| Symptom | Most likely cause |
|---|---|
| "Bot pauses before answering" | Endpointing threshold too high (try model-based) |
| "Bot answers a turn late" | ASR is batch, not streaming |
| "Bot pauses mid-sentence" | TTS not streaming, or waiting for full LLM completion |
| "First few words clipped" | Jitter buffer too small or TTS warm-up cost |
| "Bot ignores interruptions" | Echo cancellation off, or TTS not killable |
| "Random 3-second pauses" | Cold-start inference (no prompt cache, or model autoscaling) |
Instrument every stage with per-call timing. Without a stage-level latency dashboard, you cannot reason about p95 regressions.
What this has to do with text-side support
The cross-channel principle is shared, even though the numbers are different by an order of magnitude. On the text side — chat, email, helpdesk — Twig optimizes for time-to-first-token under 1 second on chat (visible "..." indicator) and time-to-complete-response under 5 seconds. The architectural moves are the same: streaming inference, prompt caching, speculative retrieval, parallelized tool calls.
The difference is the tolerance for missing the budget. A chat user might tolerate 3 seconds. A voice caller will not tolerate 1.5 seconds. So the voice channel forces the engineering rigor that, applied to text, just makes everything else faster too.
The takeaway
The 800ms rule is the conversational-feel target. Hit it at p50, hold it at p95, and the rest of the voice AI architecture has room to breathe. Miss it, and no amount of model quality, persona tuning, or knowledge grounding will rescue the experience.
The hardest part isn't picking the right ASR or LLM or TTS — vendor benchmarks are close enough that the choice rarely makes or breaks a deployment. The hard part is wiring them together so that every stage streams, tool calls overlap, and the system stops trying to be sequential. The teams that get this right ship voice AI that callers describe as "fast." The teams that get it wrong ship voice AI that callers describe as "AI."
Try Twig free — see how autonomous AI support works on your tickets
30-minute setup · Free tier available · No credit card required
Related Pages
Related Articles
The 24/7 Booking Engine: After-Hours Appointment Capture for SMBs
30–45% of SMB inbound demand arrives outside business hours. Most goes to voicemail and dies. Here's the AI front desk that captures it — and the revenue math by vertical.
10 min readAI Front Desk Agents: What They Are, How They Differ from Chatbots and IVR, and Where They Fit in 2026
An AI front desk agent is the first-touch AI across voice, chat, and scheduling — not a chatbot, not an IVR. Here is the definition, the use cases, and the buying criteria for 2026.
11 min readCapture the Copay: How AI Front Desks Collect Patient Payments Before the Visit
Unpaid copays and missed deposits trap 15–25% of SMB practice revenue in accounts receivable. AI front desks collect at booking — turning 60-day receivables into same-day cash.
11 min read