The Voice Agent Latency Playbook: Instrument, Diagnose, Fix

Community Article Published March 15, 2026

The metric that defines the voice-agent experience is perceived latency: the silence between the moment a person stops speaking and the moment the agent starts responding.

In testing, perceived latency was 1.4 seconds at P50 and 3.4 seconds at P95. This article shows how to make that gap visible, identify which stage to fix, and tune the pipeline with real traces.

→ Try the Open Voice Agent

Transport layer: Real-time voice needs low-level audio streaming. WebRTC solves this, and LiveKit provides an end-to-end framework on top of it so we can focus on the AI components instead of the media infrastructure.

Two architectural approaches exist:

Speech-to-speech models handle everything in a single model: fast, but limited in control, customization and observability.
Cascaded pipelines (STT → LLM → TTS) which makes each stage easier to inspect, tune, and replace.

This article focuses on the cascaded pipeline.

Where does the time go?

Every turn passes through a sequence of stages, each contributing milliseconds before the user hears the first word of the agent’s response.

VAD (Voice Activity Detection) monitors the audio stream and flags when speech stops. Without it, you'd need a push-to-talk button.

STT transcribes the buffered audio into text. This is often the longest stage in the input phase, because the model needs enough audio context to produce an accurate transcript.

EOU (End of Utterance). reads the transcript and decides: is the user done, or just pausing mid-sentence? This prevents the agent from interrupting natural pauses, but it can only run after STT delivers text, so the two delays stack.

LLM generates the response. The metric that matters here is time to first token (TTFT), the agent doesn't need the full response before it starts speaking.

TTS converts that first chunk of tokens into audio. Again, what matters is time to first byte (TTFB), the moment the user hears sound and the silence breaks.

Perceived Latency is the sum of the last three stages: eou_delay + llm_ttft + tts_ttfb

Tool-call turns: two silences instead of one

Tool-call turns are different: the agent needs external data before it can answer. If the user hears nothing during that time, the interaction feels slow even when the system is working correctly. Two mechanisms fix this:

Spoken acknowledgment ("Let me check that") before the tool runs.
Background sound (e.g., typing) during execution.

The silence breaks earlier, perceived latency drops, and the remaining delay feels like progress rather than uncertainty.

Making Latency Visible with Langfuse

Langfuse records what happened in each turn: when EOU fired, how long the LLM took to emit the first token, how long TTS took to produce the first audio frame, and how long each tool call ran. Every turn becomes a trace with nested spans, one per stage.

Stack: Deepgram Nova 3 (STT) + Ministral 3B via Ollama (LLM) + Pocket TTS (TTS) + LiveKit Agents + Langfuse

What the numbers say

Five sessions, 36 turns total: 23 simple (no tool calls) and 13 with tool calls.

Simple turns (n=23):

Component	P50	P95
EOU delay	554ms	858ms
LLM TTFT	566ms	2,246ms
TTS TTFB	243ms	296ms
Perceived latency	1,392ms	3,384ms

EOU delay is stable, sitting near the min_endpointing_delay default of 0.5s. This confirms turn detection works as configured. TTS TTFB is also consistent: Pocket TTS produces the first audio frame in ~240ms regardless of response content.

The LLM is where variance lives. P50 is 566ms, but P95 jumps to 2.2 seconds. As conversations grow longer, TTFT tends to climb because more history is passed to the model. Two traces hit 2.4s and 6.8s, both deep into multi-turn conversations where the full history was being sent.

Tool-call turns (n=13, across 1–4 tool rounds):

Component	P50	P95
EOU delay	550ms	616ms
LLM TTFT (decide to call tool)	913ms	4,428ms
TTS TTFB (acknowledgment)	213ms	244ms
Tool execution (first round)	208ms	2,359ms
LLM TTFT (final answer)	586ms	748ms
TTS TTFB (final audio)	250ms	330ms

Two things stand out:

The first LLM pass is slower than the second. P50 of 913ms to decide to call a tool, versus 586ms to generate the final answer. The first pass carries the full conversation context plus tool definitions. The second only needs to incorporate the tool result and respond.
Tool execution time varies widely. P50 is 208ms but P95 is 2.4 seconds. Some MCP calls resolve in ~100ms, others take over 2 seconds depending on the network. Background audio and verbal acknowledgments help bridge that wait.

For tool-call turns, the user experiences two silences. The first (before the acknowledgment) follows the same formula: eou_delay + llm_ttft + tts_ttfb, at P50 ~1.7 seconds. The second (after tool execution, before the final answer) depends on the last LLM pass and TTS: at P50 ~1.4 seconds.

Where to look first

LLM TTFT is the biggest variable. It has the widest range and the highest P95. Reducing context size (sliding window, summarization) helps. So does using a faster or smaller model.
STT latency is embedded in EOU delay. STT runs while the user speaks, but its transcription time affects when EOU can confirm the turn is complete. A faster STT model does not reduce perceived latency directly, but it tightens EOU delay, which is the first term in the formula.
TTS adds ~240ms consistently. A larger TTS model improves voice quality but increases TTFB. With Pocket TTS, this stage is not a bottleneck.
Tool execution is external. Background audio and verbal acknowledgments cover the wait.

Inspect the raw traces: Session 1 · Session 2 · Session 3 · Session 4 · Session 5

5 conversations, 36 turns.

Knobs worth turning

Traces show where time goes, the next step is deciding what to change. Some fixes are mechanical: a parameter, a context limit, a pre-rendered audio file. Others involve model tradeoffs. The goal is to fix the stage that dominates your P95.

Four levers that don't require a faster model

Turn detection tuning. EOU delay is the first term in the perceived latency formula, and it is entirely a configuration choice. Two controls matter:

Silence threshold: how long to wait after silence before confirming the turn. Too low and the agent talks over natural pauses. Too high and the user waits for a response that could have started 300ms earlier.
Interruption sensitivity: how much speech is needed before recognizing an interruption. Too low and background noise cuts the agent off. Too high and the user has to repeat themselves.

Preemptive generation. Start generating a response before end-of-turn is fully confirmed, using the transcript as soon as STT delivers it. If the prediction matches what the user meant, the agent responds faster. If not, the response is discarded. Tradeoff: wasted compute on discarded generations.

Pre-rendered audio. For predictable phrases like greetings or hold messages, preload audio frames at startup and skip TTS entirely for those turns. Only works for static phrases, but it removes ~240ms of TTS TTFB from every turn it applies to.

Background audio. Play a thinking sound while the agent generates or waits on a tool call. The latency stays the same, but the silence disappears. Combined with a spoken acknowledgment before the tool call, this is often the most practical way to handle tool execution delays.

Beyond these, the usual model-level levers apply: quantization reduces TTFT at the cost of reasoning quality, smaller models are faster but less capable, limiting context keeps TTFT flat, and starting TTS before the full LLM response is ready reduces time to first audio if chunking is handled well.

There is no universal best configuration, the point is to replace guessing with measurement.

Reference: LiveKit AgentSession parameters

Turn detection (eou_delay):

min_endpointing_delay: seconds of silence before the agent considers the turn complete. Lower values (e.g. 0.3s) speed up response but risk interrupting natural pauses.
max_endpointing_delay: upper bound when the detector is uncertain. Too high and the user waits in silence on misjudged turns.
min_interruption_duration: how long the user must speak before the agent stops talking. Too low and background noise triggers it.
min_interruption_words: same by word count. 2–3 prevents single-word noises from interrupting.

Response generation (llm_ttft):

preemptive_generation: described above.

Speech synthesis (tts_ttfb):

session.say(text, audio=...): preloaded audio for static phrases, described above.

STT latency does not appear as a separate term in the perceived latency formula, but it is embedded in EOU delay. A faster STT tightens that window.

Try the demo, inspect your trace

Talk to the agent, then click the Traces dropdown at the top of the Space. Each turn generates a trace. Click open trace to see the full breakdown in Langfuse: EOU delay, LLM TTFT, TTS TTFB, tool execution, all as nested spans.

→ Try the Open Voice Agent · GitHub

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote