datarekha
Agents May 18, 2026

Voice agents at scale: the Vapi, Retell, Bland.ai engineering

Voice agents are the unsexy success story of the agent era. Underneath the marketing they're all the same five-box pipeline — STT, LLM, TTS, turn-taker, telephony — fighting for the same 500ms latency budget. Here's how the three biggest platforms actually build it, what each one optimizes for, and where the real cost goes.

13 min read · by datarekha · voice-agentsvapiretelltelephony

If you ask which agent category shipped most in 2025–2026, the honest answer is not coding assistants and not browser automation. It’s voice. The appointment-confirmation bot at your dentist, the outbound sales caller from a series-B startup you’ve never heard of, the front-line debt-collection agent that sounds suspiciously polite — those are all built on the same small stack of platforms: Vapi, Retell AI, Bland.ai, and a handful of open-source frameworks like Pipecat and LiveKit.

The fascinating thing isn’t that they exist. It’s that all three of the biggest hosted platforms shipped the same architecture, then differentiated on which 50ms of the pipeline they own. This is what the production voice stack looks like in mid-2026, what each platform optimizes for, and where the real engineering sits.

The five-box pipeline (and the 500ms it has to fit in)

Every production voice agent — Vapi, Retell, Bland, Pipecat, LiveKit, the custom rig your competitor built in a weekend — is the same five-box pipeline:

VOICE AGENT — END-TO-END PIPELINETELEPHONYTwilioTelnyxSIP / WebRTCSTTDeepgramAssemblyAI~100–500msLLMGPT-4o-miniGemini Flash~350ms–1sTTSCartesiaElevenLabs~75–200msTELEPHONYaudio outto callerturn-taker · interruption handler · function-call routerthe cross-cutting layer; runs in parallel with everything above
The same architecture under every platform. The top row is the data path; the dashed box is where actual engineering differentiation lives.

The latency budget for natural human conversation is brutal. Below 500ms end-to-end, the caller perceives the agent as “real.” Above 800ms, they start trailing off and saying “are you still there?” Above 1,000ms, hang-up rates double. That budget — STT 100–500ms, LLM 350ms–1s, TTS 75–200ms, plus network — has to be hit per turn, which is the load-bearing constraint that explains every architectural choice below.

The trick the entire industry uses to make it fit is streaming everywhere. STT streams partials as the user speaks. LLM streams tokens as it generates. TTS streams audio as it receives tokens. Done naively, the serial latencies add up. Done right, they overlap, and a 1.4-second pipeline collapses to ~600ms perceived. Pipecat’s documentation is the cleanest open-source reference for how this pipelining is actually wired — streaming=True is not an optimization, it’s the only way the math works.

Vapi, Retell, Bland — picking the same architecture and selling different things

If the architecture is identical, what’s the actual product?

Vapi is middleware. The pitch: bring your own LLM, bring your own TTS, bring your own telephony — Vapi handles orchestration, function-call routing, and the turn-taking layer. Their base orchestration fee is $0.05 per minute, and everything else is passed through at provider cost. That’s a developer platform, not a managed service. Devs love it; ops teams hate it because debugging a bad call means reading logs from four different vendors.

Retell AI is the managed version of the same thing. They pre-pick Deepgram for STT and have invested heavily in their own turn-taking and interruption models. Their published latency target is sub-500ms, and in independent comparisons they consistently sit at the bottom of the spread. The trade-off: less flexibility, but you can hand a non-technical operations lead a Retell dashboard and they can ship a working agent in an afternoon.

Bland.ai is the volume play. The pitch is unapologetically about outbound — 100,000 calls a day, contact center replacement, regulated verticals. Their architecture leans more toward speech-to-speech models that collapse the STT–LLM–TTS pipeline at the cost of vendor lock-in, and they trade some of the voice quality of Retell for sheer throughput. Reports of multi-hour outages in 2025 have been one of the persistent complaints, and 2026 has been their year of infrastructure rebuild.

The pricing table tells the story:

PlatformLatency (median)Base rateAll-in/minOptimized for
Vapi500–800ms$0.05 orchestration$0.30–0.33Developers, custom stacks
Retell<500msbundled~$0.31Voice quality, managed infra
Bland.ai600–900msbundled~$0.09–0.20High-volume outbound

The numbers come from CloudTalk’s pricing breakdown and Buildberg’s 2026 comparison. The interesting line is the all-in column. On Vapi, only 17% of the cost is platform — the rest is Deepgram, OpenAI, ElevenLabs, and Twilio fees the platform passes through. Retell hides those numbers inside one bundled price; Bland.ai gets it lower by owning more of the stack (their own STT, their own TTS for some voices).

The hard parts (where the actual engineering lives)

Three problems get all the engineering attention. None of them are about the LLM. All of them are about the conversation around it.

1. End-of-turn detection

The agent has to know when you’ve stopped talking before it starts. Get this wrong by 200ms in either direction and the experience breaks. The naive solution — voice activity detection (VAD), wait for silence — fails because backchannels (“uh-huh”, “yeah”), sighs, coughs, and even ambient noise trip it. In production, this is the single largest source of “the bot interrupted me” complaints.

The state of the art in 2026 is a small dedicated classifier model that takes both the audio stream and the streaming transcript and predicts “is the user done?” LiveKit published their end-of-turn model in early 2026, reporting a 39% reduction in unwanted interruptions vs. VAD alone. Retell’s turn-taker is the most-praised in independent testing; Vapi exposes it as a configurable model choice; Bland.ai bakes it into their speech-to-speech path.

2. Interruption handling (barge-in)

The mirror problem. When the caller speaks while the agent is talking, what happens? Three policies:

  • Hard barge-in: the agent stops mid-syllable, flushes its TTS buffer, and listens. Feels responsive but cuts off the agent if the caller said “uh-huh.”
  • Soft barge-in: the agent pauses, waits 300ms to see if the interruption resolves, then either resumes or restarts.
  • Adaptive interruption: a small model decides which one based on prosody and content. This is what LiveKit’s adaptive interruption shipped in 2026.

Every production agent ships some flavor of this. The detail that actually matters is the audio you’ve already sent to the telephony provider can’t be unsent. Twilio buffers ~200ms ahead. Even with hard barge-in, the caller hears the next 200ms of the agent’s response. The only fix is shorter TTS chunks — generate one sentence at a time, not one paragraph.

3. Function calling for booking / CRM lookup

Voice agents that just talk are demos. Voice agents that book appointments, check order status, transfer to human, and update CRM records are products. Function calling is how that happens — same as text agents, with one extra constraint: the LLM call must return in under 1.5 seconds even when it includes tool use.

This is where most production teams compromise on model choice. GPT-4o is too slow for the inner loop; GPT-4o-mini and Gemini Flash dominate because they hit the latency budget. The hard cases get routed to a bigger model with a “let me check that for you” filler line generated in parallel — by the time the filler finishes playing, the big model has returned. It’s an audio version of the routing pattern that powers most production AI systems.

PER-TURN LATENCY BUDGET — TYPICAL PRODUCTION CALL0ms200ms500ms800ms1000msSTT partialEOTLLM (first token at 200ms)TTS first chunkaudio streaming to caller →caller hears agenttarget: under 500ms after end-of-turnThe trick: STT, LLM, and TTS are not serial — they overlap. The first TTS audio chunk leaves at ~500ms,while the LLM is still streaming the rest of its tokens behind it. Perceived latency is the leading edge, not the sum.
A working-day turn budget on a Vapi or Retell call. Without streaming/pipelining the same work serially would take ~1.6 seconds — well over the disconnect threshold.

What still fails in 2026

For all the polish, voice agents in 2026 still fail in predictable ways, and any honest production deployment treats these as design constraints rather than bugs to fix later.

Long names and uncommon nouns wreck STT. Deepgram and AssemblyAI both ship custom vocabulary upload, but adoption is uneven. A healthcare agent with no custom vocab will mis-hear “Levothyroxine” about 30% of the time. Vapi exposes this via per-agent vocab; Retell auto-builds it from your knowledge base; Bland makes you load it explicitly per campaign. The cost of getting this wrong is the agent confidently writing a wrong prescription name into a CRM.

Hold music and ambient noise destabilize VAD. If the caller is at a coffee shop, expect a 2x increase in spurious interruptions. The fix is either model-based EOT (above) or aggressive noise suppression on the inbound audio (Krisp, RNNoise) — both add 30–50ms of latency. In practice production deployments accept that trade.

Tool calls that need a database round-trip blow the latency budget. If your CRM API is on the other side of the world, you’ve already lost. The patch is the filler-line pattern above — generate “let me pull that up for you” while the network round-trip is in flight. Done well, the caller never knows. Done poorly, the agent feels artificial in a way they can’t articulate.

End-of-conversation detection is harder than start. Knowing the caller wants to hang up vs. just paused is hard. Retell and Vapi both expose tunable end-call thresholds. Bland.ai’s outbound campaigns default to aggressive hang-up to maximize calls/hour, which periodically hangs up on customers who were about to convert.

Why this category quietly won

Voice agents are the unsexy success story of the agent era for one structural reason: the surface they automate has clear boundaries. A phone call has a start, an end, a known purpose (“book a service appointment,” “confirm tomorrow’s reservation”), and a measurable outcome (was the appointment booked, was the bill paid). That’s the opposite of the “autonomous agent does anything” framing that made AutoGPT a viral demo and a non-product.

Pricing reflects this clarity. Most voice agent deployments are billed per resolution, sometimes per minute — the same outcomes-based logic Sierra applies to customer service text agents. The buyer doesn’t care which model is in the middle; they care that the call gets handled at 1/8 the cost of a human agent and the appointment ends up in their calendar.

The competitive dynamic that’s going to shape 2027 is which layers stay unbundled. Today the platforms are thin orchestrators on top of Deepgram, OpenAI, and ElevenLabs. The economic pressure is for one player — most likely Bland, given their volume — to vertically integrate and own the STT and TTS layers themselves. Retell has hinted at the same direction with their managed voice models. Vapi’s bet is the opposite: stay neutral, ride the cost curve of whoever wins the underlying models. Both bets are coherent. We’ll know in 18 months which one was right.

What to take away

  • Voice agents are a five-box pipeline with a 500ms budget. The architecture is solved. The differentiation is in turn-taking, interruption handling, and function calling.
  • Vapi, Retell, Bland are the same architecture sold to different buyers. Vapi for devs, Retell for ops teams that need good voice, Bland for high-volume outbound.
  • 70% of the per-minute cost is model fees the platforms pass through. Pricing differentiation is mostly about which providers the platform pre-picked, not platform margin.
  • The hard engineering is end-of-turn detection and barge-in, not the LLM. Production teams that obsess over which LLM to use are usually missing the actual quality bottleneck.
  • Outbound vs inbound are different products. Outbound agents (Bland.ai’s specialty) optimize for throughput and short calls; inbound agents (Retell’s strength) optimize for resolution and brand voice. Most teams discover the difference late, after picking a platform for the wrong reason.
  • The “speech-to-speech” architecture is real but lock-in heavy. Models like GPT-4o’s real-time API and Bland.ai’s native pipeline collapse STT+LLM+TTS into one call. Latency wins, vendor flexibility loses. Pick deliberately.

The voice-agent category is the cleanest demonstration of the Building Effective Agents thesis in production: a scoped, narrow task, with an explicit success metric, run through a simple workflow. Nobody on a phone call wants autonomy. They want the agent to handle the boring middle and connect them to a human when it can’t. That’s the actual product, and it’s shipping at scale.


Further reading: LiveKit’s turn-detection writeup is the most thorough public reference. Pipecat is the cleanest open-source pipeline implementation. Deepgram’s voice agent blog and Cartesia’s engineering posts cover the underlying STT and TTS engineering in detail.

Skip to content