April 30, 2026

What Full-Duplex Speech Models Actually Change in Voice AI

Key insights after running 300,000 calls in production

Key Takeaways

  • Cascaded voice AI (STT, then LLM, then TTS) loses tone, emotion, and prosody at the transcription step. This is format loss, not a latency problem.
  • Full-duplex speech models process raw audio directly, eliminating the conversion step entirely.
  • The production impact: natural turn-taking, robust barge-in handling, and responses that account for how something was said, not just what.
  • The architecture decisions made today determine what is observable and improvable later.

Introduction

A customer service agent calls a frustrated customer about a 45-day overdue payment. The customer says: "do whatever you want"

The cascaded voice AI listening to that call sees one transcript: "do what you want" It logs the call as a successful collection commitment. The forecast updates. The work queue moves on.

Three days later, the payment does not arrive. Nobody knows why the model got it wrong. Nobody can know, because the information that would have explained it was discarded at the speech-to-text step before the model ever saw the conversation.

This is not a prompt engineering problem. It is not a model capability problem. It is an architecture problem. And it is the problem full-duplex speech models actually solve.

The Pipeline You Are Probably Using

The standard architecture for voice AI today looks like this:

User speaks → STT (speech-to-text) → LLM (reasoning) → TTS (text-to-speech) → User hears response

Three separate model calls. Two format conversions. Production-grade cascading pipelines deliver 800ms to 2 seconds end-to-end in well-optimized streaming deployments, climbing to 2 to 4 seconds without aggressive optimization.

This is the cascaded architecture, and it is what the overwhelming majority of voice AI in production runs on today. It works. Deepgram or AssemblyAI handles the STT. OpenAI or Anthropic handles the reasoning. ElevenLabs or Cartesia handles TTS. The problem is not that it is slow. The problem is what it does to the conversation on its way through.

What Gets Lost Between Your Voice and the Model

The STT step converts continuous speech into a text transcript. That transcript is what the LLM receives. This works fine for simple exchanges. It stops working when the conversation carries meaning in things other than words, which is most of the time.

When your speech becomes text, the LLM never sees:

  • Tone: Whether the customer said "do whatever you want" with certainty or with resignation. Both produce the same transcript. The model cannot distinguish them, because the information was never passed in.
  • Hesitation: A pause before answering a direct question about payment is semantically different from a pause between sentences. Text has a period. It does not have the three-second silence that precedes it.
  • Prosodic stress: The emphasis on certain words tells you whether the obstacle is capability or willingness. Transcripts flatten it.
  • Emotional register: An aggressive tone, a confused tone, a defeated tone: these are encoded in acoustic features that are irretrievably lost the moment audio becomes text. The LLM gets a best-case transcript and must reconstruct what actually happened.

The research literature on prosody, the sound properties of vocal expressions including pitch, rhythm, timbre, and speech rate, consistently documents that prosodic features carry linguistic and paralinguistic information that text cannot represent (Banse and Scherer, 1996; Larrouy-Maestri et al., 2023). Overlapping speech alone, which cascading systems struggle to handle, accounts for roughly 20 percent of spoken time in natural conversation (Çetin and Shriberg, 2006).

This is not a prompt engineering problem. It is not a model capability problem. It is a format loss problem. It happens before the model is involved.

What Full-Duplex Speech Models Actually Are

Full-duplex speech models remove the transcription step entirely. Instead of audio to text to model to text to audio, they operate directly on raw audio streams. The model receives your voice. The model generates voice. No intermediate text. One continuous bidirectional channel.

Architecturally, this requires solving a harder problem: how do you get an LLM to process continuous audio in real time, manage two simultaneous streams (speaker and listener), handle overlapping speech, and produce low-latency output?

The answer varies by implementation:

  • Moshi (Kyutai Labs): Paper released September 2024, weights and code open-sourced under CC-BY 4.0. Uses a neural audio codec called Mimi that compresses speech to 1.1 kbps at 80ms frame latency, then runs a 7B-parameter speech-text foundation model that handles both streams with an "Inner Monologue" approach, maintaining semantic coherence by keeping a text-level representation internally while operating on audio at the interface. Reported latency: 160ms theoretical, 200ms in practice on an L4 GPU.
  • Google Gemini Live: Native multimodal architecture where audio is tokenized and processed through the same model that handles text, images, and code. Gemini 2.5 Flash Native Audio (released September 2025) operates over WebSockets with sub-second response initiation. VAD and barge-in are handled natively. Audio-only sessions are limited to 15 minutes, with session resumption supported for longer interactions.
  • OpenAI's GPT-4o voice mode: Responds as quickly as 232ms with a 320ms average, per OpenAI's published figures. The model processes speech end-to-end without intermediate transcription.
  • ByteDance's Seeduplex (released April 2026): Deployed in production inside the Doubao app. Reports a roughly 50 percent reduction in combined false-response and false-interruption rates, 250ms faster end-of-turn detection, 40 percent fewer talk-over incidents, and an 8.34 percent improvement in overall call satisfaction compared to ByteDance's previous half-duplex Doubao voice model.
  • NVIDIA's PersonaPlex (released January 2026): MIT-licensed code with NVIDIA Open Model License weights. Built on Moshi's architecture with 7B parameters and a Helium LLM backbone. Evaluated on FullDuplexBench using metrics like Takeover Rate (TOR) for smooth turn-taking, user interruption handling, and pause handling. NVIDIA reports state-of-the-art performance on conversational dynamics, response latency, and interruption latency relative to other open-source and commercial systems.

These models are in production. The architecture is no longer theoretical.

For context on what this means in practice: DinoDial's voice agents run natively on Gemini Live as the underlying speech-to-speech model, which is why our observability stack treats audio as a first-class signal rather than retrofitting transcript-based monitoring onto an audio-native pipeline. The architectural decision to build on full-duplex, rather than bolting it onto a cascading foundation, is what determines whether your evaluation, governance, and improvement loops can actually leverage prosodic information or just discard it the way the cascading pipeline already did.

What It Actually Changes for Call Quality

The architectural difference has three concrete effects on what happens in a call.

Turn-taking becomes semantic, not acoustic

In a cascaded system, turn detection is mostly Voice Activity Detection. The model waits for silence, buffers the audio, sends it to STT, and begins processing. VAD silence thresholds typically sit between 300 and 700ms of silence, depending on configuration. Any pause longer than that is treated as a turn end.

This breaks constantly. Mid-sentence thinking pauses trigger early responses. Background noise causes false activations. The model barges in mid-thought because the user stopped to breathe.

Full-duplex systems can detect turn completion semantically, understanding that "I'll pay... [pause]... next Tuesday" is one thought, not two. ByteDance's Seeduplex, deployed in production at scale inside Doubao, uses joint speech-semantic features for endpoint detection. Reported result: roughly 50 percent reduction in combined false-response and false-interruption rates compared to their previous half-duplex system.

Barge-in actually works

Barge-in, where a user interrupts the system mid-response, is one of the hardest problems in cascaded voice AI. By the time the user has started speaking, the STT has buffered, the LLM has generated, and the TTS is streaming. Stopping cleanly requires interrupting multiple components simultaneously.

In full-duplex systems, the model hears both streams in parallel. It knows you are speaking while it is responding. NVIDIA's PersonaPlex demonstrates strong performance on FullDuplexBench's user-interruption evaluation, where the model is required to stop speaking and respond appropriately when the user begins talking. The system does not just detect that you are speaking. It understands you are interrupting and adapts accordingly.

The model responds to what actually happened

This is the one that matters most in production.

In collections calling, a hesitant "I'll think about it" and a confident "yes" produce nearly identical transcripts. The acoustic signal is completely different: one carries uncertainty, one carries commitment. A cascaded model responds to the words. A full-duplex model, processing raw audio, responds to what was actually communicated.

This is not a marginal improvement. In high-stakes conversations such as collections, sales qualification, insurance intake, and patient triage, the difference between responding to the words and responding to the meaning is the difference between a productive call and a failed one.

Why the Architecture Decisions You Make Now Matter

Voice AI teams building on cascaded architecture are not making a mistake. But it is worth being clear-eyed about what it cannot do, structurally.

Cascading cannot recover context lost at the STT step. It cannot respond to prosody it never received. Optimizing any individual component, faster STT, smarter LLM, lower-latency TTS, does not change the fundamental information loss that happens at format conversion.

Full-duplex speech models solve a different problem than "how do we make cascading faster." They solve "how do we give the model the full conversation, not just a transcript of it."

If you are building a voice AI system today, the practical decision is whether your use case can tolerate the information loss inherent in transcript-based pipelines. If it can, cascading is fine. If it cannot, you are building on a foundation that will not let you fix the problem later, no matter how much engineering you throw at it.

This is why DinoDial was built natively on full-duplex speech-to-speech from day one rather than as a cascading product with full-duplex bolted on. The downstream consequence is that DinoDial Lens, our observability layer, can evaluate calls using both the side-channel transcript and the underlying acoustic signal. Architecture shapes what is observable, what is improvable, and ultimately what is fixable. The decision to build native is not a marketing position. It is the only way to ensure the rest of the stack is not silently optimizing the wrong things.

Request a Demo with DinoDial to see how Voice agents significantly improve your call quality

Request a Demo →

Frequently Asked Questions

A full-duplex speech model is a voice AI system that processes and generates raw audio directly, without converting speech to text as an intermediate step. Unlike cascaded architectures, full-duplex models receive audio input and produce audio output in a continuous bidirectional stream, preserving tone, emotion, and prosody.

Cascaded voice AI converts speech to text (STT), processes the text with an LLM, and converts back to speech (TTS). This involves three separate model calls and two format conversions. Full-duplex replaces this with a single model that operates directly on audio, retaining prosody and emotional cues lost at the STT step.

Reported figures: Moshi achieves 160ms theoretical latency. GPT-4o responds as quickly as 232ms. Gemini Live achieves sub-second initiation. For comparison, optimized cascaded systems usually run 800ms to 2 seconds end-to-end.

Models include Moshi (Kyutai Labs), Gemini Live (Google), GPT-4o (OpenAI), SyncLLM (UW/Meta), Seeduplex (ByteDance), PersonaPlex (NVIDIA), and SALM-Duplex (Interspeech 2025).

Full-duplex models process both audio streams in parallel, so they hear you speaking while generating a response. They can adapt immediately, whereas cascaded systems must coordinate the interruption across STT, LLM, and TTS components simultaneously.