How do you evaluate speech to speech AI models?

Through a two-layer approach. Infrastructure evaluation is deterministic: latency thresholds, tool call success rates, connection stability metrics. Conversation evaluation combines side-channel transcript analysis with acoustic signal analysis including sentiment detection, interruption patterns, and engagement signals. The key architectural decision is treating these as separate evaluation pipelines with separate failure semantics, not collapsing both into a single conversation quality score.

May 2, 2026

Voice AI Observability for Speech to Speech Models

Key insights after running 300,000 calls in production

TLDR

The "speech-to-speech AI is a black box" narrative is a talking point that reflects the architectural commitments of vendors built on cascading (STT, then LLM, then TTS) pipelines. It is not a technical reality.
Speech-to-speech models like Gemini Live natively emit side-channel transcripts for both user input and model output.
Cascading models create a false sense of observability. You get text, but you lose the prosodic, paralinguistic, and acoustic signal that influences meaning.
Real observability in voice AI requires two separate layers: infrastructure health and conversation quality.
100% post-call evaluation coverage is achievable on speech-to-speech pipelines.
The practical test: if your observability platform cannot tell you whether a bad call was an infrastructure failure or an agent failure, it is not observability. It is logging.

The State of the Speech to Speech Observability Conversation

If you have spent any time evaluating voice AI platforms in 2024 and 2025, you have heard some version of this claim: "Speech-to-speech models are black boxes. They don't give you transcripts. You can't audit what happened. They are not enterprise-ready."

It appears in LinkedIn posts, product comparison pages, and sales calls. When a product's foundational architecture differs from an emerging paradigm, the framing of that paradigm naturally reflects the architectural commitments of the people doing the framing.

What is worth pushing back on is when that framing becomes the dominant narrative, unchallenged, in engineering conversations and procurement decisions.

This blog is an examination of what "black box" actually means, what observability actually requires, and whether speech-to-speech AI genuinely fails those tests, or whether the tests themselves were designed for a different architecture.

What Does "Black Box" Actually Mean in Production AI?

A black box is defined by the absence of observability tooling, not by the architecture of the model itself.

That sentence is the entire argument. Everything else is illustration.

In systems engineering, a black box is a component whose internal workings cannot be observed, validated, or reasoned about from outside. You send input. You get output. What happens in between is opaque.

The claim applied to speech-to-speech AI is specific: because the model processes audio end-to-end with no intermediate text representation, you cannot see what the model "heard," what it "decided," or why it said what it said. You cannot audit for compliance. You cannot debug failures. You cannot build eval suites.

The argument has surface plausibility. It is also, at its core, a category error.

The question is never "is the model transparent by default?" The question is: can you build sufficient observability around it to detect failures, localize problems, measure outcomes, and improve over time?

For speech-to-speech AI, the answer is yes. Demonstrably, architecturally, and in production.

The Cascading Architecture's Hidden Observability Problem

Before examining what speech-to-speech observability looks like, it is worth being precise about what cascading model observability actually provides, and where it silently fails.

The Three-Pass Pipeline

A standard cascading voice AI pipeline makes three sequential model calls per conversational turn:

STT (Speech-to-Text): Raw audio to text transcript
LLM: Text prompt and transcript to text response
TTS (Text-to-Speech): Text response to synthesized audio

Each hop adds latency. Industry benchmarks for production-grade cascading pipelines show end-to-end latency of 1 to 2 seconds in well-optimized deployments, climbing to 2 to 4 seconds without aggressive streaming optimization. That is well above the 200 to 500ms turn-taking gap that characterizes natural human conversation, as documented in cross-linguistic research by Stivers et al. (2009). This is not a theoretical concern. It is measurable drop-off in conversation completion rates.

The bigger issue is not latency. It is what you lose at Step 1 and only discover at Step 3.

The Signal Loss Problem

When STT converts audio to text it does not preserve:

Prosodic features: pitch, rhythm, stress patterns, the acoustic markers that distinguish "yes" (confident agreement) from "yes" (resigned compliance) from "yes?" (uncertain question)
Paralinguistic signals: hesitation markers, laughter, sighing, the micro-pauses that indicate cognitive load
Acoustic environment data: background noise characteristics that distinguish "user in a quiet office" from "user driving on a highway," context that should change how the agent responds
Emotional state signals: tone and voice quality features that carry information the transcript discards

Moshi's authors note that overlapping speech alone accounts for roughly 20 percent of spoken time (Çetin and Shriberg, 2006), information that turn-based STT systems discard entirely.

Consider a concrete production scenario. In collections calling, a customer who says "I'll pay tomorrow" with a 200ms hesitation, falling pitch, and reduced vocal energy has measurably different default behavior than one who says "I'll pay tomorrow" with confident, even prosody. The transcripts are identical. The cascading model cannot distinguish them.

This signal loss happens at Step 1. Your observability tooling, which reads the transcript, cannot tell you that the model was working from degraded input. The transcript looks complete. The LLM received it. The TTS rendered a response. All systems nominal. The call still failed.

This is the cascading model's actual observability problem: you have text at every step, but text was never the complete picture.

How Speech-to-Speech AI Actually Works

Speech-to-speech AI, as implemented in models like Google's Gemini Live API, processes audio input and generates audio output within a single neural model. There is no STT-to-LLM handoff, no intermediate text representation as a required computational step, no TTS synthesis from a separate system.

The model receives streaming audio, maintains a continuous internal representation of the conversation state, and generates streaming audio response. Production deployments of Gemini Live (specifically Gemini 2.5 Flash Native Audio, released September 2025) achieve sub-second response initiation in typical usage, with reported time-to-first-token measurements around 320ms p50 and 780ms p95.

For comparison, OpenAI's GPT-4o voice mode responds as quickly as 232ms with a 320ms average, per OpenAI's published figures. Both represent a step change from the 2.8-second average of the GPT-3.5 cascading pipeline that preceded them.

What Speech-to-Speech Actually Hears

Because the model operates on raw audio, it retains:

Full prosodic context throughout the conversation
Acoustic environment awareness
Emotional state signals in the audio
Cross-turn memory of how the customer has been speaking: are they getting more impatient? More engaged?
Interruption and barge-in handling natively

This is not a marginal improvement over cascading. It is a qualitatively different input space.

The Transcript Question, Settled

Here is the specific technical rebuttal to the black box objection: speech-to-speech models like Gemini Live natively emit side-channel transcripts.

The Gemini Live API exposes two configuration parameters, input_audio_transcription and output_audio_transcription, that produce real-time text representations of user speech and model output respectively. These transcripts are delivered alongside the audio stream as separate fields in the server response, available to your application layer for logging, compliance, eval pipelines, and downstream processing.

This transcript is not the model's internal computation, which is audio-native. It is a complete, synchronized record of the conversational content, generated by the same model that generated the audio.

The "no transcripts" argument was never technically accurate for current-generation speech-to-speech models. It may have been true of early S2S research systems. It is not true of the production-grade models enterprises are evaluating today.

The Two-Layer Observability Architecture

Understanding why speech-to-speech AI can be fully observable requires separating two distinct observability problems that most platforms conflate.

Layer	Question Answered	Failure Type Detected
Infrastructure Observability	Did the system work?	Telephony failures, LLM latency spikes, network issues, VAD malfunctions, tool call failures
Conversation Observability	Did the agent perform well?	Hallucinations, off-script responses, guardrail violations, failed objectives, customer dissatisfaction

Most monitoring platforms, including general-purpose tools like Datadog and Sentry, operate almost entirely at the infrastructure layer. Datadog sees server uptime and latency metrics. Sentry catches exceptions and crashes. Neither can tell you that the agent gave a customer incorrect account information on a technically successful call (no latency spike, no exception thrown, no crash).

Conversely, most AI-specific eval platforms operate at the conversation layer only, scoring agent responses and measuring task completion, without establishing whether the system was even healthy during the call.

The result is a diagnostic blind spot. When a call fails, you do not know whether to route the problem to your infrastructure team or your prompt engineering team. You apply both. You fix neither efficiently.

Layer 1: Infrastructure Observability

Effective infrastructure observability for speech-to-speech AI should provide:

End-to-end call tracing per component (telephony ingress, model inference, audio egress)
Full-stack health monitoring across telephony, model serving, and network layers simultaneously
Automated failure localization: if latency is elevated, flag it as a latency problem, not an agent problem. If a tool call fails, identify which tool caused the failure, not just that "something went wrong"
Deterministic, non-AI monitoring for infrastructure signals that do not require LLM reasoning: fast, cheap, reliable

A key operational principle: not everything requires AI to debug. If a call failed because the network dropped a packet, you do not need a language model to tell you that. Infrastructure monitoring should be deterministic where possible, cheaper, faster, and more reliable than running AI evals on infrastructure events.

Layer 2: Conversation Observability

Effective conversation observability for speech-to-speech AI should provide:

Dual-dimension outcome scoring: task completion (did the agent achieve the call objective?) and customer satisfaction (how did the interaction feel?) measured independently, not collapsed into a single score
Misbehavior detection: hallucination identification, off-script responses, guardrail violations, information leakage
100% evaluation coverage: every call evaluated, not a 1 to 2 percent sample

Full coverage matters because failure distributions in voice AI are rarely uniform. A model regression that degrades performance on a specific accent pattern, or a prompt issue that triggers only when customers mention a particular competitor, will be statistically invisible in a 1 to 2 percent sample. It surfaces in 100 percent coverage.

This two-layer split is the architectural decision behind DinoDial Lens, which separates infrastructure health monitoring from conversation quality evaluation. The reason for the split is operational, not theoretical: when a call fails, the team triaging it needs to know within seconds whether to wake up the infrastructure on-call or the prompt engineer. Conflating those signals into a single "call quality score" is what makes most voice AI observability stacks reactive instead of diagnostic.

Problem Localization: The First Principle

There is a sequence that effective observability must follow, and most voice AI observability platforms skip the first step.

Localize the failure. Was this a system problem or a conversation problem?
Characterize the failure. If system: which component? If conversation: which dimension?
Score and improve. Apply eval frameworks, generate training signals, improve the agent or the infrastructure.

Most platforms enter at Step 3. They score every call, apply generic eval rubrics, and generate improvement recommendations, without ever establishing whether the call failed because the audio pipeline dropped half the packets, or because the agent gave a bad response on a perfectly healthy call.

Consider the common failure scenarios that look like agent failures but are not:

Noisy audio: A call with high background noise where the agent seems to give confused responses. Is the agent performing poorly, or is the VAD engine mis-segmenting the audio and feeding the model corrupted input? These require different fixes. Prompt engineering will not solve a VAD configuration problem.
Elevated latency: A call where the customer complains the agent is slow or unresponsive. Is the agent's reasoning path too long, or is there a network latency spike between telephony ingress and model serving? The first is a prompt problem. The second is an infrastructure problem.
Tool call failures: A call where the agent fails to retrieve account information. Is this an agent hallucination, or did the CRM API call time out? Completely different root causes requiring completely different interventions.

Localization is not optional. It is the prerequisite to every improvement action downstream. Without it, teams spend engineering cycles optimizing the wrong layer, a waste that compounds over the lifetime of a production deployment.

Human-in-the-Loop Governance

Full observability in Voice AI is not purely an automated pipeline. In production voice AI, particularly in regulated industries, there must be a structured human review layer between failure detection and training signal generation.

The reason is domain specificity. Generic eval rubrics score things like task completion and response relevance. They do not know that in a collections context, an agent asking "are you sure you want to pay the full amount?" is a compliance failure, not a helpfulness optimization. They do not know that an insurance agent who volunteers policy information not asked for may be creating regulatory exposure. They do not know which response patterns are legally sensitive in which jurisdiction.

This domain knowledge exists in the humans who run these operations, not in generic scoring models.

An effective human-in-the-loop governance layer does the following:

AI-assisted triage: The system flags calls that had problems, including calls that were failing but recovered, surfacing the exact segment of the call where something went wrong, not the full recording.
Human review: A reviewer listens to the flagged segment, not the full call. 30 seconds of attention, not 4 minutes.
Structured annotation: The reviewer annotates what went wrong and what the correct response should have been. Not binary pass or fail, but specific corrective guidance.
Eval suite accumulation: Annotations accumulate against the specific agent version and inform the eval suite. Evals are never auto-generated from annotations, because eval generation is an expensive, structured process that requires deliberate oversight.
Retraining gate: When annotation volume is sufficient and failure patterns are clear, a human makes the retraining decision, weighing performance improvement against retraining cost.

The AI does the scaling work (monitoring 100 percent of calls, surfacing failure candidates). The human does the judgment work (deciding what constitutes a failure, what the correct behavior should be, whether a pattern is worth training on).

The automation handles scale. The human handles meaning.

What Full-Stack Observability Looks Like in Production

Capability	What It Provides
End-to-end call tracing (per component)	Failure localization, system vs. conversation
100% post-call evaluation	Complete failure detection, no sampling blind spots
Dual-dimension scoring (task and satisfaction)	Accurate call quality measurement
Side-channel transcript capture	Compliance record, eval data, downstream processing
Misbehavior detection	Hallucination, guardrail violations, off-script responses
Human annotation layer	Domain-specific quality signal, compliance governance
Eval suite versioned to agent versions	Accurate attribution of improvements to retraining events
Retraining economics tracking	Cost-aware improvement decisions

The Improvement Loop

The pipeline below is how DinoDial implements this in production:

Infrastructure monitoring catches infrastructure failures in real time. Calls that failed for system reasons never reach post-eval as agent failures.
Conversation evaluation runs on 100 percent of calls for task completion, customer satisfaction, and misbehavior, using both the side-channel transcript and acoustic signals.
AI-assisted triage surfaces failure candidates to human reviewers, opening directly to the failure point in the call, not the full recording.
Human annotation adds domain-specific judgment that generic scoring cannot provide.
Annotations build a targeted adversarial test suite: specific edge case coverage, not brute-force spam.
Agent versioning tracks annotation accumulation, and retraining decisions are gated on economic and performance thresholds.

This is the full pipeline. Every step is present. Every handoff is traceable. The model is not a black box. It is a component in a fully instrumented production system.

Conclusion

The "speech-to-speech is a black box" narrative served a purpose. It framed a paradigm transition as a compliance risk, slowing enterprise adoption of a superior architecture. The technical evidence does not support it.

Speech-to-speech AI produces transcripts. It can be evaluated at 100 percent coverage. It can be observed at the infrastructure and conversation layer simultaneously. It can support human-in-the-loop governance. It can be improved through structured annotation and adversarial testing pipelines.

What it requires, and what the platforms pushing the black box narrative often lack, is an observability architecture built for voice-native operation. One that treats audio as a first-class signal, localizes failures before scoring them, and builds improvement loops that account for the actual cost of production AI at scale.

The black box is not the model. The black box is the monitoring stack that cannot tell you where the problem actually is.

Request a Demo with DinoDial to see how observability looks for Speech to Speech models

Request a Demo →

Frequently Asked Questions

Yes. Current-generation speech-to-speech models, including Google's Gemini Live API, emit real-time side-channel transcripts of both user speech and model output via the input_audio_transcription and output_audio_transcription configuration parameters. These transcripts are available to the application layer for compliance logging, audit trails, and downstream processing. The "no transcripts" objection applies to early research-stage S2S systems, not to the production-grade models available today.

Compliance in regulated voice AI is a function of observability infrastructure and governance process, not model architecture. Speech-to-speech pipelines support the same compliance requirements as cascading pipelines: call recording, transcript capture, outcome logging, human review workflows, and audit trails.

Production cascading pipelines typically deliver end-to-end conversational latency of 800ms to 2 seconds in well-optimized deployments, climbing to 2 to 4 seconds without aggressive streaming optimization. Speech-to-speech models like Gemini Live and GPT-4o achieve sub-second response initiation, with GPT-4o reporting a 232ms minimum and 320ms average per OpenAI's published figures. Natural human turn-taking sits in the 200 to 500ms range (Stivers et al., 2009), which most cascading pipelines exceed consistently.

Through a two-layer approach. Infrastructure evaluation is deterministic: latency thresholds, tool call success rates, connection stability metrics. Conversation evaluation combines side-channel transcript analysis with acoustic signal analysis (sentiment detection, interruption patterns, engagement signals). The key architectural decision is treating these as separate evaluation pipelines with separate failure semantics, not collapsing both into a single conversation quality score.

By instrumenting infrastructure observability and conversation observability as separate layers with separate alert pipelines. If a call fails and latency metrics were elevated during that call, the primary failure hypothesis is infrastructure. If a call fails and all infrastructure metrics were healthy, the primary failure hypothesis is agent performance. Conflating these layers (routing infrastructure failures into agent eval pipelines, or missing infrastructure issues because eval platforms do not monitor them) is the root cause of most misdiagnosed voice AI quality problems.

Yes, with appropriate platform infrastructure. Speech-to-speech AI models are production-stable for enterprise voice workloads. The readiness question is not about the model. It is about whether the platform surrounding the model provides the observability, eval, governance, and improvement pipeline infrastructure that enterprise deployments require. Platforms built natively for speech-to-speech operation provide this. Platforms retrofitting S2S onto cascading-native infrastructure are still maturing those capabilities.