wezebo
Back
ArticleMay 9, 2026 · 4 min read

OpenAI’s new voice models turn audio into agent infrastructure

OpenAI split real-time voice into separate models for reasoning, translation, and transcription, giving developers a more modular way to build spoken AI agents.

Wezebo
Abstract dark editorial image of sound waves flowing through layered AI infrastructure blocks with blue and violet accents, without text or logos.

OpenAI is trying to make voice agents less like talking chatbots and more like software that can keep working while a conversation is happening.

The company introduced three real-time audio models for its developer platform: GPT-Realtime-2 for live voice reasoning, GPT-Realtime-Translate for speech translation, and GPT-Realtime-Whisper for streaming transcription. OpenAI says the models are available to test in its developer playground, while Reuters reports that early customers include Zillow, Priceline, and Deutsche Telekom.

Voice is being split into parts

The important change is not just that the voice sounds better. It is that OpenAI is separating the job into smaller pieces. One model handles the live conversation and tool use. Another handles translation across more than 70 input languages and 13 output languages. A third handles low-latency speech-to-text.

That structure matters for teams building real products. A customer-support agent, travel assistant, or field-service tool may need to listen, summarize, translate, check a database, schedule an appointment, and keep talking while all of that happens. Bundling every task into one voice model can be expensive and brittle.

VentureBeat framed the move as a shift toward orchestration primitives: voice reasoning, translation, and transcription become separate components in an agent stack. That is a useful way to think about the launch. OpenAI is not only selling a voice interface. It is selling audio as infrastructure.

The developer pitch

GPT-Realtime-2 is the headline model. OpenAI describes it as a real-time voice model with GPT-5-class reasoning, longer context, stronger recovery behavior, support for interruptions, and the ability to call tools while speaking. The company says it can support patterns such as voice-to-action, systems-to-voice, and voice-to-voice interactions.

Those examples point to the business case. A Zillow-style assistant could search homes, apply user constraints, and schedule a tour. A travel app could explain a delayed connection and pull in gate or baggage information. A telecom support agent could translate a live conversation while still following the support workflow.

Pricing also shows how OpenAI expects developers to mix and match models. Reuters reported that GPT-Realtime-2 starts at $32 per million audio input tokens, GPT-Realtime-Translate costs $0.034 per minute, and GPT-Realtime-Whisper costs $0.017 per minute. That makes model routing a product decision, not just an engineering detail.

Why enterprises will care

Voice agents have often failed in production for practical reasons: latency, state management, handoffs, compliance logging, and cost. A model that sounds natural is not enough if it loses context or cannot explain what action it is taking.

OpenAI’s launch addresses that gap by giving developers more control over the voice pipeline. Transcription can be handled continuously. Translation can be routed only when needed. Reasoning and tool use can sit on top of the spoken interaction rather than being bolted on afterward.

The risk is complexity. More specialized models mean more decisions about routing, monitoring, fallback behavior, and data handling. Companies still need to decide when a voice agent should act, when it should ask for confirmation, and how much of the conversation should be stored.

The practical read is simple: voice is moving from demo layer to workflow layer. If OpenAI’s approach works, the next wave of voice products will be judged less by how human they sound and more by whether they can safely get work done while people keep talking.