OpenAI Challenges Speech-to-Speech Leaders RealTime API updates audio models that reason, transcribe, and translate

Published
Reading time
4 min read
Graph depicts GPT-Realtime-2's performance across sectors, competing with other speech-to-speech models.
Loading the Elevenlabs Text to Speech AudioNative Player...

An update of OpenAI’s speech-to-speech model lets developers tune the tradeoff between speed and reasoning.

What’s new: OpenAI introduced three new audio models in its Realtime API. GPT-Realtime-2 is a speech-to-speech model with configurable reasoning effort. GPT-Realtime-Translate translates speech between more than 70 input languages and 13 output languages, and GPT-Realtime-Whisper transcribes speech into text.

  • Input/output: GPT-Realtime-2 text, audio, image in (up to 128,000 tokens), text, audio out (up to 32,000 tokens, 1.12 seconds to first audio at minimal reasoning, 2.33 seconds at high reasoning); GPT-Realtime-Translate audio in (up to 16,000 tokens), audio out (up to 2,000 tokens); GPT-Realtime-Whisper text audio in (up to 16,000 tokens), text out (up to 2,000 tokens)
  • Knowledge cutoff: September 30, 2024
  • GPT-Realtime-2 features: Five levels of reasoning effort (minimal, low, medium, high, xhigh), parallel tool calls, narration of tool calls, optional preambles, graceful handling of problematic input, tone (attitude) control, function calling
  • GPT-Realtime-2 performance: Tops Scale AI’s Audio MultiChallenge audio-output leaderboard and Artificial Analysis Conversational Dynamics, tied for third on Artificial Analysis Big Bench Audio
  • Availability: Via OpenAI Realtime API
  • Prices: GPT-Realtime-2 $32/$0.40/$64 per million input/cached/output audio tokens, $4/$0.40/$24 per million input/cached/output text tokens, $5/$0.50 per million input/cached image tokens; GPT-Realtime-Translate $0.034 per minute; GPT-Realtime-Whisper $0.017 per minute
  • Undisclosed: Parameter counts, architectures, training data and methods

How GPT-Realtime-2 works: GPT-Realtime-2 handles audio in and audio out as an end-to-end process — including reasoning — rather than separate speech-to-text, text-generation, and text-to-speech steps.

  • An API parameter sets the reasoning effort. Low is the default, chosen to minimize latency for live conversation. Higher reasoning effort increases latency and consumption of reasoning tokens.
  • During tool calls, the model can narrate its work in progress using spoken phrases like “checking your calendar” or “looking that up now.” Optional preambles like “let me check that” can precede response to prompts, so users can track progress while the model reasons.
  • When it can’t complete a request, the model alerts users via phrases like “I’m having trouble with that right now” instead of remaining silent.

GPT-Realtime-2 performance: GPT-Realtime-2 led some independent benchmarks for conversational dynamics and multi-turn instruction following, but it trailed on the Artificial Analysis Speech Reasoning leaderboard. The time required to generate audio ranged from 1.12 seconds at minimal effort to 2.33 seconds at high effort, which yields the model’s best reasoning scores — generally slow for real-time interactions, which benefit from latency lower than 500 milliseconds.

  • On Artificial Analysis Big Bench Audio (answering questions drawn from the Big Bench benchmark), GPT-Realtime-2 set to high reasoning tied Google’s Gemini 3.1 Flash Live Preview set to high reasoning (96.6 percent), behind Step-Audio R1.1 Realtime (97.6 percent) and Grok Voice Think Fast 1.0 (97.1 percent). Set to minimal reasoning, GPT-Realtime-2 dropped to 71.8 percent.
  • On Artificial Analysis’s Conversational Dynamics (a weighted average that tests the ability to manage taking turns, pausing, interruptions, and brief interjections such as “uh-huh”), GPT-Realtime-2 set to minimal reasoning led with 96.1 percent. However, set to high reasoning (95.3 percent), it lagged GPT-Realtime-1.5 and GPT Realtime Mini (tied at 95.7 percent).
  • On 𝜏-Voice (agentic performance in three customer-service domains), GPT-Realtime-2 led the airline domain with 63 percent, according to Artificial Analysis. But considering all three domains, GPT-Realtime-2 (39.8 percent) fell behind Grok Voice Think Fast 1.0 (52.1%) but ahead of Gemini 3.1 Flash Live Preview set to high (37.7 percent).
  • On the Scale AI Audio MultiChallenge Audio Output leaderboard, which evaluates four conversational criteria (instruction retention, inference memory, self-coherence, and voice editing) in multi-turn spoken dialogue, GPT-Realtime-2 set to xhigh reasoning placed first (48.45 percent average pass rate, the share of conversations in which the model satisfies every criterion), a significant jump from its predecessor GPT-Realtime-1.5 (34.73 percent average pass rate). However, Scale AI has not yet tested Grok Voice Think Fast nor Step-Audio R1.1 Realtime.

Yes, but: The two models ahead of GPT-Realtime-2 on the Artificial Analysis Speech Reasoning leaderboard are also faster.

  • Step-Audio R1.1 Realtime takes 1.51 seconds to generate its first audio output and Grok Voice Think Fast 1.0 takes 1.25 seconds, versus 2.33 seconds for GPT-Realtime-2 at high reasoning effort.
  • With reasoning set to xhigh, GPT-Realtime-2’s overall pass rate on the Scale AI Audio MultiChallenge is below 50 percent, which suggests that reliable multi-turn spoken dialogue remains challenging for current models.

Why it matters: Voice agents generally have focused on relatively simple interactions because reasoning often comes at the cost of a snappy response. GPT-Realtime-2 offers not only high performance but also control over that tradeoff (minimal reasoning for faster turn-taking, xhigh for interactions that can wait). This flexibility expands the range of tasks voice agents can handle without resorting to text processing.

We’re thinking: It's exciting to see that GPT-Realtime-2 implements preambles similar to the pre-responses we described here!

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox