Twenty percent of the time, OpenAI's GPT Realtime 1.5 responds in English when you speak to it in Hindi, Spanish, or Turkish. That's one of the more striking findings from Voice Showdown, a new benchmark from Scale AI that puts voice AI models through real human evaluation for the first time.
Voice AI has been one of the fastest-growing categories in AI tooling, but until now there hasn't been a standardized way to compare how these models actually perform in conversation. Synthetic benchmarks test specific capabilities like transcription accuracy or latency, but they don't capture whether a voice response actually sounds helpful, natural, and correct to a real person.
The Setup
Voice Showdown borrows the arena format that's worked well for text-based model comparisons. Users speak a prompt, then hear (or read) two competing responses from different models. They pick the one they prefer. No labels, no brand names visible during voting. Scale launched with 11 frontier models evaluated across 52 model-voice combinations.
The benchmark runs two separate leaderboards. "Dictate mode" has users speak a prompt and compare written responses from 8 models. "Speech-to-speech mode" (S2S) compares actual audio responses from 6 models, testing not just content quality but voice naturalness, pacing, and tone.
Google Takes the Lead, OpenAI Has a Language Problem
On the dictate leaderboard, Google's Gemini 3 Pro and Gemini 3 Flash are statistically tied for first place. On the S2S leaderboard, Gemini 2.5 Flash Audio ties with OpenAI's GPT-4o Audio at the top.
The more interesting story is in the gaps. GPT Realtime 1.5, OpenAI's newer real-time voice model, has a significant language robustness problem. It defaults to English on non-English prompts about 20% of the time, even for widely spoken languages. For anyone building voice applications that serve multilingual users, that's a serious reliability issue, not a minor edge case.
What This Means for Voice Tool Users
If you're using voice AI through tools like ElevenLabs, Descript, or any app built on top of these foundation models, benchmark results like these help you understand what's happening under the hood. The model powering your tool's voice features matters, and now there's a public way to compare them.
Google's strong showing here also tracks with their broader push into multimodal AI (models that handle text, images, and audio together). They've been less flashy about voice than OpenAI, but the benchmark data suggests their models are quietly performing at or above the competition.
Scale AI plans to keep the arena open for ongoing community voting, so these rankings will shift as models update and new ones enter. The initial snapshot gives voice AI its first real scoreboard.