Introducing updated GPT Voice Models in Microsoft Foundry

Dave_Jacobs

Microsoft

Dec 17, 2025

Delivering Continuous Quality, Reliability, and Cost-Efficiency for Production Voice Workflows

In today’s fast‑moving AI engineering landscape, developers need voice models that don’t just perform well—they need models that are fast, predictable, production‑ready, and easy to integrate into real‑world systems, with cost-efficiency. Whether you’re building real‑time voice agents, powering multilingual transcription pipelines, or generating dynamic speech for customer‑facing applications, reliability and latency are just as important as raw model quality.

To enable practitioners to ship these scenarios with confidence, we’re introducing a series of updated, production‑ready voice models—now generally available via API in Microsoft Foundry. These models deliver state‑of‑the‑art performance across real‑time audio input/output, transcription, and natural‑sounding speech synthesis, while giving developers the tools to optimize for accuracy, latency, and brand voice consistency.

Realtime-mini: Low‑latency speech‑to‑speech for real‑time conversational agents

The new gpt-realtime-mini-2025-12-15 model powers real-time voice agents with strong parity to full-scale models in instruction-following and function-calling, while delivering improved prosody, better voice fidelity across turns, and reduced latency for seamless dialogue. With audio input and output, it’s ideal for low-latency, high-quality voice experiences. This ensures consistent brand voice across conversations, with strict consent and legal guardrails in place for compliance. We will ship this capability as a fast follow post launch. The API-only deployment makes the integration into your workflows seamlessly.

Use Cases:

Real-time customer support agents
Interactive voice bots
Personalized voice assistants

ASR (Automatic Speech Recognition): High‑accuracy transcription optimized for real‑time pipelines

Our latest transcribe model- gpt-4o-mini-transcribe-2025-12-15 model delivers a major leap in transcription accuracy and robustness for real-time scenarios. Compared to previous generations, it achieves up to 50% lower word error rate (WER) on English benchmarks and significantly improves multilingual performance. It also reduces hallucinations on silence by up to 4×, making it a reliable choice for noisy environments and real-world audio streams. Input is audio, output is text, and deployment is API-only.

Use Cases:

Multilingual transcription services
Meeting note automation
Voice-driven data entry

TTS (Text-to-Speech): Natural multilingual speech generation

The updated gpt-4o-mini-tts-2025-12-15 model sets a new benchmark for multilingual speech synthesis, delivering more natural, human-like speech with fewer artifacts and improved speaker similarity. It outperforms industry leaders with 35% fewer word errors on multilingual benchmarks and achieves up to 3× lower WER in non-English languages.

Use Cases:

Dynamic customer service voices
Expressive narration for creative storytelling
Accessibility solutions for global audiences

Raising the bar for real-world voice applications

These new models represent a significant step forward in enabling developers and enterprises to build more powerful, customizable, and intelligent voice agents. They offer state-of-the-art accuracy, reliability, and flexibility, especially in challenging scenarios involving accents, noisy environments, and varying speech speeds. With features like steerable voices and custom voice uploads and improved multilingual quality for major world languages, organizations can create tailored experiences that resonate with their customers.

All of these updates come with no changes to current pricing. The latest improvements in accuracy, stability, and performance are included at the same affordable price point, reinforcing our commitment to delivering high‑value, production‑ready voice models without increasing cost.

What’s next: Continuing innovation in real-time audio

We’re committed to ongoing innovation in real-time audio modeling. We will release the gpt-audio-mini model updates via API in Microsoft Foundry very soon. Looking ahead, we’ll continue to invest in improving the intelligence and accuracy of our audio models, exploring new ways to enable developers to bring their own custom voices and build even more personalized experiences. Stay tuned for future updates as we push the boundaries of what’s possible in multimodal AI.

Ready to transform your customer experience?

Start using the new voice models in Microsoft Foundry today. Explore the APIs, test latency and accuracy at scale, and integrate them directly into your real‑time applications.

Join our developer community, share your feedback, and help shape the next generation of real‑time multimodal audio models—your insights directly influence what we build next.

Updated Jan 08, 2026

Version 2.0

Dave_Jacobs

Microsoft

Joined July 14, 2021

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity