Blog Post

Microsoft Foundry Blog
3 MIN READ

Introducing updated GPT Voice Models in Microsoft Foundry

Dave_Jacobs's avatar
Dave_Jacobs
Icon for Microsoft rankMicrosoft
Dec 17, 2025

Delivering Continuous Quality, Reliability, and Cost-Efficiency for Production Voice Workflows

In today’s fast‑moving AI engineering landscape, developers need voice models that don’t just perform well—they need models that are fast, predictable, production‑ready, and easy to integrate into real‑world systems, with cost-efficiency. Whether you’re building real‑time voice agents, powering multilingual transcription pipelines, or generating dynamic speech for customer‑facing applications, reliability and latency are just as important as raw model quality.

To enable practitioners to ship these scenarios with confidence, we’re introducing a series of updated, production‑ready voice models—now generally available via API in Microsoft Foundry. These models deliver state‑of‑the‑art performance across real‑time audio input/output, transcription, and natural‑sounding speech synthesis, while giving developers the tools to optimize for accuracy, latency, and brand voice consistency.

  1. Realtime-mini: Lowlatency speechtospeech for realtime conversational agents

The new gpt-realtime-mini-2025-12-15 model powers real-time voice agents with strong parity to full-scale models in instruction-following and function-calling, while delivering improved prosody, better voice fidelity across turns, and reduced latency for seamless dialogue. With audio input and output, it’s ideal for low-latency, high-quality voice experiences. A standout capability is voice cloning and custom voice upload, allowing trusted customers to provide short samples for high-fidelity voice replication. This ensures consistent brand voice across conversations, with strict consent and legal guardrails in place for compliance. We will ship this capability as a fast follow post launch. The API-only deployment makes the integration into your workflows seamlessly.

Use Cases:

  • Real-time customer support agents
  • Interactive voice bots
  • Personalized voice assistants
  1. ASR (Automatic Speech Recognition): Highaccuracy transcription optimized for realtime pipelines

Our latest transcribe model- gpt-4o-mini-transcribe-2025-12-15 model delivers a major leap in transcription accuracy and robustness for real-time scenarios. Compared to previous generations, it achieves up to 50% lower word error rate (WER) on English benchmarks and significantly improves multilingual performance. It also reduces hallucinations on silence by up to , making it a reliable choice for noisy environments and real-world audio streams. Input is audio, output is text, and deployment is API-only.

Use Cases:

  • Multilingual transcription services
  • Meeting note automation
  • Voice-driven data entry
  1. TTS (Text-to-Speech): Natural multilingual speech generation with customizable voices

The updated gpt-4o-mini-tts-2025-12-15 model sets a new benchmark for multilingual speech synthesis, delivering more natural, human-like speech with fewer artifacts and improved speaker similarity. It outperforms industry leaders with 35% fewer word errors on multilingual benchmarks and achieves up to 3× lower WER in non-English languages. Like the gpt-realtime-mini-2025-12-15, gpt-4o-mini-tts-2025-12-15 model supports voice cloning and custom voice upload for trusted customers, ensuring brand voice fidelity and reducing “voice drift” across turns. Input is text, output is audio, and deployment is API-only.

Use Cases:

  • Dynamic customer service voices
  • Expressive narration for creative storytelling
  • Accessibility solutions for global audiences

Raising the bar for real-world voice applications 

These new models represent a significant step forward in enabling developers and enterprises to build more powerful, customizable, and intelligent voice agents. They offer state-of-the-art accuracy, reliability, and flexibility, especially in challenging scenarios involving accents, noisy environments, and varying speech speeds. With features like steerable voices and custom voice uploads and improved multilingual quality for major world languages, organizations can create tailored experiences that resonate with their customers.

All of these updates come with no changes to current pricing. The latest improvements in accuracy, stability, and performance are included at the same affordable price point, reinforcing our commitment to delivering high‑value, production‑ready voice models without increasing cost.

What’s next: Continuing innovation in real-time audio

We’re committed to ongoing innovation in real-time audio modeling. We will release the gpt-audio-mini model updates via API in Microsoft Foundry very soon. Looking ahead, we’ll continue to invest in improving the intelligence and accuracy of our audio models, exploring new ways to enable developers to bring their own custom voices and build even more personalized experiences. Stay tuned for future updates as we push the boundaries of what’s possible in multimodal AI.

Ready to transform your customer experience?

Start using the new voice models in Microsoft Foundry today. Explore the APIs, test latency and accuracy at scale, and integrate them directly into your real‑time applications.

Join our developer community, share your feedback, and help shape the next generation of real‑time multimodal audio models—your insights directly influence what we build next.

Updated Dec 17, 2025
Version 1.0
No CommentsBe the first to comment