Delivering Continuous Quality, Reliability, and Cost-Efficiency for Production Voice Workflows
In today’s fast‑moving AI engineering landscape, developers need voice models that don’t just perform well—they need models that are fast, predictable, production‑ready, and easy to integrate into real‑world systems, with cost-efficiency. Whether you’re building real‑time voice agents, powering multilingual transcription pipelines, or generating dynamic speech for customer‑facing applications, reliability and latency are just as important as raw model quality.
To enable practitioners to ship these scenarios with confidence, we’re introducing a series of updated, production‑ready voice models—now generally available via API in Microsoft Foundry. These models deliver state‑of‑the‑art performance across real‑time audio input/output, transcription, and natural‑sounding speech synthesis, while giving developers the tools to optimize for accuracy, latency, and brand voice consistency.
-
Realtime-mini: Low‑latency speech‑to‑speech for real‑time conversational agents
The new gpt-realtime-mini-2025-12-15 model powers real-time voice agents with strong parity to full-scale models in instruction-following and function-calling, while delivering improved prosody, better voice fidelity across turns, and reduced latency for seamless dialogue. With audio input and output, it’s ideal for low-latency, high-quality voice experiences. A standout capability is voice cloning and custom voice upload, allowing trusted customers to provide short samples for high-fidelity voice replication. This ensures consistent brand voice across conversations, with strict consent and legal guardrails in place for compliance. We will ship this capability as a fast follow post launch. The API-only deployment makes the integration into your workflows seamlessly.
Use Cases:
- Real-time customer support agents
- Interactive voice bots
- Personalized voice assistants
-
ASR (Automatic Speech Recognition): High‑accuracy transcription optimized for real‑time pipelines
Our latest transcribe model- gpt-4o-mini-transcribe-2025-12-15 model delivers a major leap in transcription accuracy and robustness for real-time scenarios. Compared to previous generations, it achieves up to 50% lower word error rate (WER) on English benchmarks and significantly improves multilingual performance. It also reduces hallucinations on silence by up to 4×, making it a reliable choice for noisy environments and real-world audio streams. Input is audio, output is text, and deployment is API-only.
Use Cases:
- Multilingual transcription services
- Meeting note automation
- Voice-driven data entry
-
TTS (Text-to-Speech): Natural multilingual speech generation with customizable voices
The updated gpt-4o-mini-tts-2025-12-15 model sets a new benchmark for multilingual speech synthesis, delivering more natural, human-like speech with fewer artifacts and improved speaker similarity. It outperforms industry leaders with 35% fewer word errors on multilingual benchmarks and achieves up to 3× lower WER in non-English languages. Like the gpt-realtime-mini-2025-12-15, gpt-4o-mini-tts-2025-12-15 model supports voice cloning and custom voice upload for trusted customers, ensuring brand voice fidelity and reducing “voice drift” across turns. Input is text, output is audio, and deployment is API-only.
Use Cases:
- Dynamic customer service voices
- Expressive narration for creative storytelling
- Accessibility solutions for global audiences
Raising the bar for real-world voice applications
These new models represent a significant step forward in enabling developers and enterprises to build more powerful, customizable, and intelligent voice agents. They offer state-of-the-art accuracy, reliability, and flexibility, especially in challenging scenarios involving accents, noisy environments, and varying speech speeds. With features like steerable voices and custom voice uploads and improved multilingual quality for major world languages, organizations can create tailored experiences that resonate with their customers.
All of these updates come with no changes to current pricing. The latest improvements in accuracy, stability, and performance are included at the same affordable price point, reinforcing our commitment to delivering high‑value, production‑ready voice models without increasing cost.
What’s next: Continuing innovation in real-time audio
We’re committed to ongoing innovation in real-time audio modeling. We will release the gpt-audio-mini model updates via API in Microsoft Foundry very soon. Looking ahead, we’ll continue to invest in improving the intelligence and accuracy of our audio models, exploring new ways to enable developers to bring their own custom voices and build even more personalized experiences. Stay tuned for future updates as we push the boundaries of what’s possible in multimodal AI.
Ready to transform your customer experience?
Start using the new voice models in Microsoft Foundry today. Explore the APIs, test latency and accuracy at scale, and integrate them directly into your real‑time applications.
Join our developer community, share your feedback, and help shape the next generation of real‑time multimodal audio models—your insights directly influence what we build next.