Now Generally Available via Foundry Models
We are thrilled to announce that we are releasing today the general availability of our latest advancement in speech-to-speech technology: gpt-realtime. This new model represents a significant leap forward in our commitment to providing advanced and reliable speech-to-speech solutions.
gpt-realtime is a new S2S (speech-to-speech) model with improved instruction following, designed to merge all of our speech-to-speech improvements into a single, cohesive model. This model is now available in the Real-time API, offering enhanced voice naturalness, higher audio quality, and improved function calling capabilities.
Key Features
- New, natural, expressive voices: New voice options (Marin and Cedar) that bring a new level of naturalness and clarity to speech synthesis.
- Improved Instruction Following: Enhanced capabilities to follow instructions more accurately and reliably.
- Enhanced Voice Naturalness: More lifelike and expressive voice output.
- Higher Audio Quality: Superior audio quality for a better user experience.
- Improved Function Calling: Enhanced ability to call custom code defined by developers.
- Image Input Support: Add images to context and discuss them via voice—no video required.
Check out the model card here: gpt-realtime
Pricing
Pricing for gpt-realtime is 20% lower compared to the previous gpt-4o-realtime preview:
Pricing is based on usage per 1 million tokens. Below is the breakdown:
Getting Started
gpt-realtime is available on Azure AI Foundry via Azure Models direct from Azure today. We are excited to see how developers and users will leverage these new capabilities to create innovative and impactful solutions. Check out the model on Azure AI Foundry and see detailed documentation in Microsoft Learn docs.