We are seeing customers such as Indiana Pacers and Coca-Cola transform customer experiences using Azure AI Speech to power customer interactions. And in the new era of agentic AI, voice is increasingly becoming an important modality to interact with AI Agents in a natural way.
Today, we are excited to announce a number to new capabilities in Azure AI Speech that will further propel our customers in the voice-enabled agentic AI era as AI Agents are being rapidly adopted by a wide range of enterprise customers across a wide variety of industries. The updates we are announcing today include the new Voice Live API (Public Preview) which can be used to help simplify creating voice agents that provide fluent and natural speech to speech conversational experiences. To provide a robust conversation experience, the Voice Live API leverages enhanced audio processing and turn detection capabilities. Additionally, it provides a flexible choice of Generative AI models, customization of TTS Voices and TTS avatars. We are also announcing the general availability of Video Translation that enables translation of video content into a wide range of languages with our new Lip Sync capability, and enhancements to Fast Transcription with expanded language coverage and support of multi-lingual transcription. Try out our latest capabilities in our improved Azure AI Foundry experience.
Voice Live API, Public Preview
We are excited to announce Voice Live API, a new Azure AI Speech feature offering a single, unified API for building voice agents. This new API, available in public preview starting today, supports low-latency, scalable speech-to-speech interactions using foundation models of your choice.
The past year has seen a surge in demand for generative AI voice chatbots across industries like customer service, education, HR, gaming, and public services. Customers are seeking real-time, natural speech interactions that support multiple languages, diverse voices, customization, and integration with avatars for enhanced engagement.
The new Voice Live API empowers users with streaming interactions supported by their chosen generative AI models, offering seamless speech input and output functionality through a single, low-latency API. This public preview introduces a wide range of capabilities that enhance conversational experiences. The API supports over 150 locales for speech input and output, accompanied by a diverse selection of more than 600 realistic voices, including over 30 ultra-natural neural HD voices optimized specifically for conversational scenarios. Users can select built-in foundation models such as GPT-4o Realtime, GPT-4o Mini Realtime, GPT-4o, GPT-4o Mini, and Phi to suit their needs. Customization options allow users to fine-tune speech models for greater accuracy and brand representation, integrating features like custom voices and avatars for tailored experiences.
Additionally, the API provides advanced conversational enhancements, including noise suppression, echo cancellation, and robust interruption detection, ensuring smooth and natural interactions. Visual engagement is further supported through easily configurable avatars that give voice agents a distinct identity. Integration with Azure AI Agent Service and Semantic Kernel is straightforward, enabling developers to add voice input and output functionalities effortlessly to agents built with these tools, all while maintaining a consistent and engaging user experience.
Check out more about the feature here and how it works in this demo. Get started in AI Foundry with Voice Live Playground.
Voice API in your application
Customer Service Agent – CommerzBank and Government of Malta are utilizing Voice Live API to provide real-time customer services through natural voice conversations with avatars
Voice Agent in wearable device – Anker integration of Voice Live API and Speech Translation API with their headphones described in their blog How Anker soundcore Uses Azure AI Speech for Seamless Multilingual Communication
Voice agents in call centers – Integration of the Voice Live API with Azure Communication Services (ACS)
Combining the Voice Live API with Azure Communication Services allows customers to build voice-enabled AI agents with integration into their call-center telephony systems.
For more details see this blog post by the Azure Communication Services team as well as the sample in GitHub.
To learn more about customer applications including Commerzbank, Gainsight, and the Government of Malta, check out the Microsoft Build session.
Video Translation Service, General Availability
We are excited to announce the general availability of Video Translation Service — a powerful, end-to-end service designed to scale global video content delivery. The full capabilities of the Video Translation service are available now in AI Foundry while the official API will be ready on June 23rd enabling developers to integrate these capabilities seamlessly into their solutions. in this latest evolution of the Video Translation service, we have included Lip Sync to enable the creation of realistic translations that match speakers' lip movements together with emotion enhanced voices to provide an immersive experience.
With GenAI-powered contextual translation, advanced multi-speaker detection, and enhanced audio-visual synchronization algorithms that dynamically adjust for language density and speech pace, we’re not just translating words — we are delivering an experience that is true to the source material. With the advanced capabilities of Azure Speech Service and new and the new Lip Sync feature, we can now preserve the speaker’s emotion and tone while perfectly aligning translated audio with mouth movements, making multilingual videos feel more authentic than ever. We have also expanded support for large video and audio files across more than 70 languages and global regions.
Developers can access the Video Translation service through three distinct methods tailored to their needs. They can integrate the service into their business workflows by calling the API, utilize a pre-built agent template to create interactive agent experiences, or explore Proof-of-Concept projects with the AI Foundry's low-code interface.
Here is a demo:
New Conversational HD voices with Custom Voice Support, General Availability
Azure AI Speech’s LLM (large language model) based DragonHD Neural TTS voices are particularly well-suited for voice agents in the conversational scenarios and have been seamlessly integrated with the Voice Live API as described above.
These DragonHD voices feature the following capabilities to enable natural conversations:
Ultra realistic – Azure AI Speech DragonHD TTS voices incorporate advanced features to identify emotional cues within the input text, which result in a better rich, natural variations and authentic emotional expression.
Context aware & variances – Azure AI Speech DragonHD TTS models are enhanced with LLM to ensure better context understanding, producing more accurate and contextually appropriate outputs.
Multilingual support – 100+ locales in one voice, with 35 neural DragonHD voices available.
Learn more about how DragonHD voice work.
Besides these voices, we are also excited to announce that our Custom Voice/Professional Voice finetuning feature[1] is also extended to support DragonHD neural voice training. In one single step, you can create a highly natural conversational voice that sounds just like your selected voice actor.
Check out this video for Neural HD voices and learn more about how to create a fine-tuned custom DragonHD voice:
DragonHD voice in your application
At Microsoft, the conversational multi-talker HD voices are being leveraged in the new Audio Overviews feature in M365 Copilot, that customers will be able to use very soon in Copilot Notebook, Word and OneDrive apps. By enabling conversational high-definition voice narration, this feature enhances the clarity and engagement of AI-generated audio overviews, especially when parsing complex or multi-perspective content with the tone or prosody adjustment based on the understanding of the context. The system leverages advanced HD voices technology to simulate natural dialogue, improving comprehension and retention for users consuming long-form content in any app.
Customers like GainSight, Anker and more are actively integrating DragonHD neural voices with Voice Live API to address their business needs.
Fast Transcription, New Locales and Multi-lingual
Since we announced the general availability of Fast Transcription last November, it has been widely adopted by thousands of customers for use cases like meeting transcription, voicemail transcription, call record transcription, audio/video editing, etc. We are pleased to share two major updates: We’ve expanded our language support to include additional locales such as Danish (Denmark), Finnish (Finland), Hebrew (Israel), Indonesian (Indonesia), Polish (Poland), Portuguese (Portugal), Swedish (Sweden), and more coming soon. For more information, see speech to text supported languages. Fast Transcription also now includes a new powerful multi-lingual transcription model, allowing for continuous and accurate transcription of audio files with multilingual content and no need to predefine locale codes. Starting with support for 15 major locales and expanding to more soon. Learn more about how to use fast transcription with multi-lingual model and try it out in AI Foundry.
Custom TTS Avatar Self-Service Portal, Generally Availability
Earlier this year, we introduced a preview version of the custom text-to-speech avatar self-service portal[1], which is now generally available today. The portal has been improved for greater stability and is now fully integrated with Azure AI Foundry. The TTS avatar service is also integrated with the Voice Live API.
One more notable additional feature is the “Voice sync for avatar”, allowing users to efficiently create a custom avatar with a personalized voice in a single step. This innovative method directly trains the voice model using audio data from the custom avatar video, which address customer pain points that need duplicated works in Custom Avatar and Custom Voice building.
Here is a video showing details of Voice sync for avatar and Custom avatar portal in AI Foundry:
Custom Avatar in your application
With the general availability of the custom avatar portal, we can assist more customers and partners worldwide in creating tailored avatar solutions for their businesses. Here are a couple of examples:
Digital Assistant Avatar
ServiceNow introduced "Digital Bre," an interactive custom avatar, as an assistant at the Knowledge 25 event to help 25,000 attendees learn about ServiceNow products, event details, and even local information.
“With Azure’s high fidelity streaming avatar, we've unlocked a new dimension of personalized digital interaction. This could be the answer to bridge between advanced AI capabilities and human-centered experience.” says WanTing Huang, Director of Innovation & Research at ServiceNow
Education Avatar
Cloudforce introduces avatar capabilities to millions of students, providing immersive and personalized learning experiences. These custom avatars serve as productivity agents while also functioning as empathetic digital educators.
“Students are already embracing generative AI at a pace and proficiency that far exceeds many professional audiences. With Azure’s AI Avatar technology, educators and institutions can tailor unique GenAI interactions that promote reasoning and learning over simply receiving answers the way they would with common public bots.” says Husein Sharaf, Founder and CEO at Cloudforce.
Foundry Improvements
Introducing Azure AI Foundry, the new home for Azure AI Speech. Keep working with all your favorite features and models – they’re here too and ready to get going! Take your app’s AI even further with a broader model catalog, plus the vision, language, and content safety capabilities of Azure AI Services. At Build, we will be transitioning over to AI Foundry from Speech Studio (https://speech.microsoft.com).
Azure AI Foundry is a unified platform designed for enterprise AI operations, model builders, and application development. The platform supports developers in creating generative AI applications, exploring, building, testing, and deploying using advanced AI tools and machine learning models grounded in responsible AI practices.
Earlier this year we introduced the below features into Speech in AI Foundry:
Pronunciation Assessment
Evaluate pronunciation and give speakers feedback on the accuracy and fluency of their speech.
Video Translation
Seamlessly translate and generate videos in multiple languages automatically.
Custom Voice (Professional voice fine-tuning)
Use your own audio recordings to create a distinct, one-of-a-kind voice for your text-to-speech apps.
At Build this year, we introduced:
Multi-Lingual Speech Translation
Azure Speech Multilingual Speech Translation enables seamless and accurate transcription of audio content across multiple languages without requiring predefined locale codes.
Voice Live enables fluent voice interactions with AI, designed to enhance dynamic communication and natural conversational experiences.
Engage in natural conversations with an avatar that recognizes users' speech input and responds fluently with realistic AI voice.
Audio content creation (Text to speech)
Craft nuanced speech by adjusting the speaking style, pacing, and pronunciation of your spoken content.
Upcoming Azure AI Speech features coming to Foundry:
Captioning with speech to text (Speech capabilities by scenario)
Convert the audio content of TV broadcast, webcast, film, video, live events or other productions into text to make your content more accessible to your audience.
Language learning (Speech capabilities by scenario)
Get instant feedback on pronunciation accuracy, fluency, prosody, grammar, and vocabulary from your chatting experience.
Call to Action
Check out the Transform your agentic apps with voice in Foundry Build session.
Try out the new capabilities in the AI Foundry here:
- Voice Live API - Azure AI Foundry - Voice Live API
- Video Translation - Azure AI Foundry - Video Translation
- Fast Transcription - Azure AI Foundry - Fast Transcription
- Custom TTS Avatar - Azure AI Foundry - TTS Avatar
- New conversational Voices - Azure AI Foundry - Voice Gallery
1: Following our Responsible AI policies, Custom Avatar and Custom Voice are Limited Access features available by registration only, and only for certain use cases.