Authors: Yufei Liu, Lihui Wang, Yao Qian, Yang Zheng, Jiajun Zhang, Bing Liu, Yang Cui, Peter Pan, Yan Deng, Songrui Wu, Gang Wang, Xi Wang, Shaofei Zhang, Sheng Zhao
We are pleased to announce that our Azure AI Speech’s Dragon HD neural text to speech (language model-based TTS, similar to model design for text LLM) voices, which have been available to users for some time, are now moving to general availability (GA). These voices have gained significant traction across various scenarios and have received valuable feedback from our users. This milestone is a testament to the extensive feedback and growing popularity of Azure AI Speech’s HD voices. As we continue to enhance the user experience, we remain committed to exploring and experimenting with new voices and advanced models to push the boundaries of TTS technology.
Key Features of Azure AI Speech’s Dragon HD Neural TTS
Azure AI Speech’s Dragon HD (language model-based TTS) Neural TTS voices are particularly well-suited for voice agents and conversational scenarios, thanks to the following key features:
- Context-Aware and Dynamic Output
The Azure AI Speech’s Dragon HD TTS models are enhanced with Language Models (LMs) to ensure better context understanding, producing more accurate and contextually appropriate outputs. Each voice incorporates dynamic temperature adjustments to vary the degree of creativity and emotion in the speech, allowing for tailored delivery based on the specific needs of the content. - Emotion-Enhanced Expressiveness
The Azure AI Speech’s Dragon HD TTS voices incorporate advanced emotion detection, leveraging acoustic and linguistic features to identify emotional cues within the input text. The model will adjusts tone, style, intonation, and rhythm dynamically to deliver speech with rich, natural variations and authentic emotional expression. - Improved Multilingual Support
Following the addition of more voice variety and enhanced multilingual capabilities last month, Azure AI Speech’s Dragon HD TTS have gained immense popularity across various use cases, including conversational agents, podcast creation, and video content production. - Cutting-Edge Acoustic Models
GA voices utilize the latest acoustic models, continually updated by Microsoft to ensure optimal performance and superior quality. These models adapt to changing needs, providing users with state-of-the-art speech synthesis.
Update details
19 Azure AI Speech’s Dragon HD TTS are now generally available
Voice name |
de-DE-Florian:DragonHDLatestNeural |
de-DE-Seraphina:DragonHDLatestNeural |
en-US-Adam:DragonHDLatestNeural |
en-US-Andrew:DragonHDLatestNeural |
en-US-Andrew2:DragonHDLatestNeural |
en-US-Ava:DragonHDLatestNeural |
en-US-Brian:DragonHDLatestNeural |
en-US-Davis:DragonHDLatestNeural |
en-US-Emma:DragonHDLatestNeural |
en-US-Emma2:DragonHDLatestNeural |
en-US-Steffan:DragonHDLatestNeural |
es-ES-Tristan:DragonHDLatestNeural |
es-ES-Ximena:DragonHDLatestNeural |
fr-FR-Remy:DragonHDLatestNeural |
fr-FR-Vivienne:DragonHDLatestNeural |
ja-JP-Masaru:DragonHDLatestNeural |
ja-JP-Nanami:DragonHDLatestNeural |
zh-CN-Xiaochen:DragonHDLatestNeural |
zh-CN-Yunfan:DragonHDLatestNeural |
Demo for a conversation between human and AvaHD voice
Introducing multi-talker voices in preview for podcast scenarios
The first multi-talker speech generation model `en-US-MultiTalker-Ava-Andrew:DragonHDLatestNeural`is a groundbreaking advancement designed to produce multi-round conversational, podcast-style speech with two distinct speakers' voices simultaneously. This model captures the natural flow of dialogue between speakers, seamlessly incorporating pauses, interjections, and contextual shifts that result in a highly realistic and engaging conversational experience.
In contrast, single-talker models synthesize each speaker's turn in isolation, without considering the broader context of the conversation. This can lead to mismatched emotions and tones, making the dialogue feel less natural and cohesive.
By maintaining contextual coherence and emotional consistency throughout the conversation, the multi-talker model stands out as the superior choice for applications requiring authentic, engaging, and dynamic dialogues.
Here is the SSML template for roles assign:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-US-MultiTalker-Ava-Andrew:DragonHDLatestNeural"> <mstts:dialog> <mstts:turn speaker="ava">Hello, Andrew! How's your day going?</mstts:turn> <mstts:turn speaker="andrew">Hey Ava! It's been great, just exploring some AI advancements in communication.</mstts:turn> <mstts:turn speaker="ava">That sounds interesting! What kind of projects are you working on?</mstts:turn> <mstts:turn speaker="andrew">Well, we've been experimenting with text to speech applications, including turning emails into podcasts.</mstts:turn> <mstts:turn speaker="ava">Wow, that could really improve content accessibility! Are you looking for collaborators?</mstts:turn> <mstts:turn speaker="andrew">Absolutely! We're open to testing new ideas and seeing how AI can enhance communication.</mstts:turn> </mstts:dialog> </voice> </speak>
Introducing 2 more versions of Ava and Andrew in preview: optimized for podcast content
Introducing two new preview versions of Ava and Andrew, optimized specifically for podcast content. While multi-talker models are designed to emulate dynamic exchanges between multiple speakers, single-talker models represent the traditional TTS approach by focusing on crafting each speaker’s contribution independently. This approach enhances linguistic accuracy, tonal control, and consistency, all without the complexity of managing inter-speaker dynamics.
Although single-talker models don’t maintain broader contextual coherence between dialogue turns, their streamlined and versatile design offers unique advantages. They are ideal for applications requiring clear, uninterrupted speech—such as instructional content or speech-to-speech interactions. These models deliver a podcast-like style similar to multi-talker models but with greater flexibility and control, catering to diverse use cases.
Voice name |
Sample |
en-US-Andrew3:DragonHDLatestNeural |
Optimized for podcast content |
en-US-Ava3:DragonHDLatestNeural |
Optimized for podcast content |
Introducing Azure AI Speech’s Dragon HD Flash models in preview
The Azure AI Speech’s Dragon HD Flash model redefines efficiency and accessibility by offering a lighter-weight solution that retains the exceptional flexibility of HD voices, all while maintaining the same price as standard neural voices.
By delivering high-quality voice synthesis at a reduced computational demand, it also significantly improves latency, enabling faster and more responsive performance. This combination of reduced latency, high-quality output, and affordability positions the HD Flash model as an optimal choice for applications requiring versatile, natural-sounding, and prompt speech generation.
Voice name |
Sample |
zh-CN-Xiaochen:DragonHDFlashLatestNeural | |
zh-CN-Xiaoxiao:DragonHDFlashLatestNeural | |
zh-CN-Xiaoxiao2:DragonHDFlashLatestNeural (Optimized for free-talking | |
zh-CN-Yunxiao:DragonHDFlashLatestNeural | |
zh-CN-Yunyi:DragonHDFlashLatestNeural |
Availability and Important Notes
- Regions: GA HD voices are available in the East US, West Europe, and Southeast Asia regions; HD Flash will also be available in ChinaNorth3
- Voice Status: While many HD voices have achieved GA, some newly introduced/experimenting voices remain in Preview status for experimentation and feedback collection. For detailed information, please refer to the Status field in the voice list API.
- Model Updates: GA voices are powered by the latest models, which are subject to continuous improvement and updates once a new default version is available.
Get Started
In our ongoing journey to enhance multilingual capabilities in text to speech (TTS) technology, we strive to deliver the best voices to empower your applications. Our voices are designed to be incredibly adaptive, seamlessly switching between languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications like language learning, travel guidance, and international business communication.
Microsoft offers an extensive portfolio of over 600 neural voices, covering more than 150 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or provide a voice to chatbots, elevating the conversational experience for users. With the Custom Neural Voice capability, businesses can also create unique and distinctive brand voices effortlessly.
With these advancements, we continue to push the boundaries of what’s possible in TTS technology, ensuring that our users have access to the most versatile, high-quality voices for their needs.
For more information
- Try our demo to listen to existing neural voices
- Add Text to speech to your apps today
- Apply for access to Custom Neural Voice
- Join Discord to collaborate and share feedback
- Contact us ttsvoicefeedback@microsoft.com
Updated Apr 01, 2025
Version 4.0GarfieldHe
Microsoft
Joined September 23, 2020
AI - Azure AI services Blog
Follow this blog board to get notified when there's new activity