Blog Post

AI - Azure AI services Blog
6 MIN READ

March 2025: Azure AI Speech’s HD voices are generally available and more

GarfieldHe's avatar
GarfieldHe
Icon for Microsoft rankMicrosoft
Mar 31, 2025

Authors: Yufei Liu, Lihui Wang, Yao Qian, Yang Zheng, Jiajun Zhang, Bing Liu, Yang Cui, Peter Pan, Yan Deng, Songrui Wu, Gang Wang, Xi Wang, Shaofei Zhang, Sheng Zhao

 

We are pleased to announce that our Azure AI Speech’s Dragon HD neural text to speech (language model-based TTS, similar to model design for text LLM) voices, which have been available to users for some time, are now moving to general availability (GA). These voices have gained significant traction across various scenarios and have received valuable feedback from our users. This milestone is a testament to the extensive feedback and growing popularity of Azure AI Speech’s HD voices. As we continue to enhance the user experience, we remain committed to exploring and experimenting with new voices and advanced models to push the boundaries of TTS technology.

Key Features of Azure AI Speech’s Dragon HD Neural TTS

Azure AI Speech’s Dragon HD (language model-based TTS) Neural TTS voices are particularly well-suited for voice agents and conversational scenarios, thanks to the following key features:

  1. Context-Aware and Dynamic Output
    The  Azure AI Speech’s Dragon HD TTS models are enhanced with Language Models (LMs) to ensure better context understanding, producing more accurate and contextually appropriate outputs. Each voice incorporates dynamic temperature adjustments to vary the degree of creativity and emotion in the speech, allowing for tailored delivery based on the specific needs of the content.
  2. Emotion-Enhanced Expressiveness
    The Azure AI Speech’s Dragon HD TTS voices incorporate advanced emotion detection, leveraging acoustic and linguistic features to identify emotional cues within the input text. The model will adjusts tone, style, intonation, and rhythm dynamically to deliver speech with rich, natural variations and authentic emotional expression.
  3. Improved Multilingual Support
    Following the addition of more voice variety and enhanced multilingual capabilities last month, Azure AI Speech’s Dragon HD TTS have gained immense popularity across various use cases, including conversational agents, podcast creation, and video content production.
  4. Cutting-Edge Acoustic Models
    GA voices utilize the latest acoustic models, continually updated by Microsoft to ensure optimal performance and superior quality. These models adapt to changing needs, providing users with state-of-the-art speech synthesis.

Update details

19 Azure AI Speech’s Dragon HD TTS are now generally available

Voice name
de-DE-Florian:DragonHDLatestNeural
de-DE-Seraphina:DragonHDLatestNeural
en-US-Adam:DragonHDLatestNeural
en-US-Andrew:DragonHDLatestNeural
en-US-Andrew2:DragonHDLatestNeural
en-US-Ava:DragonHDLatestNeural
en-US-Brian:DragonHDLatestNeural
en-US-Davis:DragonHDLatestNeural
en-US-Emma:DragonHDLatestNeural
en-US-Emma2:DragonHDLatestNeural
en-US-Steffan:DragonHDLatestNeural
es-ES-Tristan:DragonHDLatestNeural
es-ES-Ximena:DragonHDLatestNeural
fr-FR-Remy:DragonHDLatestNeural
fr-FR-Vivienne:DragonHDLatestNeural
ja-JP-Masaru:DragonHDLatestNeural
ja-JP-Nanami:DragonHDLatestNeural
zh-CN-Xiaochen:DragonHDLatestNeural
zh-CN-Yunfan:DragonHDLatestNeural

Demo for a conversation between human and AvaHD voice

Introducing multi-talker voices in preview for podcast scenarios

The first multi-talker speech generation model `en-US-MultiTalker-Ava-Andrew:DragonHDLatestNeural`is a groundbreaking advancement designed to produce multi-round conversational, podcast-style speech with two distinct speakers' voices simultaneously. This model captures the natural flow of dialogue between speakers, seamlessly incorporating pauses, interjections, and contextual shifts that result in a highly realistic and engaging conversational experience.

In contrast, single-talker models synthesize each speaker's turn in isolation, without considering the broader context of the conversation. This can lead to mismatched emotions and tones, making the dialogue feel less natural and cohesive.

By maintaining contextual coherence and emotional consistency throughout the conversation, the multi-talker model stands out as the superior choice for applications requiring authentic, engaging, and dynamic dialogues.

Here is the SSML template for roles assign:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-US-MultiTalker-Ava-Andrew:DragonHDLatestNeural"> <mstts:dialog> <mstts:turn speaker="ava">Hello, Andrew! How's your day going?</mstts:turn> <mstts:turn speaker="andrew">Hey Ava! It's been great, just exploring some AI advancements in communication.</mstts:turn> <mstts:turn speaker="ava">That sounds interesting! What kind of projects are you working on?</mstts:turn> <mstts:turn speaker="andrew">Well, we've been experimenting with text to speech applications, including turning emails into podcasts.</mstts:turn> <mstts:turn speaker="ava">Wow, that could really improve content accessibility! Are you looking for collaborators?</mstts:turn> <mstts:turn speaker="andrew">Absolutely! We're open to testing new ideas and seeing how AI can enhance communication.</mstts:turn> </mstts:dialog> </voice> </speak>

Introducing 2 more versions of Ava and Andrew in preview: optimized for podcast content

Introducing two new preview versions of Ava and Andrew, optimized specifically for podcast content. While multi-talker models are designed to emulate dynamic exchanges between multiple speakers, single-talker models represent the traditional TTS approach by focusing on crafting each speaker’s contribution independently. This approach enhances linguistic accuracy, tonal control, and consistency, all without the complexity of managing inter-speaker dynamics.

Although single-talker models don’t maintain broader contextual coherence between dialogue turns, their streamlined and versatile design offers unique advantages. They are ideal for applications requiring clear, uninterrupted speech—such as instructional content or speech-to-speech interactions. These models deliver a podcast-like style similar to multi-talker models but with greater flexibility and control, catering to diverse use cases.

Voice name

Sample

en-US-Andrew3:DragonHDLatestNeural

Optimized for podcast content

en-US-Ava3:DragonHDLatestNeural

Optimized for podcast content

Introducing Azure AI Speech’s Dragon HD Flash models in preview

The Azure AI Speech’s Dragon HD Flash model redefines efficiency and accessibility by offering a lighter-weight solution that retains the exceptional flexibility of HD voices, all while maintaining the same price as standard neural voices.

By delivering high-quality voice synthesis at a reduced computational demand, it also significantly improves latency, enabling faster and more responsive performance. This combination of reduced latency, high-quality output, and affordability positions the HD Flash model as an optimal choice for applications requiring versatile, natural-sounding, and prompt speech generation.

Voice name

Sample

zh-CN-Xiaochen:DragonHDFlashLatestNeural

zh-CN-Xiaoxiao:DragonHDFlashLatestNeural

zh-CN-Xiaoxiao2:DragonHDFlashLatestNeural (Optimized for free-talking

zh-CN-Yunxiao:DragonHDFlashLatestNeural

zh-CN-Yunyi:DragonHDFlashLatestNeural

Availability and Important Notes

  • Regions: GA HD voices are available in the East US, West Europe, and Southeast Asia regions; HD Flash will also be available in ChinaNorth3
  • Voice Status: While many HD voices have achieved GA, some newly introduced/experimenting voices remain in Preview status for experimentation and feedback collection. For detailed information, please refer to the Status field in the voice list API.
  • Model Updates: GA voices are powered by the latest models, which are subject to continuous improvement and updates once a new default version is available.

Get Started

In our ongoing journey to enhance multilingual capabilities in text to speech (TTS) technology, we strive to deliver the best voices to empower your applications. Our voices are designed to be incredibly adaptive, seamlessly switching between languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications like language learning, travel guidance, and international business communication.

Microsoft offers an extensive portfolio of over 600 neural voices, covering more than 150 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or provide a voice to chatbots, elevating the conversational experience for users. With the Custom Neural Voice capability, businesses can also create unique and distinctive brand voices effortlessly.

With these advancements, we continue to push the boundaries of what’s possible in TTS technology, ensuring that our users have access to the most versatile, high-quality voices for their needs.

For more information

Updated Apr 01, 2025
Version 4.0
No CommentsBe the first to comment