The new voice type will be available for preview on: East US, West Europe, Sweden Central, and Southeast Asia regions.
Dragon HD Omni is Microsoft Azure Speech’s newest text‑to‑speech generation, delivering over 700 high‑quality voices with enhanced expressiveness, multi‑lingual fluency, and multi‑style control — all through a unified model built in Microsoft Foundry. It removes common developer pain points such as unnatural voice prosody, limited language coverage, and heavy SSML tuning effort. The result is a powerful value proposition: faster integration, richer user experiences, and production‑ready voice output with minimal effort.
Azure speech offers a broad range of unique voices for applications like virtual agents, audiobooks, podcasts, and speech-to-speech tasks.
Demo video
700+ prebuilt voices
Dragon HD Omni offers a range of prebuilt voices with distinct personas and emotions, supporting diverse use cases from agent-based applications to content creation. These voices unlock endless possibilities, empowering users to enhance end-to-end applications.
Full update for previous generation voices
Dragon HD Omni merges a wide range of prebuilt voices into one, improving contextual adaptation, prosody, expression, and keeping each voice's unique character. This technology delivers more accurate, flexible, and lifelike speech for a variety of uses. Dragon HD Omni raises the standard for natural AI voices across customer service, accessibility, and creative projects, advancing human-computer interaction.
You can explore some voices from voice list, such as:
- "en-US-Ava:DragonHDOmniLatestNeural"
- "en-US-Andrew:DragonHDOmniLatestNeural"
- "en-US-Dana:DragonHDOmniLatestNeural"
- "en-US-Caleb:DragonHDOmniLatestNeural"
- "zh-CN-Xiaoyue:DragonHDOmniLatestNeural"
- "zh-CN-Yunqi:DragonHDOmniLatestNeural"
- "en-US-Phoebe:DragonHDOmniLatestNeural"
- "en-US-Lewis:DragonHDOmniLatestNeural"
They will be available to try directly via Speech Playground - Microsoft Foundry
Or, you can use this voice name format by adding the suffix `:DragonHDOmniLatestNeural` to try the Omni version of the given voice via direct SSML call.
For example:
|
Previous neural voice |
Omni version voice name |
|
de-DE-ConradNeural |
de-DE-Conrad:DragonHDOmniLatestNeural |
AI-Generated Voices
Dragon HD Omni now features nearly 300 brand‑new AI‑generated voices, carefully designed to deliver an unprecedented range of vocal diversity.
These voices aren’t just more of the same — they’re built to give you choice, flexibility, and creative control. With variations across:
- Gender – male, female, and non‑binary options
- Age – youthful, mature, and senior tones
- Pitch & tone – from warm and friendly to authoritative and professional
This expanded library means you can:
- Personalize experiences for different audiences, whether you’re building an educational app, a customer support bot, or a storytelling platform.
- Strengthen brand identity by selecting voices that reflect your company’s personality and values.
- Increase inclusivity with diverse vocal styles that resonate across cultures and communities.
- Unlock creativity by experimenting with unique voice personalities for podcasts, games, or immersive experiences.
|
Speaker name – Description |
Sample |
|
en-us-graphiterhodium - A bold and dramatic male voice | |
|
en-us-olivepoivre - An adult female voice that is calm and soothing. |
Styles control
Standard Azure voices have limited styles due to extensive tuning requirements. The Dragon HD Omni introduces automatic style prediction using natural language descriptions, enabling advanced customization, broader style support, reduced cost, and improved expressiveness. In the initial release, styles will launch for en-US-Ava and en-US-Andrew.
Supported styles
angry, chill surfer, confused, curious, determined, disgusted, embarrassed, emo teenager, empathetic, encouraging, excited, fearful, friendly, grateful, joyful, mad scientist, meditative, narration, neutral, new yorker, news, reflective, regretful, relieved, sad, santa, shy, soft voice, surprised
Note that style result will be strongly influenced by the input content.
SSML example
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-us-ava:DragonHDOmniLatestNeural">
<mstts:express-as style="cheerful">
Wow! What an amazing day! I feel so full of energy, and everything around me seems brighter.
My voice is bubbling with excitement, and I can’t stop smiling.
I’m ready to take on anything that comes my way—let’s celebrate this wonderful moment together!
</mstts:express-as>
</voice>
</speak>
Multilingual and Accents
All Dragon HD Omni voices support multiple languages, with the capability that can automatically predicting and generating output based on the input text. Additionally, you may utilize the tag to adjust speaking languages and accents, such as fr-FR for French, de-DE for German, etc. For a comprehensive list of supported languages and their associated syntax and attributes, please refer to the lang element.
SSML example
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"><voice name="en-us-ava:Dragon HD OmniLatestNeural"><lang xml:lang="fr-FR">
Bonjour ! Ce matin, j’ai pris un café au jardin du Luxembourg. Il faisait frais, mais très agréable. Ensuite, j’ai acheté une baguette et quelques macarons. Paris est vraiment charmant.</lang>
</voice>
</speak>
Word Boundary Event Support
Dragon HD Omni supports the word boundary event, which allows developers to track the precise timing of each word as it is spoken. This feature is essential for applications requiring word-level synchronization, such as karaoke, real-time captioning, or interactive voice experiences.
When the event fires, it provides:
- Text: The word spoken
- AudioOffset: The time offset in the audio stream (milliseconds)
- TextOffset: The position of the word in the input text
Example: Python Sample Using Wordboundary Event in Azure Speech SDK
import azure.cognitiveservices.speech as speechsdk
def word_boundary_cb(evt):
print(f"Word: '{evt.text}', AudioOffset: {evt.audio_offset / 10000}ms, TextOffset: {evt.text_offset}")
speech_config = speechsdk.SpeechConfig(subscription="YourSubscriptionKey", region="YourServiceRegion")
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
synthesizer.synthesis_word_boundary.connect(word_boundary_cb)
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-us-ava:DragonHDOmniLatestNeural">
Hello Azure, welcome to Dragon HD Omni!
</voice>
</speak>
"""
result = synthesizer.speak_ssml_async(ssml).get()
Sample Output:
Word: 'Hello', AudioOffset: 110.0ms, TextOffset: 182
Word: 'Azure', AudioOffset: 590.0ms, TextOffset: 188
Word: ',', AudioOffset: 1110.0ms, TextOffset: 193
Word: 'welcome', AudioOffset: 1270.0ms, TextOffset: 195
Word: 'to', AudioOffset: 1750.0ms, TextOffset: 203
Word: 'Dragon HD Omni', AudioOffset: 1910.0ms, TextOffset: 206
Word: '!', AudioOffset: 2750.0ms, TextOffset: 216
Parameters
Dragon HD Omni supports advanced parameter tuning to help you customize voice output for different scenarios. This guide explains each parameter in simple terms and provides recommendations for adjusting them based on your goals.
Overview
|
Parameter |
Default |
Range |
Purpose |
|
temperature |
0.7 |
0.3 – 1.0 |
Controls creativity vs. stability |
|
top_p |
0.7 |
0.3 – 1.0 |
Filters output for diversity |
|
top_k |
22 |
1 – 50 |
Limits number of options considered |
|
cfg_scale |
1.4 |
1.0 – 2.0 |
Adjusts relevance and speech speed |
Tuning for Expressiveness vs. Stability
- Higher values for temperature, top_p, and top_k result in more expressive, emotionally varied speech.
- Lower values produce more stable and predictable output.
Recommendation:
- To increase expressiveness, raise all three parameters together.
- Keep top_p equal to temperature for best results.
Tuning for Speed and Contextual Relevance
- cfg_scale affects how quickly the voice speaks and how well it aligns with the context.
- Higher values (e.g., 1.8–2.0): faster speech, stronger contextual relevance.
- Lower values (e.g., 1.0–1.2): slower speech, less contextual alignment.
Suggested Tuning Strategies
|
Goal |
Suggested Adjustment |
|
More expressive |
Increase temperature, top_p, and top_k together |
|
More stable |
Lower temperature first, then adjust top_p if needed |
|
Faster & relevant |
Increase cfg_scale |
|
Slower & neutral |
Decrease cfg_scale |
The following table describes the usage of the parameters above:
Single parameter:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-us-ava:Dragon HD OmniLatestNeural" parameters="top_p=0.8">
Hello Azure!
</voice>
</speak>
Multiple parameters:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-us-ava:Dragon HD OmniLatestNeural" parameters="top_p=0.8;top_k=22;temperature=0.7;cfg_scale=1.2">
Hello Azure! Hello Azure!
</voice>
</speak>
Get Started
In our ongoing journey to enhance multilingual capabilities in text to speech (TTS) technology, we strive to deliver the best voices to empower your applications. Our voices are designed to be incredibly adaptive, seamlessly switching between languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications like language learning, travel guidance, and international business communication.
Microsoft offers an extensive portfolio of over 600 neural voices, covering more than 150 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or provide a voice to chatbots, elevating the conversational experience for users. With the Custom Neural Voice capability, businesses can also create unique and distinctive brand voices effortlessly.
With these advancements, we continue to push the boundaries of what’s possible in TTS technology, ensuring that our users have access to the most versatile, high-quality voices for their needs.
For more information
- Try our demo to listen to existing neural voices
- Add Text to speech to your apps today
- Apply for access to Custom Neural Voice
- Join Discord to collaborate and share feedback
- Contact us ttsvoicefeedback@microsoft.com