Blog Post

Microsoft Foundry Blog
7 MIN READ

Introducing Dragon HD Omni: Azure Speech New Voice Type Now in Preview via Microsoft Foundry

GarfieldHe's avatar
GarfieldHe
Icon for Microsoft rankMicrosoft
Jan 07, 2026

The new voice type will be available for preview on: East US, West Europe, Sweden Central, and Southeast Asia regions.

Dragon HD Omni is Microsoft Azure Speech’s newest text‑to‑speech generation, delivering over 700 high‑quality voices with enhanced expressiveness, multi‑lingual fluency, and multi‑style control — all through a unified model built in Microsoft Foundry. It removes common developer pain points such as unnatural voice prosody, limited language coverage, and heavy SSML tuning effort. The result is a powerful value proposition: faster integration, richer user experiences, and production‑ready voice output with minimal effort.

Azure speech offers a broad range of unique voices for applications like virtual agents, audiobooks, podcasts, and speech-to-speech tasks.

Demo video

700+ prebuilt voices

Dragon HD Omni offers a range of prebuilt voices with distinct personas and emotions, supporting diverse use cases from agent-based applications to content creation. These voices unlock endless possibilities, empowering users to enhance end-to-end applications.

Full update for previous generation voices

Dragon HD Omni merges a wide range of prebuilt voices into one, improving contextual adaptation, prosody, expression, and keeping each voice's unique character. This technology delivers more accurate, flexible, and lifelike speech for a variety of uses. Dragon HD Omni raises the standard for natural AI voices across customer service, accessibility, and creative projects, advancing human-computer interaction.

You can explore some voices from voice list, such as:

  • "en-US-Ava:DragonHDOmniLatestNeural"
  • "en-US-Andrew:DragonHDOmniLatestNeural"
  • "en-US-Dana:DragonHDOmniLatestNeural"
  • "en-US-Caleb:DragonHDOmniLatestNeural"
  • "zh-CN-Xiaoyue:DragonHDOmniLatestNeural"
  • "zh-CN-Yunqi:DragonHDOmniLatestNeural"
  • "en-US-Phoebe:DragonHDOmniLatestNeural"
  • "en-US-Lewis:DragonHDOmniLatestNeural"

They will be available to try directly via Speech Playground - Microsoft Foundry

Or, you can use this voice name format by adding the suffix `:DragonHDOmniLatestNeural` to try the Omni version of the given voice via direct SSML call.

For example:

Previous neural voice

Omni version voice name

de-DE-ConradNeural

de-DE-Conrad:DragonHDOmniLatestNeural

AI-Generated Voices

Dragon HD Omni now features nearly 300 brand‑new AI‑generated voices, carefully designed to deliver an unprecedented range of vocal diversity.

These voices aren’t just more of the same — they’re built to give you choice, flexibility, and creative control. With variations across:

  • Gender – male, female, and non‑binary options
  • Age – youthful, mature, and senior tones
  • Pitch & tone – from warm and friendly to authoritative and professional

This expanded library means you can:

  • Personalize experiences for different audiences, whether you’re building an educational app, a customer support bot, or a storytelling platform.
  • Strengthen brand identity by selecting voices that reflect your company’s personality and values.
  • Increase inclusivity with diverse vocal styles that resonate across cultures and communities.
  • Unlock creativity by experimenting with unique voice personalities for podcasts, games, or immersive experiences.

Speaker name – Description

Sample

en-us-graphiterhodium - A bold and dramatic male voice

en-us-olivepoivre - An adult female voice that is calm and soothing.

Styles control

Standard Azure voices have limited styles due to extensive tuning requirements. The Dragon HD Omni introduces automatic style prediction using natural language descriptions, enabling advanced customization, broader style support, reduced cost, and improved expressiveness. In the initial release, styles will launch for en-US-Ava and en-US-Andrew.

Supported styles

angry, chill surfer, confused, curious, determined, disgusted, embarrassed, emo teenager, empathetic, encouraging, excited, fearful, friendly, grateful, joyful, mad scientist, meditative, narration, neutral, new yorker, news, reflective, regretful, relieved, sad, santa, shy, soft voice, surprised

Note that style result will be strongly influenced by the input content.

SSML example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> 
<voice name="en-us-ava:DragonHDOmniLatestNeural">
    <mstts:express-as style="cheerful">
      Wow! What an amazing day! I feel so full of energy, and everything around me seems brighter.
      My voice is bubbling with excitement, and I can’t stop smiling.
      I’m ready to take on anything that comes my way—let’s celebrate this wonderful moment together!
    </mstts:express-as>
 </voice>
</speak>
 

Multilingual and Accents

All Dragon HD Omni voices support multiple languages, with the capability that can automatically predicting and generating output based on the input text. Additionally, you may utilize the tag to adjust speaking languages and accents, such as fr-FR for French, de-DE for German, etc. For a comprehensive list of supported languages and their associated syntax and attributes, please refer to the lang element.

SSML example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"><voice name="en-us-ava:Dragon HD OmniLatestNeural"><lang xml:lang="fr-FR">
Bonjour ! Ce matin, j’ai pris un café au jardin du Luxembourg. Il faisait frais, mais très agréable. Ensuite, j’ai acheté une baguette et quelques macarons. Paris est vraiment charmant.</lang>
</voice>
</speak>
 

Word Boundary Event Support

Dragon HD Omni supports the word boundary event, which allows developers to track the precise timing of each word as it is spoken. This feature is essential for applications requiring word-level synchronization, such as karaoke, real-time captioning, or interactive voice experiences.

When the event fires, it provides:

  • Text: The word spoken
  • AudioOffset: The time offset in the audio stream (milliseconds)
  • TextOffset: The position of the word in the input text

 Example: Python Sample Using Wordboundary Event in Azure Speech SDK

import azure.cognitiveservices.speech as speechsdk
def word_boundary_cb(evt):
    print(f"Word: '{evt.text}', AudioOffset: {evt.audio_offset / 10000}ms, TextOffset: {evt.text_offset}")
speech_config = speechsdk.SpeechConfig(subscription="YourSubscriptionKey", region="YourServiceRegion")
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
synthesizer.synthesis_word_boundary.connect(word_boundary_cb)
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-us-ava:DragonHDOmniLatestNeural">
    Hello Azure, welcome to Dragon HD Omni!
  </voice>
</speak>
"""
result = synthesizer.speak_ssml_async(ssml).get()

 

Sample Output:

Word: 'Hello', AudioOffset: 110.0ms, TextOffset: 182
Word: 'Azure', AudioOffset: 590.0ms, TextOffset: 188
Word: ',', AudioOffset: 1110.0ms, TextOffset: 193
Word: 'welcome', AudioOffset: 1270.0ms, TextOffset: 195
Word: 'to', AudioOffset: 1750.0ms, TextOffset: 203
Word: 'Dragon HD Omni', AudioOffset: 1910.0ms, TextOffset: 206
Word: '!', AudioOffset: 2750.0ms, TextOffset: 216

 

Parameters

Dragon HD Omni supports advanced parameter tuning to help you customize voice output for different scenarios. This guide explains each parameter in simple terms and provides recommendations for adjusting them based on your goals.

Overview

Parameter

Default

Range

Purpose

temperature

0.7

0.3 – 1.0

Controls creativity vs. stability

top_p

0.7

0.3 – 1.0

Filters output for diversity

top_k

22

1 – 50

Limits number of options considered

cfg_scale

1.4

1.0 – 2.0

Adjusts relevance and speech speed

Tuning for Expressiveness vs. Stability

  • Higher values for temperature, top_p, and top_k result in more expressive, emotionally varied speech.
  • Lower values produce more stable and predictable output.

Recommendation:

  • To increase expressiveness, raise all three parameters together.
  • Keep top_p equal to temperature for best results.

Tuning for Speed and Contextual Relevance

  • cfg_scale affects how quickly the voice speaks and how well it aligns with the context.
    • Higher values (e.g., 1.8–2.0): faster speech, stronger contextual relevance.
    • Lower values (e.g., 1.0–1.2): slower speech, less contextual alignment.

Suggested Tuning Strategies

Goal

Suggested Adjustment

More expressive

Increase temperature, top_p, and top_k together

More stable

Lower temperature first, then adjust top_p if needed

Faster & relevant

Increase cfg_scale

Slower & neutral

Decrease cfg_scale

The following table describes the usage of the parameters above:

Single parameter:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-us-ava:Dragon HD OmniLatestNeural" parameters="top_p=0.8">
Hello Azure!
</voice>
</speak>

Multiple parameters:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-us-ava:Dragon HD OmniLatestNeural" parameters="top_p=0.8;top_k=22;temperature=0.7;cfg_scale=1.2">
Hello Azure! Hello Azure!
</voice>
</speak>

 

Get Started

In our ongoing journey to enhance multilingual capabilities in text to speech (TTS) technology, we strive to deliver the best voices to empower your applications. Our voices are designed to be incredibly adaptive, seamlessly switching between languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications like language learning, travel guidance, and international business communication. 

Microsoft offers an extensive portfolio of over 600 neural voices, covering more than 150 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or provide a voice to chatbots, elevating the conversational experience for users. With the Custom Neural Voice capability, businesses can also create unique and distinctive brand voices effortlessly. 

With these advancements, we continue to push the boundaries of what’s possible in TTS technology, ensuring that our users have access to the most versatile, high-quality voices for their needs. 

 

For more information

Updated Jan 07, 2026
Version 1.0
No CommentsBe the first to comment