azure speech
1 TopicIntroducing Dragon HD Omni: Azure Speech New Voice Type Now in Preview via Microsoft Foundry
Dragon HD Omni is Microsoft Azure Speech’s newest text‑to‑speech generation, delivering over 700 high‑quality voices with enhanced expressiveness, multi‑lingual fluency, and multi‑style control — all through a unified model built in Microsoft Foundry. It removes common developer pain points such as unnatural voice prosody, limited language coverage, and heavy SSML tuning effort. The result is a powerful value proposition: faster integration, richer user experiences, and production‑ready voice output with minimal effort. Azure speech offers a broad range of unique voices for applications like virtual agents, audiobooks, podcasts, and speech-to-speech tasks. Demo video 700+ prebuilt voices Dragon HD Omni offers a range of prebuilt voices with distinct personas and emotions, supporting diverse use cases from agent-based applications to content creation. These voices unlock endless possibilities, empowering users to enhance end-to-end applications. Full update for previous generation voices Dragon HD Omni merges a wide range of prebuilt voices into one, improving contextual adaptation, prosody, expression, and keeping each voice's unique character. This technology delivers more accurate, flexible, and lifelike speech for a variety of uses. Dragon HD Omni raises the standard for natural AI voices across customer service, accessibility, and creative projects, advancing human-computer interaction. You can explore some voices from voice list, such as: "en-US-Ava:DragonHDOmniLatestNeural" "en-US-Andrew:DragonHDOmniLatestNeural" "en-US-Dana:DragonHDOmniLatestNeural" "en-US-Caleb:DragonHDOmniLatestNeural" "zh-CN-Xiaoyue:DragonHDOmniLatestNeural" "zh-CN-Yunqi:DragonHDOmniLatestNeural" "en-US-Phoebe:DragonHDOmniLatestNeural" "en-US-Lewis:DragonHDOmniLatestNeural" They will be available to try directly via Speech Playground - Microsoft Foundry Or, you can use this voice name format by adding the suffix `:DragonHDOmniLatestNeural` to try the Omni version of the given voice via direct SSML call. For example: Previous neural voice Omni version voice name de-DE-ConradNeural de-DE-Conrad:DragonHDOmniLatestNeural AI-Generated Voices Dragon HD Omni now features nearly 300 brand‑new AI‑generated voices, carefully designed to deliver an unprecedented range of vocal diversity. These voices aren’t just more of the same — they’re built to give you choice, flexibility, and creative control. With variations across: Gender – male, female, and non‑binary options Age – youthful, mature, and senior tones Pitch & tone – from warm and friendly to authoritative and professional This expanded library means you can: Personalize experiences for different audiences, whether you’re building an educational app, a customer support bot, or a storytelling platform. Strengthen brand identity by selecting voices that reflect your company’s personality and values. Increase inclusivity with diverse vocal styles that resonate across cultures and communities. Unlock creativity by experimenting with unique voice personalities for podcasts, games, or immersive experiences. Speaker name – Description Sample en-us-graphiterhodium - A bold and dramatic male voice en-us-olivepoivre - An adult female voice that is calm and soothing. Styles control Standard Azure voices have limited styles due to extensive tuning requirements. The Dragon HD Omni introduces automatic style prediction using natural language descriptions, enabling advanced customization, broader style support, reduced cost, and improved expressiveness. In the initial release, styles will launch for en-US-Ava and en-US-Andrew. Supported styles angry, chill surfer, confused, curious, determined, disgusted, embarrassed, emo teenager, empathetic, encouraging, excited, fearful, friendly, grateful, joyful, mad scientist, meditative, narration, neutral, new yorker, news, reflective, regretful, relieved, sad, santa, shy, soft voice, surprised Note that style result will be strongly influenced by the input content. SSML example <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-us-ava:DragonHDOmniLatestNeural"> <mstts:express-as style="cheerful"> Wow! What an amazing day! I feel so full of energy, and everything around me seems brighter. My voice is bubbling with excitement, and I can’t stop smiling. I’m ready to take on anything that comes my way—let’s celebrate this wonderful moment together! </mstts:express-as> </voice> </speak> Multilingual and Accents All Dragon HD Omni voices support multiple languages, with the capability that can automatically predicting and generating output based on the input text. Additionally, you may utilize the tag to adjust speaking languages and accents, such as fr-FR for French, de-DE for German, etc. For a comprehensive list of supported languages and their associated syntax and attributes, please refer to the lang element. SSML example <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"><voice name="en-us-ava:Dragon HD OmniLatestNeural"><lang xml:lang="fr-FR"> Bonjour ! Ce matin, j’ai pris un café au jardin du Luxembourg. Il faisait frais, mais très agréable. Ensuite, j’ai acheté une baguette et quelques macarons. Paris est vraiment charmant.</lang> </voice> </speak> Word Boundary Event Support Dragon HD Omni supports the word boundary event, which allows developers to track the precise timing of each word as it is spoken. This feature is essential for applications requiring word-level synchronization, such as karaoke, real-time captioning, or interactive voice experiences. When the event fires, it provides: Text: The word spoken AudioOffset: The time offset in the audio stream (milliseconds) TextOffset: The position of the word in the input text Example: Python Sample Using Wordboundary Event in Azure Speech SDK import azure.cognitiveservices.speech as speechsdk def word_boundary_cb(evt): print(f"Word: '{evt.text}', AudioOffset: {evt.audio_offset / 10000}ms, TextOffset: {evt.text_offset}") speech_config = speechsdk.SpeechConfig(subscription="YourSubscriptionKey", region="YourServiceRegion") synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config) synthesizer.synthesis_word_boundary.connect(word_boundary_cb) ssml = """ <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-us-ava:DragonHDOmniLatestNeural"> Hello Azure, welcome to Dragon HD Omni! </voice> </speak> """ result = synthesizer.speak_ssml_async(ssml).get() Sample Output: Word: 'Hello', AudioOffset: 110.0ms, TextOffset: 182 Word: 'Azure', AudioOffset: 590.0ms, TextOffset: 188 Word: ',', AudioOffset: 1110.0ms, TextOffset: 193 Word: 'welcome', AudioOffset: 1270.0ms, TextOffset: 195 Word: 'to', AudioOffset: 1750.0ms, TextOffset: 203 Word: 'Dragon HD Omni', AudioOffset: 1910.0ms, TextOffset: 206 Word: '!', AudioOffset: 2750.0ms, TextOffset: 216 Parameters Dragon HD Omni supports advanced parameter tuning to help you customize voice output for different scenarios. This guide explains each parameter in simple terms and provides recommendations for adjusting them based on your goals. Overview Parameter Default Range Purpose temperature 0.7 0.3 – 1.0 Controls creativity vs. stability top_p 0.7 0.3 – 1.0 Filters output for diversity top_k 22 1 – 50 Limits number of options considered cfg_scale 1.4 1.0 – 2.0 Adjusts relevance and speech speed Tuning for Expressiveness vs. Stability Higher values for temperature, top_p, and top_k result in more expressive, emotionally varied speech. Lower values produce more stable and predictable output. Recommendation: To increase expressiveness, raise all three parameters together. Keep top_p equal to temperature for best results. Tuning for Speed and Contextual Relevance cfg_scale affects how quickly the voice speaks and how well it aligns with the context. Higher values (e.g., 1.8–2.0): faster speech, stronger contextual relevance. Lower values (e.g., 1.0–1.2): slower speech, less contextual alignment. Suggested Tuning Strategies Goal Suggested Adjustment More expressive Increase temperature, top_p, and top_k together More stable Lower temperature first, then adjust top_p if needed Faster & relevant Increase cfg_scale Slower & neutral Decrease cfg_scale The following table describes the usage of the parameters above: Single parameter: <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-us-ava:Dragon HD OmniLatestNeural" parameters="top_p=0.8"> Hello Azure! </voice> </speak> Multiple parameters: <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-us-ava:Dragon HD OmniLatestNeural" parameters="top_p=0.8;top_k=22;temperature=0.7;cfg_scale=1.2"> Hello Azure! Hello Azure! </voice> </speak> Get Started In our ongoing journey to enhance multilingual capabilities in text to speech (TTS) technology, we strive to deliver the best voices to empower your applications. Our voices are designed to be incredibly adaptive, seamlessly switching between languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications like language learning, travel guidance, and international business communication. Microsoft offers an extensive portfolio of over 600 neural voices, covering more than 150 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or provide a voice to chatbots, elevating the conversational experience for users. With the Custom Neural Voice capability, businesses can also create unique and distinctive brand voices effortlessly. With these advancements, we continue to push the boundaries of what’s possible in TTS technology, ensuring that our users have access to the most versatile, high-quality voices for their needs. For more information Try our demo to listen to existing neural voices Add Text to speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback Contact us ttsvoicefeedback@microsoft.com344Views0likes0Comments