Azure Speech continues to enhance its Neural HD voice portfolio, offering developers a broader range of choices in quality, expressiveness, performance, and regional coverage. These updates build on the existing Neural HD capabilities, making it easier for developers to select the ideal voice for scenarios ranging from highly expressive narration to low-latency, real-time interactions.
Neural HD 2.5 update to Latest in Production: Enhanced Quality, Styles, and Paralinguistic tags
Neural HD 2.5 delivers notable enhancements to existing HD voices, with an emphasis on achieving more natural prosody, enhanced expressiveness, and increased consistency—particularly when processing lengthy or complex material.
The update supports a range of speaking styles for English content and enables the integration of paralinguistic elements, contributing to more authentic conversational experiences. Enhanced style and metadata tags streamline the process of evaluating each voice's capabilities, facilitating the selection of the most appropriate options for applications such as virtual agents, narration, or expressive content creation.
In addition to SSML input, Styles and Paralinguistics can now be applied using text input as well. Please refer to the examples below.
Voice Test Results in English (US)
|
Rating |
Female |
Male |
|
Microsoft Neural HD |
3.99 |
3.94 |
|
Service A |
3.75 |
3.99 |
|
Service B |
3.66 |
3.67 |
|
Service C |
3.59 |
3.89 |
A MOS evaluation was conducted across several domains, including Knowledge Sharing, Assistant, Customer Service, and Entertainment - with a panel of human judges. As detailed in the preceding table, Microsoft Neural HD received consistently high and balanced scores for both female and male voices, indicating dependable, high-quality performance across all domains. Whereas some alternatives demonstrate strengths within particular gender categories, Microsoft Neural HD provides a reliable and uniform listening experience, making it an appropriate choice for production use across varied real-world applications.
List of supported styles:
`amazed`, `amused`, `angry`, `annoyed`, `anxious`, `appreciative`, `calm`, `cautious`, `concerned`, `confident`, `confused`, `curious`, `defeated`, `defensive`, `defiant`, `determined`, `disappointed`, `disgusted`, `doubtful`, `ecstatic`, `encouraging`, `excited`, `fast`, `fearful`, `frustrated`, `happy`, `hesitant`, `hurt`, `impatient`, `impressed`, `intrigued`, `joking`, `laughing`, `optimistic`, `painful`, `panicked`, `panting`, `pleading`, `proud`, `quiet`, `reassuring`, `reflective`, `relieved`, `remorseful`, `resigned`, `sad`, `sarcastic`, `secretive`, `serious`, `shocked`, `shouting`, `shy`, `skeptical`, `slow`, `struggling`, `surprised`, `suspicious`, `sympathetic`, `terrified`, `upset`, `urgent`, `whispering`
Note: Styles and Paralingsuitic are available on all HDLatestNeural voices, except “en-IN-Arjun:DragonHDLatestNeural”, “en-IN-Aarti:DragonHDLatestNeural”, and “en-IN-Meera:DragonHDLatestNeural”
SSML samples
|
SSML input (express-as tag) | |
|
SSML input (using quotes “[]”) | |
|
Text input | |
List of supported paralinguistic tags:
`laughter`, `coughing`, `throat_clearing`, `breathing`, `sighing`, `yawning`
Neural HD Omni: Enhanced Quality, Styles, and Paralinguistic tags
We also updating Neural HD Omni that we announced few weeks ago with overall quality and Styles, Paralinguistic tags support for all HD Omni voices.
List of supported styles:
`amazed`, `amused`, `angry`, `annoyed`, `anxious`, `appreciative`, `calm`, `cautious`, `concerned`, `confident`, `confused`, `curious`, `defeated`, `defensive`, `defiant`, `determined`, `disappointed`, `disgusted`, `doubtful`, `ecstatic`, `encouraging`, `excited`, `fast`, `fearful`, `frustrated`, `happy`, `hesitant`, `hurt`, `impatient`, `impressed`, `intrigued`, `joking`, `laughing`, `optimistic`, `painful`, `panicked`, `panting`, `pleading`, `proud`, `quiet`, `reassuring`, `reflective`, `relieved`, `remorseful`, `resigned`, `sad`, `sarcastic`, `secretive`, `serious`, `shocked`, `shouting`, `shy`, `skeptical`, `slow`, `struggling`, `surprised`, `suspicious`, `sympathetic`, `terrified`, `upset`, `urgent`
SSML samples
|
SSML input | |
|
Text input | |
List of supported paralinguistic tags:
`laughter`, `coughing`, `throat_clearing`, `breathing`, `sighing`, `yawning`
Neural HD Multi-Talker Voices: Expand for Language support and Speakers
Neural HD Multi-Talker voices facilitate multi-speaker output within a unified voice family, thereby enhancing the efficiency of producing dynamic and immersive audio content without necessitating the management of numerous distinct voices.
This capability is ideally suited for applications such as dialogue creation, podcast production, role-based narration, and storytelling scenarios where clear speaker distinction and seamless conversational flow are essential. Multi-Talker voices are specifically designed to maintain superior audio quality throughout speaker transitions, effectively minimizing the complexity often associated with coordinating outputs involving multiple voices.
Previously, “en-US-MultiTalker-Ava-Andrew:DragonHDLatestNeural” and “en-US-MultiTalker-Ava-Steffan:DragonHDLatestNeural” were available in preview, featuring a fixed set of speakers and limited to en-US language support. The recent update broadens input text language compatibility beyond en-US to include fr-FR, es-ES, de-DE, it-IT, pt-BR, ko-KR, ja-JP, and zh-CN. Additionally, a newly introduced group of speakers is available under the model “en-MultiTalker-1:DragonHDLatestNeural” comprising:
|
Gender |
Speaker name |
|
Female |
"Ada", "Ava", "Emma", "Jane" |
|
Male |
"Andrew", "Brian", "Davis", "Steffan" |
Sample SSML
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" version="1.0" xml:lang="en-US">
<voice name="en-Multitalker-1:DragonHDLatestNeural">
<mstts:dialog>
<mstts:turn speaker="emma">Andrew, before we get into today’s chat, I have to ask—did you do anything fun over the weekend?</mstts:turn>
<mstts:turn speaker="andrew">Actually, yes. I spent most of Saturday at a local market, no big plans, just wandering around and trying way too much street food.</mstts:turn>
<mstts:turn speaker="emma">That already sounds like a perfect weekend. Markets have the best energy. What ended up being your favorite find?</mstts:turn>
<mstts:turn speaker="andrew">The fresh pastries, without a doubt. Nothing fancy, just warm, simple, and really comforting, one of those moments where you slow down and just enjoy it.</mstts:turn>
</mstts:dialog>
</voice>
</speak>
Neural HD Flash Voices: Low-Latency version of HD
Neural HD Flash introduces a new class of HD voices optimized for speed and responsiveness, particularly beneficial for scenarios where low latency is essential. We are introducing few more voices with primary locales as US English (en-US), they are also supporting bilingual with en-US and zh-CN.
These HD Flash voices are engineered to deliver fast synthesis while maintaining the core Neural HD qualities of clear pronunciation and natural-sounding prosody. They are well-suited for use cases such as voice assistants, call center automation, and real-time speech-to-speech experiences, where responsiveness is crucial for user experience.
With HD Flash, developers can now choose between maximizing expressiveness with Neural HD and Neural HD Omni, or prioritizing faster response times depending on their application's requirements.
HD Flash Voices & Styles List
|
Voice name |
Supported styles |
|
zh-CN-Xiaoxiao:DragonHDFlashLatestNeural |
angry, chat, cheerful, customer-service, excited, fearful, sad, voice-assistant |
|
zh-CN-Xiaoxiao2: DragonHDFlashLatestNeural |
affectionate, angry, anxious, cheerful, curious, disappointed, empathetic, encouraging, excited, fearful, guilty, lonely, poetry-reading, sad, sentimental, sorry, story, surprised, tired, whispering |
|
zh-CN-Xiaochen: DragonHDFlashLatestNeural |
cheerful, debating, empathetic, live-commercial, poetry-reading, sad, sorry |
|
zh-CN-Xiaoyi: DragonHDFlashLatestNeural |
angry, complaining, cute, gentle, nervous, sad, shy, strict |
|
zh-CN-Xiaoyu: DragonHDFlashLatestNeural |
angry, debating, cheerful, comforting, sad, sorry |
|
zh-CN-Xiaohan: DragonHDFlashLatestNeural |
affectionate, angry, cheerful, complaining, fearful, gentle, sad, shy, strict |
|
zh-CN-Xiaoshuang: DragonHDFlashLatestNeural |
chat |
|
zh-CN-Xiaoyou: DragonHDFlashLatestNeural |
chat, angry, cheerful, poetry-reading, sad, story, cute |
|
zh-CN-Yunxi: DragonHDFlashLatestNeural |
angry, chat, cheerful, complaining, depressed, fearful, news, sad, shy, strict, voice-assistant |
|
zh-CN-Yunyi: DragonHDFlashLatestNeural |
assassin, captain, cavalier, prince, game-narrator, geomancer, poet |
|
zh-CN-Yunxiao: DragonHDFlashLatestNeural |
|
|
zh-CN-Yunhan: DragonHDFlashLatestNeural |
angry, cheerful, curious, empathetic, encouraging, excited, guilty, lonely, sad, serious, sorry, whispering, surprised, tired |
|
zh-CN-Yunxia: DragonHDFlashLatestNeural |
affectionate, angry, cheerful, comforting, encouraging, excited, fearful, sad, surprised |
|
zh-CN-Yunye:DragonHDFlashLatestNeural |
|
|
en-US-Tiana:DragonHDFlashLatestNeural |
|
|
en-US-Tyler:DragonHDFlashLatestNeural |
|
|
en-US-Jimmie:DragonHDFlashLatestNeural |
|
Note: Styles support is per voice for HD Flash model
Neural HD Regions Expansion
Starting in March 2026, Neural HD voices will be rolling out to even more locations! Previously available in `East US`, `West Europe`, and `Southeast Asia`, these enhanced voices are now available on `West US 2`, `East US2`, `Central India`, `Canada Central`, `France Central`, and `Sweden Central`. Please refer to Supported Regions for Azure Speech - Foundry Tools | Microsoft Learn for latest information.
Neural HD Pricing Update
Starting from March 2026, the Neural HD voices will be offered at a new rate of $22 per 1 million characters, reduced from the previous price of $30 per 1 million characters. This adjustment provides a more accessible and economical option for users integrating Neural HD into their solutions. Please refer to Pricing - Azure Speech in Foundry Tools | Microsoft Azure for latest information.
Getting Started with Neural HD Voices
Begin exploring the latest Neural HD voices in Azure Speech to find the right mix of quality, performance, and expressiveness for your applications. As part of our ongoing commitment to advancing multilingual text-to-speech (TTS) technology, we strive to deliver adaptive voices that can seamlessly switch languages based on text input. These voices offer natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications such as language learning, travel guidance, and international business communication.
Microsoft's extensive portfolio features over 600 neural voices covering more than 150 languages and locales. These TTS voices enable rapid addition of read-aloud features for accessible app design or provide voices for chatbots to enhance conversational experiences. Through the Custom Neural Voice capability, businesses can also develop unique and distinctive brand voices with ease.
With these innovations, we continue to push the boundaries of TTS technology, ensuring users have access to the most flexible and high-quality voices available.
Additional Resources and Next Steps
- Try our demo to listen to existing neural voices
- Add text-to-speech to your apps today
- Apply for access to Custom Neural Voice
- Join Discord to collaborate and share feedback
- Contact us at ttsvoicefeedback@microsoft.com