New HD voices preview in Azure AI Speech: contextual and realistic output evolved

Microsoft

Sep 30, 2024

Our commitment to improving Azure AI Speech voices is unwavering, as we consistently work towards making them more expressive and engaging. Today, we are thrilled to announce a new and improved HD version of our neural text to speech service for selected voices. This new version further enhances the overall expressiveness, incorporating emotion detection based on the context of the input.

With innovative technology which uses acoustic and linguistic features to generate speech filled with rich, natural variations. It can adeptly detect emotional cues in the text and autonomously adjust the voice's tone and style. With this upgrade, you can expect a more human-like speech pattern characterized by improved intonation, rhythm, and emotion.

What is new?

Auto-regressive transformer language models have demonstrated remarkable efficacy in modelling tasks including text, vision and speech recently. We are now introducing new HD voices powered by language model-based structure. These new HD voices are designed to speak in the selected platform voice timber. And it also provides some extra value:

Human-like speech generation: Our model not only interprets the input text accurately but also understands the underlying sentiment, automatically adjusting the speaking tone to match the emotion conveyed. This dynamic adjustment happens in real-time, without the need for manual editing, ensuring that each generated output is contextually appropriate and distinct.
Conversational: The new model excels at replicating natural speech patterns, including spontaneous pauses and emphasis. When given conversational text, it faithfully reproduces common phonemes like pauses and filler words. Instead of sounding like a reading of written text, the generated voice feels as if someone is conversing directly with you.
Prosody variations: Human voices naturally exhibit variation. Every sentence spoken by a human won’t be the same as any previously spoken ones. The new system enhances realism by introducing slight variations in each output, making the speech sound even more natural.

Voice demos

HD voices come with a base model that understands the input text and predicts the speaking pattern accordingly. Check out samples below for a list of HD voices available, based on the ‘DragonHDLatestNeural’ model.

Voice name	Script	Audio
de-DE-Seraphina:DragonHDLatestNeural	Willkommen zu unserem Lernmodul über Safari-Ökosysteme. Safaris sind lebendige Ökosysteme, die eine Fülle bemerkenswerter Tiere beheimaten. Von den geschickten Raubtieren wie Löwen und Geparden bis hin zu sanften Riesen wie Elefanten und Giraffen – diese Lebensräume bieten eine beeindruckende Artenvielfalt. Nashörner und Zebras leben hier Seite an Seite mit Gnus und bilden eine einzigartige Gemeinschaft. In diesem Modul erforschen wir ihre faszinierenden Anpassungen und das empfindliche Geflecht des Zusammenlebens, das sie erhält.
en-US-Andrew:DragonHDLatestNeural	Welcome to Tech Talks & Chill, the podcast where we keep it casual while diving into the coolest stuff happening in the tech world. Whether it's the latest in AI, gadgets that are changing the game, or the software shaping the future, we’ve got it covered. Each week, we’ll hang out with experts, geek out over new breakthroughs, and swap stories about the people pushing tech forward. So grab a coffee, kick back, and join us as we chat all things tech—no jargon, no stress, just good conversation with friends.
en-US-Andrew2:DragonHDLatestNeural	...and that scene alone makes the movie worth watching. Oh, and if you’re just tuning in, welcome! We’re breaking down The Midnight Chase today, and I’ve got to say—it’s one of the best thrillers I’ve seen this year. The pacing? Perfect. The lead actor? Absolutely nailed it. There’s this one moment, no spoilers, but the tension is so thick you can almost feel it. And the cinematography? Stunning—especially the way they use lighting to build suspense. If you’re into edge-of-your-seat action with a solid storyline, this is definitely one to check out. Stay with us, I’ll be diving deeper into why this one stands out from other thrillers!
en-US-Aria:DragonHDLatestNeural	As you complete the inspection, take clear and comprehensive notes. Use our standardized checklist as a guide, noting any deviations or areas of concern. If possible, take photographs to visually document any hazards or non-compliance issues. These notes and visuals will serve as evidence of your inspection findings. When you compile your report, include these details along with recommendations for corrective actions.
en-US-Ava:DragonHDLatestNeural	Ladies, it’s time for some self-pampering! Treat yourself to a moment of bliss with our exclusive Winter Spa Package. Indulge in a rejuvenating spa day like never before, and let your worries melt away. We’re excited to offer you a limited-time sale, making self-care more affordable than ever. Elevate your well-being, embrace relaxation, and step into a world of tranquility with us this Winter.
en-US-Davis:DragonHDLatestNeural	Unlock an exclusive golfing paradise at Hole 1 Golf, with our limited-time sale! For a short period, enjoy unbeatable deals on memberships, rounds, and golf gear. Swing into savings, elevate your game, and make the most of this incredible offer. Don’t miss out; tee off with us today and seize the opportunity to elevate your golf experience!
en-US-Emma:DragonHDLatestNeural	Imagine waking up to the sound of gentle waves and the warm Italian sun kissing your skin. At Bella Vista Resort, your dream holiday awaits! Nestled along the stunning Amalfi Coast, our luxurious beachfront resort offers everything you need for the perfect getaway. Indulge in spacious, elegantly designed rooms with breathtaking sea views, relax by our infinity pool, or savor authentic Italian cuisine at our on-site restaurant. Explore picturesque villages, soak up the sun on pristine sandy beaches, or enjoy thrilling water sports—there’s something for everyone! Join us for unforgettable sunsets and memories that will last a lifetime. Book your stay at Bella Vista Resort today and experience the ultimate sunny beach holiday in Italy!
en-US-Emma2:DragonHDLatestNeural	...and that’s when I realized how much living abroad teaches you outside the classroom. Oh, and if you’re just joining us, welcome! We’ve been talking about studying abroad, and I was just sharing this one story—my first week in Spain, I thought I had the language down, but when I tried ordering lunch, I panicked and ended up with callos, which are tripe. Not what I expected! But those little missteps really helped me get more comfortable with the language and culture. Anyway, stick around, because next I’ll be sharing some tips for adjusting to life abroad!
en-US-Jenny:DragonHDLatestNeural	Turning to international news, NASA’s recent successful mission to send a rover to explore Mars has captured the world’s attention. The rover, named ‘Perseverance,’ touched down on the Martian surface earlier this week, marking a historic achievement in space exploration. It’s equipped with cutting-edge technology and instruments to search for signs of past microbial life and gather data about the planet’s geology.
en-US-Steffan:DragonHDLatestNeural	By activating ‘Auto-Tagging,’ your productivity soars as it seamlessly locates and retrieves vital information within seconds, eliminating the need for time-consuming tasks. This intuitive feature not only understands your content but also empowers you to concentrate on what truly matters. To enable ‘Auto-Tagging,’ simply navigate to the settings menu and toggle the feature on for hassle-free organization.
ja-JP-Masaru:DragonHDLatestNeural	今日のテーマは、日本料理の魅力です。今聞いている方も、いらっしゃいませ！まずは天ぷらについて話しましょう。外はサクサク、中はふんわりとした食感が特徴で、旬の野菜や新鮮な魚介類を使うことで、その味が引き立ちます。次にお寿司も忘れてはいけません。新鮮なネタとシャリの絶妙なバランスはシンプルながら奥が深く、各地域の特産品を使ったお寿司も楽しめます。旅をするたびに新しい発見があるのも、日本料理の楽しみの一つです。この後は、各地の郷土料理についてもお話ししますので、ぜひ最後までお付き合いください！
zh-CN-Xiaochen:DragonHDLatestNeural	最近我真的越来越喜欢探索各种美食了！你知道吗，我特别喜欢尝试不同国家的菜肴，每次都有新的惊喜。上周我去了一家意大利餐厅，他们的披萨简直太好吃了，薄脆的饼底搭配上新鲜的番茄酱和浓郁的奶酪，每一口都充满了满足感。尤其是那种在嘴里融化的感觉，真的让人欲罢不能。当然，我也特别喜欢中餐，无论是火锅还是川菜，那种麻辣鲜香的味道总是让我停不下来。尤其是和朋友一起吃火锅，边吃边聊，感觉特别温馨。还有一些更独特的尝试，比如最近吃了印度的咖喱，虽然开始有点不习惯那种浓烈的香料味，但后来慢慢品味，竟然觉得很有层次感，很丰富。每次尝试新菜，我都觉得像是在探索一段新的旅程，不知道下一口会带来什么样的体验。

Note:

The new voices are currently not listed in the Voice List API or documentation. However, you can access and use them by directly calling the SSML template. We will update the API and documentation with more details in the near future.
These HD voices are implemented based on the latest base model: DragonHDLatestNeural. The name before the colon, e.g, en-US-Andrew, is the voice persona name and its original locale. The base model will be tracked by versions in future updates.

Content Creation Demo

In a hectic work environment with many documents to read, converting them into podcasts for on-the-go listening can be beneficial. Here is a demo using Azure OpenAI GPT-4O and HD voices to create podcast content from PDF document. The same idea can be applied to any other documents like web pages, word documents etc.

The main steps of the demo are as follows:

Use Azure OpenAI GPT-4o to summarize a lengthy document.
Create a conversational podcast script.
Convert the script into audio featuring two hosts using Azure HD voices.

Check out sample code on Github

Avatar Chat demo

HD voices can also be utilized with Azure Speech to text and TTS avatars, as well as GPT-4o in real-time full duplex conversations. This technology can enhance the interactive experience for customer service chatbots, among other applications. We have published an avatar demo below:

It can support continuous conversations between the user and bot.
It can support user interruption when the bot is speaking.
It can achieve end-to-end low latency through best practice using Azure OpenAI GPT-4o and speech services.

Check out sample code on Github

How to use

You can start to use HD voices with the same speech synthesis SDK and REST APIs as non-HD voices. Follow the quick start to learn more about how to synthesize speech with SDK or learn to use the REST API here.

Voice Locale: The locale in the voice name indicates its original language and region.
Base Models: The current base model is DragonHDv1Neural. The latest version, DragonHDLatest, will be implemented once available. As more versions are introduced, you can specify the desired model (e.g., DragonHDv2Neural) according to the availability of each voice.
SSML Usage: To reference a voice in SSML, use the format voicename:basemodel.
Temperature Parameter: The temperature value is a float ranging from 0 to 1, influencing the randomness of the output. You can also adjust the temperature parameter to control the variation of outputs. Here’s an example:
- Lower Temperature: Results in less randomness, leading to more predictable outputs.
- Higher Temperature: Increases randomness, allowing for more diverse outputs.
- The default temperature is set at 1.0.
- Less randomness yields more stable results, while more randomness offers variety but less consistency.
Availability: HD voices are currently in preview and will be accessible in three regions: East US, West Europe, and Southeast Asia.
Pricing: The cost for HD voices is $30 per 1 million characters.

Example SSML:

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'>
<voice name='en-US-Ava:DragonHDLatestNeural' parameters='temperature=0.8'>Here is a test</voice>
</speak>

Note:

The new voices are currently not listed in the Voice List API or documentation. However, you can access and use them by directly calling the SSML template. We will update the API and documentation with more details in the near future.
These HD voices are implemented based on the latest base model: DragonHDLatestNeural. The name before the colon, e.g, en-US-Andrew, is the voice persona name and its original locale. The base model will be tracked by versions in future updates.

Get started

In our ongoing quest to enhance multilingual capabilities in text-to-speech (TTS) technology, our goal is bringing the best voices to our product, our voices are designed to be incredibly adaptive, seamlessly switching languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications such as language learning, travel guidance, and international business communication.

Microsoft offers over 500 neural voices covering more than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with the Custom Neural Voice capability, businesses can easily create a unique brand voice.

With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available.