azure ai speech
57 TopicsGuidebook to reduce latency for Azure Speech-To-Text (STT) and Text-To-Speech (TTS) applications
Are You Tired of Waiting? How to Drastically Reduce Latency in Speech Recognition and Synthesis In the fast-paced world of technology, every second counts, especially when it comes to speech recognition and synthesis. Latency can be a deal-breaker, turning an otherwise seamless interaction into a frustrating wait. But what if there were proven strategies to not only tackle but significantly reduce this delay, enhancing user experience and application performance like never before? In our latest blog post, we dive deep into the world of speech technology, uncovering innovative and practical solutions to minimize latency across various domains: from general and real-time transcription to file transcription and speech synthesis. Whether you're dealing with network latency, aiming for instant feedback in real-time transcription, or striving for quicker file processing and more responsive speech synthesis, this post has got you covered. With actionable tips and code snippets, you'll learn how to streamline your speech technology applications, ensuring they're not just functional, but lightning-fast.9.1KViews8likes0CommentsCreating Intelligent Video Summaries and Avatar Videos with Azure AI Services
Unlock the true value of your organization’s video content! In this post, I share how we built an end-to-end AI video analytics platform using Microsoft Azure. Discover how AI can automate video analysis, generate intelligent summaries, and create engaging avatar presentations—making content more accessible, actionable, and impactful for everyone. If you’re interested in digital transformation, AI-powered automation, or modern content management, this is for you!730Views5likes1CommentMy Journey of Building a Voice Bot from Scratch
My Journey in Building Voice Bot for production The world of artificial intelligence is buzzing with innovations, and one of its most captivating branches is the development of voice bots. These digital entities have the power to transform user interactions, making them more natural and intuitive. In this blog post, I want to take you on a journey through my experience of building a voice bot from scratch using Azure's cutting-edge technologies: OpenAI GPT-4o-Realtime, Azure Text-to-Speech (TTS), and Speech-to-Text (STT). Key Features for Building Effective Voice Bot Natural Interaction: A voice agent's ability to converse naturally is paramount. The goal is to create interactions that mirror human conversation, avoiding robotic or scripted responses. This naturalism fosters user comfort, leading to a more seamless engaging experience. Context Awareness: True sophistication in a voice agent comes from its ability to understand context and retain information. This capability allows it to provide tailored responses and actions based on user history, preferences, and specific queries. Multi-Language Support: One of the significant hurdles in developing a comprehensive voice agent lies in the need for multi-language support. As brands cater to diverse markets, ensuring clear and contextually accurate communication across languages is vital. Real-time Processing: The real-time capabilities of voice agents allow for immediate responses, enhancing the customer experience. This feature is crucial for tasks like booking, purchasing, and inquiries where time sensitivity matters. Furthermore, there are immense opportunities available. When implemented successfully, a robust voice agent can revolutionize customer engagement. Consider a scenario where a business utilizes an AI-driven voice agent to reach out to potential customers in a marketing campaign. This approach can greatly enhance efficiency, allowing the business to manage high volumes of prospects, providing a vastly improved return on investment compared to traditional methods. Before diving into the technicalities, it's crucial to have a clear vision of what you want to achieve with your voice bot. For me, the goal was to create a bot that could engage users in seamless conversations, understand their needs, and provide timely responses. I envisioned a bot that could be integrated into various platforms, offering flexibility and adaptability. Azure provides a robust suite of tools for AI development, and choosing it was an easy decision due to its comprehensive offerings and strong integration capabilities. Here’s how I began: Text-to-Speech (TTS): This service would convert the bot's text responses into human-like speech. Azure TTS offers a range of customizable voices, allowing me to choose one that matched the bot's personality. Speech-to-Text (STT): To understand user inputs, the bot needed to convert spoken language into text. Azure STT was instrumental in achieving this, providing real-time transcription with high accuracy. Foundational Model: This would refer to a large language model (LLM) that powers the bot's understanding of language and generation of text responses. Examples of foundational models include: GPT-4: A powerful LLM developed by OpenAI, capable of generating human-quality text, translating languages, writing different kinds of creative content, and answering your questions in an informative way. Foundation Speech to Speech Model: This could refer to a model that directly translates speech from one language to another, without the need for text as an intermediate step. Such a model could be used for real-time translation or for generating speech in a language different from the input language. As voice technology continues to evolve, different types of voice bots have emerged to cater to varying user needs. In this analysis, we will explore three prominent types: Voice Bot Duplex, GPT-4o-Realtime, and GPT-4o-Realtime + TTS. This detailed comparison will cover their architecture, strengths, weaknesses, best practices, challenges, and potential opportunities for implementation. Type 1: Voice Bot Duplex: Duplex Bot is an advanced AI system that conducts phone conversations and completes tasks using Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). Azure’s automatic speech recognition (ASR) technology, turning spoken language into text. This text is analysed by an LLM to generate responses, which are then converted back to speech by Azure Text-To-Speech (TTS). Duplex Bot can listen and respond simultaneously, improving interaction fluidity and reducing response time. This integration enables Duplex to autonomously manage tasks like booking appointments with minimal human intervention. - Strengths: Low operational cost . Complex architecture with multiple processing hops, making it difficult to implement. Suitable for straightforward use cases with basic conversational requirements. Customizable easily for both STT and TTS side - Weaknesses: Higher latency compared to advanced models, limiting real-time capabilities. Limited ability to perform complex actions or maintain context over longer conversations. Does not capture the human emotion from the speech Switching between language is difficult during the conversation. You have to choose the language beforehand for better output. Type 2- GPT-4o-Realtime GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model model which process these speech byte array , reason and respond back speech as byte array. - Strengths: Simplest architecture with no processing hops, making it easier to implement. Low latency and high reliability Suitable for straightforward use cases with complex conversational requirements. Switching between language is very easy Captures emotion of user. - Weaknesses: High operational cost . You can not customize the voice synthesized. You can not add Business specific abbreviation to the model to handle separately Hallucinate a lot during number input. If you say the model 123456 sometimes the model takes 123435 Support for different language may be an issue as there is no official documentation of language specific support. Type 3- GPT-4o-Realtime + TTS GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model which process these speech bytes array, reason and respond back speech as byte array. But if you want to customize the speech synthesis it then there is no finetune options present to customize the same. Hence, we came up with an option where we plugged in GPT-4o-Realtime with Azure TTS where we take the advanced voice modulation like built-in Neural voices with range of Indic languages also you can also finetune a custom neural voice (CNV). Custom neural voice (CNV) is a text to speech feature that lets you create a one-of-a-kind, customized, synthetic voice for your applications. With custom neural voice, you can build a highly natural-sounding voice for your brand or characters by providing human speech samples as training data. Out of the box, text to speech can be used with prebuilt neural voices for each supported language. The prebuilt neural voices work well in most text to speech scenarios if a unique voice isn't required. Custom neural voice is based on the neural text to speech technology and the multilingual, multi-speaker, universal model. You can create synthetic voices that are rich in speaking styles, or adaptable cross languages. The realistic and natural sounding voice of custom neural voice can represent brands, personify machines, and allow users to interact with applications conversationally. See the supported languages for custom neural voice. - Strengths: Simple architecture with only one processing hops, making it easier to implement. Low latency and high reliability Suitable for straightforward use cases with complex conversational requirements and customized voice. Switching between language is very easy Captures emotion of user. - Weaknesses: High operational cost but still lower than GPT-4o-Realtime. You cannot add Business specific abbreviation to the model to handle separately Hallucinate a lot during number input. If you say the model 123456 sometimes the model takes 123435 Does not support custom phrases Conclusion Building a voice bot is an exciting yet challenging journey. As we've seen, leveraging Azure’s advanced tools like GPT-4o-Realtime, Text-to-Speech, and Speech-to-Text can provide the foundation for creating a voice bot that understands, engages, and responds with human-like fluency. Throughout this journey, key aspects like natural interaction, context awareness, multi-language support, and real-time processing were vital in ensuring the bot’s effectiveness across various scenarios. While each voice bot model, from Voice Bot Duplex to GPT-4o-Realtime and GPT-4o-Realtime + TTS, offers its strengths and weaknesses, they all highlight the importance of carefully considering the specific needs of the application. Whether aiming for simple conversations or more sophisticated interactions, the choice of model will directly impact the bot's performance, cost, and overall user satisfaction. Looking ahead, the potential for AI-driven voice bots is immense. With ongoing advancements in AI, voice bots are bound to become even more integrated into our daily lives, transforming the way we interact with technology. As this field continues to evolve, the combination of innovative tools and strategic thinking will be key to developing voice bots that not only meet but exceed user expectations. My Previous Blog: From Zero to Hero: Building Your First Voice Bot with GPT-4o Real-Time API using Python Github Link: https://github.com/monuminu/rag-voice-bot2.6KViews5likes0CommentsExplore Azure AI Services: Curated list of prebuilt models and demos
Unlock the potential of AI with Azure's comprehensive suite of prebuilt models and demos. Whether you're looking to enhance speech recognition, analyze text, or process images and documents, Azure AI services offer ready-to-use solutions that make implementation effortless. Explore the diverse range of use cases and discover how these powerful tools can seamlessly integrate into your projects. Dive into the full catalogue of demos and start building smarter, AI-driven applications today.10KViews5likes1CommentNew HD voices preview in Azure AI Speech: contextual and realistic output evolved
Our commitment to improving Azure AI Speech voices is unwavering, as we consistently work towards making them more expressive and engaging. Today, we are thrilled to announce a new and improved HD version of our neural text to speech service for selected voices. This new version further enhances the overall expressiveness, incorporating emotion detection based on the context of the input. With innovative technology which uses acoustic and linguistic features to generate speech filled with rich, natural variations. It can adeptly detect emotional cues in the text and autonomously adjust the voice's tone and style. With this upgrade, you can expect a more human-like speech pattern characterized by improved intonation, rhythm, and emotion. What is new? Auto-regressive transformer language models have demonstrated remarkable efficacy in modelling tasks including text, vision and speech recently. We are now introducing new HD voices powered by language model-based structure. These new HD voices are designed to speak in the selected platform voice timber. And it also provides some extra value: Human-like speech generation: Our model not only interprets the input text accurately but also understands the underlying sentiment, automatically adjusting the speaking tone to match the emotion conveyed. This dynamic adjustment happens in real-time, without the need for manual editing, ensuring that each generated output is contextually appropriate and distinct. Conversational: The new model excels at replicating natural speech patterns, including spontaneous pauses and emphasis. When given conversational text, it faithfully reproduces common phonemes like pauses and filler words. Instead of sounding like a reading of written text, the generated voice feels as if someone is conversing directly with you. Prosody variations: Human voices naturally exhibit variation. Every sentence spoken by a human won’t be the same as any previously spoken ones. The new system enhances realism by introducing slight variations in each output, making the speech sound even more natural. Voice demos HD voices come with a base model that understands the input text and predicts the speaking pattern accordingly. Check out samples below for a list of HD voices available, based on the ‘DragonHDLatestNeural’ model. Voice name Script Audio de-DE-Seraphina:DragonHDLatestNeural Willkommen zu unserem Lernmodul über Safari-Ökosysteme. Safaris sind lebendige Ökosysteme, die eine Fülle bemerkenswerter Tiere beheimaten. Von den geschickten Raubtieren wie Löwen und Geparden bis hin zu sanften Riesen wie Elefanten und Giraffen – diese Lebensräume bieten eine beeindruckende Artenvielfalt. Nashörner und Zebras leben hier Seite an Seite mit Gnus und bilden eine einzigartige Gemeinschaft. In diesem Modul erforschen wir ihre faszinierenden Anpassungen und das empfindliche Geflecht des Zusammenlebens, das sie erhält. en-US-Andrew:DragonHDLatestNeural Welcome to Tech Talks & Chill, the podcast where we keep it casual while diving into the coolest stuff happening in the tech world. Whether it's the latest in AI, gadgets that are changing the game, or the software shaping the future, we’ve got it covered. Each week, we’ll hang out with experts, geek out over new breakthroughs, and swap stories about the people pushing tech forward. So grab a coffee, kick back, and join us as we chat all things tech—no jargon, no stress, just good conversation with friends. en-US-Andrew2:DragonHDLatestNeural ...and that scene alone makes the movie worth watching. Oh, and if you’re just tuning in, welcome! We’re breaking down The Midnight Chase today, and I’ve got to say—it’s one of the best thrillers I’ve seen this year. The pacing? Perfect. The lead actor? Absolutely nailed it. There’s this one moment, no spoilers, but the tension is so thick you can almost feel it. And the cinematography? Stunning—especially the way they use lighting to build suspense. If you’re into edge-of-your-seat action with a solid storyline, this is definitely one to check out. Stay with us, I’ll be diving deeper into why this one stands out from other thrillers! en-US-Aria:DragonHDLatestNeural As you complete the inspection, take clear and comprehensive notes. Use our standardized checklist as a guide, noting any deviations or areas of concern. If possible, take photographs to visually document any hazards or non-compliance issues. These notes and visuals will serve as evidence of your inspection findings. When you compile your report, include these details along with recommendations for corrective actions. en-US-Ava:DragonHDLatestNeural Ladies, it’s time for some self-pampering! Treat yourself to a moment of bliss with our exclusive Winter Spa Package. Indulge in a rejuvenating spa day like never before, and let your worries melt away. We’re excited to offer you a limited-time sale, making self-care more affordable than ever. Elevate your well-being, embrace relaxation, and step into a world of tranquility with us this Winter. en-US-Davis:DragonHDLatestNeural Unlock an exclusive golfing paradise at Hole 1 Golf, with our limited-time sale! For a short period, enjoy unbeatable deals on memberships, rounds, and golf gear. Swing into savings, elevate your game, and make the most of this incredible offer. Don’t miss out; tee off with us today and seize the opportunity to elevate your golf experience! en-US-Emma:DragonHDLatestNeural Imagine waking up to the sound of gentle waves and the warm Italian sun kissing your skin. At Bella Vista Resort, your dream holiday awaits! Nestled along the stunning Amalfi Coast, our luxurious beachfront resort offers everything you need for the perfect getaway. Indulge in spacious, elegantly designed rooms with breathtaking sea views, relax by our infinity pool, or savor authentic Italian cuisine at our on-site restaurant. Explore picturesque villages, soak up the sun on pristine sandy beaches, or enjoy thrilling water sports—there’s something for everyone! Join us for unforgettable sunsets and memories that will last a lifetime. Book your stay at Bella Vista Resort today and experience the ultimate sunny beach holiday in Italy! en-US-Emma2:DragonHDLatestNeural ...and that’s when I realized how much living abroad teaches you outside the classroom. Oh, and if you’re just joining us, welcome! We’ve been talking about studying abroad, and I was just sharing this one story—my first week in Spain, I thought I had the language down, but when I tried ordering lunch, I panicked and ended up with callos, which are tripe. Not what I expected! But those little missteps really helped me get more comfortable with the language and culture. Anyway, stick around, because next I’ll be sharing some tips for adjusting to life abroad! en-US-Jenny:DragonHDLatestNeural Turning to international news, NASA’s recent successful mission to send a rover to explore Mars has captured the world’s attention. The rover, named ‘Perseverance,’ touched down on the Martian surface earlier this week, marking a historic achievement in space exploration. It’s equipped with cutting-edge technology and instruments to search for signs of past microbial life and gather data about the planet’s geology. en-US-Steffan:DragonHDLatestNeural By activating ‘Auto-Tagging,’ your productivity soars as it seamlessly locates and retrieves vital information within seconds, eliminating the need for time-consuming tasks. This intuitive feature not only understands your content but also empowers you to concentrate on what truly matters. To enable ‘Auto-Tagging,’ simply navigate to the settings menu and toggle the feature on for hassle-free organization. ja-JP-Masaru:DragonHDLatestNeural 今日のテーマは、日本料理の魅力です。今聞いている方も、いらっしゃいませ!まずは天ぷらについて話しましょう。外はサクサク、中はふんわりとした食感が特徴で、旬の野菜や新鮮な魚介類を使うことで、その味が引き立ちます。 次にお寿司も忘れてはいけません。新鮮なネタとシャリの絶妙なバランスはシンプルながら奥が深く、各地域の特産品を使ったお寿司も楽しめます。旅をするたびに新しい発見があるのも、日本料理の楽しみの一つです。 この後は、各地の郷土料理についてもお話ししますので、ぜひ最後までお付き合いください! zh-CN-Xiaochen:DragonHDLatestNeural 最近我真的越来越喜欢探索各种美食了!你知道吗,我特别喜欢尝试不同国家的菜肴,每次都有新的惊喜。 上周我去了一家意大利餐厅,他们的披萨简直太好吃了,薄脆的饼底搭配上新鲜的番茄酱和浓郁的奶酪,每一口都充满了满足感。尤其是那种在嘴里融化的感觉,真的让人欲罢不能。 当然,我也特别喜欢中餐,无论是火锅还是川菜,那种麻辣鲜香的味道总是让我停不下来。尤其是和朋友一起吃火锅,边吃边聊,感觉特别温馨。 还有一些更独特的尝试,比如最近吃了印度的咖喱,虽然开始有点不习惯那种浓烈的香料味,但后来慢慢品味,竟然觉得很有层次感,很丰富。每次尝试新菜,我都觉得像是在探索一段新的旅程,不知道下一口会带来什么样的体验。 Note: The new voices are currently not listed in the Voice List API or documentation. However, you can access and use them by directly calling the SSML template. We will update the API and documentation with more details in the near future. These HD voices are implemented based on the latest base model: DragonHDLatestNeural. The name before the colon, e.g, en-US-Andrew, is the voice persona name and its original locale. The base model will be tracked by versions in future updates. Content Creation Demo In a hectic work environment with many documents to read, converting them into podcasts for on-the-go listening can be beneficial. Here is a demo using Azure OpenAI GPT-4O and HD voices to create podcast content from PDF document. The same idea can be applied to any other documents like web pages, word documents etc. The main steps of the demo are as follows: Use Azure OpenAI GPT-4o to summarize a lengthy document. Create a conversational podcast script. Convert the script into audio featuring two hosts using Azure HD voices. Check out sample code on Github Avatar Chat demo HD voices can also be utilized with Azure Speech to text and TTS avatars, as well as GPT-4o in real-time full duplex conversations. This technology can enhance the interactive experience for customer service chatbots, among other applications. We have published an avatar demo below: It can support continuous conversations between the user and bot. It can support user interruption when the bot is speaking. It can achieve end-to-end low latency through best practice using Azure OpenAI GPT-4o and speech services. Check out sample code on Github How to use You can start to use HD voices with the same speech synthesis SDK and REST APIs as non-HD voices. Follow the quick start to learn more about how to synthesize speech with SDK or learn to use the REST API here. Voice Locale: The locale in the voice name indicates its original language and region. Base Models: The current base model is DragonHDv1Neural. The latest version, DragonHDLatest, will be implemented once available. As more versions are introduced, you can specify the desired model (e.g., DragonHDv2Neural) according to the availability of each voice. SSML Usage: To reference a voice in SSML, use the format voicename:basemodel. Temperature Parameter: The temperature value is a float ranging from 0 to 1, influencing the randomness of the output. You can also adjust the temperature parameter to control the variation of outputs. Here’s an example: Lower Temperature: Results in less randomness, leading to more predictable outputs. Higher Temperature: Increases randomness, allowing for more diverse outputs. The default temperature is set at 1.0. Less randomness yields more stable results, while more randomness offers variety but less consistency. Availability: HD voices are currently in preview and will be accessible in three regions: East US, West Europe, and Southeast Asia. Pricing: The cost for HD voices is $30 per 1 million characters. Example SSML: <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'> <voice name='en-US-Ava:DragonHDLatestNeural' parameters='temperature=0.8'>Here is a test</voice> </speak> Note: The new voices are currently not listed in the Voice List API or documentation. However, you can access and use them by directly calling the SSML template. We will update the API and documentation with more details in the near future. These HD voices are implemented based on the latest base model: DragonHDLatestNeural. The name before the colon, e.g, en-US-Andrew, is the voice persona name and its original locale. The base model will be tracked by versions in future updates. Get started In our ongoing quest to enhance multilingual capabilities in text-to-speech (TTS) technology, our goal is bringing the best voices to our product, our voices are designed to be incredibly adaptive, seamlessly switching languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications such as language learning, travel guidance, and international business communication. Microsoft offers over 500 neural voices covering more than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with the Custom Neural Voice capability, businesses can easily create a unique brand voice. With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available. For more information Try our demo to listen to existing neural voices Add Text-to-Speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback11KViews4likes0CommentsMarch 2025: Azure AI Speech’s HD voices are generally available and more
Authors: Yufei Liu, Lihui Wang, Yao Qian, Yang Zheng, Jiajun Zhang, Bing Liu, Yang Cui, Peter Pan, Yan Deng, Songrui Wu, Gang Wang, Xi Wang, Shaofei Zhang, Sheng Zhao We are pleased to announce that our Azure AI Speech’s Dragon HD neural text to speech (language model-based TTS, similar to model design for text LLM) voices, which have been available to users for some time, are now moving to general availability (GA). These voices have gained significant traction across various scenarios and have received valuable feedback from our users. This milestone is a testament to the extensive feedback and growing popularity of Azure AI Speech’s HD voices. As we continue to enhance the user experience, we remain committed to exploring and experimenting with new voices and advanced models to push the boundaries of TTS technology. Key Features of Azure AI Speech’s Dragon HD Neural TTS Azure AI Speech’s Dragon HD (language model-based TTS) Neural TTS voices are particularly well-suited for voice agents and conversational scenarios, thanks to the following key features: Context-Aware and Dynamic Output The Azure AI Speech’s Dragon HD TTS models are enhanced with Language Models (LMs) to ensure better context understanding, producing more accurate and contextually appropriate outputs. Each voice incorporates dynamic temperature adjustments to vary the degree of creativity and emotion in the speech, allowing for tailored delivery based on the specific needs of the content. Emotion-Enhanced Expressiveness The Azure AI Speech’s Dragon HD TTS voices incorporate advanced emotion detection, leveraging acoustic and linguistic features to identify emotional cues within the input text. The model will adjusts tone, style, intonation, and rhythm dynamically to deliver speech with rich, natural variations and authentic emotional expression. Improved Multilingual Support Following the addition of more voice variety and enhanced multilingual capabilities last month, Azure AI Speech’s Dragon HD TTS have gained immense popularity across various use cases, including conversational agents, podcast creation, and video content production. Cutting-Edge Acoustic Models GA voices utilize the latest acoustic models, continually updated by Microsoft to ensure optimal performance and superior quality. These models adapt to changing needs, providing users with state-of-the-art speech synthesis. Update details 19 Azure AI Speech’s Dragon HD TTS are now generally available Voice name de-DE-Florian:DragonHDLatestNeural de-DE-Seraphina:DragonHDLatestNeural en-US-Adam:DragonHDLatestNeural en-US-Andrew:DragonHDLatestNeural en-US-Andrew2:DragonHDLatestNeural en-US-Ava:DragonHDLatestNeural en-US-Brian:DragonHDLatestNeural en-US-Davis:DragonHDLatestNeural en-US-Emma:DragonHDLatestNeural en-US-Emma2:DragonHDLatestNeural en-US-Steffan:DragonHDLatestNeural es-ES-Tristan:DragonHDLatestNeural es-ES-Ximena:DragonHDLatestNeural fr-FR-Remy:DragonHDLatestNeural fr-FR-Vivienne:DragonHDLatestNeural ja-JP-Masaru:DragonHDLatestNeural ja-JP-Nanami:DragonHDLatestNeural zh-CN-Xiaochen:DragonHDLatestNeural zh-CN-Yunfan:DragonHDLatestNeural Demo for a conversation between human and AvaHD voice Introducing multi-talker voices in preview for podcast scenarios The first multi-talker speech generation model `en-US-MultiTalker-Ava-Andrew:DragonHDLatestNeural`is a groundbreaking advancement designed to produce multi-round conversational, podcast-style speech with two distinct speakers' voices simultaneously. This model captures the natural flow of dialogue between speakers, seamlessly incorporating pauses, interjections, and contextual shifts that result in a highly realistic and engaging conversational experience. In contrast, single-talker models synthesize each speaker's turn in isolation, without considering the broader context of the conversation. This can lead to mismatched emotions and tones, making the dialogue feel less natural and cohesive. By maintaining contextual coherence and emotional consistency throughout the conversation, the multi-talker model stands out as the superior choice for applications requiring authentic, engaging, and dynamic dialogues. Here is the SSML template for roles assign: <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-US-MultiTalker-Ava-Andrew:DragonHDLatestNeural"> <mstts:dialog> <mstts:turn speaker="ava">Hello, Andrew! How's your day going?</mstts:turn> <mstts:turn speaker="andrew">Hey Ava! It's been great, just exploring some AI advancements in communication.</mstts:turn> <mstts:turn speaker="ava">That sounds interesting! What kind of projects are you working on?</mstts:turn> <mstts:turn speaker="andrew">Well, we've been experimenting with text to speech applications, including turning emails into podcasts.</mstts:turn> <mstts:turn speaker="ava">Wow, that could really improve content accessibility! Are you looking for collaborators?</mstts:turn> <mstts:turn speaker="andrew">Absolutely! We're open to testing new ideas and seeing how AI can enhance communication.</mstts:turn> </mstts:dialog> </voice> </speak> Introducing 2 more versions of Ava and Andrew in preview: optimized for podcast content Introducing two new preview versions of Ava and Andrew, optimized specifically for podcast content. While multi-talker models are designed to emulate dynamic exchanges between multiple speakers, single-talker models represent the traditional TTS approach by focusing on crafting each speaker’s contribution independently. This approach enhances linguistic accuracy, tonal control, and consistency, all without the complexity of managing inter-speaker dynamics. Although single-talker models don’t maintain broader contextual coherence between dialogue turns, their streamlined and versatile design offers unique advantages. They are ideal for applications requiring clear, uninterrupted speech—such as instructional content or speech-to-speech interactions. These models deliver a podcast-like style similar to multi-talker models but with greater flexibility and control, catering to diverse use cases. Voice name Sample en-US-Andrew3:DragonHDLatestNeural Optimized for podcast content en-US-Ava3:DragonHDLatestNeural Optimized for podcast content Introducing Azure AI Speech’s Dragon HD Flash models in preview The Azure AI Speech’s Dragon HD Flash model redefines efficiency and accessibility by offering a lighter-weight solution that retains the exceptional flexibility of HD voices, all while maintaining the same price as standard neural voices. By delivering high-quality voice synthesis at a reduced computational demand, it also significantly improves latency, enabling faster and more responsive performance. This combination of reduced latency, high-quality output, and affordability positions the HD Flash model as an optimal choice for applications requiring versatile, natural-sounding, and prompt speech generation. Voice name Sample zh-CN-Xiaochen:DragonHDFlashLatestNeural zh-CN-Xiaoxiao:DragonHDFlashLatestNeural zh-CN-Xiaoxiao2:DragonHDFlashLatestNeural (Optimized for free-talking zh-CN-Yunxiao:DragonHDFlashLatestNeural zh-CN-Yunyi:DragonHDFlashLatestNeural Availability and Important Notes Regions: GA HD voices are available in the East US, West Europe, and Southeast Asia regions; HD Flash will also be available in ChinaNorth3 Voice Status: While many HD voices have achieved GA, some newly introduced/experimenting voices remain in Preview status for experimentation and feedback collection. For detailed information, please refer to the Status field in the voice list API. Model Updates: GA voices are powered by the latest models, which are subject to continuous improvement and updates once a new default version is available. Get Started In our ongoing journey to enhance multilingual capabilities in text to speech (TTS) technology, we strive to deliver the best voices to empower your applications. Our voices are designed to be incredibly adaptive, seamlessly switching between languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications like language learning, travel guidance, and international business communication. Microsoft offers an extensive portfolio of over 600 neural voices, covering more than 150 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or provide a voice to chatbots, elevating the conversational experience for users. With the Custom Neural Voice capability, businesses can also create unique and distinctive brand voices effortlessly. With these advancements, we continue to push the boundaries of what’s possible in TTS technology, ensuring that our users have access to the most versatile, high-quality voices for their needs. For more information Try our demo to listen to existing neural voices Add Text to speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback Contact us ttsvoicefeedback@microsoft.com4.1KViews3likes0CommentsBuilding custom AI Speech models with Phi-3 and Synthetic data
Introduction In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include: Speech to Text (STT) Text to Speech (TTS) Speech Translation Custom Neural Voice Speaker Recognition Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire. The Data Challenge: When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions. Addressing Data Scarcity with Synthetic Data A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed. What is Synthetic Data? Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather. Use cases include: Privacy Compliance: Train models without handling personal or sensitive data. Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy. Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance. Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models. By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness. Overview of the Process This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture: End-to-End Custom Speech-to-Text Model Fine-Tuning Process Custom Speech with Synthetic data Hands-on Labs: GitHub Repository Step 0: Environment Setup First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to: Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry. Provision Azure AI Speech and Azure Storage account. Below is a sample configuration focusing on creating a custom Italian model: # this is a sample for keys used in this code repo. # Please rename it to .env before you can use it # Azure Phi3.5 AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct #Azure AI Speech AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt CUSTOM_SPEECH_LANG=Italian CUSTOM_SPEECH_LOCALE=it-IT # https://speech.microsoft.com/portal?projecttype=voicegallery TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural #Azure Account Storage AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_CONTAINER_NAME=stt-container Key Settings Explained: AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model. AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources. CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model. TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation. AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored. > Voice Gallery Step 1: Generating Domain-Specific Text Utterances with Phi-3.5 Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand). Code snippet (illustrative): topic = f""" Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages. """ question = f""" create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages. only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. """ response = client.complete( messages=[ SystemMessage(content=""" Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. """), UserMessage(content=f""" #topic#: {topic} Question: {question} """), ], ... ) content = response.choices[0].message.content print(content) # Prints the generated JSONL with no, locale, and content keys Sample Output (Contoso Electronics in Italian): {"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"} {"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"} {"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"} {"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"} {"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"} {"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"} {"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"} {"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"} {"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"} {"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"} These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio. Step 2: Creating the Synthetic Audio Dataset Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples. Core Function: def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice): ssml = f"""<speak version='1.0' xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'> <voice name='{default_tts_voice}'> {html.escape(text)} </voice> </speak>""" speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get() stream = speechsdk.AudioDataStream(speech_sythesis_result) stream.save_to_wav_file(file_path) Execution: For each generated text line, the code produces multiple WAV files (one per specified TTS voice). It also creates a manifest.txt for reference and a zip file containing all the training data. Note: If DELETE_OLD_DATA = True, the training_dataset folder resets each run. If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples. Code snippet (illustrative): import zipfile import shutil DELETE_OLD_DATA = True train_dataset_dir = "train_dataset" if not os.path.exists(train_dataset_dir): os.makedirs(train_dataset_dir) if(DELETE_OLD_DATA): for file in os.listdir(train_dataset_dir): os.remove(os.path.join(train_dataset_dir, file)) timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") zip_filename = f'train_{lang}_{timestamp}.zip' with zipfile.ZipFile(zip_filename, 'w') as zipf: for file in files: zipf.write(os.path.join(output_dir, file), file) print(f"Created zip file: {zip_filename}") shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename)) print(f"Moved zip file to: {os.path.join(train_dataset_dir, zip_filename)}") train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)} %store train_dataset_path You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario. Example Snippet to create the synthetic evaluation data: import datetime print(TTS_FOR_EVAL) languages = [CUSTOM_SPEECH_LOCALE] eval_output_dir = "synthetic_eval_data" DELETE_OLD_DATA = True if not os.path.exists(eval_output_dir): os.makedirs(eval_output_dir) if(DELETE_OLD_DATA): for file in os.listdir(eval_output_dir): os.remove(os.path.join(eval_output_dir, file)) eval_tts_voices = TTS_FOR_EVAL.split(',') for tts_voice in eval_tts_voices: with open(synthetic_text_file, 'r', encoding='utf-8') as f: for line in f: try: expression = json.loads(line) no = expression['no'] for lang in languages: text = expression[lang] timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"{no}_{lang}_{timestamp}.wav" get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice) with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file: manifest_file.write(f"{file_name}\t{text}\n") except json.JSONDecodeError as e: print(f"Error decoding JSON on line: {line}") print(e) Step 3: Creating and Training a Custom Speech Model To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs: Upload your dataset (the zip file created in Step 2) to your Azure Storage container. Register your dataset as a Custom Speech dataset. Create a Custom Speech model using that dataset. Create evaluations using that custom model with asynchronous calls until it’s completed. You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes. Key APIs & References: Azure Speech-to-Text REST APIs (v3.2) The provided common.py in the hands-on repo abstracts API calls for convenience. Example Snippet to create training dataset: uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key) kind="Acoustic" display_name = "acoustic dataset(zip) for training" description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model" zip_dataset_dict = {} for display_name in uploaded_files: zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE) You can monitor training progress using monitor_training_status function which polls the model’s status and updates you once training completes Core Function: def monitor_training_status(custom_model_id): with tqdm(total=3, desc="Running Status", unit="step") as pbar: status = get_custom_model_status(base_url, headers, custom_model_id) if status == "NotStarted": pbar.update(1) while status != "Succeeded" and status != "Failed": if status == "Running" and pbar.n < 2: pbar.update(1) print(f"Current Status: {status}") time.sleep(10) status = get_custom_model_status(base_url, headers, custom_model_id) while(pbar.n < 3): pbar.update(1) print("Training Completed") Step 4: Evaluate Trained Custom Speech After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER. Key Steps: Use create_evaluation function to evaluate the custom model against your test set. Compare evaluation metrics between base and custom models. Check WER to quantify accuracy improvements. After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file. Example Snippet to create evaluation: description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model" evaluation_ids={} for display_name in uploaded_files: evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE) Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb. Example Snippet to create WER dateframe: # Collect WER results for each dataset wer_results = [] eval_title = "Evaluation Results for base model and custom model: " for display_name in uploaded_files: eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name]) eval_title = eval_title + display_name + " " wer_results.append({ 'Dataset': display_name, 'WER_base_model': eval_info['properties']['wordErrorRate1'], 'WER_custom_model': eval_info['properties']['wordErrorRate2'], }) # Create a DataFrame to display the results print(eval_info) wer_df = pd.DataFrame(wer_results) print(eval_title) print(wer_df) About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training. You'll also similarly create a WER result markdown file using the md_table_scoring_result method below. Core Function: # Create a markdown file for table scoring results md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files) Implementation Considerations The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance. Conclusion By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to: Rapidly produce large volumes of specialized training and evaluation data. Substantially reduce the time and cost associated with recording real audio. Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples. As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂 Reference Azure AI Speech Overview Microsoft Phi-3 Cookbook Text to Speech Overview Speech to Text Overview Custom Speech Overview Customize a speech model with fine-tuning in the Azure AI Foundry Scaling Speech-Text Pre-Training with Synthetic Interleaved Data (arXiv) Training TTS Systems from Synthetic Data: A Practical Approach for Accent Transfer (arXiv) Generating Data with TTS and LLMs for Conversational Speech Recognition (arXiv)1.2KViews3likes8Comments