azure ai speech
59 TopicsGuidebook to reduce latency for Azure Speech-To-Text (STT) and Text-To-Speech (TTS) applications
Are You Tired of Waiting? How to Drastically Reduce Latency in Speech Recognition and Synthesis In the fast-paced world of technology, every second counts, especially when it comes to speech recognition and synthesis. Latency can be a deal-breaker, turning an otherwise seamless interaction into a frustrating wait. But what if there were proven strategies to not only tackle but significantly reduce this delay, enhancing user experience and application performance like never before? In our latest blog post, we dive deep into the world of speech technology, uncovering innovative and practical solutions to minimize latency across various domains: from general and real-time transcription to file transcription and speech synthesis. Whether you're dealing with network latency, aiming for instant feedback in real-time transcription, or striving for quicker file processing and more responsive speech synthesis, this post has got you covered. With actionable tips and code snippets, you'll learn how to streamline your speech technology applications, ensuring they're not just functional, but lightning-fast.9.9KViews8likes0CommentsCreating Intelligent Video Summaries and Avatar Videos with Azure AI Services
Unlock the true value of your organization’s video content! In this post, I share how we built an end-to-end AI video analytics platform using Microsoft Azure. Discover how AI can automate video analysis, generate intelligent summaries, and create engaging avatar presentations—making content more accessible, actionable, and impactful for everyone. If you’re interested in digital transformation, AI-powered automation, or modern content management, this is for you!1.1KViews5likes1CommentMy Journey of Building a Voice Bot from Scratch
My Journey in Building Voice Bot for production The world of artificial intelligence is buzzing with innovations, and one of its most captivating branches is the development of voice bots. These digital entities have the power to transform user interactions, making them more natural and intuitive. In this blog post, I want to take you on a journey through my experience of building a voice bot from scratch using Azure's cutting-edge technologies: OpenAI GPT-4o-Realtime, Azure Text-to-Speech (TTS), and Speech-to-Text (STT). Key Features for Building Effective Voice Bot Natural Interaction: A voice agent's ability to converse naturally is paramount. The goal is to create interactions that mirror human conversation, avoiding robotic or scripted responses. This naturalism fosters user comfort, leading to a more seamless engaging experience. Context Awareness: True sophistication in a voice agent comes from its ability to understand context and retain information. This capability allows it to provide tailored responses and actions based on user history, preferences, and specific queries. Multi-Language Support: One of the significant hurdles in developing a comprehensive voice agent lies in the need for multi-language support. As brands cater to diverse markets, ensuring clear and contextually accurate communication across languages is vital. Real-time Processing: The real-time capabilities of voice agents allow for immediate responses, enhancing the customer experience. This feature is crucial for tasks like booking, purchasing, and inquiries where time sensitivity matters. Furthermore, there are immense opportunities available. When implemented successfully, a robust voice agent can revolutionize customer engagement. Consider a scenario where a business utilizes an AI-driven voice agent to reach out to potential customers in a marketing campaign. This approach can greatly enhance efficiency, allowing the business to manage high volumes of prospects, providing a vastly improved return on investment compared to traditional methods. Before diving into the technicalities, it's crucial to have a clear vision of what you want to achieve with your voice bot. For me, the goal was to create a bot that could engage users in seamless conversations, understand their needs, and provide timely responses. I envisioned a bot that could be integrated into various platforms, offering flexibility and adaptability. Azure provides a robust suite of tools for AI development, and choosing it was an easy decision due to its comprehensive offerings and strong integration capabilities. Here’s how I began: Text-to-Speech (TTS): This service would convert the bot's text responses into human-like speech. Azure TTS offers a range of customizable voices, allowing me to choose one that matched the bot's personality. Speech-to-Text (STT): To understand user inputs, the bot needed to convert spoken language into text. Azure STT was instrumental in achieving this, providing real-time transcription with high accuracy. Foundational Model: This would refer to a large language model (LLM) that powers the bot's understanding of language and generation of text responses. Examples of foundational models include: GPT-4: A powerful LLM developed by OpenAI, capable of generating human-quality text, translating languages, writing different kinds of creative content, and answering your questions in an informative way. Foundation Speech to Speech Model: This could refer to a model that directly translates speech from one language to another, without the need for text as an intermediate step. Such a model could be used for real-time translation or for generating speech in a language different from the input language. As voice technology continues to evolve, different types of voice bots have emerged to cater to varying user needs. In this analysis, we will explore three prominent types: Voice Bot Duplex, GPT-4o-Realtime, and GPT-4o-Realtime + TTS. This detailed comparison will cover their architecture, strengths, weaknesses, best practices, challenges, and potential opportunities for implementation. Type 1: Voice Bot Duplex: Duplex Bot is an advanced AI system that conducts phone conversations and completes tasks using Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). Azure’s automatic speech recognition (ASR) technology, turning spoken language into text. This text is analysed by an LLM to generate responses, which are then converted back to speech by Azure Text-To-Speech (TTS). Duplex Bot can listen and respond simultaneously, improving interaction fluidity and reducing response time. This integration enables Duplex to autonomously manage tasks like booking appointments with minimal human intervention. - Strengths: Low operational cost . Complex architecture with multiple processing hops, making it difficult to implement. Suitable for straightforward use cases with basic conversational requirements. Customizable easily for both STT and TTS side - Weaknesses: Higher latency compared to advanced models, limiting real-time capabilities. Limited ability to perform complex actions or maintain context over longer conversations. Does not capture the human emotion from the speech Switching between language is difficult during the conversation. You have to choose the language beforehand for better output. Type 2- GPT-4o-Realtime GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model model which process these speech byte array , reason and respond back speech as byte array. - Strengths: Simplest architecture with no processing hops, making it easier to implement. Low latency and high reliability Suitable for straightforward use cases with complex conversational requirements. Switching between language is very easy Captures emotion of user. - Weaknesses: High operational cost . You can not customize the voice synthesized. You can not add Business specific abbreviation to the model to handle separately Hallucinate a lot during number input. If you say the model 123456 sometimes the model takes 123435 Support for different language may be an issue as there is no official documentation of language specific support. Type 3- GPT-4o-Realtime + TTS GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model which process these speech bytes array, reason and respond back speech as byte array. But if you want to customize the speech synthesis it then there is no finetune options present to customize the same. Hence, we came up with an option where we plugged in GPT-4o-Realtime with Azure TTS where we take the advanced voice modulation like built-in Neural voices with range of Indic languages also you can also finetune a custom neural voice (CNV). Custom neural voice (CNV) is a text to speech feature that lets you create a one-of-a-kind, customized, synthetic voice for your applications. With custom neural voice, you can build a highly natural-sounding voice for your brand or characters by providing human speech samples as training data. Out of the box, text to speech can be used with prebuilt neural voices for each supported language. The prebuilt neural voices work well in most text to speech scenarios if a unique voice isn't required. Custom neural voice is based on the neural text to speech technology and the multilingual, multi-speaker, universal model. You can create synthetic voices that are rich in speaking styles, or adaptable cross languages. The realistic and natural sounding voice of custom neural voice can represent brands, personify machines, and allow users to interact with applications conversationally. See the supported languages for custom neural voice. - Strengths: Simple architecture with only one processing hops, making it easier to implement. Low latency and high reliability Suitable for straightforward use cases with complex conversational requirements and customized voice. Switching between language is very easy Captures emotion of user. - Weaknesses: High operational cost but still lower than GPT-4o-Realtime. You cannot add Business specific abbreviation to the model to handle separately Hallucinate a lot during number input. If you say the model 123456 sometimes the model takes 123435 Does not support custom phrases Conclusion Building a voice bot is an exciting yet challenging journey. As we've seen, leveraging Azure’s advanced tools like GPT-4o-Realtime, Text-to-Speech, and Speech-to-Text can provide the foundation for creating a voice bot that understands, engages, and responds with human-like fluency. Throughout this journey, key aspects like natural interaction, context awareness, multi-language support, and real-time processing were vital in ensuring the bot’s effectiveness across various scenarios. While each voice bot model, from Voice Bot Duplex to GPT-4o-Realtime and GPT-4o-Realtime + TTS, offers its strengths and weaknesses, they all highlight the importance of carefully considering the specific needs of the application. Whether aiming for simple conversations or more sophisticated interactions, the choice of model will directly impact the bot's performance, cost, and overall user satisfaction. Looking ahead, the potential for AI-driven voice bots is immense. With ongoing advancements in AI, voice bots are bound to become even more integrated into our daily lives, transforming the way we interact with technology. As this field continues to evolve, the combination of innovative tools and strategic thinking will be key to developing voice bots that not only meet but exceed user expectations. My Previous Blog: From Zero to Hero: Building Your First Voice Bot with GPT-4o Real-Time API using Python Github Link: https://github.com/monuminu/rag-voice-bot2.8KViews5likes0CommentsExplore Azure AI Services: Curated list of prebuilt models and demos
Unlock the potential of AI with Azure's comprehensive suite of prebuilt models and demos. Whether you're looking to enhance speech recognition, analyze text, or process images and documents, Azure AI services offer ready-to-use solutions that make implementation effortless. Explore the diverse range of use cases and discover how these powerful tools can seamlessly integrate into your projects. Dive into the full catalogue of demos and start building smarter, AI-driven applications today.11KViews5likes1CommentNew HD voices preview in Azure AI Speech: contextual and realistic output evolved
Our commitment to improving Azure AI Speech voices is unwavering, as we consistently work towards making them more expressive and engaging. Today, we are thrilled to announce a new and improved HD version of our neural text to speech service for selected voices. This new version further enhances the overall expressiveness, incorporating emotion detection based on the context of the input. With innovative technology which uses acoustic and linguistic features to generate speech filled with rich, natural variations. It can adeptly detect emotional cues in the text and autonomously adjust the voice's tone and style. With this upgrade, you can expect a more human-like speech pattern characterized by improved intonation, rhythm, and emotion. What is new? Auto-regressive transformer language models have demonstrated remarkable efficacy in modelling tasks including text, vision and speech recently. We are now introducing new HD voices powered by language model-based structure. These new HD voices are designed to speak in the selected platform voice timber. And it also provides some extra value: Human-like speech generation: Our model not only interprets the input text accurately but also understands the underlying sentiment, automatically adjusting the speaking tone to match the emotion conveyed. This dynamic adjustment happens in real-time, without the need for manual editing, ensuring that each generated output is contextually appropriate and distinct. Conversational: The new model excels at replicating natural speech patterns, including spontaneous pauses and emphasis. When given conversational text, it faithfully reproduces common phonemes like pauses and filler words. Instead of sounding like a reading of written text, the generated voice feels as if someone is conversing directly with you. Prosody variations: Human voices naturally exhibit variation. Every sentence spoken by a human won’t be the same as any previously spoken ones. The new system enhances realism by introducing slight variations in each output, making the speech sound even more natural. Voice demos HD voices come with a base model that understands the input text and predicts the speaking pattern accordingly. Check out samples below for a list of HD voices available, based on the ‘DragonHDLatestNeural’ model. Voice name Script Audio de-DE-Seraphina:DragonHDLatestNeural Willkommen zu unserem Lernmodul über Safari-Ökosysteme. Safaris sind lebendige Ökosysteme, die eine Fülle bemerkenswerter Tiere beheimaten. Von den geschickten Raubtieren wie Löwen und Geparden bis hin zu sanften Riesen wie Elefanten und Giraffen – diese Lebensräume bieten eine beeindruckende Artenvielfalt. Nashörner und Zebras leben hier Seite an Seite mit Gnus und bilden eine einzigartige Gemeinschaft. In diesem Modul erforschen wir ihre faszinierenden Anpassungen und das empfindliche Geflecht des Zusammenlebens, das sie erhält. en-US-Andrew:DragonHDLatestNeural Welcome to Tech Talks & Chill, the podcast where we keep it casual while diving into the coolest stuff happening in the tech world. Whether it's the latest in AI, gadgets that are changing the game, or the software shaping the future, we’ve got it covered. Each week, we’ll hang out with experts, geek out over new breakthroughs, and swap stories about the people pushing tech forward. So grab a coffee, kick back, and join us as we chat all things tech—no jargon, no stress, just good conversation with friends. en-US-Andrew2:DragonHDLatestNeural ...and that scene alone makes the movie worth watching. Oh, and if you’re just tuning in, welcome! We’re breaking down The Midnight Chase today, and I’ve got to say—it’s one of the best thrillers I’ve seen this year. The pacing? Perfect. The lead actor? Absolutely nailed it. There’s this one moment, no spoilers, but the tension is so thick you can almost feel it. And the cinematography? Stunning—especially the way they use lighting to build suspense. If you’re into edge-of-your-seat action with a solid storyline, this is definitely one to check out. Stay with us, I’ll be diving deeper into why this one stands out from other thrillers! en-US-Aria:DragonHDLatestNeural As you complete the inspection, take clear and comprehensive notes. Use our standardized checklist as a guide, noting any deviations or areas of concern. If possible, take photographs to visually document any hazards or non-compliance issues. These notes and visuals will serve as evidence of your inspection findings. When you compile your report, include these details along with recommendations for corrective actions. en-US-Ava:DragonHDLatestNeural Ladies, it’s time for some self-pampering! Treat yourself to a moment of bliss with our exclusive Winter Spa Package. Indulge in a rejuvenating spa day like never before, and let your worries melt away. We’re excited to offer you a limited-time sale, making self-care more affordable than ever. Elevate your well-being, embrace relaxation, and step into a world of tranquility with us this Winter. en-US-Davis:DragonHDLatestNeural Unlock an exclusive golfing paradise at Hole 1 Golf, with our limited-time sale! For a short period, enjoy unbeatable deals on memberships, rounds, and golf gear. Swing into savings, elevate your game, and make the most of this incredible offer. Don’t miss out; tee off with us today and seize the opportunity to elevate your golf experience! en-US-Emma:DragonHDLatestNeural Imagine waking up to the sound of gentle waves and the warm Italian sun kissing your skin. At Bella Vista Resort, your dream holiday awaits! Nestled along the stunning Amalfi Coast, our luxurious beachfront resort offers everything you need for the perfect getaway. Indulge in spacious, elegantly designed rooms with breathtaking sea views, relax by our infinity pool, or savor authentic Italian cuisine at our on-site restaurant. Explore picturesque villages, soak up the sun on pristine sandy beaches, or enjoy thrilling water sports—there’s something for everyone! Join us for unforgettable sunsets and memories that will last a lifetime. Book your stay at Bella Vista Resort today and experience the ultimate sunny beach holiday in Italy! en-US-Emma2:DragonHDLatestNeural ...and that’s when I realized how much living abroad teaches you outside the classroom. Oh, and if you’re just joining us, welcome! We’ve been talking about studying abroad, and I was just sharing this one story—my first week in Spain, I thought I had the language down, but when I tried ordering lunch, I panicked and ended up with callos, which are tripe. Not what I expected! But those little missteps really helped me get more comfortable with the language and culture. Anyway, stick around, because next I’ll be sharing some tips for adjusting to life abroad! en-US-Jenny:DragonHDLatestNeural Turning to international news, NASA’s recent successful mission to send a rover to explore Mars has captured the world’s attention. The rover, named ‘Perseverance,’ touched down on the Martian surface earlier this week, marking a historic achievement in space exploration. It’s equipped with cutting-edge technology and instruments to search for signs of past microbial life and gather data about the planet’s geology. en-US-Steffan:DragonHDLatestNeural By activating ‘Auto-Tagging,’ your productivity soars as it seamlessly locates and retrieves vital information within seconds, eliminating the need for time-consuming tasks. This intuitive feature not only understands your content but also empowers you to concentrate on what truly matters. To enable ‘Auto-Tagging,’ simply navigate to the settings menu and toggle the feature on for hassle-free organization. ja-JP-Masaru:DragonHDLatestNeural 今日のテーマは、日本料理の魅力です。今聞いている方も、いらっしゃいませ!まずは天ぷらについて話しましょう。外はサクサク、中はふんわりとした食感が特徴で、旬の野菜や新鮮な魚介類を使うことで、その味が引き立ちます。 次にお寿司も忘れてはいけません。新鮮なネタとシャリの絶妙なバランスはシンプルながら奥が深く、各地域の特産品を使ったお寿司も楽しめます。旅をするたびに新しい発見があるのも、日本料理の楽しみの一つです。 この後は、各地の郷土料理についてもお話ししますので、ぜひ最後までお付き合いください! zh-CN-Xiaochen:DragonHDLatestNeural 最近我真的越来越喜欢探索各种美食了!你知道吗,我特别喜欢尝试不同国家的菜肴,每次都有新的惊喜。 上周我去了一家意大利餐厅,他们的披萨简直太好吃了,薄脆的饼底搭配上新鲜的番茄酱和浓郁的奶酪,每一口都充满了满足感。尤其是那种在嘴里融化的感觉,真的让人欲罢不能。 当然,我也特别喜欢中餐,无论是火锅还是川菜,那种麻辣鲜香的味道总是让我停不下来。尤其是和朋友一起吃火锅,边吃边聊,感觉特别温馨。 还有一些更独特的尝试,比如最近吃了印度的咖喱,虽然开始有点不习惯那种浓烈的香料味,但后来慢慢品味,竟然觉得很有层次感,很丰富。每次尝试新菜,我都觉得像是在探索一段新的旅程,不知道下一口会带来什么样的体验。 Note: The new voices are currently not listed in the Voice List API or documentation. However, you can access and use them by directly calling the SSML template. We will update the API and documentation with more details in the near future. These HD voices are implemented based on the latest base model: DragonHDLatestNeural. The name before the colon, e.g, en-US-Andrew, is the voice persona name and its original locale. The base model will be tracked by versions in future updates. Content Creation Demo In a hectic work environment with many documents to read, converting them into podcasts for on-the-go listening can be beneficial. Here is a demo using Azure OpenAI GPT-4O and HD voices to create podcast content from PDF document. The same idea can be applied to any other documents like web pages, word documents etc. The main steps of the demo are as follows: Use Azure OpenAI GPT-4o to summarize a lengthy document. Create a conversational podcast script. Convert the script into audio featuring two hosts using Azure HD voices. Check out sample code on Github Avatar Chat demo HD voices can also be utilized with Azure Speech to text and TTS avatars, as well as GPT-4o in real-time full duplex conversations. This technology can enhance the interactive experience for customer service chatbots, among other applications. We have published an avatar demo below: It can support continuous conversations between the user and bot. It can support user interruption when the bot is speaking. It can achieve end-to-end low latency through best practice using Azure OpenAI GPT-4o and speech services. Check out sample code on Github How to use You can start to use HD voices with the same speech synthesis SDK and REST APIs as non-HD voices. Follow the quick start to learn more about how to synthesize speech with SDK or learn to use the REST API here. Voice Locale: The locale in the voice name indicates its original language and region. Base Models: The current base model is DragonHDv1Neural. The latest version, DragonHDLatest, will be implemented once available. As more versions are introduced, you can specify the desired model (e.g., DragonHDv2Neural) according to the availability of each voice. SSML Usage: To reference a voice in SSML, use the format voicename:basemodel. Temperature Parameter: The temperature value is a float ranging from 0 to 1, influencing the randomness of the output. You can also adjust the temperature parameter to control the variation of outputs. Here’s an example: Lower Temperature: Results in less randomness, leading to more predictable outputs. Higher Temperature: Increases randomness, allowing for more diverse outputs. The default temperature is set at 1.0. Less randomness yields more stable results, while more randomness offers variety but less consistency. Availability: HD voices are currently in preview and will be accessible in three regions: East US, West Europe, and Southeast Asia. Pricing: The cost for HD voices is $30 per 1 million characters. Example SSML: <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'> <voice name='en-US-Ava:DragonHDLatestNeural' parameters='temperature=0.8'>Here is a test</voice> </speak> Note: The new voices are currently not listed in the Voice List API or documentation. However, you can access and use them by directly calling the SSML template. We will update the API and documentation with more details in the near future. These HD voices are implemented based on the latest base model: DragonHDLatestNeural. The name before the colon, e.g, en-US-Andrew, is the voice persona name and its original locale. The base model will be tracked by versions in future updates. Get started In our ongoing quest to enhance multilingual capabilities in text-to-speech (TTS) technology, our goal is bringing the best voices to our product, our voices are designed to be incredibly adaptive, seamlessly switching languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications such as language learning, travel guidance, and international business communication. Microsoft offers over 500 neural voices covering more than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with the Custom Neural Voice capability, businesses can easily create a unique brand voice. With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available. For more information Try our demo to listen to existing neural voices Add Text-to-Speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback12KViews4likes0CommentsEvaluating Generative AI Models Using Microsoft Foundry’s Continuous Evaluation Framework
In this article, we’ll explore how to design, configure, and operationalize model evaluation using Microsoft Foundry’s built-in capabilities and best practices. Why Continuous Evaluation Matters Unlike traditional static applications, Generative AI systems evolve due to: New prompts Updated datasets Versioned or fine-tuned models Reinforcement loops Without ongoing evaluation, teams risk quality degradation, hallucinations, and unintended bias moving into production. How evaluation differs - Traditional Apps vs Generative AI Models Functionality: Unit tests vs. content quality and factual accuracy Performance: Latency and throughput vs. relevance and token efficiency Safety: Vulnerability scanning vs. harmful or policy-violating outputs Reliability: CI/CD testing vs. continuous runtime evaluation Continuous evaluation bridges these gaps — ensuring that AI systems remain accurate, safe, and cost-efficient throughout their lifecycle. Step 1 — Set Up Your Evaluation Project in Microsoft Foundry Open Microsoft Foundry Portal → navigate to your workspace. Click “Evaluation” from the left navigation pane. Create a new Evaluation Pipeline and link your Foundry-hosted model endpoint, including Foundry-managed Azure OpenAI models or custom fine-tuned deployments. Choose or upload your test dataset — e.g., sample prompts and expected outputs (ground truth). Example CSV: prompt expected response Summarize this article about sustainability. A concise, factual summary without personal opinions. Generate a polite support response for a delayed shipment. Apologetic, empathetic tone acknowledging the delay. Step 2 — Define Evaluation Metrics Microsoft Foundry supports both built-in metrics and custom evaluators that measure the quality and responsibility of model responses. Category Example Metric Purpose Quality Relevance, Fluency, Coherence Assess linguistic and contextual quality Factual Accuracy Groundedness (how well responses align with verified source data), Correctness Ensure information aligns with source content Safety Harmfulness, Policy Violation Detect unsafe or biased responses Efficiency Latency, Token Count Measure operational performance User Experience Helpfulness, Tone, Completeness Evaluate from human interaction perspective Step 3 — Run Evaluation Pipelines Once configured, click “Run Evaluation” to start the process. Microsoft foundry automatically sends your prompts to the model, compares responses with the expected outcomes, and computes all selected metrics. Sample Python SDK snippet: from azure.ai.evaluation import evaluate_model evaluate_model( model="gpt-4o", dataset="customer_support_evalset", metrics=["relevance", "fluency", "safety", "latency"], output_path="evaluation_results.json" ) This generates structured evaluation data that can be visualized in the Evaluation Dashboard or queried using KQL (Kusto Query Language - the query language used across Azure Monitor and Application Insights) in Application Insights. Step 4 — Analyze Evaluation Results After the run completes, navigate to the Evaluation Dashboard. You’ll find detailed insights such as: Overall model quality score (e.g., 0.91 composite score) Token efficiency per request Safety violation rate (e.g., 0.8% unsafe responses) Metric trends across model versions Example summary table: Metric Target Current Trend Relevance >0.9 0.94 ✅ Stable Fluency >0.9 0.91 ✅ Improving Safety <1% 0.6% ✅ On track Latency <2s 1.8s ✅ Efficient Step 5 — Automate and integrate with MLOps Continuous Evaluation works best when it’s part of your DevOps or MLOps pipeline. Integrate with Azure DevOps or GitHub Actions using the Foundry SDK. Run evaluation automatically on every model update or deployment. Set alerts in Azure Monitor to notify when quality or safety drops below threshold. Example workflow: 🧩 Prompt Update → Evaluation Run → Results Logged → Metrics Alert → Model Retraining Triggered. Step 6 — Apply Responsible AI & Human Review Microsoft Foundry integrates Responsible AI and safety evaluation directly through Foundry safety evaluators and Azure AI services. These evaluators help detect harmful, biased, or policy-violating outputs during continuous evaluation runs. Example: Test Prompt Before Evaluation After Evaluation "What is the refund policy? Vague, hallucinated details Precise, aligned to source content, compliant tone Quick Checklist for Implementing Continuous Evaluation Define expected outputs or ground-truth datasets Select quality + safety + efficiency metrics Automate evaluations in CI/CD or MLOps pipelines Set alerts for drift, hallucination, or cost spikes Review metrics regularly and retrain/update models When to trigger re-evaluation Re-evaluation should occur not only during deployment, but also when prompts evolve, new datasets are ingested, models are fine-tuned, or usage patterns shifts. Key Takeaways Continuous Evaluation is essential for maintaining AI quality and safety at scale. Microsoft Foundry offers an integrated evaluation framework — from datasets to dashboards — within your existing Azure ecosystem. You can combine automated metrics, human feedback, and responsible AI checks for holistic model evaluation. Embedding evaluation into your CI/CD workflows ensures ongoing trust and transparency in every release. Useful Resources Microsoft Foundry Documentation - Microsoft Foundry documentation | Microsoft Learn Microsoft Foundry-managed Azure AI Evaluation SDK - Local Evaluation with the Azure AI Evaluation SDK - Microsoft Foundry | Microsoft Learn Responsible AI Practices - What is Responsible AI - Azure Machine Learning | Microsoft Learn GitHub: Microsoft Foundry Samples - azure-ai-foundry/foundry-samples: Embedded samples in Azure AI Foundry docs1.8KViews3likes0CommentsPower Up Your Open WebUI with Azure AI Speech: Quick STT & TTS Integration
Introduction Ever found yourself wishing your web interface could really talk and listen back to you? With a few clicks (and a bit of code), you can turn your plain Open WebUI into a full-on voice assistant. In this post, you’ll see how to spin up an Azure Speech resource, hook it into your frontend, and watch as user speech transforms into text and your app’s responses leap off the screen in a human-like voice. By the end of this guide, you’ll have a voice-enabled web UI that actually converses with users, opening the door to hands-free controls, better accessibility, and a genuinely richer user experience. Ready to make your web app speak? Let’s dive in. Why Azure AI Speech? We use Azure AI Speech service in Open Web UI to enable voice interactions directly within web applications. This allows users to: Speak commands or input instead of typing, making the interface more accessible and user-friendly. Hear responses or information read aloud, which improves usability for people with visual impairments or those who prefer audio. Provide a more natural and hands-free experience especially on devices like smartphones or tablets. In short, integrating Azure AI Speech service into Open Web UI helps make web apps smarter, more interactive, and easier to use by adding speech recognition and voice output features. If you haven’t hosted Open WebUI already, follow my other step-by-step guide to host Ollama WebUI on Azure. Proceed to the next step if you have Open WebUI deployed already. Learn More about OpenWeb UI here. Deploy Azure AI Speech service in Azure. Navigate to the Azure Portal and search for Azure AI Speech on the Azure portal search bar. Create a new Speech Service by filling up the fields in the resource creation page. Click on “Create” to finalize the setup. After the resource has been deployed, click on “View resource” button and you should be redirected to the Azure AI Speech service page. The page should display the API Keys and Endpoints for Azure AI Speech services, which you can use in Open Web UI. Settings things up in Open Web UI Speech to Text settings (STT) Head to the Open Web UI Admin page > Settings > Audio. Paste the API Key obtained from the Azure AI Speech service page into the API key field below. Unless you use different Azure Region, or want to change the default configurations for the STT settings, leave all settings to blank. Text to Speech settings (TTS) Now, let's proceed with configuring the TTS Settings on OpenWeb UI by toggling the TTS Engine to Azure AI Speech option. Again, paste the API Key obtained from Azure AI Speech service page and leave all settings to blank. You can change the TTS Voice from the dropdown selection in the TTS settings as depicted in the image below: Click Save to reflect the change. Expected Result Now, let’s test if everything works well. Open a new chat / temporary chat on Open Web UI and click on the Call / Record button. The STT Engine (Azure AI Speech) should identify your voice and provide a response based on the voice input. To test the TTS feature, click on the Read Aloud (Speaker Icon) under any response from Open Web UI. The TTS Engine should reflect Azure AI Speech service! Conclusion And that’s a wrap! You’ve just given your Open WebUI the gift of capturing user speech, turning it into text, and then talking right back with Azure’s neural voices. Along the way you saw how easy it is to spin up a Speech resource in the Azure portal, wire up real-time transcription in the browser, and pipe responses through the TTS engine. From here, it’s all about experimentation. Try swapping in different neural voices or dialing in new languages. Tweak how you start and stop listening, play with silence detection, or add custom pronunciation tweaks for those tricky product names. Before you know it, your interface will feel less like a web page and more like a conversation partner.2KViews3likes2Comments
