speech
53 TopicsAnnouncing Live Interpreter API - Now in Public Preview
Today, we’re excited to introduce Live Interpreter –a breakthrough new capability in Azure Speech Translation – that makes real-time, multilingual communication effortless. Live Interpreter continuously identifies the language being spoken without requiring you to set an input language and delivers low latency speech-to-speech translation in a natural voice that preserves the speaker’s style and tone.4.1KViews1like0CommentsPower Up Your Open WebUI with Azure AI Speech: Quick STT & TTS Integration
Introduction Ever found yourself wishing your web interface could really talk and listen back to you? With a few clicks (and a bit of code), you can turn your plain Open WebUI into a full-on voice assistant. In this post, you’ll see how to spin up an Azure Speech resource, hook it into your frontend, and watch as user speech transforms into text and your app’s responses leap off the screen in a human-like voice. By the end of this guide, you’ll have a voice-enabled web UI that actually converses with users, opening the door to hands-free controls, better accessibility, and a genuinely richer user experience. Ready to make your web app speak? Let’s dive in. Why Azure AI Speech? We use Azure AI Speech service in Open Web UI to enable voice interactions directly within web applications. This allows users to: Speak commands or input instead of typing, making the interface more accessible and user-friendly. Hear responses or information read aloud, which improves usability for people with visual impairments or those who prefer audio. Provide a more natural and hands-free experience especially on devices like smartphones or tablets. In short, integrating Azure AI Speech service into Open Web UI helps make web apps smarter, more interactive, and easier to use by adding speech recognition and voice output features. If you haven’t hosted Open WebUI already, follow my other step-by-step guide to host Ollama WebUI on Azure. Proceed to the next step if you have Open WebUI deployed already. Learn More about OpenWeb UI here. Deploy Azure AI Speech service in Azure. Navigate to the Azure Portal and search for Azure AI Speech on the Azure portal search bar. Create a new Speech Service by filling up the fields in the resource creation page. Click on “Create” to finalize the setup. After the resource has been deployed, click on “View resource” button and you should be redirected to the Azure AI Speech service page. The page should display the API Keys and Endpoints for Azure AI Speech services, which you can use in Open Web UI. Settings things up in Open Web UI Speech to Text settings (STT) Head to the Open Web UI Admin page > Settings > Audio. Paste the API Key obtained from the Azure AI Speech service page into the API key field below. Unless you use different Azure Region, or want to change the default configurations for the STT settings, leave all settings to blank. Text to Speech settings (TTS) Now, let's proceed with configuring the TTS Settings on OpenWeb UI by toggling the TTS Engine to Azure AI Speech option. Again, paste the API Key obtained from Azure AI Speech service page and leave all settings to blank. You can change the TTS Voice from the dropdown selection in the TTS settings as depicted in the image below: Click Save to reflect the change. Expected Result Now, let’s test if everything works well. Open a new chat / temporary chat on Open Web UI and click on the Call / Record button. The STT Engine (Azure AI Speech) should identify your voice and provide a response based on the voice input. To test the TTS feature, click on the Read Aloud (Speaker Icon) under any response from Open Web UI. The TTS Engine should reflect Azure AI Speech service! Conclusion And that’s a wrap! You’ve just given your Open WebUI the gift of capturing user speech, turning it into text, and then talking right back with Azure’s neural voices. Along the way you saw how easy it is to spin up a Speech resource in the Azure portal, wire up real-time transcription in the browser, and pipe responses through the TTS engine. From here, it’s all about experimentation. Try swapping in different neural voices or dialing in new languages. Tweak how you start and stop listening, play with silence detection, or add custom pronunciation tweaks for those tricky product names. Before you know it, your interface will feel less like a web page and more like a conversation partner.827Views2likes1CommentAnnouncing gpt-realtime on Azure AI Foundry:
We are thrilled to announce that we are releasing today the general availability of our latest advancement in speech-to-speech technology: gpt-realtime. This new model represents a significant leap forward in our commitment to providing advanced and reliable speech-to-speech solutions. gpt-realtime is a new S2S (speech-to-speech) model with improved instruction following, designed to merge all of our speech-to-speech improvements into a single, cohesive model. This model is now available in the Real-time API, offering enhanced voice naturalness, higher audio quality, and improved function calling capabilities. Key Features New, natural, expressive voices: New voice options (Marin and Cedar) that bring a new level of naturalness and clarity to speech synthesis. Improved Instruction Following: Enhanced capabilities to follow instructions more accurately and reliably. Enhanced Voice Naturalness: More lifelike and expressive voice output. Higher Audio Quality: Superior audio quality for a better user experience. Improved Function Calling: Enhanced ability to call custom code defined by developers. Image Input Support: Add images to context and discuss them via voice—no video required. Check out the model card here: gpt-realtime Pricing Pricing for gpt-realtime is 20% lower compared to the previous gpt-4o-realtime preview: Pricing is based on usage per 1 million tokens. Below is the breakdown: Getting Started gpt-realtime is available on Azure AI Foundry via Azure Models direct from Azure today. We are excited to see how developers and users will leverage these new capabilities to create innovative and impactful solutions. Check out the model on Azure AI Foundry and see detailed documentation in Microsoft Learn docs.3.1KViews1like0CommentsModel Mondays S2E11: Exploring Speech AI in Azure AI Foundry
1. Weekly Highlights This week’s top news in the Azure AI ecosystem included: Lakuna — Copilot Studio Agent for Product Teams: A hackathon project built with Copilot Studio and Azure AI Foundry, Lakuna analyzes your requirements and docs to surface hidden assumptions, helping teams reflect, test, and reduce bias in product planning. Azure ND H200 v5 VMs for AI: Azure Machine Learning introduced ND H200 v5 VMs, featuring NVIDIA H200 GPUs (over 1TB GPU memory per VM!) for massive models, bigger context windows, and ultra-fast throughput. Agent Factory Blog Series: The next wave of agentic AI is about extensibility: plug your agents into hundreds of APIs and services using Model Connector Protocol (MCP) for portable, reusable tool integrations. GPT-5 Tool Calling on Azure AI Foundry: GPT-5 models now support free-form tool calling—no more rigid JSON! Output SQL, Python, configs, and more in your preferred format for natural, flexible workflows. Microsoft a Leader in 2025 Gartner Magic Quadrant: Azure was again named a leader for Cloud Native Application Platforms—validating its end-to-end runway for AI, microservices, DevOps, and more. 2. Spotlight On: Azure AI Foundry Speech Playground The main segment featured a live demo of the new Azure AI Speech Playground (now part of Foundry), showing how developers can experiment with and deploy cutting-edge voice, transcription, and avatar capabilities. Key Features & Demos: Speech Recognition (Speech-to-Text): Try real-time transcription directly in the playground—recognizing natural speech, pauses, accents, and domain terms. Batch and Fast transcription options for large files and blob storage. Custom Speech: Fine-tune models for your industry, vocabulary, and noise conditions. Text to Speech (TTS): Instantly convert text into natural, expressive audio in 150+ languages with 600+ neural voices. Demo: Listen to pre-built voices, explore whispering, cheerful, angry, and more styles. Custom Neural Voice: Clone and train your own professional or personal voice (with strict Responsible AI controls). Avatars & Video Translation: Bring your apps to life with prebuilt avatars and video translation, which syncs voice-overs to speakers in multilingual videos. Voice Live API: Voice Live API (Preview) integrates all premium speech capabilities with large language models, enabling real-time, proactive voice agents and chatbots. Demo: Language learning agent with voice, avatars, and proactive engagement. One-click code export for deployment in your IDE. 3. Customer Story: Hilo Health This week’s customer spotlight featured Helo Health—a healthcare technology company using Azure AI to boost efficiency for doctors, staff, and patients. How Hilo Uses Azure AI: Document Management: Automates fax/document filing, splits multi-page faxes by patient, reduces staff effort and errors using Azure Computer Vision and Document Intelligence. Ambient Listening: Ambient clinical note transcription captures doctor-patient conversations and summarizes them for easy EHR documentation. Genie AI Contact Center: Agentic voice assistants handle patient calls, book appointments, answer billing/refill questions, escalate to humans, and assist human agents—using Azure Communication Services, Azure Functions, FastAPI (community), and Azure OpenAI. Conversational Campaigns: Outbound reminders, procedure preps, and follow-ups all handled by voice AI—freeing up human staff. Impact: Hilo reaches 16,000+ physician practices and 180,000 providers, automates millions of communications, and processes $2B+ in payments annually—demonstrating how multimodal AI transforms patient journeys from first call to post-visit care. 4. Key Takeaways Here’s what you need to know from S2E11: Speech AI is Accessible: The Azure AI Foundry Speech Playground makes experimenting with voice recognition, TTS, and avatars easy for everyone. From Playground to Production: Fine-tune, export code, and deploy speech models in your own apps with Azure Speech Service. Responsible AI Built-In: Custom Neural Voice and avatars require application and approval, ensuring ethical, secure use. Agentic AI Everywhere: Voice Live API brings real-time, multimodal voice agents to any workflow. Healthcare Example: Hilo’s use of Azure AI shows the real-world impact of speech and agentic AI, from patient intake to after-visit care. Join the Community: Keep learning and building—join the Discord and Forum. Sharda's Tips: How I Wrote This Blog I organize key moments from each episode, highlight product demos and customer stories, and use GitHub Copilot for structure. For this recap, I tested the Speech Playground myself, explored the docs, and summarized answers to common developer questions on security, dialects, and deployment. Here’s my favorite Copilot prompt this week: "Generate a technical blog post for Model Mondays S2E11 based on the transcript and episode details. Focus on Azure Speech Playground, TTS, avatars, Voice Live API, and healthcare use cases. Add practical links for developers and students!" Coming Up Next Week Next week: Observability! Learn how to monitor, evaluate, and debug your AI models and workflows using Azure and OpenAI tools. Register For The Livestream – Sep 1, 2025 Register For The AMA – Sep 5, 2025 Ask Questions & View Recaps – Discussion Forum About Model Mondays Model Mondays is your weekly Azure AI learning series: 5-Minute Highlights: Latest AI news and product updates 15-Minute Spotlight: Demos and deep dives with product teams 30-Minute AMA Fridays: Ask anything in Discord or the forum Start building: Register For Livestreams Watch Past Replays Register For AMA Recap Past AMAs Join The Community Don’t build alone! The Azure AI Developer Community is here for real-time chats, events, and support: Join the Discord Explore the Forum About Me I'm Sharda, a Gold Microsoft Learn Student Ambassador focused on cloud and AI. Find me on GitHub, Dev.to, Tech Community, and LinkedIn. In this blog series, I share takeaways from each week’s Model Mondays livestream.152Views0likes0CommentsBuild recap: new Azure AI Foundry resource, Developer APIs and Tools
At Microsoft Build 2025, we introduced Azure AI Foundry resource, Azure AI Foundry API, and supporting tools to streamline the end-to-end development lifecycle of AI agents and applications. These capabilities are designed to help developers accelerate time-to-market; support production-scale workloads with scale and central governance; and support administrators with a self-serve capability to enable their teams’ experimentation with AI in a controlled environment. The Azure AI Foundry resource type unifies agents, models and tools under a single management grouping, equipped with built-in enterprise-readiness capabilities — such as tracing & monitoring, agent and model-specific evaluation capabilities, and customizable enterprise setup configurations tailored to your organizational policies like using your own virtual networks. This launch represents our commitment to providing organizations with a consistent, efficient and centrally governable environment for building and operating the AI agents and applications of today, and tomorrow. New platform capabilities The new Foundry resource type evolves our vision for Azure AI Foundry as a unified Azure platform-as-a-service offering, enabling developers to focus on building applications rather than managing infrastructure, while taking advantage of native Azure platform capabilities like Azure Data and Microsoft Defender. Previously, Azure AI Foundry portal’s capabilities required the management of multiple Azure resources and SDKs to build an end-to-end application. New capabilities include: Foundry resource type enables administrators with a consistent way of managing security and access to Agents, Models, Projects, and Azure tooling Integration. With this change, Azure Role Based Access Control, Networking and Policies are administered under a single Azure resource provider namespace, for streamlined management. ‘Azure AI Foundry’ is a renaming of the former ‘Azure AI Services’ resource type, with access to new capabilities. While Azure AI Foundry still supports bring-your-own Azure resources, we now default to a fully Microsoft-managed experience, making it faster and easier to get started. Foundry projects are folders that enable developers to independently create new environments for exploring new ideas and building prototypes, while managing data in isolation. Projects are child resources; they may be assigned their own admin controls but by default share common settings such as networking or connected resource access from their parent resource. This principle aims to take IT admins out of the day-to-day loop once security and governance are established at the resource level, enabling developers to self-serve confidently within their projects. Azure AI Foundry API is designed from the ground up, to build and evaluate API-first agentic applications, and lets you work across model providers agnostically with a consistent contract. Azure AI Foundry SDK wraps the Foundry API making it easy to integrate capabilities into code whether your application is built in Python, C#, JavaScript/TypeScript or Java. Azure AI Foundry for VS Code Extension complements your workflow with capabilities to help you explore models, and develop agents and is now supported with the new Foundry project type. New built-in RBAC roles provide up-to-date role definitions to help admins differentiate access between Administrator, Project Manager and Project users. Foundry RBAC actions follow strict control- and data plane separation, making it easier to implement the principle of least privilege. Why we built these new platform capabilities If you are already building with Azure AI Foundry -- these capabilities are meant to simplify platform management, enhance workflows that span multiple models and tools, and reinforce governance capabilities, as we see AI workloads grow more complex. The emergence of generative AI fundamentally changed how customers build AI solutions, requiring capabilities that span multiple traditional domains. We launched Azure AI Foundry to provide a comprehensive toolkit for exploring, building and evaluating this new wave of GenAI solutions. Initially, this experience was backed by two core Azure services -- Azure AI Services for accessing models including those from OpenAI, and Azure Machine Learning’s hub, to access tools for orchestration and customization. With the emergence of AI agents composing models and tools; and production workloads demanding the enforcement of central governance across those, we are investing to bring the management of agents, models and their tooling integration layer together to best serve these workload’s requirements. The Azure AI Foundry resource and Foundry API are purposefully designed to unify and simplify the composition and management of core building blocks of AI applications: Models Agents & their tools Observability, Security, and Trust In this new era of AI, there is no one-size-fits-all approach to building AI agents and applications. That's why we designed the new platform as a comprehensive AI factory with modular, extensible, and interoperable components. Foundry Project vs Hub-Based Project Going forward, new agents and model-centric capabilities will only land on the new Foundry project type. This includes access to Foundry Agent Service in GA and Foundry API. While we are transitioning to Azure AI Foundry as a managed platform service, hub-based project type remains accessible in Azure AI Foundry portal for GenAI capabilities that are not yet supported by the new resource type. Hub-based projects will continue to support use cases for custom model training in Azure Machine Learning Studio, CLI and SDK. For a full overview of capabilities supported by each project type, see this support matrix. Azure AI Foundry Agent Service The Azure AI Foundry Agent Service experience, now generally available, is powered by the new Foundry project. Existing customers exploring the GA experience will need the new AI Foundry resource. All new investments in the Azure AI Foundry Agent Service are focused on the Foundry project experience. Foundry projects act as secure units of isolation and collaboration — agents within a project share: File storage Thread storage (i.e. conversation history) Search indexes You can also bring your own Azure resources (e.g., storage, bring-your-own virtual network) to support compliance and control over sensitive data. Start Building with Foundry Azure AI Foundry is your foundation for scalable, secure, and production-grade AI development. Whether you're building your first agent or deploying a multi-agent workforce at Scale, Azure AI Foundry is ready for what’s next.3.7KViews2likes0CommentsThe Future of AI: How Lovable.dev and Azure OpenAI Accelerate Apps that Change Lives
Discover how Charles Elwood, a Microsoft AI MVP and TEDx Speaker, leverages Lovable.dev and Azure OpenAI to create impactful AI solutions. From automating expense reports to restoring voices, translating gestures to speech, and visualizing public health data, Charles's innovations are transforming lives and democratizing technology. Follow his journey to learn more about AI for good.1.3KViews2likes0CommentsAI Avatars: Redefining Human-Digital Interaction in the Enterprise Era
In today’s AI-driven world, businesses are constantly seeking innovative ways to humanize digital experiences. AI Avatars are emerging as a powerful solution—bridging the gap between intelligent automation and authentic, human-like engagement. With advancements in speech synthesis, large language models, and avatar rendering technologies, organizations can now deploy AI-powered digital assistants that not only understand and respond but also interact with a lifelike presence. The Rise of AI Avatars in Enterprise Applications AI Avatars go beyond traditional chatbots or voice assistants. These virtual beings offer multimodal interaction—combining voice, visual cues, and conversational intelligence into a seamless user experience. Built on enterprise-grade platforms like Azure AI, these avatars can be integrated into customer support portals, digital kiosks, internal knowledge hubs, and more. Their utility spans a range of industries: Retail: Personalized shopping assistants that guide consumers through products. Healthcare: Virtual health concierges that help patients navigate care. Education: Interactive tutors that deliver lessons with empathy and responsiveness. HR and Training: Onboarding avatars that answer employee questions, onboard new hires, or provide compliance updates. One of our key partners, Cloudforce, has integrated AI Avatar technology directly into their flagship platform nebulaONE®. This integration enables enterprises to deploy digital assistants that are deeply embedded in business processes, offering contextualized support and real-time engagement. From training and onboarding to employee self-service, nebulaONE's agentic AI Avatars act as a digital bridge between users and systems—driving efficiency, engagement, and satisfaction. Partner Spotlight: Cloudforce’s Avatar Initiative To operationalize and productize AI Avatars, Microsoft collaborates with a growing ecosystem of partners. Cloudforce is one of the early pioneers in this space. Their work in embedding avatars into nebulaONE demonstrates what’s possible when advanced AI meets real-world enterprise needs. With a vision to transform user interaction across industries, Cloudforce built a production-grade AI Avatar module designed to support customer Q&A, knowledge discovery, and live guided walkthroughs. Leveraging Azure OpenAI, Azure AI Speech, and privately-deployed secure cloud infrastructure, they have brought conversational intelligence to life—with both a face and a voice. Looking ahead, Cloudforce’s broader vision is to bring AI Avatar capabilities to millions of students—delivering immersive learning experiences that blend interactivity, personalization, and scale. Their education-focused roadmap enhancements highlight the potential of avatars not just as productivity agents, but as accessible and empathetic digital educators, delivering equitable access to knowledge previously reserved for a fortunate few. This kind of partner innovation illustrates how AI Avatars can be customized and scaled to deliver tangible business value across multiple domains. Partner Contribution "Students are already embracing generative AI at a pace and proficiency that far exceeds many professional audiences. With Azure's AI Avatar technology, educators and institutions can tailor unique GenAI interactions that promote reasoning and learning over simply receiving answers the way they would with common public bots." says Husein Sharaf, Founder and CEO at Cloudforce. "We understand the concerns and hesitation that our education partners are currently grappling with, however we believe they can and should take an active role in shaping how this transformative technology is leveraged across their campuses, or risk being left behind as students choose their own adventure." "Microsoft's enterprise AI capabilities are enabling partners like us to deliver secure, cost-efficient, and responsible AI experiences at scale. With the Azure AI Foundry and key innovations like AI Avatars as our building blocks, the nebulaONE platform is poised to serve as the GenAI gateway to tens of thousands of business users, and millions of students at leading educational institutions globally. Our customers are seeking unique differentiators that will enable them to compete and win in the age of AI, and our collaboration with Microsoft is empowering us to deliver just that." Summary AI Avatars represent the next frontier in digital interaction. By combining conversational AI, expressive voice synthesis, and realistic visual rendering, these intelligent agents deliver truly human-like experiences—at scale. They are not just tools, but digital extensions of your brand. Partners like Cloudforce are leading the way with innovative platforms like nebulaONE, showing how this technology can be embedded into enterprise solutions and educational experiences to drive efficiency with a human touch. While Cloudforce is among the first to productize AI Avatars using Azure AI, they are part of a growing movement—helping to shape the future of AI-powered experiences across industries. As AI continues to evolve, avatars will become a standard interface—transforming the way we learn, work, and engage with digital systems.1.8KViews7likes2CommentsProject Maria: Bringing Speech and Avatars Together for Next-Generation Customer Experiences
In an age where digital transformation influences nearly every aspect of business, companies are actively seeking innovative ways to differentiate their customer interactions. Traditional text-based chatbots, while helpful, often leave users wanting a more natural, personalized, and efficient experience. Imagine hosting a virtual brand ambassador—a digital twin of yourself or your organization’s spokesperson—capable of answering customer queries in real time with a lifelike voice and expressive 2D or 3D face. This is where Project Maria comes in. Project Maria is an internal Microsoft initiative that integrates cutting-edge speech-to-text (STT), text-to-speech (TTS), large language model and avatar technologies. Using Azure AI speech and custom neural voice models, it seeks to create immersive, personalized interactions for customers—reducing friction, increasing brand loyalty, and opening new business opportunities in areas such as customer support, product briefings, digital twins, live marketing events, safety briefings, and beyond. In this blog post, we will dive into: The Problem and Rationale for evolving beyond basic text-based solutions. Speech-to-Text (STT), Text-to-Speech (TTS) Pipelines, Azure OpenAI GPT-4o Real-Time API that power natural conversations. Avatar Models in Azure, including off-the-shelf 2D avatars and fully customized custom avatar Neural Voice Model Creation, from data gathering to training and deployment on Azure. Security and Compliance considerations for handling sensitive voice assets and data. Use Cases from customer support to digital brand ambassadors and safety briefings. Real-World Debut of Project Maria, showcased at the AI Leaders’ Summit in Seattle. Future Outlook on how custom avatar will reshape business interactions, scale presence, and streamline time-consuming tasks. If you’re developing or considering a neural (custom) voice + avatar models for your product or enterprise, this post will guide you through both conceptual and technical details to help you get started—and highlight where the field is heading next. 1. The Problem: Limitations of Text-Based Chatbots 1.1 Boredom and Fatigue in Text Interactions Text-based chatbots have come a long way, especially with the advent of powerful Large Language Models (LLMs) and Small Large Models (SLMs). Despite these innovations, interactions can still become tedious—often requiring users to spend significant personal time crafting the right questions. Many of us have experienced chatbots that respond with excessively verbose or repetitive messages, leading to boredom or even frustration. In industries that demand immediacy—like healthcare, finance, or real-time consumer support—purely text-based exchanges can feel slow and cumbersome. Moreover, text chat requires a user’s full attention to read and type, whether in a busy contact center environment or an internal knowledge base where employees juggle multiple tasks. 1.2 Desire for More Engaging and Efficient Modalities Today’s users expect something closer to human conversation. Devices ranging from smartphones to smart speakers and in-car infotainment systems have normalized voice-based interfaces. Adding an avatar—whether a 2D or 3D representation—deepens engagement by combining speech with a friendly visual persona. This can elevate brand identity: an avatar that looks, talks, and gestures like your company’s brand ambassador or a well-known subject-matter expert. 1.3 The Need for Scalability In a busy customer support environment, human representatives simply can’t handle an infinite volume of conversations or offer 24/7 coverage across multiple channels. Automation is essential, yet providing high-quality automated interactions remains challenging. While a text-based chatbot might handle routine queries, a voice-based, avatar-enabled agent can manage more complex requests with greater dynamism and personality. By giving your digital support assistant both a “face” and a voice aligned with your brand, you can foster deeper emotional connections and provide a more genuine, empathetic experience. This blend of automation and personalization scales your support operations, ensuring higher customer satisfaction while freeing human agents to focus on critical or specialized tasks. 2. The Vision: Project Maria’s Approach Project Maria addresses these challenges by creating a unified pipeline that supports: Speech-to-Text (STT) for recognizing user queries quickly and accurately. Natural Language Understanding (NLU) layers (potentially leveraging Azure OpenAI or other large language models) for comprehensive query interpretation. Text-to-Speech (TTS) that returns highly natural-sounding responses, possibly in multiple languages, with customized prosody and style. Avatar Rendering, which can be a 2D animated avatar or a more advanced 3D digital twin, bringing personality and facial expressions to the conversation. By using Azure AI Services—particularly the Speech and Custom Neural Voice offerings—can deliver brand-specific voices. This ensures that each brand or individual user’s avatar can match (or approximate) a signature voice, turning a run-of-the-mill voice assistant into a truly personal digital replicas 3. Technical Foundations 3.1 Speech-to-Text (STT) At the heart of the system is Azure AI Services for Speech, which provides: Real-time transcription capabilities with a variety of languages and dialects. Noise suppression, ensuring robust performance in busy environments. Streaming APIs, critical for real-time or near-real-time interactions. When a user speaks, audio data is captured (for example, via a web microphone feed or a phone line) and streamed to the Azure service. The recognized text is returned in segments, which the NLU or conversation manager can interpret. 3.1.1 Audio Pipeline Capture: The user’s microphone audio is captured by a front-end (e.g., a web app, mobile app, or IoT device). Pre-processing: Noise reduction or volume normalization might be applied locally or in the cloud, ensuring consistent input. Azure STT Ingestion: Data is sent to the Speech service endpoint, authenticated via subscription keys or tokens (more on security later). Result Handling: The recognized text arrives in partial hypotheses (partial transcripts) and final recognized segments. Project Maria (Custom Avatar) processes these results to understand user intent 3.2 Text-to-Speech (TTS) Once an intent is identified and a response is formulated, the system needs to deliver speech output. Standard Neural Voices: Microsoft provides a wide range of prebuilt voices in multiple languages. Custom Neural Voice: For an even more personalized experience, you can train a voice model that matches a brand spokesperson or a distinct voice identity. This is done using your custom datasets, ensuring the final system speaks exactly like the recorded persona. 3.2.1 Voice Font Selection and Configuration In a typical architecture: The conversation manager (which could be an orchestrator or a custom microservice) provides the text output to the TTS service. The TTS service uses a configured voice font—like en-US-JennyNeural or a custom neural voice ID (like Maria Neural Voice) if you have a specialized voice model. The synthesized audio is returned as an audio stream (e.g., PCM or MP3). You can play this in a webpage directly or in a native app environment. Azure OpenAI GPT-4o Real-Time API integrates with Azure's Speech Services to enable seamless interactions. First, your speech is transcribed in near real time. GPT-4o then processes this text to generate context-aware responses, which are converted to natural-sounding audio via Azure TTS. This audio is synchronized with avatar models to create a lifelike, engaging interface 3.3 Real-Time Conversational Loop Maria is designed for real-time or text to speech conversations. The user’s speech is continuously streamed to Azure STT. The recognized text triggers a real-time inference step for the next best action or response. The response is generated by Azure OpenAI model (like GPT-4o) or other LLM/SLM The text is then synthesized to speech, which the user hears with minimal latency. 3.4 Avatars: 2D and Beyond 3.4.1 Prebuilt Azure 2D Avatars Azure AI Speech Services includes an Avatar capability that can be activated to display a talking head or a 2D animated character. Developers can: Choose from prebuilt characters or import basic custom animations. Synchronize lip movements to the TTS output. Overlay brand-specific backgrounds or adopt transparency for embedding in various UIs. 3.4.2 Fully Custom Avatars (Customer Support Agent Like Maria) For organizations wanting a customer support agent, subject-matter expert, or brand ambassador: Capture: Record high-fidelity audio and video of the person you want to replicate. The more data, the better the outcome (though privacy and licensing must be considered). Modeling: Use advanced 3D or specialized 2D animation software (or partner with Microsoft’s custom avatar creation solutions) to generate a rigged model that matches the real person’s facial geometry and expressions. Integration: Once the model is rigged, it can be integrated with the TTS engine. As text is converted to speech, the avatar automatically animates lip shapes and facial expressions in near real time. 3.5 Latency and Bandwidth Considerations When building an interactive system, keep an eye on: Network latency: Real-time STT and TTS require stable, fast connections. Compute resources: If hosting advanced ML or high concurrency, scaling containers (e.g., via Docker and Kubernetes) is critical. Avatars: Real-time animation might require sending frames or instructions to a client’s browser or device. 4. Building the Model: Neural Voice Model Creation 4.1 Data Gathering To train a custom neural voice, you typically need: High-quality audio clips: Ideally recorded in a professional studio to minimize background noise, with the same microphone setup throughout. Matching transcripts for each clip. Minimum data duration: Microsoft recommends a certain threshold (e.g., 300+ utterances, typically around 30 minutes to a few hours of recorded speech, depending on the complexity of the final voice needed). 4.2 Training Process Data Upload: Use the Azure Speech portal or APIs to upload your curated dataset. Model Training: Azure runs training jobs that often require a few hours (or more). This step includes: Acoustic feature extraction (spectrogram analysis). Language or phoneme modeling for the relevant language and accent. Prosody tuning, ensuring the voice can handle various styles (cheerful, empathetic, urgent, etc.). Quality Checks: After training, you receive an initial voice model. You can generate test phrases to assess clarity, intonation, and overall quality. Iteration: If the voice quality is not satisfactory, you gather more data or refine the existing data (removing noisy segments or inaccurate transcripts). 4.3 Deployment Once satisfied with the custom neural voice: Deploy the model to an Azure endpoint within your subscription. Configure your TTS engine to use the custom endpoint ID instead of a standard voice. 5. Securing Avatar and Voice Models Security is paramount when personal data, brand identity, or intellectual property is on the line. 5.1 API Keys and Endpoints Azure AI Services requires an API key or an OAuth token to access STT/TTS features. Store keys in Azure Key Vault or as secure environment variables. Avoid hard-coding them in the front-end or source control. 5.2 Access Control Role-Based Access Control (RBAC) at both Azure subscription level and container (e.g., Docker or Kubernetes) level ensures only authorized personnel can deploy or manage the containers running these services. Network Security: Use private endpoints if you want to limit exposure to the public internet. 5.3 Intellectual Property Concerns Avatar and Voice Imitation: A avatar model and custom neural voice that mimics a specific individual must be authorized by that individual. Azure has a verification process in place to ensure consent. Data Storage: The training audio data and transcripts must be securely stored, often with encryption at rest and in transit. 6. Use Cases: Bringing It All Together 6.1 Customer Support A digital avatar that greets users on a website or mobile app can handle first-level queries: “Where can I find my billing information?” “What is your return policy?” By speaking these answers aloud with a friendly face and voice, the experience is more memorable and can reduce queue times for human agents. If the question is too complex, the avatar can seamlessly hand off to a live agent. Meanwhile, transcripts of the entire conversation are stored (e.g., in Azure Cosmos DB), enabling data analytics and further improvements to the system. 6.2 Safety Briefings and Public Announcements Industries like manufacturing, aviation, or construction must repeatedly deliver consistent safety messages. A personal avatar can recite crucial safety protocols in multiple languages, ensuring nothing is lost in translation. Because the TTS voice is consistent, workers become accustomed to the avatar’s instructions. Over time, you could even create a brand or site-specific “Safety Officer” avatar that fosters familiarity. 6.3 Digital Twins at Live Events Suppose you want your company’s spokesperson to simultaneously appear at multiple events across the globe. With a digital twin: The spokesperson’s avatar and voice “present” in real time, responding to local audience questions. This can be done in multiple languages, bridging communication barriers instantaneously. Attendees get a sense of personal interaction, while the real spokesperson can focus on core tasks, or appear physically at another event entirely. 6.4 AI Training and Education In e-learning platforms, a digital tutor can guide students through lessons, answer questions in real time, and adapt the tone of voice based on the difficulty of the topic or the student’s performance. By offering a face and voice, the tutor becomes more engaging than a text-only system. 7. Debut: Maria at the AI Leaders Summit in Seattle Project Maria had its first major showcase at the AI Leaders Summit in Seattle last week. We set up a live demonstration: Live Conversations: Attendees approached a large screen that displayed Maria’s 2D avatar. On-the-Fly: Maria recognized queries with STT, generated text responses from an internal knowledge base (powered by GPT-4o or domain-specific models), then spoke them back with a custom Azure neural voice. Interactive: The avatar lip-synced to the output speech, included animated gestures for emphasis, and even displayed text-based subtitles for clarity. The response was overwhelmingly positive. Customers praised the fluid voice quality and the lifelike nature of Maria’s avatar. Many commented that they felt they were interacting with a real brand ambassador, especially because the chosen custom neural voice had just the right inflections and emotional range. 8. Technical Implementation Details Below is a high-level architecture of how Project Maria might be deployed using containers and Azure resources. Front-End Web App: Built with a modern JavaScript framework (React, Vue, Angular, etc.). Captures user audio through the browser’s WebRTC or MediaStream APIs. Connects via WebSockets or RESTful endpoints for STT requests. Renders the avatar in a <canvas> element or using a specialized avatar library. Backend: Containerized with Docker. Exposes endpoints for STT streaming (optionally passing data directly to Azure for transcription). Integrates with the TTS service, retrieving synthesized audio buffers. Returns the audio back to the front-end in a continuous stream for immediate playback. Avatar Integration: The back-end or a specialized service handles lip-sync generation (e.g., via phoneme mapping from the TTS output). The front-end renders the 2D or 3D avatar in sync with the audio playback. This can be done by streaming timing markers that indicate which phoneme is currently active. Data and Conversation Storage: Use an Azure Cosmos DB or a similar NoSQL solution to store transcripts, user IDs, timestamps, and optional metadata (e.g., conversation sentiment). This data can later be used to improve the conversation model, evaluate performance, or train advanced analytics solutions. Security: All sensitive environment variables (like Azure API keys) are loaded securely, either through Azure Key Vault or container orchestration secrets. The system enforces user authentication if needed. For instance, an internal HR system might restrict the avatar-based service to employees only. Scaling: Deploy containers in Azure Kubernetes Service (AKS), setting up auto-scaling to handle peak loads. Monitor CPU/memory usage, as well as TTS quota usage. For STT, ensure the service tier can handle simultaneous requests from multiple users. 9. Securing Avatar Models and Voice Data 9.1 Identity Management Each avatar or custom neural voice is tied to a specific subscription. Using Azure Active Directory (Azure AD), you can give fine-grained permissions so that only authorized DevOps or AI specialists can alter or redeploy the voice. 9.2 API Gateways and Firewalls For enterprise contexts, you might place an API Gateway in front of your containerized services. This central gateway can: Inspect requests for anomalies, Enforce rate-limits, Log traffic to meet compliance or auditing requirements. 9.3 Key Rotation and Secrets Management Frequently rotates keys to minimize the risk of compromised credentials. Tools like Azure Key Vault or GitHub’s secret storage features can automate the rotation process, ensuring minimal downtime. 10. The Path Forward: Scaling Custom Avatar 10.1 Extended Personalization While Project Maria currently focuses on voice and basic facial expressions, future expansions include: Emotion Synthesis: Beyond standard TTS expressions (friendly, sad, excited), we can integrate emotional AI to dynamically adjust the avatar’s tone based on user sentiment. Gesture Libraries: 2D or 3D avatars can incorporate hand gestures, posture changes, or background movements to mimic a real person in conversation. This reduces the “uncanny valley” effect. 10.2 Multilingual, Multimodal As businesses operate globally, multilingual interactions become paramount. We have seen many use cases to: Auto-detect language from a user’s speech and respond in kind. Offer real-time translation, bridging non-English speakers to brand content. 10.3 Agent Autonomy Systems like Maria won’t just respond to direct questions; they can initiate proactivity: Send voice-based notifications or warnings when critical events happen. Manage long-running tasks such as scheduling or triaging user requests, akin to an “executive assistant” for multiple users simultaneously. 10.4 Ethical and Social Considerations With near-perfect replicas of voices, there is a growing concern about identity theft, misinformation, and deepfakes. Companies implementing digital twins must: Secure explicit consent from individuals. Implement watermarking or authentication for voice data. Educate customers and employees on usage boundaries and disclaimers 11. Conclusion Project Maria represents a significant leap in how businesses and organizations can scale their presence, offering a humanized, voice-enabled digital experience. By merging speech-to-text, text-to-speech, and avatar technologies, you can: Boost Engagement: A friendly face and familiar voice can reduce user fatigue and build emotional resonance. Extend Brand Reach: Appear in many locations at once via digital twins, creating personalized interactions at scale. Streamline Operations: Automate repetitive queries while maintaining a human touch, freeing up valuable employee time. Ensure Security and Compliance: By using Azure’s robust ecosystem of services and best practices for voice data. As demonstrated at the AI Leaders Summit in Seattle, Maria is already reshaping how businesses think about communication. The synergy of avatars, neural voices, and secure, cloud-based AI is paving the way for the next frontier in customer interaction. Looking ahead, we anticipate that digital twins—like Maria—will become ubiquitous, automating not just chat responses but a wide range of tasks that once demanded human presence. From personalized marketing to advanced training scenarios, the possibilities are vast. In short, the fusion of STT, TTS, and avatar technologies is more than a novel gimmick; it is an evolution in human-computer interaction. By investing in robust pipelines, custom neural voice training, and carefully orchestrated containerized deployments, businesses can unlock extraordinary potential. Project Maria is our blueprint for how to do it right—secure, customizable, and scalable—helping organizations around the world transform user experiences in ways that are both convenient and captivating. If you’re looking to scale your brand, innovate in human-machine dialogues, or harness the power of digital twins, we encourage you to explore Azure AI Services’ STT, TTS, and Avatar solutions. Together, these advancements promise a future where your digital self (or brand persona) can meaningfully interact with users anytime, anywhere. Detailed Technical Implementation:- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/what-is-custom-text-to-speech-avatar Text to Speech with Multi-Agent Orchestration Framework:- https://github.com/ganachan/Project_Maria_Accelerator_tts Contoso_Maria_Greetings.mp41.6KViews1like1Comment