azure speech
6 TopicsIntroducing MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in Microsoft Foundry
Another Step Towards a Complete AI Platform Since inception, our goal with Microsoft Foundry has been to deliver the most complete AI and app agent factory; giving developers access to the latest frontier models, tools, infrastructure, security, and reliability to confidently build and scale their AI solutions. Today, we're taking another step towards that vision by announcing the public preview of three new models from Microsoft AI in Microsoft Foundry: MAI-Transcribe-1: Our first-generation speech recognition model, delivering enterprise-grade accuracy across 25 languages at approximately 50% lower GPU cost than leading alternatives. MAI-Voice-1: A high-fidelity speech generation model capable of producing 60 seconds of expressive audio in under one second on a single GPU. MAI-Image-2: Our highest-capability text-to-image model, which debuted on #3 on the Arena.ai leaderboard for image model families. These are the same models already powering our own products such as Copilot, Bing, PowerPoint, and Azure Speech, and now they're available exclusively on Foundry for developers to use. We can't wait to see what you create with these new multimedia AI models in public preview. Read on for a deeper look at each model's capabilities and how to start building with them in Foundry! MAI-Transcribe-1 & Voice-1: End-To-End Voice Experiences Voice and speech are rapidly becoming the primary interface for the next generation of AI agents, and building great voice experiences requires models that can both speak and listen with precision. With MAI-Voice-1 and MAI-Transcribe-1, Microsoft is delivering exactly that: a comprehensive, first-party audio AI stack purpose-built for developers. MAI-Voice-1 is a lightning-fast speech generation model capable of producing a full minute of audio in under a second on a single GPU; making it one of the most efficient speech systems available today. On the listening side, MAI-Transcribe-1 supports up to 25 languages and is engineered for enterprise-grade reliability across accents, languages, and real-world audio conditions. But what truly sets it apart is its efficiency: when benchmarked against leading transcription models, MAI-Transcribe-1 delivers competitive accuracy at nearly half the GPU cost; an advantage that translates directly into more predictable, scalable pricing for enterprises 1 . Use cases for MAI-Transcribe-1 and MAI-Voice-1 MAI-Voice-1 and MAI-Transcribe-1 are designed for production use across a broad set of real-world scenarios: Conversational AI & Agent Assist: Enable real‑time transcription for IVR systems, virtual assistants, and call‑center workflows to power voice‑driven interfaces, live agent assist, and post‑call summarization. Live Captioning & Accessibility: Deliver real‑time captions for large events, enterprise meetings, and digital communications to improve accessibility and inclusivity across spoken experiences. Media, Subtitling & Archiving: Automate video subtitling, dialogue indexing, and transcription to support scalable content production, searchability, and long‑term media archiving. Education & Training Platforms: Transcribe lectures, learning modules, and certification programs to enhance discoverability, reviewability, and knowledge retention in e‑learning environments. Customer & Market Insights: Convert spoken interactions across research interviews, focus groups, and support channels into structured data for downstream analytics and business intelligence. We're also applying these model capabilities inside Microsoft's own products. MAI-Voice-1 powers the expressive voice experiences in Copilot's Audio Expressions and podcast features. MAI-Transcribe-1 drives Copilot's Voice Mode transcriptions and the new dictation feature, connecting natural voice input with the generative power of Copilot's language models. Both models are available through Azure Speech, where developers can tap into first-party MAI model quality alongside the enterprise-grade reliability, scalability, and 700+ voice gallery of the Azure Speech ecosystem. Try MAI-Transcribe-1 & Voice-1 Today MAI-Transcribe-1 and Voice-1 are available now through Azure Speech. Here's how to get started: Experiment in MAI Playground: Speak, record, or upload audio to see the models in action at the MAI playground. Build in Foundry: deploy MAI-Transcribe-1 and MAI-Voice-1 in Azure Speech. MAI-Transcribe-1 starts at $0.36 USD per hour, while MAI-Voice-1 pricing starts at $22 USD per 1M characters. Developers looking to create custom voices using MAI-Voice-1 can do so through the Personal Voice feature in Azure Speech — including the ability to clone a voice from a short 10-second audio sample. Note that custom voice creation requires an approval process consistent with Microsoft's responsible AI policies. MAI-Image-2: Limitless Creativity For Every Builder Images are at the center of how developers build compelling AI-powered creative experiences; from marketing tools to content platforms to multimodal agents. MAI-Image-2 is Microsoft's answer to that demand. This model has been developed in close collaboration with photographers, designers, and visual storytellers and debuted in the top-3 text-to-image model families on the Arena.ai leaderboard. It raises the bar across the capabilities that matter most in real creative workflows; more natural, photorealistic image generation, stronger in-image text rendering for infographics and diagrams, and greater precision on complex layouts, detailed scenes, and cinematic visuals. Use cases for MAI-Image-2 Developers can integrate MAI-Image-2 across a range of high-impact workflows: Media & Creative Ideation: Designers, illustrators, and creative teams use text‑to‑image generation to explore visual directions, styles, and compositions early in the creative process—moving from concept to exploration faster. Enterprise Communications & Internal Branding: Organizations create custom visuals for internal campaigns, training materials, and executive communications directly from text, ensuring clarity, polish, and brand alignment without relying on stock imagery. UX & Product Concept Visualization: Product teams visualize interfaces, workflows, environments, and conceptual product scenarios from text descriptions, helping teams communicate ideas and align early—before engineering or design resources are engaged. WPP, one of the world's largest marketing and communications groups, is among the first enterprise partners building with MAI-Image-2 at scale, using it to power creative production workflows that previously required significant manual effort. "MAI-Image-2 is a genuine game-changer. It's a platform that not only responds to the intricate nuance of creative direction, but deeply respects the sheer craft involved in generating real-world, campaign-ready images. WPP has some of the best creative talent in the world and MAI-Image-2 is making them even better." -Rob Reilly, Global Chief Creative Officer, WPP We’re also implementing MAI-Image-2 to power image generation within Microsoft’s own products, including Copilot, Bing Image Creator, and PowerPoint, and now you have access to this powerful, cost effective model for your own apps. Try MAI-Image-2 Today Experiment in the MAI Playground: Preview MAI-Image-2 at MAI playground and share feedback directly with the team. Build in Foundry: deploy MAI-Image-2 via the API and start building your apps and agents! MAI-Image-2 starts at $5 USD per 1M tokens for text input and $33 USD per 1M tokens for image output. We look forward to your feedback on these models in Foundry. References: 1 1 st on overall WER on the FLEURS benchmark. Out of the top 25 global languages, MAI-Transcribe-1 ranks 1st by FLEURS in 11 core languages. It wins against Whisper-large-v3 on the remaining 14 and Gemini 3.1 Flash on 11 of those 14.240Views0likes0CommentsBuilding Knowledge-Grounded Conversational AI Agents with Azure Speech Photo Avatars
From Chat to Presence: The Next Step in Conversational AI Chat agents are now embedded across nearly every industry, from customer support on websites to direct integrations inside business applications designed to boost efficiency and productivity. As these agents become more capable and more visible, user expectations are also rising: conversations should feel natural, trustworthy, and engaging. While text‑only chat agents work well for many scenarios, voice‑enabled agents take a meaningful step forward by introducing a clearer persona and a stronger sense of presence, making interactions feel more human and intuitive (see healow Genie success story). In domains such as Retail, Healthcare, Education, and Corporate Training, adding a visual dimension through AI avatars further elevates the experience. Pairing voice with a lifelike visual representation improves inclusiveness, reduces interaction friction, and helps users better contextualize conversations—especially in scenarios that rely on trust, guidance, or repeated engagement. To support these experiences, Microsoft offers two AI avatar options through Azure Speech: Video Avatars, which are generally available and provide full‑ or partial‑body immersive representations, and Photo Avatars, currently in public preview, which deliver a headshot‑style visual well suited for web‑based agents and digital twin scenarios. Both options support custom avatars, enabling organizations to reflect their brand identity rather than relying solely on generic representations (see W2M custom video avatar). Choosing between Video Avatars and Photo Avatars is less about preference and more about intent. Video Avatars offer higher visual fidelity and immersion but require more extensive onboarding, such as high-quality recorded video of an avatar talent. Photo Avatars, by contrast, can be created from a single image, enabling a lighter‑weight onboarding process while still delivering a human‑centered experience. The right choice depends on the desired interaction style, visual presence, and target deployment scenario. What this solution demonstrates In this post, I walk through how to integrate Azure Speech Photo Avatars — powered by Microsoft Research's VASA-1 model — into a knowledge‑grounded conversational AI agent built on Azure AI Search. The goal is to show how voice, visuals, and retrieval‑augmented generation (RAG) can come together to create a more natural and engaging agent experience. The solution exposes a web‑based interface where users can speak naturally to the AI agent using their voice. The agent responds in real time using synthesized speech, while live transcriptions of the conversation are displayed in the UI to improve clarity and accessibility. To help compare different interaction patterns, the sample application supports three modes: 1) Photo Avatar mode, which adds a lifelike visual presence. 2) Video Avatar mode, which provides a more immersive, full‑motion experience. 3) Voice‑only mode, which focuses purely on speech‑to‑speech interaction. Key architectural components An end‑to‑end architecture for the solution is shown in the diagram below. The solution is composed of the following core services and building blocks: Microsoft Foundry — provides the platform for deploying, managing, and accessing the foundation models used by the application. Azure OpenAI — provides the Realtime API for speech‑to‑speech interaction in the voice‑only mode and the Chat Completions API used by backend services for reasoning and conversational responses. gpt‑4.1 — LLM used for reasoning tasks such as deciding when to invoke tool calls and summarizing responses. gpt-realtime-mini — LLM used for speech-to-speech interaction in the Voice-only mode. text‑embedding‑3‑large — LLM used for generating vector embeddings used in retrieval‑augmented generation. Azure Speech — delivers the real‑time speech‑to‑text (STT), text‑to‑speech (TTS), and AI avatars capabilities for both Photo Avatar and Video Avatar experiences. Azure Document Intelligence — extracts structured text, layout, and key information from source documents used to build the knowledge base. Azure AI Search — provides vector‑based retrieval to ground the language model with relevant, context‑aware content. Azure Container Apps — hosts the web UI frontend, backend services, and MCP server within a managed container runtime. Azure Container Apps Environment — defines a secure and isolated boundary for networking, scaling, and observability of the containerized workloads. Azure Container Registry — stores and manages Docker images used by the container applications. How you can try it yourself The complete sample implementation is available in the LiveChat AI Voice Assistant repository, which includes instructions for deploying the solution into your Azure environment. The repository uses Infrastructure as Code (IaC) deployment via Azure Developer CLI (azd) to orchestrate Azure resource provisioning and application deployment. Prerequisites: An Azure subscription with appropriate services and models' quota is required to deploy the solution. Getting the solution up and running in just three simple steps: Clone the repository and navigate to the project git clone https://github.com/mardianto-msft/azure-speech-ai-avatars.git cd azure-speech-ai-avatars Authenticate with Azure azd auth login Initialize and deploy the solution azd up Once deployed, you can access the sample application by opening the frontend service URL in a web browser. To demonstrate knowledge grounding, the sample includes source documents derived from Microsoft’s 2025 Annual Report and Shareholder Letter. These grounding documents can optionally be replaced with your own data, allowing the same architecture to be reused for domain‑specific or enterprise scenarios. When using the provided sample documents, you can ask questions such as: “How much was Microsoft’s net income in 2025?”, “What are Microsoft’s priorities according to the shareholder letter?”, “Who is Microsoft’s CEO?” Bringing Conversational AI Agents to Life This implementation of Azure Speech Photo Avatars serves as a practical starting point for building more engaging, knowledge‑grounded conversational AI agents. By combining voice interaction, visual presence, and retrieval‑augmented generation, Photo Avatars offer a lightweight yet powerful way to make AI agents feel more approachable, trustworthy, and human‑centered — especially in web‑based and enterprise scenarios. From here, the solution can be extended over time with capabilities such as long‑term memory, richer personalization, or more advanced multi‑agent orchestration. Whether used as a reference architecture or as the foundation for a production system, this approach demonstrates how Azure Speech Photo Avatars can help bridge the gap between conversational intelligence and meaningful user experience. By emphasizing accessibility, trust, and human‑centered design, it reflects Microsoft’s broader mission to empower every person and every organization on the planet to achieve more.481Views0likes0CommentsIntroducing Dragon HD Omni: Azure Speech New Voice Type Now in Preview via Microsoft Foundry
Dragon HD Omni is Microsoft Azure Speech’s newest text‑to‑speech generation, delivering over 700 high‑quality voices with enhanced expressiveness, multi‑lingual fluency, and multi‑style control — all through a unified model built in Microsoft Foundry. It removes common developer pain points such as unnatural voice prosody, limited language coverage, and heavy SSML tuning effort. The result is a powerful value proposition: faster integration, richer user experiences, and production‑ready voice output with minimal effort. Azure speech offers a broad range of unique voices for applications like virtual agents, audiobooks, podcasts, and speech-to-speech tasks. Demo video 700+ prebuilt voices Dragon HD Omni offers a range of prebuilt voices with distinct personas and emotions, supporting diverse use cases from agent-based applications to content creation. These voices unlock endless possibilities, empowering users to enhance end-to-end applications. Full update for previous generation voices Dragon HD Omni merges a wide range of prebuilt voices into one, improving contextual adaptation, prosody, expression, and keeping each voice's unique character. This technology delivers more accurate, flexible, and lifelike speech for a variety of uses. Dragon HD Omni raises the standard for natural AI voices across customer service, accessibility, and creative projects, advancing human-computer interaction. You can explore some voices from voice list, such as: "en-US-Ava:DragonHDOmniLatestNeural" "en-US-Andrew:DragonHDOmniLatestNeural" "en-US-Dana:DragonHDOmniLatestNeural" "en-US-Caleb:DragonHDOmniLatestNeural" "zh-CN-Xiaoyue:DragonHDOmniLatestNeural" "zh-CN-Yunqi:DragonHDOmniLatestNeural" "en-US-Phoebe:DragonHDOmniLatestNeural" "en-US-Lewis:DragonHDOmniLatestNeural" They will be available to try directly via Speech Playground - Microsoft Foundry Or, you can use this voice name format by adding the suffix `:DragonHDOmniLatestNeural` to try the Omni version of the given voice via direct SSML call. For example: Previous neural voice Omni version voice name de-DE-ConradNeural de-DE-Conrad:DragonHDOmniLatestNeural AI-Generated Voices Dragon HD Omni now features nearly 300 brand‑new AI‑generated voices, carefully designed to deliver an unprecedented range of vocal diversity. These voices aren’t just more of the same — they’re built to give you choice, flexibility, and creative control. With variations across: Gender – male, female, and non‑binary options Age – youthful, mature, and senior tones Pitch & tone – from warm and friendly to authoritative and professional This expanded library means you can: Personalize experiences for different audiences, whether you’re building an educational app, a customer support bot, or a storytelling platform. Strengthen brand identity by selecting voices that reflect your company’s personality and values. Increase inclusivity with diverse vocal styles that resonate across cultures and communities. Unlock creativity by experimenting with unique voice personalities for podcasts, games, or immersive experiences. Speaker name – Description Sample en-us-graphiterhodium - A bold and dramatic male voice en-us-olivepoivre - An adult female voice that is calm and soothing. Check the full Dragon HD Omni voice list at here. Styles control Standard Azure voices have limited styles due to extensive tuning requirements. The Dragon HD Omni introduces automatic style prediction using natural language descriptions, enabling advanced customization, broader style support, reduced cost, and improved expressiveness. In the initial release, styles will launch for en-US-Ava and en-US-Andrew. Supported styles angry, chill surfer, confused, curious, determined, disgusted, embarrassed, emo teenager, empathetic, encouraging, excited, fearful, friendly, grateful, joyful, mad scientist, meditative, narration, neutral, new yorker, news, reflective, regretful, relieved, sad, santa, shy, soft voice, surprised Note that style result will be strongly influenced by the input content. SSML example <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-us-ava:DragonHDOmniLatestNeural"> <mstts:express-as style="cheerful"> Wow! What an amazing day! I feel so full of energy, and everything around me seems brighter. My voice is bubbling with excitement, and I can’t stop smiling. I’m ready to take on anything that comes my way—let’s celebrate this wonderful moment together! </mstts:express-as> </voice> </speak> Multilingual and Accents All Dragon HD Omni voices support multiple languages, with the capability that can automatically predicting and generating output based on the input text. Additionally, you may utilize the tag to adjust speaking languages and accents, such as fr-FR for French, de-DE for German, etc. For a comprehensive list of supported languages and their associated syntax and attributes, please refer to the lang element. SSML example <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"><voice name="en-us-ava:Dragon HD OmniLatestNeural"><lang xml:lang="fr-FR"> Bonjour ! Ce matin, j’ai pris un café au jardin du Luxembourg. Il faisait frais, mais très agréable. Ensuite, j’ai acheté une baguette et quelques macarons. Paris est vraiment charmant.</lang> </voice> </speak> Word Boundary Event Support Dragon HD Omni supports the word boundary event, which allows developers to track the precise timing of each word as it is spoken. This feature is essential for applications requiring word-level synchronization, such as karaoke, real-time captioning, or interactive voice experiences. When the event fires, it provides: Text: The word spoken AudioOffset: The time offset in the audio stream (milliseconds) TextOffset: The position of the word in the input text Example: Python Sample Using Wordboundary Event in Azure Speech SDK import azure.cognitiveservices.speech as speechsdk def word_boundary_cb(evt): print(f"Word: '{evt.text}', AudioOffset: {evt.audio_offset / 10000}ms, TextOffset: {evt.text_offset}") speech_config = speechsdk.SpeechConfig(subscription="YourSubscriptionKey", region="YourServiceRegion") synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config) synthesizer.synthesis_word_boundary.connect(word_boundary_cb) ssml = """ <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-us-ava:DragonHDOmniLatestNeural"> Hello Azure, welcome to Dragon HD Omni! </voice> </speak> """ result = synthesizer.speak_ssml_async(ssml).get() Sample Output: Word: 'Hello', AudioOffset: 110.0ms, TextOffset: 182 Word: 'Azure', AudioOffset: 590.0ms, TextOffset: 188 Word: ',', AudioOffset: 1110.0ms, TextOffset: 193 Word: 'welcome', AudioOffset: 1270.0ms, TextOffset: 195 Word: 'to', AudioOffset: 1750.0ms, TextOffset: 203 Word: 'Dragon HD Omni', AudioOffset: 1910.0ms, TextOffset: 206 Word: '!', AudioOffset: 2750.0ms, TextOffset: 216 Parameters Dragon HD Omni supports advanced parameter tuning to help you customize voice output for different scenarios. This guide explains each parameter in simple terms and provides recommendations for adjusting them based on your goals. Overview Parameter Default Range Purpose temperature 0.7 0.3 – 1.0 Controls creativity vs. stability top_p 0.7 0.3 – 1.0 Filters output for diversity top_k 22 1 – 50 Limits number of options considered cfg_scale 1.4 1.0 – 2.0 Adjusts relevance and speech speed Tuning for Expressiveness vs. Stability Higher values for temperature, top_p, and top_k result in more expressive, emotionally varied speech. Lower values produce more stable and predictable output. Recommendation: To increase expressiveness, raise all three parameters together. Keep top_p equal to temperature for best results. Tuning for Speed and Contextual Relevance cfg_scale affects how quickly the voice speaks and how well it aligns with the context. Higher values (e.g., 1.8–2.0): faster speech, stronger contextual relevance. Lower values (e.g., 1.0–1.2): slower speech, less contextual alignment. Suggested Tuning Strategies Goal Suggested Adjustment More expressive Increase temperature, top_p, and top_k together More stable Lower temperature first, then adjust top_p if needed Faster & relevant Increase cfg_scale Slower & neutral Decrease cfg_scale The following table describes the usage of the parameters above: Single parameter: <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-us-ava:Dragon HD OmniLatestNeural" parameters="top_p=0.8"> Hello Azure! </voice> </speak> Multiple parameters: <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="en-us-ava:Dragon HD OmniLatestNeural" parameters="top_p=0.8;top_k=22;temperature=0.7;cfg_scale=1.2"> Hello Azure! Hello Azure! </voice> </speak> Get Started In our ongoing journey to enhance multilingual capabilities in text to speech (TTS) technology, we strive to deliver the best voices to empower your applications. Our voices are designed to be incredibly adaptive, seamlessly switching between languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications like language learning, travel guidance, and international business communication. Microsoft offers an extensive portfolio of over 600 neural voices, covering more than 150 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or provide a voice to chatbots, elevating the conversational experience for users. With the Custom Neural Voice capability, businesses can also create unique and distinctive brand voices effortlessly. With these advancements, we continue to push the boundaries of what’s possible in TTS technology, ensuring that our users have access to the most versatile, high-quality voices for their needs. For more information Try our demo to listen to existing neural voices Add Text to speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback Contact us ttsvoicefeedback@microsoft.com2.4KViews0likes0CommentsCreate a Simple Speech REST API with Azure AI Speech Services
Explore the world of Speech recognition and Speech Synthesis with Azure AI Services. In this tutorial, you will learn how to create your own simple Speech REST API using Azure AI Speech Synthesis and Azure OpenAI services or OpenAI API. Experience the power of speech synthesis using Azure and explore the infinite number of possibilities today unveiled to you by Azure AI Services to create powerful products.6KViews2likes0CommentsBuild a Virtual Assistant with Azure Open AI and Azure Speech Service
This post shows you how to create an extremely powerful virtual assistant with Azure OpenAI and Azure Speech Services for all languages. It is just a static web application without running any server and everything done with client side JavaScript. Azure OpenAI Service provides developers with API calls to make a virtual assistant that uses Azure AI and speech services. Students can use it to get course-related answers. You can try the Live2D Azure OpenAI chatbot by creating an Azure subscription and configuring it.22KViews1like5CommentsOmbromanie: Creating Hand Shadow stories with Azure Speech and TensorFlow.js Handposes
This is an article for all those big siblings who sneaked flashlights into their bedroom to cast scary shadows onto the wall, and for all the little kids, or the grownups who are still kids at heart, who were thus deliciously entertained.3KViews0likes0Comments