speech
56 TopicsReal-Time Speech Intelligence for Global Scale: gpt-4o-transcribe-diarize in Azure AI Foundry
Voice is a natural interface for communication. Now, with the general availability of gpt-4o-transcribe-diarize, the new automatic speech recognition (ASR) model in Azure AI Foundry, transforming speech into actionable text is faster, smarter, and more accurate than ever. This launch marks a significant milestone in our mission to empower organizations with AI that delivers speed, accuracy, and enterprise-grade reliability. With gpt-4o-transcribe-diarize seamlessly integrated, businesses can unlock critical insights from conversations, instantly converting audio into text with ultra-low latency and outstanding accuracy across 100+ languages. Whether you're enhancing live event accessibility, analyzing customer interactions, or enabling intelligent voice-driven applications, gpt-4o-transcribe-diarize helps capture spoken word and leverages it for real-time decision-making. Experience how Azure AI’s innovation in speech technology is helping to redefine productivity and global reach, setting a new standard for audio intelligence in the enterprise landscape. Why gpt-4o-transcribe-diarize Matters Businesses today operate in a world where conversations drive decisions. From customer support calls to virtual meetings, audio data holds critical insights. Gpt-4o-transcribe-diarize unlocks these insights, converting speech to text with ultra-low latency and high accuracy across 100+ languages. Whether you’re captioning live events, analyzing call center interactions, or building voice-driven applications, gpt-4o-transcribe-diarize offers the opportunity to help your workflows be powered by real-time intelligence. Key Features Lightning-Fast Transcription: Convert 10 minutes of audio in ~15 seconds with our new Fast Transcription API. Global Language Coverage: Support for 100+ languages and dialects for inclusive, global experiences. Seamless Integration: Available in Azure AI Foundry with managed endpoints for easy deployment and scale. Real-World Impact Imagine a reporter summarizing interviews in real time, a financial institution transcribing calls instantly, or a global retailer powering multilingual voice assistants; all with the speed and security of Azure AI Foundry. gpt-4o-transcribe-diarize can make these scenarios possible today. Pricing and regional availability for gpt-4o-transcribe-diarize Model Deployment Regions Price $/1m tokens gpt-4o-transcribe-diarize Global Standard (Paygo) East US 2, Sweden Central Text input: $2.50 Audio input: $6.00 Output: $10.00 gpt-4o-transcribe-diarize in audio AI innovation context gpt-4o-transcribe-diarize is part of a broader wave of audio AI innovation on Azure, joining new models like OpenAI gpt-realtime and gpt-audio that are purpose-built for expressive, low-latency voice experiences. While gpt-4o-transcribe-diarize delivers ultra-fast transcription with enterprise-grade accuracy, gpt-realtime enables natural, emotionally rich voice interactions with millisecond responsiveness—ideal for live conversations, voice agents, and multimodal applications. Meanwhile, audio models like gpt-4o-transcribe mini, and mini-tts extend the platform’s capabilities with customizable speech synthesis and real-time captioning, making Azure AI a comprehensive solution for building intelligent, production-ready voice systems. gpt-realtime Features OpenAI claims the gpt-realtime model introduces a new standard for voice-first applications, combining expressive audio generation with ultra-low latency and natural conversational flow. It’s designed to power real-time interactions that feel like natural, responsive speech. Key Features: Millisecond Latency: Enables live responsiveness suitable for real-time conversations, kiosks, and voice agents. Emotionally Expressive Voices: Supports nuanced speech delivery with voices like Marin and Cedar, capable of conveying tone, emotion, and intent. Natural Turn-Taking: Built-in mechanisms for detecting pauses and transitions, allowing fluid back-and-forth dialogue. Function Calling Support: Seamlessly integrates with backend systems to trigger actions based on voice input. Multimodal Readiness: Designed to work with text, audio, and visual inputs for rich, interactive experiences. Stable APIs for Production: Enterprise-grade reliability with consistent behavior across sessions and deployments. These features make gpt-realtime a foundational model for building intelligent voice interfaces that go beyond transcription—delivering conversational intelligence in real time. gpt-realtime Use Cases With its expressive audio capabilities and real-time responsiveness, gpt-realtime unlocks new possibilities across industries. Whether enhancing customer engagement or streamlining operations, it brings voice AI into the heart of enterprise workflows. Examples include: Customer Service Agents: Power virtual agents that respond instantly with natural, tones for rich expressiveness, improving customer satisfaction and reducing wait times. Retail Kiosks & Smart Devices: Enable voice-driven product discovery, troubleshooting, and checkout experiences with real-time feedback. Multilingual Voice Assistants: Deliver localized, expressive voice experiences across global markets with support for multiple languages and dialects. Live Captioning & Accessibility: Combine gpt-4o-transcribe-diarize gpt-realtime to provide real-time captions and voice synthesis for inclusive experiences. These use cases demonstrate how gpt-realtime transforms voice into a strategic interface—bridging human communication and intelligent systems with speed and accuracy. Ready to transform voice into value? Learn more and start building with gpt-4o-transcribe-diarize2.6KViews0likes1CommentUsing the Voice Live API in Azure AI Foundry
In this blog post, we’ll explore the Voice Live API from Azure AI Foundry. Officially released for general availability on October 1, 2025, this API unifies speech recognition, generative AI, and text-to-speech capabilities into a single, streamlined interface. It removes the complexity of manually orchestrating multiple components and ensures a consistent developer experience across all models, making it easy to switch and experiment. What sets Voice Live API apart are its advanced conversational enhancements, including: Semantic Voice Activity Detection (VAD) that’s robust against background noise and accurately detects when a user intends to speak. Semantic end-of-turn detection that supports natural pauses in conversation. Server-side audio processing features like noise suppression and echo cancellation, simplifying client-side development. Let’s get started. 1. Getting Started with Voice Live API The Voice Live API ships with an SDKthat lets you open a single realtime WebSocket connection and then do everything—stream microphone audio up, receive synthesized audio/text/function‑call events down— without writing any of the low-level networking plumbing. This is how the connection is opened with the Python SDK. from azure.ai.voicelive.aio import connect from azure.core.credentials import AzureKeyCredential async with connect( endpoint=VOICE_LIVE_ENDPOINT, # https://<your-foundry-resource>.cognitiveservices.azure.com/ credential=AzureKeyCredential(VOICE_LIVE_KEY), model="gpt-4o-realtime", connection_options={ "max_msg_size": 10 * 1024 * 1024, # allow streamed PCM "heartbeat": 20, # keep socket alive "timeout": 20, # network resilience }, ) as connection: Notice that you don't need an underlying deployment nor manage any generative AI models, as the API handles all the underlying infrastructure. Immediately after connecting, declare what kind of conversation you want. This is where you “teach” the session the model instructions, which voice to synthesize, what tool functions it may call, and how to detect speech turns: from azure.ai.voicelive.models import ( RequestSession, Modality, AzureStandardVoice, InputAudioFormat, OutputAudioFormat, AzureSemanticVad, ToolChoiceLiteral, AudioInputTranscriptionOptions ) session_config = RequestSession( modalities=[Modality.TEXT, Modality.AUDIO], instructions="Assist the user with account questions succinctly.", voice=AzureStandardVoice(name="alloy", type="azure-standard"), input_audio_format=InputAudioFormat.PCM16, output_audio_format=OutputAudioFormat.PCM16, turn_detection=AzureSemanticVad( threshold=0.5, prefix_padding_ms=300, silence_duration_ms=500 ), tools=[ # optional { "name": "get_user_information", "description": "Retrieve profile and limits for a user", "input_schema": { "type": "object", "properties": {"user_id": {"type": "string"}}, "required": ["user_id"] } } ], tool_choice=ToolChoiceLiteral.AUTO, input_audio_transcription=AudioInputTranscriptionOptions(model="whisper-1"), ) await connection.session.update(session=session_config) After session setup, it is pure event-driven flow: async for event in connection: if event.type == ServerEventType.RESPONSE_AUDIO_DELTA: playback_queue.put(event.delta) elif event.type == ServerEventType.CONVERSATION_ITEM_CREATED and event.item.type == ItemType.FUNCTION_CALL: handle_function_call(event) That’s the core: one connection, one session config message, then pure event-driven flow. 2. Deep Dive: Tool (Function) Handling in the Voice Live SDK In the Voice Live context, “tools” are model-invocable functions you expose with a JSON schema. The SDK streams a structured function call request (name + incrementally streamed arguments), you execute real code locally, then feed the JSON result back so the model can incorporate it into its next spoken (and/or textual) turn. Let’s unpack the full lifecycle. First, the model emits a CONVERSATION_ITEM_CREATED event whose item.type == FUNCTION_CALL if event.item.type == ItemType.FUNCTION_CALL: await self._handle_function_call_with_improved_pattern(event, connection) Arguments stream (possibly token-by-token) until the SDK signals RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE. Optionally, the SDK may also complete the “response” segment with RESPONSE_DONE before you run the tool. Then we execute the local Python function, and explicitly request a new model response via connection.response.create(), telling the model to incorporate the tool result into a natural-language (and audio) answer. async def _handle_function_call(self, created_evt, connection): call_item = created_evt.item # ResponseFunctionCallItem name = call_item.name call_id = call_item.call_id prev_id = call_item.id # 1. Wait until arguments are fully streamed args_done = await _wait_for_event( connection, {ServerEventType.RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE} ) assert args_done.call_id == call_id arguments = args_done.arguments # JSON string # 2. (Optional) Wait for RESPONSE_DONE to avoid race with model finishing segment await _wait_for_event(connection, {ServerEventType.RESPONSE_DONE}) # 3. Execute func = self.available_functions.get(name) if not func: # Optionally send an error function output return result = await func(arguments) # Implementations are async in this sample # 4. Send output output_item = FunctionCallOutputItem(call_id=call_id, output=json.dumps(result)) await connection.conversation.item.create( previous_item_id=prev_id, item=output_item ) # 5. Trigger follow-up model response await connection.response.create() 3. Sample App: Try the repo with sample app we have created, together with all the infrastructure required already automated. This sample app simulates a friendly real‑time contact‑center rep who can listen continuously, understand you as you speak, instantly look up things like your credit card’s upcoming due date or a product detail via function calls, and then answer back naturally in a Brazilian Portuguese neural voice with almost no lag. Behind the scenes it streams your microphone audio to Azure’s Voice Live (GPT‑4o realtime) model, transcribes and reasons on the fly, selectively triggers lightweight “get user information” or “get product information” lookups to Azure AI Search , and speaks responses right back to you. Happy Coding!387Views0likes0CommentsThe Future of AI: How Lovable.dev and Azure OpenAI Accelerate Apps that Change Lives
Discover how Charles Elwood, a Microsoft AI MVP and TEDx Speaker, leverages Lovable.dev and Azure OpenAI to create impactful AI solutions. From automating expense reports to restoring voices, translating gestures to speech, and visualizing public health data, Charles's innovations are transforming lives and democratizing technology. Follow his journey to learn more about AI for good.1.5KViews2likes0CommentsUpgrade your voice agent with Azure AI Voice Live API
Today, we are excited to announce the general availability of Voice Live API, which enables real-time speech-to-speech conversational experience through a unified API powered by generative AI models. With Voice Live API, developers can easily voice-enable any agent built with the Azure AI Foundry Agent Service. Azure AI Foundry Agent Service, enables the operation of agents that make decisions, invoke tools, and participate in workflows across development, deployment, and production. By eliminating the need to stitch together disparate components, Voice Live API offers a low latency, end-to-end solution for voice-driven experiences. As always, a diverse range of customers provided valuable feedback during the preview period. Along with announcing general availability, we are also taking this opportunity to address that feedback and improve the API. Following are some of the new features designed to assist developers and enterprises in building scalable, production-ready voice agents. More natively integrated GenAI models including GPT-Realtime Voice Live API enables developers to select from a range of advanced AI models designed for conversational applications, such as GPT-Realtime, GPT-5, GPT-4.1, Phi, and others. These models are natively supported and fully managed, eliminating the need for developers to manage model deployment or plan for capacity. These natively supported models may each have a distinct stage in their life cycle (e.g. public preview, generally available) and be subject to varying pricing structures. The table below lists the models supported in each pricing tier. Pricing Tier Generally Available In Public Preview Voice Live Pro GPT-Realtime, GPT-4.1, GPT-4o GPT-5 Voice Live Standard GPT-4o-mini, GPT-4.1-mini GPT-4o-Mini-Realtime, GPT-5-mini Voice Live Lite NA Phi-4-MM-Realtime, GPT-5-Nano, Phi-4-Mini Extended speech languages to 140+ Voice Live API now supports speech input in over 140 languages/locales. View all supported languages by configuration. Automatic multilingual configuration is enabled by default, using the multilingual model. Integrated with Custom Speech Developers need customization to better manage input and output for different use cases. Besides the support for Custom Voice released in May 2025, Voice Live now supports seamless integration with Custom Speech for improved speech recognition results. Developers can also improve speech input accuracy with phrase lists and refine speech synthesis pronunciation using custom lexicons, all without training a model. Learn how to customize speech and voice models for Voice Live API. Natural HD voices upgraded Neural HD voices in Azure AI Speech are contextually aware and engineered to provide a natural, expressive experience, making them ideal for voice agent applications. The latest V2 upgrade enhances lifelike qualities with features such as natural pauses, filler words, and seamless transitions between speaking styles, all available with Voice Live API. Check out the latest demo of Ava Neural HD V2. Improved VAD features for interruption detection Voice Live API now features semantic Voice Activity Detection (VAD), enabling it to intelligently recognize pauses and filler word interruptions in conversations. In the latest en-US evaluation on Multilingual filler words data, Voice Live API achieved ~20% relative improvement from previous VAD models. This leap in performance is powered by integrating semantic VAD into the n-best pipeline, allowing the system to better distinguish meaningful speech from filler noise and enabling more accurate latency tracking and cleaner segmentation, especially in multilingual and noisy environments. 4K avatar support Voice Live API enables efficient integration with streaming avatars. With the latest updates, avatar options offer support for high-fidelity 4K video models. Learn more about the avatar update. Improved function calling and integration with Azure AI Foundry Agent Service Voice Live API enables function calling to assist developers in building robust voice agents with their chosen generative AI models. This release improves asynchronous function calls and enhances integration with Azure AI Foundry Agent Service for agent creation and operation. Learn more about creating a voice live real-time voice agent with Azure AI Foundry Agent Service. More developer resources and availability in more regions Developer resources are available in C# and Python, with more to come. Get started with Voice Live API. Voice Live API is available in more regions now including Australia East, East US, Japan East, and UK South, besides the previously supported regions such as Central India, East US 2, South East Asia, Sweden Central, and West US 2. Check the features supported in each region. Customers adopting Voice Live In healthcare, patient experience is always the top priority. With Voice Live, eClinicalWorks’ healow Genie contact center solution is now taking healthcare modernization a step further. healow is piloting Voice Live API for Genie to inform patients about their upcoming appointments, answer common questions, and return voicemails. Reducing these routine calls saves healthcare staff hours each day and boosts patient satisfaction through timely interactions. “We’re looking forward to using Azure AI Foundry Voice Live API so that when a patient calls, Genie can detect the question and respond in a natural voice in near-real time,” said Sidd Shah, Vice President of Strategy & Business Growth at healow. “The entire roundtrip is all happening in Voice Live API.” If a patient asks about information in their medical chart, Genie can also fetch data from their electronic health record (EHR) and provide answers. Read the full story here. “If we did multiple hops to go across different infrastructures, that would add up to a diminished patient experience. The Azure AI Foundry Voice Live API is integrated into one single, unified solution, delivering speech-to-text and text-to-speech in the same infrastructure.” Bhawna Batra, VP of Engineering at eClinicalWorks Capgemini, a global business and technology transformation partner, is reimagining its global service desk managed operations through its Capgemini Cloud Infrastructure Services (CIS) division. The first phase covers 500,000 users across 45 clients, which is only part of the overall deployment base. The goal is to modernize the service desk to meet changing expectations for speed, personalization, and scale. To drive this transformation, Capgemini launched the “AI-Powered Service Desk” platform powered by Microsoft technologies including Dynamics 365 Contact Center, Copilot Studio, and Azure AI Foundry. A key enhancement was the integration of Voice Live API for real-time voice interactions, enabling intelligent, conversational support across telephony channels. The new platform delivers a more agile, truly conversational, AI-driven service experience, automating routine tasks and enhancing agent productivity. With scalable voice capabilities and deep integration across Microsoft’s ecosystem, Capgemini is positioned to streamline support operations, reduce response times, and elevate customer satisfaction across its enterprise client base. "Integrating Microsoft’s Voice Live API into our platform has been transformative. We’re seeing measurable improvements in user engagement and satisfaction thanks to the API’s low-latency, high-quality voice interactions. As a result, we are able to deliver more natural and responsive experiences, which have been positively received by our customers.” Stephen Hilton, EVP Chief Operating Officer at CIS Capgemini Astra Tech, a fast-growing UAE-based technology group part of G42, is bringing Voice Live API to its flagship platform, botim, a fintech-first and AI-native platform. Eight out of 10 smartphone users in the UAE already rely on the app. The company is now reshaping botim from a communications tool into a fintech-first service, adding features such as digital wallets, international remittances, and micro-loans. To achieve its broader vision, Astra Tech set out to make botim simpler, more intuitive, and more human. “Voice removes a lot of complexity, and it’s the most natural way to interact,” says Frenando Ansari, Lead Product Manager at Astra Tech. “For users with low digital literacy or language barriers, tapping through traditional interfaces can be difficult. Voice personalizes the experience and makes it accessible in their preferred language.” " The Voice Live API acts as a connective tissue for AI-driven conversation across every layer of the app. It gives us a standardized framework so that different product teams can incorporate voice without needing to hire deep AI expertise.” Frenando Ansari, Lead Product Manager at Astra Tech “The most impressive thing about the Voice Live API is the voice activity detection and the noise control algorithm.” Meng Wang, AI Head at Astra Tech Get started Voice Live API is transforming how developers build voice-enabled agent systems by providing an integrated, scalable, and efficient solution. By combining speech recognition, generative AI, and text-to-speech functionalities into a unified interface, it addresses the challenges of traditional implementations, enabling faster development and superior user experiences. From streamlining customer service to enhancing education and public services, the opportunities are endless. The future of voice-first solutions is here—let’s build it together! Voice Live API introduction (video) Try Voice Live in Azure AI Foundry Voice Live API documents Voice Live quickstart Voice Live Agent code sample in GitHub
2.4KViews2likes0CommentsAnnouncing Live Interpreter API - Now in Public Preview
Today, we’re excited to introduce Live Interpreter –a breakthrough new capability in Azure Speech Translation – that makes real-time, multilingual communication effortless. Live Interpreter continuously identifies the language being spoken without requiring you to set an input language and delivers low latency speech-to-speech translation in a natural voice that preserves the speaker’s style and tone.6.2KViews1like0CommentsPower Up Your Open WebUI with Azure AI Speech: Quick STT & TTS Integration
Introduction Ever found yourself wishing your web interface could really talk and listen back to you? With a few clicks (and a bit of code), you can turn your plain Open WebUI into a full-on voice assistant. In this post, you’ll see how to spin up an Azure Speech resource, hook it into your frontend, and watch as user speech transforms into text and your app’s responses leap off the screen in a human-like voice. By the end of this guide, you’ll have a voice-enabled web UI that actually converses with users, opening the door to hands-free controls, better accessibility, and a genuinely richer user experience. Ready to make your web app speak? Let’s dive in. Why Azure AI Speech? We use Azure AI Speech service in Open Web UI to enable voice interactions directly within web applications. This allows users to: Speak commands or input instead of typing, making the interface more accessible and user-friendly. Hear responses or information read aloud, which improves usability for people with visual impairments or those who prefer audio. Provide a more natural and hands-free experience especially on devices like smartphones or tablets. In short, integrating Azure AI Speech service into Open Web UI helps make web apps smarter, more interactive, and easier to use by adding speech recognition and voice output features. If you haven’t hosted Open WebUI already, follow my other step-by-step guide to host Ollama WebUI on Azure. Proceed to the next step if you have Open WebUI deployed already. Learn More about OpenWeb UI here. Deploy Azure AI Speech service in Azure. Navigate to the Azure Portal and search for Azure AI Speech on the Azure portal search bar. Create a new Speech Service by filling up the fields in the resource creation page. Click on “Create” to finalize the setup. After the resource has been deployed, click on “View resource” button and you should be redirected to the Azure AI Speech service page. The page should display the API Keys and Endpoints for Azure AI Speech services, which you can use in Open Web UI. Settings things up in Open Web UI Speech to Text settings (STT) Head to the Open Web UI Admin page > Settings > Audio. Paste the API Key obtained from the Azure AI Speech service page into the API key field below. Unless you use different Azure Region, or want to change the default configurations for the STT settings, leave all settings to blank. Text to Speech settings (TTS) Now, let's proceed with configuring the TTS Settings on OpenWeb UI by toggling the TTS Engine to Azure AI Speech option. Again, paste the API Key obtained from Azure AI Speech service page and leave all settings to blank. You can change the TTS Voice from the dropdown selection in the TTS settings as depicted in the image below: Click Save to reflect the change. Expected Result Now, let’s test if everything works well. Open a new chat / temporary chat on Open Web UI and click on the Call / Record button. The STT Engine (Azure AI Speech) should identify your voice and provide a response based on the voice input. To test the TTS feature, click on the Read Aloud (Speaker Icon) under any response from Open Web UI. The TTS Engine should reflect Azure AI Speech service! Conclusion And that’s a wrap! You’ve just given your Open WebUI the gift of capturing user speech, turning it into text, and then talking right back with Azure’s neural voices. Along the way you saw how easy it is to spin up a Speech resource in the Azure portal, wire up real-time transcription in the browser, and pipe responses through the TTS engine. From here, it’s all about experimentation. Try swapping in different neural voices or dialing in new languages. Tweak how you start and stop listening, play with silence detection, or add custom pronunciation tweaks for those tricky product names. Before you know it, your interface will feel less like a web page and more like a conversation partner.980Views2likes1CommentAnnouncing gpt-realtime on Azure AI Foundry:
We are thrilled to announce that we are releasing today the general availability of our latest advancement in speech-to-speech technology: gpt-realtime. This new model represents a significant leap forward in our commitment to providing advanced and reliable speech-to-speech solutions. gpt-realtime is a new S2S (speech-to-speech) model with improved instruction following, designed to merge all of our speech-to-speech improvements into a single, cohesive model. This model is now available in the Real-time API, offering enhanced voice naturalness, higher audio quality, and improved function calling capabilities. Key Features New, natural, expressive voices: New voice options (Marin and Cedar) that bring a new level of naturalness and clarity to speech synthesis. Improved Instruction Following: Enhanced capabilities to follow instructions more accurately and reliably. Enhanced Voice Naturalness: More lifelike and expressive voice output. Higher Audio Quality: Superior audio quality for a better user experience. Improved Function Calling: Enhanced ability to call custom code defined by developers. Image Input Support: Add images to context and discuss them via voice—no video required. Check out the model card here: gpt-realtime Pricing Pricing for gpt-realtime is 20% lower compared to the previous gpt-4o-realtime preview: Pricing is based on usage per 1 million tokens. Below is the breakdown: Getting Started gpt-realtime is available on Azure AI Foundry via Azure Models direct from Azure today. We are excited to see how developers and users will leverage these new capabilities to create innovative and impactful solutions. Check out the model on Azure AI Foundry and see detailed documentation in Microsoft Learn docs.4.3KViews1like0CommentsModel Mondays S2E11: Exploring Speech AI in Azure AI Foundry
1. Weekly Highlights This week’s top news in the Azure AI ecosystem included: Lakuna — Copilot Studio Agent for Product Teams: A hackathon project built with Copilot Studio and Azure AI Foundry, Lakuna analyzes your requirements and docs to surface hidden assumptions, helping teams reflect, test, and reduce bias in product planning. Azure ND H200 v5 VMs for AI: Azure Machine Learning introduced ND H200 v5 VMs, featuring NVIDIA H200 GPUs (over 1TB GPU memory per VM!) for massive models, bigger context windows, and ultra-fast throughput. Agent Factory Blog Series: The next wave of agentic AI is about extensibility: plug your agents into hundreds of APIs and services using Model Connector Protocol (MCP) for portable, reusable tool integrations. GPT-5 Tool Calling on Azure AI Foundry: GPT-5 models now support free-form tool calling—no more rigid JSON! Output SQL, Python, configs, and more in your preferred format for natural, flexible workflows. Microsoft a Leader in 2025 Gartner Magic Quadrant: Azure was again named a leader for Cloud Native Application Platforms—validating its end-to-end runway for AI, microservices, DevOps, and more. 2. Spotlight On: Azure AI Foundry Speech Playground The main segment featured a live demo of the new Azure AI Speech Playground (now part of Foundry), showing how developers can experiment with and deploy cutting-edge voice, transcription, and avatar capabilities. Key Features & Demos: Speech Recognition (Speech-to-Text): Try real-time transcription directly in the playground—recognizing natural speech, pauses, accents, and domain terms. Batch and Fast transcription options for large files and blob storage. Custom Speech: Fine-tune models for your industry, vocabulary, and noise conditions. Text to Speech (TTS): Instantly convert text into natural, expressive audio in 150+ languages with 600+ neural voices. Demo: Listen to pre-built voices, explore whispering, cheerful, angry, and more styles. Custom Neural Voice: Clone and train your own professional or personal voice (with strict Responsible AI controls). Avatars & Video Translation: Bring your apps to life with prebuilt avatars and video translation, which syncs voice-overs to speakers in multilingual videos. Voice Live API: Voice Live API (Preview) integrates all premium speech capabilities with large language models, enabling real-time, proactive voice agents and chatbots. Demo: Language learning agent with voice, avatars, and proactive engagement. One-click code export for deployment in your IDE. 3. Customer Story: Hilo Health This week’s customer spotlight featured Helo Health—a healthcare technology company using Azure AI to boost efficiency for doctors, staff, and patients. How Hilo Uses Azure AI: Document Management: Automates fax/document filing, splits multi-page faxes by patient, reduces staff effort and errors using Azure Computer Vision and Document Intelligence. Ambient Listening: Ambient clinical note transcription captures doctor-patient conversations and summarizes them for easy EHR documentation. Genie AI Contact Center: Agentic voice assistants handle patient calls, book appointments, answer billing/refill questions, escalate to humans, and assist human agents—using Azure Communication Services, Azure Functions, FastAPI (community), and Azure OpenAI. Conversational Campaigns: Outbound reminders, procedure preps, and follow-ups all handled by voice AI—freeing up human staff. Impact: Hilo reaches 16,000+ physician practices and 180,000 providers, automates millions of communications, and processes $2B+ in payments annually—demonstrating how multimodal AI transforms patient journeys from first call to post-visit care. 4. Key Takeaways Here’s what you need to know from S2E11: Speech AI is Accessible: The Azure AI Foundry Speech Playground makes experimenting with voice recognition, TTS, and avatars easy for everyone. From Playground to Production: Fine-tune, export code, and deploy speech models in your own apps with Azure Speech Service. Responsible AI Built-In: Custom Neural Voice and avatars require application and approval, ensuring ethical, secure use. Agentic AI Everywhere: Voice Live API brings real-time, multimodal voice agents to any workflow. Healthcare Example: Hilo’s use of Azure AI shows the real-world impact of speech and agentic AI, from patient intake to after-visit care. Join the Community: Keep learning and building—join the Discord and Forum. Sharda's Tips: How I Wrote This Blog I organize key moments from each episode, highlight product demos and customer stories, and use GitHub Copilot for structure. For this recap, I tested the Speech Playground myself, explored the docs, and summarized answers to common developer questions on security, dialects, and deployment. Here’s my favorite Copilot prompt this week: "Generate a technical blog post for Model Mondays S2E11 based on the transcript and episode details. Focus on Azure Speech Playground, TTS, avatars, Voice Live API, and healthcare use cases. Add practical links for developers and students!" Coming Up Next Week Next week: Observability! Learn how to monitor, evaluate, and debug your AI models and workflows using Azure and OpenAI tools. Register For The Livestream – Sep 1, 2025 Register For The AMA – Sep 5, 2025 Ask Questions & View Recaps – Discussion Forum About Model Mondays Model Mondays is your weekly Azure AI learning series: 5-Minute Highlights: Latest AI news and product updates 15-Minute Spotlight: Demos and deep dives with product teams 30-Minute AMA Fridays: Ask anything in Discord or the forum Start building: Register For Livestreams Watch Past Replays Register For AMA Recap Past AMAs Join The Community Don’t build alone! The Azure AI Developer Community is here for real-time chats, events, and support: Join the Discord Explore the Forum About Me I'm Sharda, a Gold Microsoft Learn Student Ambassador focused on cloud and AI. Find me on GitHub, Dev.to, Tech Community, and LinkedIn. In this blog series, I share takeaways from each week’s Model Mondays livestream.182Views0likes0CommentsBuild recap: new Azure AI Foundry resource, Developer APIs and Tools
At Microsoft Build 2025, we introduced Azure AI Foundry resource, Azure AI Foundry API, and supporting tools to streamline the end-to-end development lifecycle of AI agents and applications. These capabilities are designed to help developers accelerate time-to-market; support production-scale workloads with scale and central governance; and support administrators with a self-serve capability to enable their teams’ experimentation with AI in a controlled environment. The Azure AI Foundry resource type unifies agents, models and tools under a single management grouping, equipped with built-in enterprise-readiness capabilities — such as tracing & monitoring, agent and model-specific evaluation capabilities, and customizable enterprise setup configurations tailored to your organizational policies like using your own virtual networks. This launch represents our commitment to providing organizations with a consistent, efficient and centrally governable environment for building and operating the AI agents and applications of today, and tomorrow. New platform capabilities The new Foundry resource type evolves our vision for Azure AI Foundry as a unified Azure platform-as-a-service offering, enabling developers to focus on building applications rather than managing infrastructure, while taking advantage of native Azure platform capabilities like Azure Data and Microsoft Defender. Previously, Azure AI Foundry portal’s capabilities required the management of multiple Azure resources and SDKs to build an end-to-end application. New capabilities include: Foundry resource type enables administrators with a consistent way of managing security and access to Agents, Models, Projects, and Azure tooling Integration. With this change, Azure Role Based Access Control, Networking and Policies are administered under a single Azure resource provider namespace, for streamlined management. ‘Azure AI Foundry’ is a renaming of the former ‘Azure AI Services’ resource type, with access to new capabilities. While Azure AI Foundry still supports bring-your-own Azure resources, we now default to a fully Microsoft-managed experience, making it faster and easier to get started. Foundry projects are folders that enable developers to independently create new environments for exploring new ideas and building prototypes, while managing data in isolation. Projects are child resources; they may be assigned their own admin controls but by default share common settings such as networking or connected resource access from their parent resource. This principle aims to take IT admins out of the day-to-day loop once security and governance are established at the resource level, enabling developers to self-serve confidently within their projects. Azure AI Foundry API is designed from the ground up, to build and evaluate API-first agentic applications, and lets you work across model providers agnostically with a consistent contract. Azure AI Foundry SDK wraps the Foundry API making it easy to integrate capabilities into code whether your application is built in Python, C#, JavaScript/TypeScript or Java. Azure AI Foundry for VS Code Extension complements your workflow with capabilities to help you explore models, and develop agents and is now supported with the new Foundry project type. New built-in RBAC roles provide up-to-date role definitions to help admins differentiate access between Administrator, Project Manager and Project users. Foundry RBAC actions follow strict control- and data plane separation, making it easier to implement the principle of least privilege. Why we built these new platform capabilities If you are already building with Azure AI Foundry -- these capabilities are meant to simplify platform management, enhance workflows that span multiple models and tools, and reinforce governance capabilities, as we see AI workloads grow more complex. The emergence of generative AI fundamentally changed how customers build AI solutions, requiring capabilities that span multiple traditional domains. We launched Azure AI Foundry to provide a comprehensive toolkit for exploring, building and evaluating this new wave of GenAI solutions. Initially, this experience was backed by two core Azure services -- Azure AI Services for accessing models including those from OpenAI, and Azure Machine Learning’s hub, to access tools for orchestration and customization. With the emergence of AI agents composing models and tools; and production workloads demanding the enforcement of central governance across those, we are investing to bring the management of agents, models and their tooling integration layer together to best serve these workload’s requirements. The Azure AI Foundry resource and Foundry API are purposefully designed to unify and simplify the composition and management of core building blocks of AI applications: Models Agents & their tools Observability, Security, and Trust In this new era of AI, there is no one-size-fits-all approach to building AI agents and applications. That's why we designed the new platform as a comprehensive AI factory with modular, extensible, and interoperable components. Foundry Project vs Hub-Based Project Going forward, new agents and model-centric capabilities will only land on the new Foundry project type. This includes access to Foundry Agent Service in GA and Foundry API. While we are transitioning to Azure AI Foundry as a managed platform service, hub-based project type remains accessible in Azure AI Foundry portal for GenAI capabilities that are not yet supported by the new resource type. Hub-based projects will continue to support use cases for custom model training in Azure Machine Learning Studio, CLI and SDK. For a full overview of capabilities supported by each project type, see this support matrix. Azure AI Foundry Agent Service The Azure AI Foundry Agent Service experience, now generally available, is powered by the new Foundry project. Existing customers exploring the GA experience will need the new AI Foundry resource. All new investments in the Azure AI Foundry Agent Service are focused on the Foundry project experience. Foundry projects act as secure units of isolation and collaboration — agents within a project share: File storage Thread storage (i.e. conversation history) Search indexes You can also bring your own Azure resources (e.g., storage, bring-your-own virtual network) to support compliance and control over sensitive data. Start Building with Foundry Azure AI Foundry is your foundation for scalable, secure, and production-grade AI development. Whether you're building your first agent or deploying a multi-agent workforce at Scale, Azure AI Foundry is ready for what’s next.4.2KViews2likes0CommentsAI Avatars: Redefining Human-Digital Interaction in the Enterprise Era
In today’s AI-driven world, businesses are constantly seeking innovative ways to humanize digital experiences. AI Avatars are emerging as a powerful solution—bridging the gap between intelligent automation and authentic, human-like engagement. With advancements in speech synthesis, large language models, and avatar rendering technologies, organizations can now deploy AI-powered digital assistants that not only understand and respond but also interact with a lifelike presence. The Rise of AI Avatars in Enterprise Applications AI Avatars go beyond traditional chatbots or voice assistants. These virtual beings offer multimodal interaction—combining voice, visual cues, and conversational intelligence into a seamless user experience. Built on enterprise-grade platforms like Azure AI, these avatars can be integrated into customer support portals, digital kiosks, internal knowledge hubs, and more. Their utility spans a range of industries: Retail: Personalized shopping assistants that guide consumers through products. Healthcare: Virtual health concierges that help patients navigate care. Education: Interactive tutors that deliver lessons with empathy and responsiveness. HR and Training: Onboarding avatars that answer employee questions, onboard new hires, or provide compliance updates. One of our key partners, Cloudforce, has integrated AI Avatar technology directly into their flagship platform nebulaONE®. This integration enables enterprises to deploy digital assistants that are deeply embedded in business processes, offering contextualized support and real-time engagement. From training and onboarding to employee self-service, nebulaONE's agentic AI Avatars act as a digital bridge between users and systems—driving efficiency, engagement, and satisfaction. Partner Spotlight: Cloudforce’s Avatar Initiative To operationalize and productize AI Avatars, Microsoft collaborates with a growing ecosystem of partners. Cloudforce is one of the early pioneers in this space. Their work in embedding avatars into nebulaONE demonstrates what’s possible when advanced AI meets real-world enterprise needs. With a vision to transform user interaction across industries, Cloudforce built a production-grade AI Avatar module designed to support customer Q&A, knowledge discovery, and live guided walkthroughs. Leveraging Azure OpenAI, Azure AI Speech, and privately-deployed secure cloud infrastructure, they have brought conversational intelligence to life—with both a face and a voice. Looking ahead, Cloudforce’s broader vision is to bring AI Avatar capabilities to millions of students—delivering immersive learning experiences that blend interactivity, personalization, and scale. Their education-focused roadmap enhancements highlight the potential of avatars not just as productivity agents, but as accessible and empathetic digital educators, delivering equitable access to knowledge previously reserved for a fortunate few. This kind of partner innovation illustrates how AI Avatars can be customized and scaled to deliver tangible business value across multiple domains. Partner Contribution "Students are already embracing generative AI at a pace and proficiency that far exceeds many professional audiences. With Azure's AI Avatar technology, educators and institutions can tailor unique GenAI interactions that promote reasoning and learning over simply receiving answers the way they would with common public bots." says Husein Sharaf, Founder and CEO at Cloudforce. "We understand the concerns and hesitation that our education partners are currently grappling with, however we believe they can and should take an active role in shaping how this transformative technology is leveraged across their campuses, or risk being left behind as students choose their own adventure." "Microsoft's enterprise AI capabilities are enabling partners like us to deliver secure, cost-efficient, and responsible AI experiences at scale. With the Azure AI Foundry and key innovations like AI Avatars as our building blocks, the nebulaONE platform is poised to serve as the GenAI gateway to tens of thousands of business users, and millions of students at leading educational institutions globally. Our customers are seeking unique differentiators that will enable them to compete and win in the age of AI, and our collaboration with Microsoft is empowering us to deliver just that." Summary AI Avatars represent the next frontier in digital interaction. By combining conversational AI, expressive voice synthesis, and realistic visual rendering, these intelligent agents deliver truly human-like experiences—at scale. They are not just tools, but digital extensions of your brand. Partners like Cloudforce are leading the way with innovative platforms like nebulaONE, showing how this technology can be embedded into enterprise solutions and educational experiences to drive efficiency with a human touch. While Cloudforce is among the first to productize AI Avatars using Azure AI, they are part of a growing movement—helping to shape the future of AI-powered experiences across industries. As AI continues to evolve, avatars will become a standard interface—transforming the way we learn, work, and engage with digital systems.2KViews7likes2Comments