Microsoft Foundry Blog

6 MIN READ

Advancing Speech Innovation with Azure Speech in Microsoft Foundry

DONGLI

Microsoft

Nov 20, 2025

Empowering Developers and Businesses with GenAI Powered Speech Services in Foundry Tools

Speaking with AI is becoming as natural as talking to a friend. Voice is quickly emerging as the most intuitive way for humans to interact with technology, and advances in conversational AI are helping enterprises deliver consistent customer experiences, slash operating costs, and scale effortlessly. As voice becomes the primary interface between people and AI, we’re excited to unveil the next wave of Azure Speech innovations at Microsoft Ignite 2025—unlocking new possibilities for developers and organizations to elevate their voice, translation, and conversational AI experiences.

What’s new in Azure Speech

Here’s what is new, why it matters, and how you can get hands-on at Ignite.

Voice Live API GA: Voice Enable Any Agent

Voice Live API is a unified single API for real-time speech-to-speech conversations, opening the door for developers to build advanced voice assistants, agents, and chatbots with robust speech input and natural speech output. At Ignite 2025, we are announcing the general availability of Voice Live API with more built-in GenAI models and the ability to Bring Your Own Foundry Models. We are also announcing the ability to generate expressive avatars using a single photo through Voice Live API with the new Photo Avatar feature in public preview to support more engaging experience. With Voice Live API, now you can:

Choose from 10+ natively integrated foundation GenAI models including the latest GPT-Realtime and GPT-Realtime-Mini or bring your own model deployed in Microsoft Foundry

Add accurate and speech input across 140+ locales, and select from hundreds of natural, multi-lingual voices including the latest HD V2 ones across 150+ locales

Turn on Azure semantic VAD (voice activity detection) to better detect user speech activities for natural conversations

Customize your speech input model to increase the accuracy for your use case, and create brand identities for your voice agents with custom voice and custom avatar, including our newly released lifelike photo avatars.

Since its public preview, Voice Live API has been extensively evaluated by thousands of customers, ranging from scenarios such as customer services, in-car assistants, public service agents, employee support agents, e-learning agents, conversational chat assistants, and more.

With Voice Live, eClinicalWorks’ healow Genie contact center solution is now taking healthcare modernization a step further. Refer to the customer story here.

“Genie gives providers a way to use AI to converse with patients through their preferred channel. By automating routine calls, Genie will slash administrative burden and costs while delivering fast, accurate answers in a natural voice. Powered by Azure Speech Voice Live API and Azure Communication Services, Genie understands patient questions and responds instantly—making every interaction seamless and human-like.”

- Sidd Shah, Vice President of Strategy & Business Growth, healow

Capgemini, a global business and technology transformation partner, is reimagining its global service desk managed operations through its Capgemini Cloud Infrastructure Services (CIS) division.

“Integrating Microsoft’s Voice Live API into our platform has been transformative. We’re seeing measurable improvements in user engagement and satisfaction thanks to the API’s low-latency, high-quality voice interactions. As a result, we are able to deliver more natural and responsive experiences, which have been positively received by our customers.”

- Stephen Hilton, EVP Chief Operating Officer at CIS Capgemini

Astra Tech, a fast-growing UAE-based technology group part of G42, is bringing Voice Live API to its flagship platform, botim, a fintech-first and AI-native platform.  Refer to the customer story here.

"Voice Live API acts as a connective tissue for AI-driven conversation across every layer of the app. It gives us a standardized framework so different product teams can incorporate voice without needing to hire deep AI expertise.”

- Frenando Ansari, Lead Product Manager, Astra Tech

Live Interpreter General Availability: Real-Time, Human-Like Multilingual Communication

Our Speech translation technology already powers Microsoft Teams, Microsoft Translator, and other popular Microsoft products. Every month, over hundreds of thousands Microsoft Teams attendees use the Live Interpreter Agent in their Teams meetings. Today, we are taking another big step with the general availability of the Live Interpreter API, which enables developers with the same underlying model and technology as Microsoft Teams to power real time speech-to-speech translation with personal voice. Key capabilities of the Live Interpreter API include:

Ultra-low latency near human parity

Automated and continuous language detection of all the 76 input languages supported by Azure Speech. Supported output languages include English, German, Spanish, French, Italian, Japanese, Korean, Portuguese, and Chinese (Simplified), and we plan to expand the output language coverage in upcoming releases.

Personal voice that preserves the tone and style of the original speakers, making the conversation experience natural and authentic

With just a few lines of code, developers can now enable seamless and inclusive multilingual communication across various business scenarios such as contact center live translated calls, multilingual online meetings, e-commerce or e-sports streaming translation, and more. Customer such as Oppo and Caption Connect are already testing the Live Interpreter in business settings.

LLM Speech Public Preview: LLM powered Fast Speech Transcription and Translation

We’re thrilled to announce the public preview of LLM Speech, a new API powered by large language models for advanced audio file transcription and translation. Key capabilities of the LLM Speech API include:

Accurate transcription and translation with good contextual understanding

Multi-lingual

Prompt tuning on output text

Rich add-ons such as speaker diarization, wordtiming, multi-channels, etc

Ultra-fast inference

This API is ideal for scenarios such as meeting notes, call center agent assist, voicemails, pre-generated audio captions & subtitles, and more. Customers including Anker, have been leveraging LLM Speech in their latest applications and devices.

Learn more:

Photo Avatar Public Preview: Generate Expressive Avatars from a Single Image

At Ignite 2025, we are also unveiling Photo Avatar, powered by Microsoft Research’s VASA-1 model, now available in public preview. Azure Speech Voice Live API has been enabling agent development with video avatars, and now, with the introduction of Photo Avatar, developers can create personalized, visually enhanced voice agents more easily. Photo avatars are head-only, designed to convey expressive and natural facial emotions. Unlike video avatars, which require lengthy video shoots and complex model training, a Photo Avatar is created instantly from a single image.

There are two kinds of Photo Avatars available:

Standard: We are offering 30 new standard photo avatars, available out of the box.

Custom: Customers can also create unique customized photo avatars tailored to their brand. These avatars are accessible via Microsoft Foundry or API integration.

Photo avatar is one add-on in Voice Live for conversational AI with engaging experience. It can also be utilized in scenarios such as video content creation, enabling developers to produce visually compelling talking-head videos effortlessly.

Gulf Air, the national carrier of Bahrain, leverages photo avatars as virtual assistants in their pilot, cabin crew, and engineer training sessions, enhancing learning experience in the corporate.

"We had the opportunity to test the VASA-1 powered Photo Avatar with Voice Live API, and we were impressed by the ability to convert an image to expressive avatar and instantly enable live speech-to-speech interaction, offering an incredibly smooth and user-friendly experience. We could create a fully functional live avatar assistant ready to use, with minimal setup effort. This technology opens endless possibilities for Gulf Air Group, from internal departments with branded avatars to event-specific designs, all without extensive model training. Photo Avatar with Voice Live is a game-changer for creating personalized digital agents aligned with our brand."

- Ahmed Ebrahim, Manager Digital Innovation and Process Management, Gulf Air

Developer Experience: Simpler, Faster, and More Flexible

At Ignite 2025, we’re excited to announce major enhancements to the Azure Speech developer experience in Microsoft Foundry portal. Developers can easily access and deploy Azure Speech APIs in the model catalog, including Speech to Text, Text to Speech, Text to Speech Avatar, and Voice Live, streamlining the testing and rollout of cutting-edge Speech AI capabilities in their applications.

Additionally, the expanded Tools catalog in Microsoft Foundry introduces the Azure Speech MCP Server to provide speech capabilities as tools to build AI agents.

These capabilities are available today, get started in Microsoft Foundry by creating a new Foundry resource, or by using an existing Microsoft Foundry resource.

Get started today 

The easiest way to explore is through the Microsoft Foundry portal and the Foundry Tools catalog. From there you can follow the documentation and Microsoft Learn courses, and start building with Azure Speech referring to Azure Speech Documentation.