Blog Post

Apps on Azure Blog
2 MIN READ

Microsoft’s New In‑House AI Models (MAI‑Transcribe, MAI‑Voice, MAI‑Image)

kumar_rahul's avatar
kumar_rahul
Icon for Microsoft rankMicrosoft
Apr 26, 2026
What Are the New MAI Models?
MAI‑Transcribe‑1 (Speech‑to‑Text)

MAI‑Transcribe‑1 is Microsoft’s first‑generation in‑house speech recognition model. It supports 25 languages and is optimized for real‑world, noisy enterprise audio, such as meetings and call centers.

Key highlights

  • Enterprise‑grade transcription accuracy
  • Designed for multilingual and accented speech
  • Lower GPU cost compared to prior Azure speech offerings
MAI‑Voice‑1 (Text‑to‑Speech)

MAI‑Voice‑1 is a high‑fidelity voice generation model capable of producing natural, expressive speech while preserving speaker identity over long‑form audio. 

Key highlights

  • Generates up to 60 seconds of audio in ~1 second
  • Supports custom voice creation
  • Optimized for voice agents and conversational systems
MAI‑Image‑2 (Text‑to‑Image)

MAI‑Image‑2 is Microsoft’s highest‑capability text‑to‑image model, already ranking among top image models used in production Copilot experiences. 

Key highlights

  • High‑quality photorealistic image generation
  • Accurate in‑image text rendering
  • Production‑ready latency and cost profile
Why This Matters for Azure Developers

For Azure developers, this launch changes three things fundamentally:

  1. First‑party AI stack
    Developers can now build speech, voice, and image workloads without relying on external AI providers.
  2. Enterprise‑ready by default
    These models inherit Azure RBAC, Managed Identity, compliance, and governance through Microsoft Foundry.
  3. Agent‑first design
    MAI models are designed to be embedded inside AI agents, not just called as single APIs

 

Below is a common enterprise architecture using MAI models.

Sample Code Calling MAI‑Transcribe‑1: 

Sample code
What Changed with MAI Models: Before vs After (Developer Perspective)

Microsoft’s MAI models are not just new endpoints — they represent a fundamental shift in how Azure developers build multimodal and agent‑based AI solutions.

High‑Level Comparison
AspectBefore MAI (Azure & External Models)After MAI (MAI‑Transcribe, Voice, Image)
Model OwnershipHeavy dependency on third‑party models (OpenAI, external TTS/STT providers)First‑party Microsoft‑built models, operated and optimized by Microsoft
Enterprise IntegrationAI models integrated into AzureAI models native to Microsoft Foundry
Governance & ComplianceMixed controls depending on model providerUnified Azure RBAC, Entra ID, Purview, Managed Identity
Agent ReadinessPrimarily single‑request / single‑response APIsDesigned for agent‑oriented, long‑running workflows
Cost PredictabilityToken‑based or mixed pricing modelsEnterprise‑optimized price‑to‑performance models
Operational ConsistencyDifferent SDKs, APIs, quotasSingle Foundry tooling and SDK surface

 

Published Apr 26, 2026
Version 1.0
No CommentsBe the first to comment