Blog Post

Microsoft Foundry Blog
3 MIN READ

Introducing VibeVoice ASR: Longform, Structured Speech Recognition At Scale

YanXia001's avatar
YanXia001
Icon for Microsoft rankMicrosoft
Mar 12, 2026

Speech recognition technology continues to make enormous strides—but most systems still struggle with real-world audio: long meetings, multi-speaker conversations, domain-specific terminology, and multilingual discussions that don’t fit neatly into short audio chunks.

Today, we’re excited to announce VibeVoice ASR, now available through Foundry Model Catalog and Hugging Face. The model is also featured in Foundry Labs, where developers can explore and experiment with Microsoft’s latest AI innovations.

VibeVoice ASR is a unified speech-to-text model designed to transcribe up to 60 minutes of continuous audio in a single pass, while producing rich, structured output that captures who said what, and when. Built by Microsoft Research, and trained to produce structured transcriptions reliably, VibeVoice ASR represents a new approach to speech recognition—one that treats longform audio understanding as a first-class problem rather than a collection of stitched-together steps.

What’s new

VibeVoice ASR moves beyond traditional automatic speech recognition pipelines by unifying transcription, speaker diarization, and timestamping into a single model and a single inference pass. Instead of slicing audio into short segments and reconciling results afterward, the model processes long recordings holistically, preserving global context across the entire conversation.

 

 

This release is designed to support modern, production-scale scenarios—like hourlong meetings, interviews, and podcasts—where accuracy, structure, and consistency matter just as much as raw transcription quality.

VibeVoice ASR is also fully integrated into the Hugging Face Transformers ecosystem and discoverable through the Microsoft Foundry model catalog, making it easy for developers to experiment, evaluate, and deploy using familiar tooling.

Key capabilities

  • 60-minute single pass transcription
    Transcribe up to an hour of continuous audio in a single pass, preserving long‑range context and speaker consistency—without the need for manual chunking or post‑processing to reconcile results.
  • Structured output (who, when, what)
    Joint ASR, speaker recording, and timestamping produce transcripts that clearly identify who spoke, when they spoke, and what was said—with minimal post-processing.
  • Customized hotwords
    Inject domain-specific vocabulary, names, or technical terms directly into transcription requests to improve accuracy in specialized contexts.
  • Multilingual and codeswitching support
    Native support for 50+ languages, including seamless codeswitching, without requiring explicit language configuration.
  • Unified, LLM-based architecture
    Combines acoustic and semantic audio tokenizers with a large language model decoder to enable long-context speech understanding.

Getting started

You can start exploring VibeVoice ASR today through Microsoft Foundry, where it’s available in the Foundry model catalog for easy evaluation and deployment alongside other foundation models on the platform.

For teams working directly in the open‑source ecosystem, VibeVoice ASR is also available on Hugging Face with full support in the Transformers library. The model supports structured transcription outputs, customized hotwords, and GPU‑accelerated inference—making it straightforward to integrate into existing speech or agentic pipelines. See the most recent example to deploy it here: Deploy VibeVoice ASR on Microsoft Foundry: Long-Form Transcription, +50 Languages Supported & Speaker Diarization

We’re excited to see what you build with it.

Explore more from Microsoft in Foundry Labs

Do you want to explore more of the latest AI innovations from Microsoft? Head to Foundry Labs to discover and experiment with cutting‑edge AI models and solutions. Explore emerging capabilities and join a community that’s shaping the future of AI.

Updated Mar 12, 2026
Version 3.0
No CommentsBe the first to comment