Speech recognition technology continues to make enormous strides—but most systems still struggle with real-world audio: long meetings, multi-speaker conversations, domain-specific terminology, and multilingual discussions that don’t fit neatly into short audio chunks.
Today, we’re excited to announce VibeVoice ASR, now available through Foundry Model Catalog and Hugging Face. The model is also featured in Foundry Labs, where developers can explore and experiment with Microsoft’s latest AI innovations.
VibeVoice ASR is a unified speech-to-text model designed to transcribe up to 60 minutes of continuous audio in a single pass, while producing rich, structured output that captures who said what, and when. Built by Microsoft Research, and trained to produce structured transcriptions reliably, VibeVoice ASR represents a new approach to speech recognition—one that treats longform audio understanding as a first-class problem rather than a collection of stitched-together steps.
What’s new
VibeVoice ASR moves beyond traditional automatic speech recognition pipelines by unifying transcription, speaker diarization, and timestamping into a single model and a single inference pass. Instead of slicing audio into short segments and reconciling results afterward, the model processes long recordings holistically, preserving global context across the entire conversation.
This release is designed to support modern, production-scale scenarios—like hourlong meetings, interviews, and podcasts—where accuracy, structure, and consistency matter just as much as raw transcription quality.
VibeVoice ASR is also fully integrated into the Hugging Face Transformers ecosystem and discoverable through the Microsoft Foundry model catalog, making it easy for developers to experiment, evaluate, and deploy using familiar tooling.
Key capabilities
- 60-minute single pass transcription
Transcribe up to an hour of continuous audio in a single pass, preserving long‑range context and speaker consistency—without the need for manual chunking or post‑processing to reconcile results. - Structured output (who, when, what)
Joint ASR, speaker recording, and timestamping produce transcripts that clearly identify who spoke, when they spoke, and what was said—with minimal post-processing. - Customized hotwords
Inject domain-specific vocabulary, names, or technical terms directly into transcription requests to improve accuracy in specialized contexts. - Multilingual and codeswitching support
Native support for 50+ languages, including seamless codeswitching, without requiring explicit language configuration. - Unified, LLM-based architecture
Combines acoustic and semantic audio tokenizers with a large language model decoder to enable long-context speech understanding.
Getting started
You can start exploring VibeVoice ASR today through Microsoft Foundry, where it’s available in the Foundry model catalog for easy evaluation and deployment alongside other foundation models on the platform.
For teams working directly in the open‑source ecosystem, VibeVoice ASR is also available on Hugging Face with full support in the Transformers library. The model supports structured transcription outputs, customized hotwords, and GPU‑accelerated inference—making it straightforward to integrate into existing speech or agentic pipelines. See the most recent example to deploy it here: Deploy VibeVoice ASR on Microsoft Foundry: Long-Form Transcription, +50 Languages Supported & Speaker Diarization
We’re excited to see what you build with it.
Explore more from Microsoft in Foundry Labs
Do you want to explore more of the latest AI innovations from Microsoft? Head to Foundry Labs to discover and experiment with cutting‑edge AI models and solutions. Explore emerging capabilities and join a community that’s shaping the future of AI.