What’s trending in Hugging Face, March 16, 2026
This week's Model Mondays edition features two models that have just arrived in Microsoft Foundry: Microsoft's VibeVoice-ASR, a unified speech-to-text model that handles 60-minute audio files in a single pass with built-in speaker diarisation and timestamps, and MiniMaxAI's MiniMax-M2.5, a frontier agentic model that leads on coding and tool-use benchmarks with performance comparable to the strongest proprietary models at a fraction of their cost; and Qwen's Qwen3.5-9B, the largest of the Qwen3.5 Small Series.
All three represent a shift toward long-context, multi-step capability: VibeVoice-ASR processes up to an hour of continuous audio without chunking; MiniMax-M2.5 handles complex, multi-phase agentic tasks more efficiently than its predecessor—completing SWE-Bench Verified 37% faster than M2.1 with 20% fewer tool-use rounds; and Qwen3.5-9B brings multimodal reasoning on consumer hardware that outperforms much larger models.
Models of the week
VibeVoice-ASR
Model Specs
- Parameters / size: ~8.3B
- Primary task: Automatic Speech Recognition with diarisation and timestamps
Why it's interesting
- 60-minute single-pass with full speaker attribution: VibeVoice-ASR processes up to 60 minutes of continuous audio without chunk-based segmentation—yielding structured JSON output with start/end timestamps, speaker IDs, and transcribed content for each segment. This eliminates the speaker-tracking drift and semantic discontinuities that chunk-based pipelines introduce at segment boundaries.
- Joint ASR, diarisation, and timestamps in one model: Rather than running separate systems for transcription, speaker separation, and timing, VibeVoice-ASR produces all three outputs in a single forward pass. Users can also inject customized hot words—proper nouns, technical terms, or domain-specific phrases—to improve recognition accuracy on specialized content without fine-tuning.
- Multilingual with native code-switching: Supports 50+ languages with no explicit language configuration required and handles code-switching within and across utterances natively. This makes it suitable for multilingual meetings and international call center recordings without pre-routing audio by language.
- Benchmarks: On the Open ASR Leaderboard, VibeVoice-ASR achieves an average WER of 7.77% across 8 English datasets (RTFx 51.80), including 2.20% on LibriSpeech Clean and 2.57% on TED-LIUM. On the MLC-Challenge multi-speaker benchmark: DER 4.28%, cpWER 11.48%, tcpWER 13.02%.
Try it
|
Use case |
What to build |
Best practices |
|
Long-form, multi-speaker transcription for meetings + compliance |
A transcription service that ingests up to 60 minutes of audio per request and returns structured segments with speaker IDs + start/end timestamps + transcript text (ready for search, summaries, or compliance review). |
Keep audio un-chunked (single-pass) to preserve speaker coherence and avoid stitching drift; rely on the model’s joint ASR, diarisation, and timestamping so you don’t need separate diarisation/timestamp pipelines or postprocessing. |
|
Multilingual + domain-specific transcription (global support, technical reviews) |
A global transcription workflow for multilingual meetings or call center recordings that outputs “who/when/what,” and supports vocabulary injection for product names, acronyms, and technical terms. |
Provide customized hot words (names / technical terms) in the request to improve recognition on specialized content; don’t require explicit language configuration—VibeVoice-ASR supports 50+ languages and code-switching, so you can avoid pre-routing audio by language. |
Read more about the model and try out the playground Microsoft for Hugging Face Spaces to try the model for yourself.
MiniMax-M2.5
Model Specs
- Parameters / size: ~229B (FP8, Mixture of Experts)
- Primary task: Text generation (agentic coding, tool use, search)
Why it's interesting?
- Leading coding benchmark performance: Scores 80.2% on SWE-Bench Verified and 51.3% on Multi-SWE-Bench across 10+ programming languages (Go, C, C++, TypeScript, Rust, Python, Java, and others). In evaluations across different agent harnesses, M2.5 scores 79.7% on Droid and 76.1% on OpenCode—both ahead of Claude Opus 4.6 (78.9% and 75.9% respectively). The model was trained across 200,000+ real-world coding environments covering the full development lifecycle: system design, environment setup, feature iteration, code review, and testing.
- Expert-level search and tool use: M2.5 achieves industry-leading performance in BrowseComp, Wide Search, and Real-world Intelligent Search Evaluation (RISE), laying a solid foundation for autonomously handling complex tasks.
- Professional office work: Achieves a 59.0% average win rate against other mainstream models in financial modeling, Word, and PowerPoint tasks, evaluated via the GDPval-MM framework with pairwise comparison by senior domain professionals (finance, law, social sciences). M2.5 was co-developed with these professionals to incorporate domain-specific tacit knowledge—rather than general instruction-following—into the model's training.
Try it
|
Use case |
What to build |
Best practices |
|
Agentic software engineering |
Multi‑file code refactors, CI‑gated patch generation, long‑running coding agents working across large repositories |
Start prompts with a clear architecture or refactor goal. Let the model plan before editing files, keep tool calls sequential, and break large changes into staged tasks to maintain state and coherence across long workflows. |
|
Autonomous productivity agents |
Research assistants, web‑enabled task agents, document and spreadsheet generation workflows |
Be explicit about intent and expected output format. Decompose complex objectives into smaller steps (search → synthesize → generate), and leverage the model’s long‑context handling for multi‑step reasoning and document creation. |
With these use cases and best practices in mind, the next step is translating them into a clear, bounded prompt that gives the model a specific goal and the right tools to act. The example below shows how a product or engineering team might frame an automated code review and implementation task, so the model can reason through the work step by step and return results that map directly back to the original requirement:
“You're building an automated code review and feature implementation system for a backend engineering team. Deploy MiniMax-M2.5 in Microsoft Foundry with access to your repository's file system tools and test runner. Given a GitHub issue describing a new API endpoint requirement, have the model first write a functional specification decomposing the requirement into sub-tasks, then implement the endpoint across the relevant service files, write unit tests with at least 85% coverage, and return a pull request summary explaining each code change and its relationship to the original requirement. Flag any implementation decisions that deviate from the patterns found in the existing codebase.”
Qwen3.5-9B
Model Specs
- Parameters / size: 9B
- Context length: 262,144 tokens natively; extensible to 1,010,000 tokens
- Primary task: Image-text-to-text (multimodal reasoning)
Why it’s interesting
- High intelligence density at small sizes: Qwen 3.5 Small models show large reasoning gains relative to parameter count, with the 4B and 9B variants outperforming other sub‑10B models on public reasoning benchmarks.
- Long‑context by default: Support for up to 262K tokens enables long‑document analysis, codebase review, and multi‑turn workflows without chunking.
- Native multimodal architecture: Vision is built into the model architecture rather than added via adapters, allowing small models (0.8B, 2B) to handle image‑text tasks efficiently.
- Open and deployable: Apache‑2.0 licensed models designed for local, edge, or cloud deployment scenarios.
Benchmarks
See more at: AI Model & API Providers Analysis | Artificial AnalysisTry it
|
Use case |
When to use |
Best‑practice prompt pattern |
|
Long‑context reasoning |
Analyzing full PDFs, long research papers, or large code repositories where chunking would lose context |
Set a clear goal and scope. Ask the model to summarize key arguments, surface contradictions, or trace decisions across the entire document before producing an output. |
|
Lightweight multimodal document understanding |
OCR‑driven workflows using screenshots, scanned forms, or mixed image‑text inputs |
Ground the task in the artifact. Instruct the model to first describe what it sees, then extract structured information, then answer follow‑up questions. |
With these best practices in mind, Qwen 3.5-9B demonstrates how compact, multimodal models can handle complex reasoning tasks without chunking or manual orchestration. The prompt below shows how an operations analyst might use the model to analyze a full report end‑to‑end:
"You are assisting an operations analyst. Review the attached PDF report and extracted tables. Identify the three largest cost drivers, explain how they changed quarter‑over‑quarter, and flag any anomalies that would require follow‑up. If information is missing, state what data would be needed."
Getting started
You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation.