language understanding (luis)
19 TopicsNow in Foundry: VibeVoice-ASR, MiniMax M2.5, Qwen3.5-9B
This week's Model Mondays edition features two models that have just arrived in Microsoft Foundry: Microsoft's VibeVoice-ASR, a unified speech-to-text model that handles 60-minute audio files in a single pass with built-in speaker diarisation and timestamps, and MiniMaxAI's MiniMax-M2.5, a frontier agentic model that leads on coding and tool-use benchmarks with performance comparable to the strongest proprietary models at a fraction of their cost; and Qwen's Qwen3.5-9B, the largest of the Qwen3.5 Small Series. All three represent a shift toward long-context, multi-step capability: VibeVoice-ASR processes up to an hour of continuous audio without chunking; MiniMax-M2.5 handles complex, multi-phase agentic tasks more efficiently than its predecessor—completing SWE-Bench Verified 37% faster than M2.1 with 20% fewer tool-use rounds; and Qwen3.5-9B brings multimodal reasoning on consumer hardware that outperforms much larger models. Models of the week VibeVoice-ASR Model Specs Parameters / size: ~8.3B Primary task: Automatic Speech Recognition with diarisation and timestamps Why it's interesting 60-minute single-pass with full speaker attribution: VibeVoice-ASR processes up to 60 minutes of continuous audio without chunk-based segmentation—yielding structured JSON output with start/end timestamps, speaker IDs, and transcribed content for each segment. This eliminates the speaker-tracking drift and semantic discontinuities that chunk-based pipelines introduce at segment boundaries. Joint ASR, diarisation, and timestamps in one model: Rather than running separate systems for transcription, speaker separation, and timing, VibeVoice-ASR produces all three outputs in a single forward pass. Users can also inject customized hot words—proper nouns, technical terms, or domain-specific phrases—to improve recognition accuracy on specialized content without fine-tuning. Multilingual with native code-switching: Supports 50+ languages with no explicit language configuration required and handles code-switching within and across utterances natively. This makes it suitable for multilingual meetings and international call center recordings without pre-routing audio by language. Benchmarks: On the Open ASR Leaderboard, VibeVoice-ASR achieves an average WER of 7.77% across 8 English datasets (RTFx 51.80), including 2.20% on LibriSpeech Clean and 2.57% on TED-LIUM. On the MLC-Challenge multi-speaker benchmark: DER 4.28%, cpWER 11.48%, tcpWER 13.02%. Try it Use case What to build Best practices Long-form, multi-speaker transcription for meetings + compliance A transcription service that ingests up to 60 minutes of audio per request and returns structured segments with speaker IDs + start/end timestamps + transcript text (ready for search, summaries, or compliance review). Keep audio un-chunked (single-pass) to preserve speaker coherence and avoid stitching drift; rely on the model’s joint ASR, diarisation, and timestamping so you don’t need separate diarisation/timestamp pipelines or postprocessing. Multilingual + domain-specific transcription (global support, technical reviews) A global transcription workflow for multilingual meetings or call center recordings that outputs “who/when/what,” and supports vocabulary injection for product names, acronyms, and technical terms. Provide customized hot words (names / technical terms) in the request to improve recognition on specialized content; don’t require explicit language configuration—VibeVoice-ASR supports 50+ languages and code-switching, so you can avoid pre-routing audio by language. Read more about the model and try out the playground Microsoft for Hugging Face Spaces to try the model for yourself. MiniMax-M2.5 Model Specs Parameters / size: ~229B (FP8, Mixture of Experts) Primary task: Text generation (agentic coding, tool use, search) Why it's interesting? Leading coding benchmark performance: Scores 80.2% on SWE-Bench Verified and 51.3% on Multi-SWE-Bench across 10+ programming languages (Go, C, C++, TypeScript, Rust, Python, Java, and others). In evaluations across different agent harnesses, M2.5 scores 79.7% on Droid and 76.1% on OpenCode—both ahead of Claude Opus 4.6 (78.9% and 75.9% respectively). The model was trained across 200,000+ real-world coding environments covering the full development lifecycle: system design, environment setup, feature iteration, code review, and testing. Expert-level search and tool use: M2.5 achieves industry-leading performance in BrowseComp, Wide Search, and Real-world Intelligent Search Evaluation (RISE), laying a solid foundation for autonomously handling complex tasks. Professional office work: Achieves a 59.0% average win rate against other mainstream models in financial modeling, Word, and PowerPoint tasks, evaluated via the GDPval-MM framework with pairwise comparison by senior domain professionals (finance, law, social sciences). M2.5 was co-developed with these professionals to incorporate domain-specific tacit knowledge—rather than general instruction-following—into the model's training. Try it Use case What to build Best practices Agentic software engineering Multi‑file code refactors, CI‑gated patch generation, long‑running coding agents working across large repositories Start prompts with a clear architecture or refactor goal. Let the model plan before editing files, keep tool calls sequential, and break large changes into staged tasks to maintain state and coherence across long workflows. Autonomous productivity agents Research assistants, web‑enabled task agents, document and spreadsheet generation workflows Be explicit about intent and expected output format. Decompose complex objectives into smaller steps (search → synthesize → generate), and leverage the model’s long‑context handling for multi‑step reasoning and document creation. With these use cases and best practices in mind, the next step is translating them into a clear, bounded prompt that gives the model a specific goal and the right tools to act. The example below shows how a product or engineering team might frame an automated code review and implementation task, so the model can reason through the work step by step and return results that map directly back to the original requirement: “You're building an automated code review and feature implementation system for a backend engineering team. Deploy MiniMax-M2.5 in Microsoft Foundry with access to your repository's file system tools and test runner. Given a GitHub issue describing a new API endpoint requirement, have the model first write a functional specification decomposing the requirement into sub-tasks, then implement the endpoint across the relevant service files, write unit tests with at least 85% coverage, and return a pull request summary explaining each code change and its relationship to the original requirement. Flag any implementation decisions that deviate from the patterns found in the existing codebase.” Qwen3.5-9B Model Specs Parameters / size: 9B Context length: 262,144 tokens natively; extensible to 1,010,000 tokens Primary task: Image-text-to-text (multimodal reasoning) Why it’s interesting High intelligence density at small sizes: Qwen 3.5 Small models show large reasoning gains relative to parameter count, with the 4B and 9B variants outperforming other sub‑10B models on public reasoning benchmarks. Long‑context by default: Support for up to 262K tokens enables long‑document analysis, codebase review, and multi‑turn workflows without chunking. Native multimodal architecture: Vision is built into the model architecture rather than added via adapters, allowing small models (0.8B, 2B) to handle image‑text tasks efficiently. Open and deployable: Apache‑2.0 licensed models designed for local, edge, or cloud deployment scenarios. Benchmarks AI Model & API Providers Analysis | Artificial Analysis Try it Use case When to use Best‑practice prompt pattern Long‑context reasoning Analyzing full PDFs, long research papers, or large code repositories where chunking would lose context Set a clear goal and scope. Ask the model to summarize key arguments, surface contradictions, or trace decisions across the entire document before producing an output. Lightweight multimodal document understanding OCR‑driven workflows using screenshots, scanned forms, or mixed image‑text inputs Ground the task in the artifact. Instruct the model to first describe what it sees, then extract structured information, then answer follow‑up questions. With these best practices in mind, Qwen 3.5-9B demonstrates how compact, multimodal models can handle complex reasoning tasks without chunking or manual orchestration. The prompt below shows how an operations analyst might use the model to analyze a full report end‑to‑end: "You are assisting an operations analyst. Review the attached PDF report and extracted tables. Identify the three largest cost drivers, explain how they changed quarter‑over‑quarter, and flag any anomalies that would require follow‑up. If information is missing, state what data would be needed." Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation. Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry754Views0likes0CommentsNow in Foundry: Qwen3-Coder-Next, Qwen3-ASR-1.7B, Z-Image
This week's spotlight features three models from that demonstrate enterprise-grade AI across the full scope of modalities. From low latency coding agents to state-of-the-art multilingual speech recognition and foundation-quality image generation, these models showcase the breadth of innovation happening in open-source AI. Each model balances performance with practical deployment considerations, making them viable for production systems while pushing the boundaries of what's possible in their respective domains. This week's Model Mondays edition highlights Qwen3-Coder-Next, an 80B MoE model that activates only 3B parameters while delivering coding agent capabilities with 256k context; Qwen3-ASR-1.7B, which achieves state-of-the-art accuracy across 52 languages and dialects; and Z-Image from Tongyi-MAI, an undistilled text-to-image foundation model with full Classifier-Free Guidance support for professional creative workflows. Models of the week Qwen: Qwen3-Coder-Next Model Specs Parameters / size: 80B total (3B activated) Context length: 262,144 tokens Primary task: Text generation (coding agents, tool use) Why it's interesting Extreme efficiency: Activates only 3B of 80B parameters while delivering performance comparable to models with 10-20x more active parameters, making advanced coding agents viable for local deployment on consumer hardware Built for agentic workflows: Excels at long-horizon reasoning, complex tool usage, and recovering from execution failures, a critical capability for autonomous development that go beyond simple code completion Benchmarks: Competitive performance with significantly larger models on SWE-bench and coding benchmarks (Technical Report) Try it Use Case Prompt Pattern Code generation with tool use Provide task context, available tools, and execution environment details Long-context refactoring Include full codebase context within 256k window with specific refactoring goals Autonomous debugging Present error logs, stack traces, and relevant code with failure recovery instructions Multi-file code synthesis Describe architecture requirements and file structure expectations Financial services sample prompt: You are a coding agent for a fintech platform. Implement a transaction reconciliation service that processes batches of transactions, detects discrepancies between internal records and bank statements, and generates audit reports. Use the provided database connection tool, logging utility, and alert system. Handle edge cases including partial matches, timing differences, and duplicate transactions. Include unit tests with 90%+ coverage. Qwen: Qwen3-ASR-1.7B Model Specs Parameters / size: 1.7B Context length: 256 tokens (default), configurable up to 4096 Primary task: Automatic speech recognition (multilingual) Why it's interesting All-in-one multilingual capability: Single 1.7B model handles language identification plus speech recognition for 30 languages, 22 Chinese dialects, and English accents from multiple regions—eliminating the need to manage separate models per language Specialized audio versatility: Transcribes not just clean speech but singing voice, songs with background music, and extended audio files, expanding use cases beyond traditional ASR to entertainment and media workflows State-of-the-art accuracy: Outperforms GPT-4o, Gemini-2.5, and Whisper-large-v3 across multiple benchmarks. English: Tedlium 4.50 WER vs 7.69/6.15/6.84; Chinese: WenetSpeech 4.97/5.88 WER vs 15.30/14.43/9.86 (Technical Paper) Language ID included: 97.9% average accuracy across benchmark datasets for automatic language identification, eliminating the need for separate language detection pipelines Try it Use Case Prompt Pattern Multilingual transcription Send audio files via API with automatic language detection Call center analytics Process customer service recordings to extract transcripts and identify languages Content moderation Transcribe user-generated audio content across multiple languages Meeting transcription Convert multilingual meeting recordings to text for documentation Customer support sample prompt: Deploy Qwen3-ASR-1.7B to a Microsoft Foundry endpoint and transcribe multilingual customer service calls. Send audio files via API to automatically detect the language (from 52 supported options including 30 languages and 22 Chinese dialects) and generate accurate transcripts. Process calls from customers speaking English, Spanish, Mandarin, Cantonese, Arabic, French, and other languages without managing separate models per language. Use transcripts for quality assurance, compliance monitoring, and customer sentiment analysis. Tongyi-MAI: Z-Image Model Specs Parameters / size: 6B Context length: N/A (text-to-image) Primary task: Text-to-image generation Why it's interesting Undistilled foundation model: Full-capacity base without distillation preserves complete training signal with Classifier-Free Guidance support (a technique that improves prompt adherence and output quality), enabling complex prompt engineering and negative prompting that distilled models cannot achieve High output diversity: Generates distinct character identities in multi-person scenes with varied compositions, facial features, and lighting, critical for creative applications requiring visual variety rather than consistency Aesthetic versatility: Handles diverse visual styles from hyper-realistic photography to anime and stylized illustrations within a single model, supporting resolutions from 512×512 to 2048×2048 at any aspect ratio with 28-50 inference steps (Technical Paper) Try it Use Case Prompt Pattern Multilingual transcription Send audio files via API with automatic language detection Call center analytics Process customer service recordings to extract transcripts and identify languages Content moderation Transcribe user-generated audio content across multiple languages Meeting transcription Convert multilingual meeting recordings to text for documentation E-commerce sample prompt: Professional product photography of a modern ergonomic office chair in a bright Scandinavian-style home office. Natural window lighting from left, clean white desk with laptop and succulent plant, light oak hardwood floor. Chair positioned at 45-degree angle showing design details. Photorealistic, commercial photography, sharp focus, 85mm lens, f/2.8, soft shadows. Getting started You can deploy open‑source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation. Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry969Views0likes0CommentsIntroducing Azure Cognitive Service for Language
Azure Cognitive Services has historically had three distinct NLP services that solve different but related customer problems. These are Text Analytics, LUIS and QnA Maker. As these services have matured and customers now depend on them for business-critical workloads, we wanted to take a step back and evaluate the most effective path forward over the next several years for delivering our roadmap of a world-class, state-of-the-art NLP platform-as-a-service. Each service was initially focused on a set of capabilities supporting distinct customer scenarios. For example, LUIS for custom language models most often supporting bots, Text Analytics for general purpose pre-built language services, and QnA Maker for knowledge-based question / answering. As AI accuracy has improved, the cost of offering more sophisticated models has decreased, and customers have increased their adoption of NLP for business workloads, we are seeing more and more overlapping scenarios where the lines are blurred between the three distinct services. As such, the most effective path forward is a single unified NLP service in Azure Cognitive Services. Today we are pleased to announce the availability of Azure Cognitive Service for Language. It unifies the capabilities in Text Analytics, LUIS, and the legacy QnA Maker service into a single service. The key benefits include: Easier to discover and adopt features. Seamlessness between pre-built and custom-trained NLP. Easier to build NLP capabilities once and reuse them across application scenarios. Access to multilingual state-of-the-art NLP models. Simpler to get started through consistency in APIs, documentation, and samples across all features. More billing predictability. The unified Language Service will not affect any existing applications. All existing capabilities of the three services will continue to be supported until we announce a deprecation timeline of the existing services (which would be no less than one year). However, new features and innovation will start happening only on the unified Language service. For example, question answering and conversational language understanding (CLU) are only available in the unified service (more details on these features later). As such, customers are encouraged to start making plans to leverage the unified service. More details on migration including links to resources are provided below. Here is everything we are announcing today in the unified Language service: Text Analytics is now Language Service: All existing features of Text Analytics are included in the Language Service. Specifically, Sentiment Analysis and Opinion Mining, Named Entity Recognition (NER), Entity Linking, Key Phrase Extraction, Language Detection, Text Analytics for health, and Text Summarization are all part of the Language Service as they exist today. Text Analytics customers don’t need to do any migrations or updates to their in-production or in-development apps. The unified service is backward compatible with all existing Text Analytics features. The key difference is when creating a new resource in the Azure portal UI, you will now see the resource labeled as “Language” rather than “Text Analytics”. Introducing conversational language understanding (preview) - the next generation of LUIS: Language Understanding (LUIS) has been one of our fastest growing Cognitive Services with customers deploying custom language models to production for various scenarios from command-and-control IoT devices and chat bots, to contact center agent assist scenarios. The next phase in the evolution of LUIS is conversational language understanding (CLU) which we are announcing today as a preview feature of the new Language Service. CLU introduces multilingual transformer-based models as the underlying model architecture and results in significant accuracy improvements over LUIS. Also new as part of CLU is the ability to create orchestration projects, which allow you to configure a project to route to multiple customizable language services, like question answering knowledge bases, other CLU projects, and even classic LUIS applications. Visit here to learn more. If you are an existing LUIS customer, we are not requiring you to migrate your application to CLU today. However, as CLU represents the evolution of LUIS, we encourage you to start experimenting with CLU in preview and provide us feedback on your experience. You can import a LUIS JSON application directly into CLU to get started. GA of question answering: In May 2021, we launched the preview of custom question answering. Today we are announcing the General Availability (GA) of question answering as part of the new Language Service. If you are just getting started with building knowledge bases that are query-able with natural language, visit here to get started. If you want to know more about migrating legacy QnA Maker knowledge bases to the Language Service see here. Your existing QnA Maker knowledge bases will continue to work. We are not requiring customers to migrate from QnA Maker at this time. However, question answering represents the evolution of QnA Maker and new features will only be developed for the unified service. As such, we encourage you to plan for a migration from legacy QnA Maker if this applies to you. Introducing custom named entity recognition (preview): Documents include an abundant amount of valuable information. Enterprises rely on pulling out that information to easily filter and search through those documents. Using the standard Text Analytics NER, they could extract known types like person names, geographical locations, datetimes, and organizations. However, lots of information of interest is more specific than the standard types. To unlock these scenarios, we’re happy to announce custom NER as a preview capability of the new Language Service. The capability allows you to build your own custom entity extractors by providing labelled examples of text to train models. Securely upload your data in your own storage accounts and label your data in the language studio. Deploy and query the custom models to obtain entity predictions on new text. Visit here to learn more. Introducing custom text classification (preview): While many pieces of information can exist in any given document, the whole piece of text can belong to one or more categories. Organizing and categorizing documents is key to data reliant enterprises. We’re excited to announce custom text classification, a preview feature under the Language service, where you can create custom classification models with your defined classes. Securely upload your data in your own storage accounts and label your data in the language studio. Choose between single-label classification where you can label and predict one class for every document, or multi-label classification that allows you to assign or predict several classes per document. This service enables automation to incoming pieces of text such as support tickets, customer email complaints, or organizational reports. Visit here to learn more. Language studio: This is the single destination for experimentation, evaluation, and training of Language AI / NLP in Cognitive Services. With the Language studio you can now try any of our capabilities with a few buttons clicks. For example, you can upload medical documents and get back all the entities and relations extracted instantly, and you can easily integrate the API into your solution using the Language SDK. You can take it further by training your own custom NER model and deploy it through the easy-to-use interface. Try it out now yourself here. Several customers are already using Azure Cognitive Service for Language to transform their businesses. Here's what two of them had to say: “We used Azure Cognitive Services and Bot Service to deliver an instantly responsive, personal expert into our customers’ pockets. Providing this constant access to help is key to our customer care strategy.” -Paul Jacobs, Group Head of Operations Transformation, Vodafone “Sellers might have up to 100,000 documents associated with a deal, so the time savings can be absolutely massive. Now that we’ve added Azure Cognitive Service for Language to our tool, customers can potentially compress weeks of work into days.” -Thomas Fredell, Chief Product Officer, Datasite To learn more directly from customers, see the following customer stories: Vodafone transforms its customer care strategy with digital assistant built on Azure Cognitive Services Progressive Insurance levels up its chatbot journey and boosts customer experience with Azure AI Kepro improves healthcare outcomes with fast and accurate insights from Text Analytics for health On behalf of the entire Cognitive Services Language team at Microsoft, we can't wait to see how Azure Cognitive Service for Language benefits your business!29KViews5likes0CommentsLearn about Bot Framework Composer’s new authoring experience and deploy your bot to a telephone
The new telephony channel, combined with our Bot Framework developer platform, makes it easy to rapidly build always-available virtual assistants, or IVR assistants, that provide natural language intent-based call handling and the ability to handle advanced conversation flows, such as context switching and responding to follow up questions and still meeting the goal of reducing operational costs for enterprises.6.3KViews1like0Comments