model catalog
72 TopicsNVIDIA NIM for NVIDIA Nemotron, Cosmos, & Microsoft Trellis: Now Available in Azure AI Foundry
We’re excited to announce 7 new powerful NVIDIA NIM™ additions to Azure AI Foundry Models now on Managed Compute. The latest wave of models—NVIDIA Nemotron Nano 9B v2, Llama 3.1 Nemotron Nano VL 8B, Llama 3.3 Nemotron Super 49B v1.5 (coming soon), Cosmos Reason1-7B, Cosmos Predict 2.5 (coming soon), Cosmos Transfer 2.5. (coming soon), and Microsoft Trellis—marks a significant leap forward in intelligent application development. Collectively, these models redefine what’s possible in advanced instruction-following, vision-language understanding, and efficient language modeling, empowering developers to build multimodal, visually rich, and context-aware solutions. By combining robust reasoning, flexible input handling, and enterprise-grade deployment options, these additions accelerate innovation across industries—from robotics and autonomous vehicles to immersive retail and digital twins—enabling smarter, safer, and more adaptive experiences at scale. Meet the Models Model Name Size Primary Use Cases NVIDIA Nemotron Nano 9B v2 Available Now 9B parameters Multilingual Reasoning: Multilingual and code-based reasoning tasks Enterprise Agents: AI and productivity agents Math/Science: Scientific reasoning, advanced math Coding: Software engineering and tool calling Llama 3.3 Nemotron Super 49B v1.5 Available Now 49B Enterprise Agents: AI and productivity agents Math/Science: Scientific reasoning, advanced math Coding: Software engineering and tool calling Llama 3.1 Nemotron Nano VL 8B Available Now 8B Multimodal: Multimodal vision-language tasks, document intelligence and understanding Edge Agents: Mobile and edge AI agents Cosmos Reason1-7B Available Now 7B Robotics: Planning and executing tasks with physical constraints. Autonomous Vehicles: Understanding environments and making decisions. Video Analytics Agents: Extracting insights and performing root-cause analysis from video data. Cosmos Predict 2.5 Coming Soon 2B Generalist Model: World state generation and prediction Cosmos Transfer 2.5 Coming Soon 2B Structural Conditioning: Physical AI Microsoft TRELLIS by Microsoft Research Available Now - Digital Twins: Generate accurate 3D assets from simple prompts Immersive Retail experiences: photorealistic product models for AR, virtual try-ons Game and simulation development: Turn creative ideas into production-ready 3D content Meet the NVIDIA Nemotron Family NVIDIA Nemotron Nano 9B v2: Compact power for high-performance reasoning and agentic tasks NVIDIA Nemotron Nano 9B v2 is a high-efficiency large language model built with a hybrid Mamba-Transformer architecture, designed to excel in both reasoning and non-reasoning tasks. Efficient architecture for high-performance reasoning: Combines Mamba-2 and Transformer components to deliver strong reasoning capabilities with higher throughput. Extensive multilingual and code capabilities: Trained on diverse language and programming data, it performs exceptionally well across tasks involving natural language (English, German, French, Italian, Spanish and Japanese), code generation, and complex problem solving. Reasoning Budget Control: Supports runtime “thinking” budget control. During inference, the user can specify how many tokens the model is allowed to "think" for helping balance speed, cost, and accuracy during inference. For example, a user can tell the model to think for “1K tokens or 3K tokens, etc ” for different use cases with far better cost predictability. Fig 1. provided by NVIDIA Nemotron Nano 9B v2 is built from the ground up with training data spanning 15 languages and 43 programming languages, giving it broad multilingual and coding fluency. Its capabilities were sharpened through advanced post-training techniques like GRPO and DPO enabling it to reason deeply, follow instructions precisely, and adapt dynamically to different tasks. -> Explore the model card on Azure AI Foundry Llama 3.3 Nemotron Super 49B v1.5: High-throughput reasoning at scale Llama 3.3 Nemotron Super 49Bv1.5 (coming soon) is a significantly upgraded version of Llama-3.3-Nemotron-Super-49B-v1 and is a large language model which is a derivative of Meta Llama-3.3-70B-Instruct (the reference model) optimized for advanced reasoning, instruction following, and tool use across a wide range of tasks. Excels in applications such as chatbots, AI agents, and retrieval-augmented generation (RAG) systems Balances accuracy and compute efficiency for enterprise-scale workloads Designed to run efficiently on a single NVIDIA H100 GPU, making it practical for real-world applications Llama-3.3-Nemotron-Super-49B-v1.5 was trained through a multi-phase process combining human expertise, synthetic data, and advanced reinforcement learning techniques to refine its reasoning and instruction-following abilities. Its impressive performance across benchmarks like MATH500 (97.4%) and AIME 2024 (87.5%) highlights its strength in tackling complex tasks with precision and depth. Llama 3.1 Nemotron Nano VL 8B: Multimodal intelligence for edge deployments Llama 3.1 Nemotron Nano VL 8B is a compact vision-language model that excels in tasks such as report generation, Q&A, visual understand, and document intelligence. This model delivers low latency and high efficiency, reducing TCO. This model was trained on a diverse mix of human-annotated and synthetic data, enabling robust performance across multimodal tasks such as document understanding and visual question answering. It achieved strong results on evaluation benchmarks including DocVQA (91.2%), ChartQA (86.3%), AI2D (84.8%), and OCRBenchV2 English (60.1%). -> Explore the model card on Azure AI Foundry What Sets Nemotron Apart NVIDIA Nemotron is a family of open models, datasets, recipes, and tools. 1. Open-source AI technologies: Open models, data, and recipes offer transparency, allowing developers to create trustworthy custom AI for their specific needs, from creating new agents to refining existing applications. Open Weights: NVIDIA Open Model License offers enterprises data control and flexible deployment. Open Data: Models are trained with transparent, permissively-licensed NVIDIA data, available on Hugging Face, ensuring confidence in use. Additionally, it allows developers to train their high-accuracy custom models with these open datasets. Open Recipe: NVIDIA shares development techniques, like NAS, hybrid architecture, Minitron, as well as NeMo tools enabling customization or creation of custom models. 2. Highest Accuracy & Efficiency: Engineered for efficiency, Nemotron delivers industry leading accuracy in the least amount of time for reasoning, vision, and agentic tasks. 3. Run Anywhere On Cloud: Packaged as NVIDIA NIM, for secure and reliable deployment of high-performance AI model inferencing across Azure platforms. Meet the Cosmos Family NVIDIA Cosmos™ is a world foundation model (WFM) development platform to advance physical AI. At its core are Cosmos WFMs, openly available pretrained multimodal models that developers can use out-of-the-box for generating world states as videos and physical AI reasoning, or post-train to develop specialized physical AI models. Cosmos Reason1-7B: Physical AI Cosmos Reason1-7B combines chain-of-thought reasoning, flexible input handling for images and video, a compact 7B parameter architecture, and advanced physical world understanding making it ideal for real-time robotics, video analytics, and AI agents that require contextual, step-by-step decision-making in complex environments. This model transforms how AI and robotics interact with the real world giving your systems the power to not just see and describe, but truly understand, reason, and make decisions in complex environments like factories, cities, and autonomous vehicles. With its ability to analyze video, plan robot actions, and verify safety protocols, Cosmos Reason1-7B helps developers build smarter, safer, and more adaptive solutions for real-world challenges. Cosmos Reason1-7B is physical AI for 4 embodiments: Fig.2 Physical AI Model Strengths Physical World Reasoning: Leverages prior knowledge, physics laws, and common sense to understand complex scenarios. Chain-of-Thought (CoT) Reasoning: Delivers contextual, step-by-step analysis for robust decision-making. Flexible Input: Handles images, video (up to 30 seconds, 1080p), and text with a 16k context window. Compact & Deployable: 7B parameters runs efficiently from edge devices to the cloud. Production-Ready: Available via Hugging Face, GitHub, and NVIDIA NIM; integrates with industry-standard APIs. Enterprise Use Cases Cosmos Reason1-7B is more than a model, it’s a catalyst for building intelligent, adaptive solutions that help enterprises shape a safer, more efficient, and truly connected physical world. Fig.3 Use Cases Reimagine safety and efficiency by empowering AI agents to analyze millions of live streams and recorded videos, instantly verifying protocols and detecting risks in factories, cities, and industrial sites. Accelerate robotics innovation with advanced reasoning and planning, enabling robots to understand their environment, make methodical decisions, and perform complex tasks—from autonomous vehicles navigating busy streets to household robots assisting with daily chores. Transform data curation and annotation by automating the selection, labeling, and critiquing of massive, diverse datasets, fueling the next generation of AI with high-quality training data. Unlock smarter video analytics with chain-of-thought reasoning, allowing systems to summarize events, verify actions, and deliver actionable insights for security, compliance, and operational excellence. -> Explore the model card on Azure AI Foundry Also coming soon to Azure AI Foundry are two models of the Cosmos WFM, designed for world generation and data augmentation. Cosmos Predict 2.5 2B Cosmos Predict 2.5 is a next-generation world foundation model that generates realistic, controllable video worlds from text, images, or videos—all through a unified architecture. Trained on 200M+ high-quality clips and enhanced with reinforcement learning, it delivers stronger physics and prompt alignment while cutting compute cost and post-training time for faster Physical AI workflows. Cosmos Transfer 2.5 2B While Predict 2.5 generates worlds, Transfer 2.5 that transforms structured simulation inputs—like segmentation, depth, or LiDAR maps—into photorealistic synthetic data for Physical AI training and development. What Sets Cosmos Apart Built for Physical AI — Purpose-built for robotics, autonomous systems, and embodied agents that understand physics, motion, and spatial environments. Multimodal World Modeling — Combines images, video, depth, segmentation, LiDAR, and trajectories to create physics-aware, controllable world simulations. Scalable Synthetic Data Generation — Generates diverse, photorealistic data at scale using structured simulation inputs for faster Sim2Real training and adaptation. Microsoft Trellis by Microsoft Research: Enterprise-ready 3D Generation Microsoft Trellis by Microsoft Research is a cutting-edge 3D asset generation model developed by Microsoft Research, designed to create high-quality, versatile 3D assets, complete with shapes and textures, from text or image prompts. Seamlessly integrated within the NVIDIA NIM microservice, Trellis accelerates asset generation and empowers creators with flexible, production-ready outputs. Quickly generate high-fidelity 3D models from simple text or image prompts perfect for industries like manufacturing, energy, and smart infrastructure looking to accelerate digital twin creation, predictive maintenance, and immersive training environments. From virtual try-ons in retail to production-ready assets in media, TRELLIS empowers teams to create stunning 3D content at scale, cutting down production time and unlocking new levels of interactivity and personalization. -> Explore the model card on Azure AI Foundry Pricing The pricing breakdown consists of the Azure Compute charges plus a flat fee per GPU for the NVIDIA AI Enterprise license that is required to use the NIM software. Pay-as-you-go (per gpu hour) NIM Surcharge: $1 per gpu hour Azure Compute charges also apply based on deployment configuration Why use Managed Compute? Managed Compute is a deployment option within Azure AI Foundry Models that lets you run large language models (LLMs), SLMs, HuggingFace models and custom models fully hosted on Azure infrastructure. Azure Managed Compute is a powerful deployment option for models not available via standard (pay-go) endpoints. It gives you: Custom model support: Deploy open-source or third-party models Infrastructure flexibility: Choose your own GPU SKUs (NVIDIA A10, A100, H100) Detailed control: Configure inference servers, protocols, and advanced settings Full integration: Works with Azure ML SDK, CLI, Prompt Flow, and REST APIs Enterprise-ready: Supports VNet, private endpoints, quotas, and scaling policies NVIDIA NIM Microservices on Azure These models are available as NVIDIA NIM™ microservices on Azure AI Foundry. NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing. NIM microservices are pre-built, containerized AI endpoints that simplify deployment and scale across environments. They allow developers to run models securely and efficiently in the cloud environment. If you're ready to build smarter, more capable AI agents, start exploring Azure AI Foundry. Build Trustworthy AI Solutions Azure AI Foundry delivers managed compute designed for enterprise-grade security, privacy, and governance. Every deployment of NIM microservices through Azure AI Foundry is backed by Microsoft’s Responsible AI principles and Secure Future Initiative ensuring fairness, reliability, and transparency so organizations can confidently build and scale agentic AI workflows. How to Get Started in Azure AI Foundry Explore Azure AI Foundry: Begin by accessing the Azure AI Foundry portal and then following the steps below. Navigate to ai.azure.com. Select on top left existing project that is (Hub) resource provider. If you do not have a HUB Project, create new Hub Project using “+ Create New” link. Choose AI Hub Resource: Deploy with NIM Microservices: Use NVIDIA’s optimized containers for secure, scalable deployment. Select Model Catalog from the left sidebar menu: In the "Collections" filter, select NVIDIA to see all the NIM microservices that are available on Azure AI Foundry. Select the NIM you want to use. Click Deploy. Choose the deployment name and virtual machine (VM) type that you would like to use for your deployment. VM SKUs that are supported for the selected NIM and also specified within the model card will be preselected. Note that this step requires having sufficient quota available in your Azure subscription for the selected VM type. If needed, follow the instructions to request a service quota increase. Use this NVIDIA NeMo Agent Toolkit: designed to orchestrate, monitor, and optimize collaborative AI agents. Note about the License Users are responsible for compliance with the terms of NVIDIA AI Product Agreement . Learn More How to Deploy NVIDIA NIM Docs Learn More about Accelerating agentic workflows with Azure AI Foundry, NVIDIA NIM, and NVIDIA NeMo Agent Toolkit Register for Microsoft Ignite 20251.2KViews1like0CommentsIntroducing GPT-5.4 in Microsoft Foundry
Today, we’re thrilled to announce that OpenAI’s GPT‑5.4 is now generally available in Microsoft Foundry: a model designed to help organizations move from planning work to reliably completing it in production environments. As AI agents are applied to longer, more complex workflows; consistency and follow‑through become as important as raw intelligence. GPT‑5.4 combines stronger reasoning with built in computer use capabilities to support automation scenarios, and dependable execution across tools, files, and multi‑step workflows at scale. GPT-5.4: Enhanced Reliability in Production AI GPT-5.4 is designed for organizations operating AI in real production environments, where consistency, instruction adherence, and sustained context are critical to success. The model brings together advances in reasoning, coding, and agentic workflows to help AI systems not only plan tasks but complete them with fewer interruptions and reduced manual oversight. Compared with earlier generations, GPT-5.4 emphasizes stability across longer interactions, enabling teams to deploy agentic AI with greater confidence in day-to-day production use. GPT-5.4 introduces advancements that aim for production grade AI: More consistent reasoning over time, helping maintain intent across multi‑turn and multi‑step interactions Enhanced instruction alignment to reduce prompt tuning and oversight Latency improved performance for responsive, real-time workflows Integrated computer use capabilities for structured orchestration of tools, file access, data extraction, guarded code execution, and agent handoffs More dependable tool invocation reducing prompt tuning and human oversight Higher‑quality generated artifacts, including documents, spreadsheets, and presentations with more consistent structure Together, these improvements support AI systems that behave more predictably as tasks grow in length and complexity. From capability to real-world outcomes GPT‑5.4 delivers practical value across a wide range of production scenarios where follow‑through and reliability are essential: Agent‑driven workflows, such as customer support, research assistance, and business process automation Enterprise knowledge work, including document drafting, data analysis, and presentation‑ready outputs Developer workflows, spanning code generation, refactoring, debugging support, and UI scaffolding Extended reasoning tasks, where logical consistency must be preserved across longer interactions Teams benefit from reduced task drift, fewer mid‑workflow failures, and more predictable outcomes when deploying GPT‑5.4 in production. GPT-5.4 Pro: Deeper analysis for complex decision workflows GPT‑5.4 Pro, a premium variant designed for scenarios where analytical depth and completeness are prioritized over latency. Additional capabilities include: Multi‑path reasoning evaluation, allowing alternative approaches to be explored before selecting a final response Greater analytical depth, supporting problems with trade‑offs or multiple valid solutions Improved stability across long reasoning chains, especially in sustained analytical tasks Enhanced decision support, where rigor and thoroughness outweigh speed considerations Organizations typically select GPT‑5.4 Pro when deeper analysis is required such as scientific research and complex problems, while GPT‑5.4 remains the right choice for workloads that prioritize reliable execution and agentic follow‑through. Microsoft Foundry: Enterprise‑Grade Control from Day One GPT‑5.4 and GPT‑5.4 Pro are available through Microsoft Foundry, which provides the operational controls organizations need to deploy AI responsibly in production environments. Foundry supports policy enforcement, monitoring, version management, and auditability, helping teams manage AI systems throughout their lifecycle. By deploying GPT‑5.4 through Microsoft Foundry, organizations can integrate advanced agentic capabilities into existing environments while aligning with security, compliance, and operational requirements from day one. Customer Spotlight Get Started with GPT-5.4 in Microsoft Foundry GPT‑5.4 sets a new bar for production‑ready AI by combining stronger reasoning with dependable execution. Through enterprise‑grade deployment in Microsoft Foundry, organizations can move beyond experimentation and confidently build AI systems that complete complex work at scale. Computer use capabilities will be introduced shortly after launch. GPT‑5.4 in Microsoft Foundry is priced at $2.50 per million input tokens, $0.25 per million cached input tokens, and $15.00 per million output tokens. It is available at launch in Standard Global and Standard Data Zone (US), with additional deployment options coming soon. GPT‑5.4 Pro is priced at $30.00 per million input tokens, and $180.00 per million output tokens, and is available at launch in Standard Global. Build agents for real-world workloads. Start building with GPT‑5.4 in Microsoft Foundry today.11KViews2likes2CommentsIntroducing GPT-5.3 Chat in Microsoft Foundry: A more grounded way to chat at enterprise scale
OpenAI’s GPT‑5.3 Chat marks the next step in the GPT‑5 series, designed to deliver more dependable, context‑aware chat experiences for enterprise workloads. The model emphasizes steadier instruction handling and clearer responses, supporting high‑volume, real‑world conversations with greater consistency. GPT‑5.3 Chat is now available via API in Microsoft Foundry, where teams will be able to deploy production‑ready chat and agent experiences that are standardized, governed, and built to scale across the enterprise. What’s new in GPT‑5.3 Chat GPT‑5.3 Chat centers on predictable behavior, relevance, and response quality, helping teams build chat experiences that operate reliably across end‑to‑end workflows while aligning with enterprise safety and compliance expectations. Fewer dead ends, more resolved conversations Reduces unnecessary refusals by responding more proportionately when safe context is available Supports compliant reformulation to keep interactions moving forward Enables end‑to‑end resolution in support, IT, and policy‑driven workflows Grounded answers you can operationalize Combines built‑in web search with model reasoning to surface relevant, actionable information Prioritizes relevance and context over long lists of loosely related results Keeps responses actionable while maintaining enterprise controls and traceability Consistent outputs at scale Improved tone, explanation quality, and instruction following Easier to template, govern, and monitor across apps Less downstream cleanup as usage scales Built for production in Microsoft Foundry Production‑grade infrastructure Observability, failover, quota management, and performance monitoring Designed for real workloads—not experiments Consistent behavior across regions and use cases without re‑architecting Smarter scaling with quota tiers Automatic quota increases with sustained usage Fewer rate‑limit interruptions as demand grows Flexible tiers from Free through Tier 6 Security and compliance by default Identity, access controls, policy enforcement, and data boundaries built in Meets regulated‑industry requirements out of the box Teams can move fast without compromising trust GPT-5.3 Chat in Microsoft Foundry is priced at $1.75 per million input tokens, $0.175 per million cached input tokens, and $14.00 per million output tokens. Ready to build with GPT‑5.3 Chat in Foundry? Start turning reliable conversations into real applications. Explore GPT-5.3 Chat in Microsoft Foundry and begin building production ready‑ chat and agent experiences today.3.9KViews1like1CommentIntroducing Phi-4-Reasoning-Vision to Microsoft Foundry
Vision reasoning models unlock a critical capability for developers: the ability to move beyond passive perception toward systems that can understand, reason over, and act on visual information. Instead of treating images, diagrams, documents, or UI screens as unstructured inputs, vision reasoning models enable developers to build applications that can interpret visual structure, connect it with textual context, and perform multi-step reasoning to reach actionable conclusions. Today, we are excited to announce Phi-4-Reasoning-Vision-15B is available in Microsoft Foundry and Hugging Face. This model brings high‑fidelity vision to the reasoning‑focused Phi‑4 family, extending small language models (SLMs) beyond perception into structured, multi‑step visual reasoning for agents, analytical tools, and scientific workflows. What’s new? The Phi model family has advanced toward combining efficient visual understanding with strong reasoning in small language models. Earlier Phi‑4 models demonstrated reliable perception and grounding across images and text, while later iterations introduced structured reasoning to improve performance on complex tasks. Phi‑4‑reasoning-vision-15B brings these threads together, pairing high‑resolution visual perception with selective, task‑aware reasoning. As a result, the model can reason deeply when needed while remaining fast and efficient for perception‑focused scenarios—making it well suited for interactive, real‑world applications. Key capabilities Reasoning behavior is explicitly enabled via prompting: Developers can explicitly enable or disable reasoning to balance latency and accuracy at runtime. Optimized for vision reasoning and can be used for: diagram-based math, document, chart, and table understanding, GUI interpretations and grounding for agent scenarios to interpret screens and actions, Computer-use agent scenarios, and General image chat and answering questions Benchmarks The following results summarize Phi-4-reasoning-vision-15B performance across a set of established multimodal reasoning, mathematics, and computer use benchmarks. The following benchmarks are the result of internal evaluations. Benchmark Phi-4-reasoning-vision-15B Phi-4-reasoning-vision-15B – force no think Phi-4-mm-instruct Kimi-VL-A3B-Instruct gemma-3-12b-it Qwen3-VL-8B-Instruct-4K Qwen3-VL-8B-Instruct-32K Qwen3-VL-32B-Instruct-4K Qwen3-VL-32B-Instruct-32K AI2D _TEST 84.8 84.7 68.6 84.6 80.4 82.7 83 84.8 85 ChartQA _TEST 83.3 76.5 23.5 87 39 83.1 83.2 84.3 84 HallusionBench 64.4 63.1 56 65.2 65.3 73.5 74.1 74.4 74.9 MathVerse _MINI 44.9 43.8 32.4 41.7 29.8 54.5 57.4 64.2 64.2 MathVision _MINI 36.2 34.2 20 28.3 31.9 45.7 50 54.3 60.5 MathVista _MINI 75.2 68.7 50.5 67.1 57.4 77.1 76.4 82.5 81.8 MMMU _VAL 54.3 52 42.3 52 50 60.7 64.6 68.6 70.6 MMStar 64.5 63.3 45.9 60 59.4 68.9 69.9 73.7 74.3 OCRBench 76 75.6 62.6 86.5 75.3 89.2 90 88.5 88.5 ScreenSpot _v2 88.2 88.3 28.5 89.8 3.5 91.5 91.5 93.7 93.9 Table 1: Accuracy comparisons relative to popular open-weight, non-thinking models Benchmark Phi-4-reasoning-vision-15B Phi-4-reasoning-vision-15B - force thinking Kimi-VL-A3B-Thinking gemma-3-12b-it Qwen3-VL-8B-Thinking-4K Qwen3-VL-8B-Thinking-40K Qwen3-VL-32B-Thiking-4K Qwen3-VL-32B-Thinking-40K AI2D_TEST 84.8 79.7 81.2 80.4 83.5 83.9 86.9 87.2 ChartQA _TEST 83.3 82.9 73.3 39 78 78.6 78.5 79.1 HallusionBench 64.4 63.9 70.6 65.3 71.6 73 76.4 76.6 MathVerse _MINI 44.9 53.1 61 29.8 67.3 73.3 78.3 78.2 MathVision _MINI 36.2 36.2 50.3 31.9 43.1 50.7 60.9 58.6 MathVista _MINI 75.2 74.1 78.6 57.4 77.7 79.5 83.9 83.8 MMMU _VAL 54.3 55 60.2 50 59.3 65.3 72 72.2 MMStar 64.5 63.9 69.6 59.4 69.3 72.3 75.5 75.7 OCRBench 76 73.7 79.9 75.3 81.2 82 83.7 85 ScreenSpot _v2 88.2 88.1 81.8 3.5 93.3 92.7 83.1 83.1 Table 2: Accuracy comparisons relative to popular open-weight, thinking models All results were obtained using a consistent evaluation setup and prompts across models; numbers are provided for comparison and analysis rather than as leaderboard claims. For more information regarding benchmarks and evaluations, please read the technical paper on the Microsoft Research hub. Suggested use cases and applications Phi‑4‑Reasoning-Vision-15B supports applications that require both high‑fidelity visual perception and structured inference. Two representative scenarios include scientific and mathematical reasoning over visual inputs, and computer‑using agents (CUAs) that operate directly on graphical user interfaces. In both cases, the model provides grounded visual understanding paired with controllable, low‑latency reasoning suitable for interactive systems. Computer use agents in retail scenarios For computer use agents, Phi‑4‑Reasoning-Vision-15B provides the perception and grounding layer required to understand and act within live ecommerce interfaces. For example, in an online shopping experience, the model interprets screen content—products, prices, filters, promotions, buttons, and cart state—and produces grounded observations that agentic models like Fara-7B can use to select actions. Its compact size and low latency inference make it well suited for CUA workflows and agentic applications. Visual reasoning for education Another practical use of visual reasoning models is education. A developer could build a K‑12 tutoring app with Phi‑4‑Reasoning‑Vision‑15B where students upload photos of worksheets, charts, or diagrams to get guided help—not answers. The model can understand the visual content, identify where the student went wrong, and explain the correct steps clearly. Over time, the app can adapt by serving new examples matched to the student’s learning level, turning visual problem‑solving into a personalized learning experience. Microsoft Responsible AI principles At Microsoft, our mission to empower people and organizations remains constant—especially in the age of AI, where the potential for human achievement is greater than ever. We recognize that trust is foundational to AI adoption, and earning that trust requires a commitment to transparency, safety, and accountability. As with other Phi models, Phi-4-Reasoning-Vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. These safety focused training signals help the model recognize and decline requests that fall outside intended or acceptable use. Additional details on the model’s safety considerations, evaluation approach, and known limitations are provided in the accompanying technical blog and model card. Getting started Start using Phi‑4‑Reasoning-Vision-15B in Microsoft Foundry today. Microsoft Foundry provides a unified environment for model discovery, evaluation, and deployment, making it straightforward to move from initial experimentation to production use while applying appropriate safety and governance practices. Deploy the new model on Microsoft Foundry. Learn more about the Phi family on Foundry Labs and in the Phi Cookbook Connect to the Microsoft Developer Community on Discord Read the technical paper on Microsoft Research Read more use cases on the Educators Developer blog768Views0likes0CommentsGrok 4.0 Goes GA in Microsoft Foundry and Grok 4.1 Fast Arrives with Major Enhancements
We first brought Grok 4.0 to Microsoft Foundry in September 2025, marking an important milestone in expanding Foundry’s multi-model ecosystem with frontier reasoning models from xAI. Since then, customer interest and usage have continued to build as developers explored Grok’s strengths in fast reasoning, sense-making, and interpretation of complex, ambiguous information. Today, we’re excited to announce that Grok 4.0 is now generally available (GA) in Microsoft Foundry, giving enterprises a production-ready path to deploy Grok at scale. Building on that momentum, Grok 4.1 Fast (Reasoning and Non-reasoning) are now available in Microsoft Foundry. Grok 4.1 introduces a suite of improvements that enhance conversation quality, creativity, and emotional intelligence while maintaining core reasoning strengths. According to xAI, Grok 4.1 delivers more natural, fluid dialogue compared with earlier versions. Introducing Grok 4.1 Fast (Reasoning and Non-Reasoning) Grok 4.1 Fast is optimized for speed, scale, and agentic execution, giving developers flexibility to choose between reasoning and non-reasoning variants depending on workload requirements. Grok 4.1 Fast (Reasoning): Designed for scenarios that require rapid multi-step reasoning, structured decision-making, and interpretation of complex inputs. This variant is well-suited for agent workflows, analysis pipelines, and applications that need fast responses without sacrificing reasoning depth. Grok 4.1 Fast (Non-Reasoning): Optimized for maximum throughput and low latency, this variant is ideal for tasks such as summarization, classification, content transformation, and tool-driven execution where deterministic speed and efficiency matter more than deep reasoning. Together, these options allow teams to right-size performance and cost by selecting the appropriate Grok 4.1 Fast variant for each stage of an application from high-volume preprocessing and orchestration to targeted reasoning tasks. What’s New with Grok 4.1 Fast? Grok 4.1 brings several enhancements that broaden the model’s applicability and user experience: Improved Conversational Quality: According to xAI, Grok 4.1 Fast offers smoother, more natural interaction patterns, making it more comfortable and intuitive to engage with, especially in multi-turn dialogues. Enhanced Creativity and Emotional Awareness: According to xAI, Grok 4.1 Fast demonstrates stronger creative writing capabilities and greater emotional intelligence, helping it generate more expressive and engaging outputs that better align with human expectations. Reduced Hallucination and Better Reliability: According to xAI, Grok 4.1 Fast produces fewer factual inaccuracies than its predecessor These enhancements make Grok 4.1 Fast a compelling choice for use cases that require engaging conversational interfaces, creative support, and rich natural language interaction. As with all frontier AI models, Grok-4.1 Fast introduces new capabilities alongside new operational considerations. Microsoft’s safety and responsible AI evaluations indicate that Grok-4.1 Fast may demonstrate increased risks in safety testing compared with other models available through Azure. In practice, this means there may be an increased risk of generating explicit or potentially harmful content. To support responsible deployment, customers should implement system-level safety instructions and leverage Azure AI Content Safety (AACS) to help monitor and filter outputs. Because no single safety system can address every possible risk scenario, customers are encouraged to conduct their own evaluations and validation before deploying Grok-4.1 in production systems. To provide enhanced safety and enterprise reliability, Microsoft's deployment of Grok 4.1 features a system-applied safety prompt that cannot be disabled. Customers are expected to operate the model without attempting to bypass or interfere with this feature. Enterprise-Ready Deployment via Microsoft Foundry With Grok 4.0 now GA in Foundry, enterprises gain the ability to incorporate advanced reasoning models into their workflows while enjoying the governance, compliance, and operational tooling that Azure provides. Models hosted in Foundry can be deployed serverless or with provisioned throughput, and customers benefit from centralized billing, identity integration, and access to other Azure services. Foundry’s model catalog also includes other Grok variants such as Grok 4.1 Fast and related non-reasoning SKUs, giving enterprises flexibility to balance performance, latency, and cost depending on their workloads. Pricing Model Deployment Input/1M Tokens Output/1M Tokens Availability Grok 4.1 Fast (Non-Reasoning) Global Standard $0.2 $0.5 Public Preview on 2/27/2026 Grok 4.1 Fast (Reasoning) Global Standard $0.2 $0.5 Public Preview on 3/4/2026 Looking Ahead The combination of Grok’s deep reasoning capabilities with the enterprise readiness of Microsoft Foundry opens new possibilities for production AI applications, from complex analytical agents and research assistants to creative and customer-facing experiences. With Grok 4.1’s conversational refinements further raising the model’s usability and expressiveness, Foundry customers can now experiment with and scale a broader set of AI-driven solutions, all within a trusted, governed environment. As Microsoft continues to expand Foundry’s catalog and partners like xAI continue to innovate, organizations have more options than ever to power next-generation AI applications across industries, use cases, and domains. Try Grok 4.1 Non-Reasoning <AI Model Catalog | Microsoft Foundry Models> Reasoning <AI Model Catalog | Microsoft Foundry Models>607Views0likes0CommentsUnlocking Document Understanding with Mistral Document AI in Microsoft Foundry
Enterprises today face a familiar yet formidable challenge: mountains of documents -contracts, invoices, reports, forms - remain locked in unstructured formats. Traditional OCR (optical character recognition) captures text, but often struggles with context, layout complexity, or multilingual content. The result? Slow workflows, error-prone manual reviews, and missed insights. Enter mistral-document-ai-2512 in Microsoft Foundry. This new model brings together high-end OCR using mistral-ocr-2512 and intelligent document understanding using mistral-small-2506 to turn unstructured documents into actionable data. It doesn’t just “read” pages - it understands them: multi-column layouts, handwritten annotations, tables with merging cells, multilingual content-all processed with enterprise-grade speed and precision. In this blog, we’ll explore what Mistral Document AI 2512 is, why it matters, how it stacks up, and the business impact it promises, especially when paired with solution accelerators like ARGUS. Meet Mistral Document AI Mistral Document AI is an enterprise-grade document understanding model, offered via Microsoft Foundry. It’s built to convert both physical (scans, photos) and digital (PDFs, DOCX) documents into highly structured, machine-readable outputs. Key features include: Top-tier accuracy: According to benchmarks, Mistral’s OCR 2512 stacks display significantly higher accuracy than many alternatives, especially on scanned documents and complex layouts. For example, in comparisons it achieved ~95.9 % “overall” vs ~89-91 % for other platforms Global / multilingual reach: In language-by-language tests (Russian, French, German, Spanish, Chinese, etc), Mistral’s error-rate/fuzzy-match metrics reached 99 %+ in many cases Layout & context awareness: It’s built to not just extract linear text but understand multi-column layouts, tables, charts, images, handwritten input and more Structured output functionality: The model supports structured extraction (JSON), markup (Markdown with interleaved images), preserving document structure for downstream systems Enterprise-ready deployment: With availability via Microsoft Foundry and support for private/secure inference, the model is geared for regulated industries and high-volume workflows Putting it another way: where traditional OCR stops at “here’s the raw text on page 7”, Mistral DocumentAI 2512 can say “here’s the vendor invoice, here are line-items, here’s the total, here’s the signature block, and here’s the part that was handwritten”, ready to plug into downstream systems. Business Impact & Industry examples Mistral Document AI isn’t just another OCR tool; it’s a strategic enabler that turns document-heavy operations into intelligent, automated workflows. The business value comes down to four key advantages: Speed and efficiency: Automating document understanding eliminates manual reviews and retyping. Tasks that took days can be done in minutes, accelerating core business processes Accuracy and consistency: With 99 %+ recognition accuracy and deep layout understanding, Mistral delivers cleaner data and fewer downstream errors - essential in compliance-critical or analytics-driven operations Cost and productivity gains: Reducing manual extraction frees teams for higher-value work, cutting operational costs while increasing output per employee Scalability and adaptability: Cloud-native performance allows organizations to scale document processing instantly during peak loads, across multiple languages and formats, without sacrificing quality Overall, mistral-document-ai-2512 excels where consistency and quality are critical. Industry and Use Cases In regulated industries or big-data scenarios, even a small improvement in accuracy or speed can translate into substantial business gains. Its benchmarks indicate not just incremental progress, but a major step forward - giving enterprises a powerful new engine for their document workflows. Here’s where that impact becomes tangible: Financial services: Banks and insurers handle vast document volumes - loan applications, KYC forms, and claims reports - where data integrity and auditability are non-negotiable. Mistral automates extraction, classification, and clause identification across diverse formats, improving turnaround time and compliance accuracy while reducing manual handling costs Healthcare & life sciences: Clinical records, lab results, and insurance claims often combine handwritten, tabular, and multi-language content. Mistral’s layout awareness and multilingual support ensure clean, structured datasets for downstream analytics and regulatory submissions Manufacturing & logistics: From quality certificates to shipping manifests, Mistral streamlines the flow of operational documents. It can extract production parameters, vendor data, and timestamps at scale - building a unified, queryable data layer that supports supply chain traceability Legal & public sector: Legal teams and agencies depend on consistency and transparency. Mistral helps index, summarise, and validate contracts or permits with full structural fidelity - dramatically cutting review cycles while maintaining evidential quality Retail & consumer goods: Retailers process supplier invoices, product specifications, and marketing briefs from global partners. With Mistral’s multilingual precision and structure preservation, global document flows become searchable and analytics-ready Across these industries, the result is the same: cleaner data, faster throughput, and fewer human errors - the foundation for more reliable decisions and more agile operations. Pricing Argus – A ready-to-implement accelerator to start using Mistral Document AI To spin up a solution faster, one can leverage solution accelerators such as ARGUS (open-source repository available on GitHub). ARGUS serves as a full-pipeline implementation: from document ingestion, OCR/extraction (via Mistral Document AI), to downstream processing and structured output. It shows how to deploy end-to-end, integrate with storage, preprocess documents, handle large-scale batches, output JSON schemas, and integrate into existing business workflows. Mistral Document AI Integration ARGUS now offers flexible OCR provider selection with Mistral Document AI as one of the several options. This enhancement gives you the freedom to choose the best OCR engine for your specific document processing needs. Key Features: Dual Provider Support: Toggle between Azure Document Intelligence (default) and Mistral Document AI Runtime Switching: Change OCR providers on-the-fly through the Settings UI without redeployment Simple Configuration: Set up Mistral via environment variables (OCR_PROVIDER, MISTRAL_DOC_AI_ENDPOINT, MISTRAL_DOC_AI_KEY) or the web interface Seamless Integration: Both providers expose the same interface, ensuring consistent behavior across your document processing pipeline Why This Matters: Different OCR engines excel at processing different document content. Azure Document Intelligence offers enterprise-grade form and table recognition, while Mistral Document AI 2512, in addition, enables extraction to structured JSON with customizable schemas, document classification, and image processing—including text, charts, and signatures. It can convert charts into tables, extract fine print from figures, and even define custom image types for specialized workflows. Now you can select the optimal provider for each use case. In effect, instead of building from scratch, ARGUS gives you the legs to run: pipeline orchestration, ingestion, error-handling, schema-mapping, output integration-all wired to Mistral’s engine. This significantly accelerates time-to-value and reduces risk for enterprise adopters. Getting Started: Navigate to the ARGUS frontend interface (Streamlit app) and click on the Settings tab. In the OCR Provider Configuration section, select your preferred provider. If using Mistral, enter your endpoint URL, API key, and model name. Click Update OCR Provider to apply changes immediately—no restart required. All new document processing will use your selected OCR engine. If your organization is looking to unlock document intelligence, here’s a structured path: Explore Mistral Document AI via Microsoft Foundry: Browse the model card, review endpoint specs, try sample documents to test accuracy and extraction structure Deploy and Pilot with ARGUS: Use the GitHub repo to spin up an end-to-end pipeline on a small workload (e.g., a batch of invoices or contracts) and compare manual vs AI-driven throughput and error-rates Define business value metrics: Track processing time, error rate, manual hours saved, and downstream impact (faster decision cycles, fewer reworks). Scale and govern: Once pilot proves value, expand into multiple document types, languages, geographies - and ensure governance (data handling, compliance, model-monitoring) Embed continuous improvement: As usage grows, feed back learnings, tune schema definitions, refine extraction rules, and extend into QA, insights or analytics layers Conclusion In today’s data-rich but document-heavy environment, the ability to truly understand documents (and not just digitize them) is becoming a strategic imperative. Mistral Document AI represents a next-generation shift: accurate, layout-aware, multilingual, structured. When paired with accelerators like ARGUS, enterprises can move from manual bottlenecks to streamlined, insight-rich document workflows. If you’re thinking about unlocking the value buried in your documents-be it invoices, contracts, forms or reports, now is the time. With mistral-document-ai-2512, what used to be a cost-center is now a potential performance lever. Ready to get started? Explore the model, and let your documents begin talking back.4.3KViews2likes0CommentsMeet FLUX.2 [flex] for Text‑Heavy Design and UI Prototyping Now Available on Microsoft Foundry
Within the FLUX.2 family, FLUX.2 [flex] is the specialist for text and graphic design production, while FLUX.2 [pro] is optimized for photorealistic and cinematic-quality outputs. Use FLUX.2 [flex] when typography, logos, UI copy, or on-brand visual accuracy is the priority. Model Overview FLUX.2 [flex] delivers fine-grained control for precise results. It is purpose-built for typography and text-forward workflows, maintaining small details in complex scenes while enabling rapid iteration at scale. Teams use it for production-grade brand design, product UI prototyping, and any workflow where accurate text rendering is critical. This positioning makes FLUX.2 [flex] ideal for platform builders and creative teams who need to balance speed, scale, and cost while maintaining creative control. FLUX.2 [flex] is now available as gated Public Preview. Fill out this form Black Forest Lab's Flux models Limited Access Applications to get access to these models today. Key Capabilities Best-in-class text rendering for logos, UI copy, product packaging, and social graphics Fine-grained control - Adjust inference steps and guidance scale to dial in precise outputs for production workflows Detail preservation - Maintains fine details and small elements in complex scenes, keeping brand assets sharp at any resolution Multi-reference editing - Supports up to 8 reference images via API for complex multi-image compositing workflows Use Cases FLUX.2 [flex] is specialized for workflows where accurate text rendering and fine-grained visual control are non-negotiable. It excels across brand design, product UI prototyping, social graphics, and any project requiring readable typography, precise layouts, or clean on-brand assets. Brand and Graphic Design Generate precise, on-brand graphic design assets with clean, readable typography across logos, packaging, and marketing collateral Iterate across typefaces, layout styles, and color systems without sacrificing text accuracy or fine detail Produce production-ready social graphics and campaign visuals with correctly rendered copy Creative & Media Production Render multi-word headlines, body copy blocks, and editorial layouts that adapt seamlessly to any style. Modern, classic, handwritten, or dynamic 3D lettering Generate magazine covers, editorial spreads, and text-heavy visual layouts with correctly spaced, readable fonts at production quality Test copywriting concepts and headline variations at 3x the speed of previous-generation models Product UI Prototyping Rapidly prototype product UIs with rich text components, accurate button labels, navigation copy, and on-screen text Build and iterate on app screens, dashboards, and design systems where text legibility and layout precision are critical 💡Tip: Use FLUX.2 [pro] in Microsoft Foundry when your workflow prioritizes photorealistic imagery or cinematic-quality final renders, rather than text-forward or graphic design outputs. Join Black Forest Labs on Model Mondays YouTube Livestream: March 2, 2026 Stephen Batifol, Developer Advocate at Black Forest Labs, joins Model Mondays to show how teams use BFL’s generative image models on Microsoft Foundry to move from experimentation to real‑world creative workflows at scale with fewer touch‑ups, stronger brand fidelity, and state‑of‑the‑art image quality. This hands‑on session features a live demo and practical patterns for building scalable image pipelines, so you can ship complete creative campaigns in days, not weeks. RSVP Here Pricing Foundry Models are fully hosted and managed on Azure. FLUX.2 [flex] is available through pay-as-you-go and on Global Standard deployment type with the following pricing: Model Type Pricing FLUX.2-flex Global Standard $0.05 per megapixel (MP) Important Notes: For pricing, resolution is always rounded up to the next megapixel, separately for each reference image and for the generated image. 1 megapixel is counted as 1024x1024 pixels For multiple reference images, each reference image is counted as 1 megapixel Images exceeding 4 megapixels are resized to 4 megapixels Reference the Foundry Models pricing page for pricing. Build Trustworthy AI Solutions Black Forest Labs models in Foundry Models are delivered under the Microsoft Product Terms, giving you enterprise-grade security and compliance out of the box. Each FLUX endpoint offers Content Safety controls and guardrails. Runtime protections include built-in content-safety filters, role-based access control, virtual-network isolation, and automatic Azure Monitor logging. Governance signals stream directly into Azure Policy, Purview, and Microsoft Sentinel, giving security and compliance teams real-time visibility. Together, Microsoft's capabilities let you create with more confidence, knowing that privacy, security, and safety are woven into every Black Forest Labs deployment from day one. Getting Started with FLUX.2 in Microsoft Foundry If you don’t have an Azure subscription, you can sign up for an Azure account here. Search for the model name in the model catalog in Microsoft Foundry under “Build.” FLUX.2-flex Open the model card in the model catalog. Click on deploy to obtain the inference API and key. View your deployment under Build > Models. You should land on the deployment page that shows you the API and key in less than a minute. You can try out your prompts in the playground. You can use the API and key with various clients. Learn More FLUX.2 [flex] is now available as gated Public Preview. Fill out this form Black Forest Lab's Flux models Limited Access Applications to get access to these models today. ▶️ RSVP for the next Model Monday LIVE on YouTube or On-Demand 👩💻 Explore FLUX.2 Documentation on Microsoft Learn 👋 Continue the conversation on Discord450Views0likes0CommentsNew Azure Open AI models bring fast, expressive, and real‑time AI experiences in Microsoft Foundry
Modern AI applications, whether voice‑first experiences or building large software systems, rarely fit into a single prompt. Real work unfolds over time: maintaining context, following instructions, invoking tools, and adapting as requirements evolve. When these foundations break down through latency spikes, instruction drift, or unreliable tool calls, both user conversations and developer workflows are impacted. OpenAI’s latest models address this shared challenge by prioritizing continuity and reliability across real‑time interaction and long‑running engineering tasks. Starting today, GPT-Realtime-1.5, GPT-Audio-1.5, and GPT-5.3-Codex are rolling out into Microsoft Foundry. Together, these models reflect the growing needs of the modern developer and push the needle from short, stateless interactions toward AI systems that can reason, act, and collaborate over time. GPT-5.3-Codex at a glance GPT‑5.3‑Codex brings together advanced coding capability with broader reasoning and professional problem solving in a single model built for real engineering work. It unifies the frontier coding performance of GPT-5.2-Codex with the reasoning and professional knowledge capabilities of GPT5.2 in one system. This shifts the experience from optimizing isolated outputs to supporting longer running development efforts; where repositories are large, changes span multiple steps, and requirements aren’t always fully specified at the start. What’s improved Model experiences 25% faster execution time, according to Open AI, than its predecessors so developers can accelerate development of new applications. Built for long-running tasks that involve research, tool use, and complex, multi‑step execution while maintaining context. Midtask steerability and frequent updates allow developers to redirect and collaborate with the model as it works without losing context. Stronger computer-use capabilities allow developers to execute across the full spectrum of technical work. Common use cases Developers and teams can apply GPT‑5.3‑Codex across a wide range of scenarios, including: Refactoring and modernizing large or legacy applications Performing multi‑step migrations or upgrades Running agentic developer workflows that span analysis, implementation, testing, and remediation Automating code reviews, test generation, and defect detection Supporting development in security‑sensitive or regulated environments Pricing Model Input Price/1M Tokens Cached Input Price/1M Tokens Output Price/1M Tokens GPT-5.3-Codex $1.75 $0.175 $14.00 GPT-Realtime-1.5 and GPT-Audio-1.5 at a glance The models deliver measurable gains in reasoning and speech understanding for real‑time voice interactions on Microsoft Foundry. In OpenAI’s evaluations, it shows a +5% lift on Big Bench Audio (reasoning), a +10.23% improvement in alphanumeric transcription, and a +7% gain in instruction following, while maintaining low‑latency performance. Key improvements include: What's improved More natural‑sounding speech: Audio output is smoother and more conversational, with improved pacing and prosody. Higher audio quality: Clearer, more consistent audio output across supported voices. Improved instruction following: Better alignment with developer‑provided system and user instructions during live interactions. Function calling support: Enables structured, tool‑driven interactions within real‑time audio flows. Common use cases Developers are using GPT-Realtime-1.5 and GPT-Audio-1.5 for scenarios where low‑latency voice interaction is essential, including: Conversational voice agents for customer support or internal help desks Voice‑enabled assistants embedded in applications or devices Live voice interfaces for kiosks, demos, and interactive experiences Hands‑free workflows where audio input and output replace keyboard interaction Pricing Model Text Audio Image Input Cached Input Output Input Cached Input Output Input Cached Input Output GPT-Realtime-1.5 $4.00 $0.04 $16.0 $32.0 $0.40 $64.00 $4.00 $0.04 $16.0 GPT-Audio-1.5 $2.50 n/a $10.0 $32.00 n/a $64.00 $2.50 n/a $10.0 Getting started in Microsoft Foundry Start building in Microsoft Foundry, evaluate performance, and explore Azure Open AI models today. Foundry brings evaluation, deployment, and governance into a single workflow, helping teams progress from experiments to scalable applications while maintaining security and operational controls.8.5KViews1like0CommentsWhat’s trending on Hugging Face: PubMedBERT Base Embeddings, Paraphrase Multilingual MiniLM, BGE-M3
The embedding model landscape has evolved beyond one-size-fits-all solutions. Today’s developers navigate a set of deliberate trade‑offs: domain specialization to improve accuracy in vertical applications, multilingual capabilities to support global use cases, and retrieval strategies that optimize performance at scale. Once a model demonstrates strong semantic performance, predictable behavior, and broad community support, it often becomes a trusted reference baseline that developers build around and deploy with confidence. This week, we’re not spotlighting models that are new to Microsoft Foundry. Instead, we’re turning our attention to models that have managed to stay relevant in a rapidly expanding sea of options. This week's Model Monday's edition highlights three Hugging Face models including NeuML's PubMedBERT Base Embeddings for domain-specific medical text understanding, Sentence Transformers' Paraphrase Multilingual MiniLM for lightweight cross-lingual semantic similarity, and BAAI's BGE-M3 for multi-functional long-context retrieval across 100+ languages. Models of the week NeuML: PubMedBERT Base Embeddings Model Specs Parameters / size: 109M Context length: 512 tokens Primary task: Embeddings (medical domain) Why it's interesting Domain-specific performance gains: Fine-tuned on PubMed title-abstract pairs, achieving 95.62% average Pearson correlation across medical benchmarks—outperforming general-purpose models like gte-base (95.37%), bge-base-en-v1.5 (93.78%), and all-MiniLM-L6-v2 (93.46%) on medical literature tasks Production-validated for medical RAG: With 141K downloads and deployment in 30+ medical AI applications, this model demonstrates consistent real-world performance for clinical research, drug discovery, and biomedical semantic search pipelines Built on Microsoft's BiomedNLP foundation: Extends BioMed BERT family with sentence-transformers mean pooling, creating 768-dimensional embeddings optimized for medical literature clustering and retrieval Try it Clinical research sample prompt: Industry specific sample prompt: You're building a clinical decision support system for oncology. Deploy PubMedBERT Base Embeddings in Microsoft Foundry to index 50,000 recent cancer research abstracts from PubMed. A physician queries: "What are the cardiotoxicity risks of combining checkpoint inhibitors with anthracycline chemotherapy in elderly patients?" Embed the query, retrieve the top 10 most semantically similar abstracts using cosine similarity, and return citations with PubMed IDs for evidence-based treatment planning. Sentence Transformers: Paraphrase Multilingual MiniLM L12 v2 Model Specs Parameters / size: 117M Context length: 128 tokens Primary task: Embeddings (multilingual, sentence similarity) Why it's interesting Multilingual adoption: Supports 50+ languages including Arabic, Chinese, Hebrew, Hindi, Japanese, Korean, Russian, Thai, and Vietnamese—with 18.4 million downloads last month demonstrating production-scale validation across global deployments Compact architecture for edge deployment: At 117M parameters producing 384-dimensional embeddings, this model balances multilingual coverage with inference efficiency, making it ideal for resource-constrained environments or high-throughput applications Sentence-BERT foundation: Based on the influential Sentence-BERT paper (Reimers & Gurevych, 2019), using siamese BERT networks with mean pooling to create semantically meaningful sentence embeddings for clustering, paraphrase detection, and cross-lingual search Community-proven versatility: With 299 fine-tuned variants and 100+ Spaces implementations, this model serves as a peer reviewed starting point for multilingual semantic similarity tasks, from customer support ticket routing to cross-lingual document retrieval Try it E-commerce sample prompt: You're building a global customer support platform for an e-commerce company operating in 30 countries. Deploy Paraphrase Multilingual MiniLM in Microsoft Foundry to process incoming support tickets in English, Spanish, French, German, Portuguese, Japanese, and Korean. Embed each ticket as a 384-dimensional vector and cluster by semantic similarity to automatically route issues to specialized teams (payment, shipping, returns, technical). Flag duplicate tickets with cosine similarity > 0.85 to prevent redundant responses. BAAI: BGE-M3 Model Specs Parameters / size: ~560M Context length: 8192 tokens Primary task: Embeddings (multi-functional: dense, sparse, multi-vector) Why it's interesting Three retrieval modes in one model: Uniquely supports dense retrieval (1024-dim embeddings), sparse retrieval (lexical matching like BM25), and multi-vector retrieval (ColBERT-style fine-grained matching)—enabling hybrid search pipelines without maintaining separate models or indexes Exceptional long-context capability: 8192-token context window handles full documents, legal contracts, research papers, and lengthy technical content—validated on MLDR (13-language document retrieval) and NarrativeQA (long-form question answering) benchmarks Multilingual dominance: Outperforms OpenAI embeddings on MIRACL multilingual retrieval across 13+ languages and demonstrates strong zero-shot cross-lingual transfer on MKQA. Try it Legal document search sample prompt: You're building a legal document search system for a multinational law firm. Deploy BGE-M3 in Microsoft Foundry to index 5,000 full-length commercial contracts (average 6,000 tokens each) in English, French, German, and Spanish. A lawyer queries: "Find all force majeure clauses that exclude liability for pandemics or global health emergencies." Use hybrid retrieval: (1) dense embeddings for semantic similarity to capture concept variations like "Act of God" or "unforeseen circumstances", (2) sparse retrieval for exact keyword matches on "force majeure", "pandemic", "health emergency". Combine scores with weighted sum (0.6 dense + 0.4 sparse) and return top 15 contract sections with clause numbers and jurisdiction metadata. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation. Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry299Views0likes0CommentsWhat is trending in Hugging Face on Microsoft Foundry? Feb, 2, 2026
Open‑source AI is moving fast, with important breakthroughs in reasoning, agentic systems, multimodality, and efficiency emerging every day. Hugging Face has been a leading platform where researchers, startups, and developers share and discover new models. Microsoft Foundry brings these trending Hugging Face models into a production‑ready experience, where developers can explore, evaluate, and deploy them within their Azure environment. Our weekly Model Monday’s series highlights Hugging Face models available in Foundry, focusing on what matters most to developers: why a model is interesting, where it fits, and how to put it to work quickly. This week’s Model Mondays edition highlights three Hugging Face models, including a powerful Mixture-of-Experts model from Z. AI designed for lightweight deployment, Meta’s unified foundation model for image and video segmentation, and MiniMax’s latest open-source agentic model optimized for complex workflows. Models of the week Z.AI’s GLM-4.7-flash Model Basics Model name: zai-org/GLM-4.7-Flash Parameters / size: 30B total -3B active Default settings: 131,072 max new tokens Primary task: Agentic, Reasoning and Coding Why this model matters Why it’s interesting: It utilizes a Mixture-of-Experts (MoE) architecture (30B total parameters and 3B active parameters) to offer a new option for lightweight deployment. It demonstrates strong performance on logic and reasoning benchmarks, outperforming similar sized models like gpt-oss-20b on AIME 25 and GPQA benchmarks. It supports advanced inference features like "Preserved Thinking" mode for multi-turn agentic tasks. Best‑fit use cases: Lightweight local deployment, multi-turn agentic tasks, and logical reasoning applications. What’s notable: From the Foundry catalog, users can deploy on a A100 instance or unsloth/GLM-4.7-Flash-GGUF on a CPU. ource SOTA scores among models of comparable size. Additionally, compared to similarly sized models, GLM-4.7-Flash demonstrates superior frontend and backend development capabilities. Click to see more: https://docs.z.ai Try it Use case Best‑practice prompt pattern Agentic coding (multi‑step repo work, debugging, refactoring) Treat the model as an autonomous coding agent, not a snippet generator. Explicitly require task decomposition and step‑by‑step execution, then a single consolidated result. Long‑context agent workflows (local or low‑cost autonomous agents) Call out long‑horizon consistency and context preservation. Instruct the model to retain earlier assumptions and decisions across turns. Now that you know GLM‑4.7‑Flash works best when you give it a clear goal and let it reason through a bounded task, here’s an example prompt that a product or engineering team might use to identify risks and propose mitigations: You are a software reliability analyst for a mid‑scale SaaS platform. Review recent incident reports, production logs, and customer issues to uncover edge‑case failures outside normal usage (e.g., rare inputs, boundary conditions, timing/concurrency issues, config drift, or unexpected feature interactions). Prioritize low‑frequency, high‑impact risks that standard testing misses. Recommend minimal, low‑cost fixes (validation, guardrails, fallback logic, or documentation). Deliver a concise executive summary with sections: Observed Edge Cases, Root Causes, User Impact, Recommended Lightweight Fixes, and Validation Steps. Meta's Segment Anything 3 (SAM3) Model Basics Model name: facebook/sam3 Parameters / size: 0.9B Primary task: Mask Generation, Promptable Concept Segmentation (PCS) Why this model matters Why it’s interesting: It handles a vastly larger set of open-vocabulary prompts than SAM 2, and unifies image and video segmentation capabilities. It includes a "SAM 3 Tracker" mode that acts as a drop-in replacement for SAM 2 workflows with improved performance. Best‑fit use cases: Open-vocabulary object detection, video object tracking, and automatic mask generation What’s notable: Introduces Promptable Concept Segmentation (PCS), allowing users to find all matching objects (e.g., "dial") via text prompt rather than just single instances. Try it This model enables users to identify specific objects within video footage and isolate them over extended periods. With just one line of code, it is possible to detect multiple similar objects simultaneously. The accompanying GIF demonstrates how SAM3 efficiently highlights players wearing white on the field as they appear and disappear from view. Additional examples are available at the following repository: https://github.com/facebookresearch/sam3/blob/main/assets/player.gif Use case Best‑practice prompt pattern Agentic coding (multi‑step repo work, debugging, refactoring) Treat SAM 3 as a concept detector, not an interactive click tool. Use short, concrete noun‑phrase concept prompts instead of describing the scene or asking questions. Example prompt: “yellow school bus” or “shipping containers”. Avoid verbs or full sentences. Video segmentation + object tracking Specify the same concept prompt once, then apply it across the video sequence. Do not restate the prompt per frame. Let the model maintain identity continuity. Example: “person wearing a red jersey”. Hard‑to‑name or visually subtle objects Use exemplar‑based prompts (image region or box) when text alone is ambiguous. Optionally combine positive and negative exemplars to refine the concept. Avoid over‑constraining with long descriptions. Using the GIF above as a leading example, here is a prompt that shows how SAM 3 turns raw sports footage into structured, reusable data. By identifying and tracking players based on visual concepts like jersey color so that sports leagues can turn tracked data into interactive experiences where automated player identification can relay stats, fun facts, etc when built into a larger application. Here is a prompt that will allow you to start identifying specific players across video: Act as a sports analytics operator analyzing football match footage. Segment and track all football players wearing blue jerseys across the video. Generate pixel‑accurate segmentation masks for each player and assign persistent instance IDs that remain stable during camera movement, zoom, and player occlusion. Exclude referees, opposing team jerseys, sidelines, and crowd. Output frame‑level masks and tracking metadata suitable for overlays, player statistics, and downstream analytics pipelines. MiniMax AI's MiniMax-M2.1 Model Basics Model name: MiniMaxAI/MiniMax-M2.1 Parameters / size: 229B-10B Active Default settings: 200,000 max new tokens Primary task: Agentic and Coding Why this model matters Why it’s interesting: It is optimized for robustness in coding, tool use, and long-horizon planning, outperforming Claude Sonnet 4.5 in multilingual scenarios. It excels in full-stack application development, capable of architecting apps "from zero to one”. Previous coding models focused on Python optimization, M2.1 brings enhanced capabilities in Rust, Java, Golang, C++, Kotlin, Objective-C, TypeScript, JavaScript, and other languages. The model delivers exceptional stability across various coding agent frameworks. Best‑fit use cases: Lightweight local deployment, multi-turn agentic tasks, and logical reasoning applications. What’s notable: The release of open-source weights for M2.1 delivers a massive leap over M2 on software engineering leaderboards. https://www.minimax.io/ Try it Use case Best‑practice prompt pattern End‑to‑end agentic coding (multi‑file edits, run‑fix loops) Treat the model as an autonomous coding agent, not a snippet generator. Explicitly require task decomposition and step‑by‑step execution, then a single consolidated result. Long‑horizon tool‑using agents (shell, browser, Python) Explicitly request stepwise planning and sequential tool use. M2.1’s interleaved thinking and improved instruction‑constraint handling are designed for complex, multi‑step analytical tasks that require evidence tracking and coherent synthesis, not conversational back‑and‑forth. Long‑context reasoning & analysis (large documents / logs) Declare the scope and desired output structure up front. MiniMax‑M2.1 performs best when the objective and final artifact are clear, allowing it to manage long context and maintain coherence. Because MiniMax‑M2.1 is designed to act as a long‑horizon analytical agent, it shines when you give it a clear end goal and let it work through large volumes of information—here’s a prompt a risk or compliance team could use in practice: You are a financial risk analysis agent. Analyze the following transaction logs and compliance policy documents to identify potential regulatory violations and systemic risk patterns. Plan your approach before executing. Work through the data step by step, referencing evidence where relevant. Deliver a final report with the following sections: Key Risk Patterns Identified, Supporting Evidence, Potential Regulatory Impact, Recommended Mitigations. Your response should be a complete, executive-ready report, not a conversational draft. Getting started You can deploy open‑source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation. Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry1KViews0likes0Comments