genaiops
20 TopicsGemma 4 now available in Microsoft Foundry
Experimenting with open-source models has become a core part of how innovative AI teams stay competitive: experimenting with the latest architectures and often fine-tuning on proprietary data to achieve lower latencies and cost. Today, we’re happy to announce that the Gemma 4 family, Google DeepMind’s newest model family, is now available in Microsoft Foundry via the Hugging Face collection. Azure customers can now discover, evaluate, and deploy Gemma 4 inside their Azure environment with the same policies they rely on for every other workload. Foundry is the only hyperscaler platform where developers can access OpenAI, Anthropic, Gemma, and over 11,000+ models under a single control plane. Through our close collaboration with Hugging Face, Gemma 4 joining that collection continues Microsoft’s push to bring customers the widest selection of models from any cloud – and fits in line with our enhanced investments in open-source development. Frontier Intelligence, open-source weights Released by Google DeepMind on April 2, 2026, Gemma 4 is built from the same research foundation as Gemini 3 and packaged as open weights under an Apache 2.0 license. Key capabilities across the Gemma 4 family: Native multimodal: Text + image + video inputs across all sizes; analyze video by processing sequences of frames; audio input on edge models (E2B, E4B) Enhanced reasoning & coding capabilities: Multi-step planning, deep logic, and improvements in math and instruction-following enabling autonomous agents Trained for global deployment: Pretrained on 140+ languages with support for 35+ languages out of the box Long context: Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (26B A4B/31B) allow developers to reason across extensive codebases, lengthy documents, or multi-session histories Why choose Foundry? Foundry is built to give developers breadth -- access to models from major model providers, open and proprietary, under one roof. Stay within Azure to work leading models. When you deploy through Foundry, models run inside your Azure environment and are subject to the same network policies, identity controls, and audit processes your organization already has in place. Managed online endpoints handle serving, scaling, and monitoring without manually setting up and managing the underlying infrastructure. Serverless deployment with Azure Container Apps allows developers to deploy and run containerized applications while reducing infrastructure management and saving costs. Gated model access integrates directly with Hugging Face user tokens, so models that require license acceptance stay compliant can be accessed without manual approvals. Foundry Local lets you run optimized Hugging Face models directly on your own hardware using the same model catalog and SDK patterns as your cloud deployments. Read the documentation here: https://aka.ms/foundrylocal and https://aka.ms/HF/foundrylocal Microsoft’s approach to Responsible AI is grounded in our AI principles of fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. Microsoft Foundry provides governance controls, monitoring, and evaluation capabilities to help organizations deploy new models responsibly in production environments. What are teams building with Gemma 4 in Foundry Gemma 4’s combination of multimodal input, agentic function calling, and long context offers a wide range of production use cases: Document intelligence: Processing PDFs, charts, invoices, and complex tables using native vision capabilities Multilingual enterprise apps: 140+ natively trained languages — ideal for multinational customer support, content platforms as well as language learning tools for grammar correction and writing practice Long-context analytics: Reasoning across entire codebases, legal documents, or multi-session conversation histories Getting started Try Gemma 4 in Microsoft Foundry today. New models from Hugging Face continue to roll out to Foundry on a regular basis through our ongoing collaboration. If there's a model you want to see added, let us know here. Stay connected to our developer community on Discord and stay up to date on what is new in Foundry through the Model Mondays series.1KViews1like0CommentsGenerally Available: Evaluations, Monitoring, and Tracing in Microsoft Foundry
If you've shipped an AI agent to production, you've likely run into the same uncomfortable realization: the hard part isn't getting the agent to work - it's keeping it working. Models get updated, prompts get tweaked, retrieval pipelines drift, and user traffic surfaces edge cases that never appeared in your eval suite. Quality isn't something you establish once. It's something you have to continuously measure. Today, we're making that continuous measurement a first-class operational capability. Evaluations, Monitoring, and Tracing in Microsoft Foundry are now generally available through Foundry Control Plane. These aren't standalone tools bolted onto the side of the platform - they're deeply integrated with Azure Monitor, which means AI agent observability now lives in the same operational plane as the rest of your infrastructure. The Problem With Point-in-Time Evaluation Most evaluation workflows are designed around a pre-deployment gate. You build a test dataset, run your evals, review the scores, and ship. That approach has real value - but it has a hard ceiling. In production, agent behavior is a function of many things that change independently of your code: Foundation model updates ship continuously and can shift output style, reasoning patterns, and edge case handling in ways that don't always surface on your benchmark set. Prompt changes can have nonlinear effects downstream, especially in multi-step agentic flows. Retrieval pipeline drift changes what context your agent actually sees at inference time. A document index fresh last month may have stale or subtly different content today. Real-world traffic distribution is never exactly what you sampled for your test set. Production surfaces long-tail inputs that feel obvious in hindsight but were invisible during development. The implication is straightforward: evaluation has to be continuous, not episodic. You need quality signals at development time, at every CI/CD commit, and continuously against live production traffic - all using the same evaluator definitions so results are comparable across environments. That's the core design principle behind Foundry Observability. Continuous Evaluation Across the Full AI Lifecycle Built-In Evaluators Foundry's built-in evaluators cover the most critical quality and safety dimensions for production agent systems: Coherence and Relevance measure whether responses are internally consistent and on-topic relative to the input. These are table-stakes signals for any conversational or task-completion agent. Groundedness is particularly important for RAG-based architectures. It measures whether the model's output is actually supported by the retrieved context - as opposed to plausible-sounding content the model generated from its parametric memory. Groundedness failures are a leading indicator of hallucination risk in production, and they're often invisible to human reviewers at scale. Retrieval Quality evaluates the retrieval step independently from generation. Groundedness failures can originate in two places: the model may be ignoring good context, or the retrieval pipeline may not be surfacing relevant context in the first place. Splitting these signals makes it much easier to pinpoint root cause. Safety and Policy Alignment evaluates whether outputs meet your deployment's policy requirements - content safety, topic restrictions, response format compliance, and similar constraints. These evaluators are designed to run at every stage of the AI lifecycle: Local development - run evals inline as you iterate on prompts, retrieval config, or orchestration logic CI/CD pipelines - gate every commit against your quality baselines; catch regressions before they reach production Production traffic monitoring - continuously evaluate sampled live traffic and surface trends over time Because the evaluators are identical across all three contexts, a score in CI means the same thing as a score in production monitoring. See the Practical Guide to Evaluations and the Built-in Evaluators Reference for a deeper walkthrough. Custom Evaluators - Encoding Your Own Definition of Quality Built-in evaluators cover common signals well, but production agents often need to satisfy criteria specific to a domain, regulatory environment, or internal standard. Foundry supports two types of custom evaluators (currently in public preview): LLM-as-a-Judge evaluators let you configure a prompt and grading rubric, then use a language model to apply that rubric to your agent's outputs. This is the right approach for quality dimensions that require reasoning or contextual judgment - whether a response appropriately acknowledges uncertainty, whether a customer-facing message matches your brand tone, or whether a clinical summary meets documentation standards. You write a judge prompt with a scoring scale (e.g., 1–5 with criteria for each level) that evaluates a given {input} / {response} pair. Foundry runs this at scale and aggregates scores into your dashboards alongside built-in results. Code-based evaluators are Python functions that implement any evaluation logic you can express programmatically - regex matching, schema validation, business rule checks, compliance assertions, or calls to external systems. If your organization has documented policies about what a valid agent response looks like, you can encode those policies directly into your evaluation pipeline. Custom and built-in evaluators compose naturally - running against the same traffic, producing results in the same schema, feeding into the same dashboards and alert rules. Monitoring and Alerting - AI Quality as an Operational Signal All observability data produced by Foundry - evaluation results, traces, latency, token usage, and quality metrics - is published directly to Azure Monitor. This is where the integration pays off for teams already on Azure. What this enables that siloed AI monitoring tools can't: Cross-stack correlation. When your groundedness score drops, is it a model update, a retrieval pipeline issue, or an infrastructure problem affecting latency? With AI quality signals and infrastructure telemetry in the same Azure Monitor Application Insights workspace, you can answer that in minutes rather than hours of manual correlation across disconnected systems. Unified alerting. Configure Azure Monitor alert rules on any evaluation metric - trigger a PagerDuty incident when groundedness drops below threshold, send a Teams notification when safety violations spike, or create automated runbook responses when retrieval quality degrades. These are the same alert mechanisms your SRE team already uses. Enterprise governance by default. Azure Monitor's RBAC, retention policies, diagnostic settings, and audit logging apply automatically to all AI observability data. You inherit the governance framework your organization has already built and approved. Grafana and existing dashboards. If your team uses Azure Managed Grafana, evaluation metrics can flow into existing dashboards alongside your other operational metrics - a single pane of glass for application health, infrastructure performance, and AI agent quality. The Agent Monitoring Dashboard in the Foundry portal provides an AI-native view out of the box - evaluation metric trends, safety threshold status, quality score distributions, and latency breakdowns. Everything in that dashboard is backed by Azure Monitor data, so SRE teams can always drill deeper. End-to-End Tracing: From Quality Signal to Root Cause A groundedness score tells you something is wrong. A trace tells you exactly where the failure occurred and what the agent actually did. Foundry provides OpenTelemetry-based distributed tracing that follows each request through your entire agent system: model calls, tool invocations, retrieval steps, orchestration logic, and cross-agent handoffs. Traces capture the full execution path - inputs, outputs, latency at each step, tool call parameters and responses, and token usage. The key design decision: evaluation results are linked directly to traces. When you see a low groundedness score in your monitoring dashboard, you navigate directly to the specific trace that produced it - no manual timestamp correlation, no separate trace ID lookup. The connection is made automatically. Foundry auto-collects traces across the frameworks your agents are likely already built on: Microsoft Agent Framework Semantic Kernel LangChain and LangGraph OpenAI Agents SDK For custom or less common orchestration frameworks, the Azure Monitor OpenTelemetry Distro provides an instrumentation path. Microsoft is also contributing upstream to the OpenTelemetry project - working with Cisco Outshift, we've contributed semantic conventions for multi-agent trace correlation, standardizing how agent identity, task context, and cross-agent handoffs are represented in OTel spans. Note: Tracing is currently in public preview, with GA shipping by end of March. Prompt Optimizer (Public Preview) One persistent friction point in agent development is the iteration loop between writing prompts and measuring their effect. You make a change, run your evals, look at the delta, try to infer what about the change mattered, and repeat. Prompt Optimizer tightens this loop. It analyzes your existing prompt and applies structured prompt engineering techniques - clarifying ambiguous instructions, improving formatting for model comprehension, restructuring few-shot examples, making implicit constraints explicit - with paragraph-level explanations for every change it makes. The transparency is deliberate. Rather than producing a black-box "optimized" prompt, it shows you exactly what it changed and why. You can add constraints, trigger another optimization pass, and iterate until satisfied. When you're done, apply it with one click. The value compounds alongside continuous evaluation: run your eval suite against the current prompt, optimize, run evals again, see the measured improvement. That feedback loop - optimize, measure, optimize - is the closest thing to a systematic approach to prompt engineering that currently exists. What Makes our Approach to Observability Different There are other evaluation and observability tools in the AI ecosystem. The differentiation in Foundry's approach comes down to specific architectural choices: Unified lifecycle coverage, not just pre-deployment testing. Most existing evaluation tools are designed for offline, pre-deployment use. Foundry's evaluators run in the same form at development time, in CI/CD, and against live production traffic. Your quality metrics are actually comparable across the lifecycle - you can tell whether production quality matches what you saw in testing, rather than operating two separate measurement systems that can't be compared. No separate observability silo. Publishing all observability data to Azure Monitor means you don't operate a separate system for AI quality alongside your existing infrastructure monitoring. AI incidents route through your existing on-call rotations. AI quality data is subject to the same retention and compliance controls as the rest of your telemetry. Framework-agnostic tracing. Auto-instrumentation across Semantic Kernel, LangChain, LangGraph, and the OpenAI Agents SDK means you're not locked into a specific orchestration framework. The OpenTelemetry foundation means trace data is portable to any compatible backend, protecting your investment as the tooling landscape evolves. Composable evaluators. Built-in and custom evaluators run in the same pipeline, against the same traffic, producing results in the same schema, feeding into the same dashboards and alert rules. You don't choose between generic coverage and domain-specific precision - you get both. Evaluation linked to traces. Most systems treat evaluation and tracing as separate concerns. Foundry treats them as two views of the same event - closing the loop between detecting a quality problem and diagnosing it. Getting Started If you're building agents on Microsoft Foundry, or using Semantic Kernel, LangChain, LangGraph, or the OpenAI Agents SDK and want to add production observability, the entry point is Foundry Control Plane. Try it You'll need a Foundry project with an agent and an Azure OpenAI deployment. Enable observability by navigating to Foundry Control Plane and connecting your Azure Monitor workspace. Then walk through the Practical Guide to Evaluations, explore the Built-in Evaluators Reference, and set up end-to-end tracing for your agents.3.8KViews1like0CommentsNow in Foundry: VibeVoice-ASR, MiniMax M2.5, Qwen3.5-9B
This week's Model Mondays edition features two models that have just arrived in Microsoft Foundry: Microsoft's VibeVoice-ASR, a unified speech-to-text model that handles 60-minute audio files in a single pass with built-in speaker diarisation and timestamps, and MiniMaxAI's MiniMax-M2.5, a frontier agentic model that leads on coding and tool-use benchmarks with performance comparable to the strongest proprietary models at a fraction of their cost; and Qwen's Qwen3.5-9B, the largest of the Qwen3.5 Small Series. All three represent a shift toward long-context, multi-step capability: VibeVoice-ASR processes up to an hour of continuous audio without chunking; MiniMax-M2.5 handles complex, multi-phase agentic tasks more efficiently than its predecessor—completing SWE-Bench Verified 37% faster than M2.1 with 20% fewer tool-use rounds; and Qwen3.5-9B brings multimodal reasoning on consumer hardware that outperforms much larger models. Models of the week VibeVoice-ASR Model Specs Parameters / size: ~8.3B Primary task: Automatic Speech Recognition with diarisation and timestamps Why it's interesting 60-minute single-pass with full speaker attribution: VibeVoice-ASR processes up to 60 minutes of continuous audio without chunk-based segmentation—yielding structured JSON output with start/end timestamps, speaker IDs, and transcribed content for each segment. This eliminates the speaker-tracking drift and semantic discontinuities that chunk-based pipelines introduce at segment boundaries. Joint ASR, diarisation, and timestamps in one model: Rather than running separate systems for transcription, speaker separation, and timing, VibeVoice-ASR produces all three outputs in a single forward pass. Users can also inject customized hot words—proper nouns, technical terms, or domain-specific phrases—to improve recognition accuracy on specialized content without fine-tuning. Multilingual with native code-switching: Supports 50+ languages with no explicit language configuration required and handles code-switching within and across utterances natively. This makes it suitable for multilingual meetings and international call center recordings without pre-routing audio by language. Benchmarks: On the Open ASR Leaderboard, VibeVoice-ASR achieves an average WER of 7.77% across 8 English datasets (RTFx 51.80), including 2.20% on LibriSpeech Clean and 2.57% on TED-LIUM. On the MLC-Challenge multi-speaker benchmark: DER 4.28%, cpWER 11.48%, tcpWER 13.02%. Try it Use case What to build Best practices Long-form, multi-speaker transcription for meetings + compliance A transcription service that ingests up to 60 minutes of audio per request and returns structured segments with speaker IDs + start/end timestamps + transcript text (ready for search, summaries, or compliance review). Keep audio un-chunked (single-pass) to preserve speaker coherence and avoid stitching drift; rely on the model’s joint ASR, diarisation, and timestamping so you don’t need separate diarisation/timestamp pipelines or postprocessing. Multilingual + domain-specific transcription (global support, technical reviews) A global transcription workflow for multilingual meetings or call center recordings that outputs “who/when/what,” and supports vocabulary injection for product names, acronyms, and technical terms. Provide customized hot words (names / technical terms) in the request to improve recognition on specialized content; don’t require explicit language configuration—VibeVoice-ASR supports 50+ languages and code-switching, so you can avoid pre-routing audio by language. Read more about the model and try out the playground Microsoft for Hugging Face Spaces to try the model for yourself. MiniMax-M2.5 Model Specs Parameters / size: ~229B (FP8, Mixture of Experts) Primary task: Text generation (agentic coding, tool use, search) Why it's interesting? Leading coding benchmark performance: Scores 80.2% on SWE-Bench Verified and 51.3% on Multi-SWE-Bench across 10+ programming languages (Go, C, C++, TypeScript, Rust, Python, Java, and others). In evaluations across different agent harnesses, M2.5 scores 79.7% on Droid and 76.1% on OpenCode—both ahead of Claude Opus 4.6 (78.9% and 75.9% respectively). The model was trained across 200,000+ real-world coding environments covering the full development lifecycle: system design, environment setup, feature iteration, code review, and testing. Expert-level search and tool use: M2.5 achieves industry-leading performance in BrowseComp, Wide Search, and Real-world Intelligent Search Evaluation (RISE), laying a solid foundation for autonomously handling complex tasks. Professional office work: Achieves a 59.0% average win rate against other mainstream models in financial modeling, Word, and PowerPoint tasks, evaluated via the GDPval-MM framework with pairwise comparison by senior domain professionals (finance, law, social sciences). M2.5 was co-developed with these professionals to incorporate domain-specific tacit knowledge—rather than general instruction-following—into the model's training. Try it Use case What to build Best practices Agentic software engineering Multi‑file code refactors, CI‑gated patch generation, long‑running coding agents working across large repositories Start prompts with a clear architecture or refactor goal. Let the model plan before editing files, keep tool calls sequential, and break large changes into staged tasks to maintain state and coherence across long workflows. Autonomous productivity agents Research assistants, web‑enabled task agents, document and spreadsheet generation workflows Be explicit about intent and expected output format. Decompose complex objectives into smaller steps (search → synthesize → generate), and leverage the model’s long‑context handling for multi‑step reasoning and document creation. With these use cases and best practices in mind, the next step is translating them into a clear, bounded prompt that gives the model a specific goal and the right tools to act. The example below shows how a product or engineering team might frame an automated code review and implementation task, so the model can reason through the work step by step and return results that map directly back to the original requirement: “You're building an automated code review and feature implementation system for a backend engineering team. Deploy MiniMax-M2.5 in Microsoft Foundry with access to your repository's file system tools and test runner. Given a GitHub issue describing a new API endpoint requirement, have the model first write a functional specification decomposing the requirement into sub-tasks, then implement the endpoint across the relevant service files, write unit tests with at least 85% coverage, and return a pull request summary explaining each code change and its relationship to the original requirement. Flag any implementation decisions that deviate from the patterns found in the existing codebase.” Qwen3.5-9B Model Specs Parameters / size: 9B Context length: 262,144 tokens natively; extensible to 1,010,000 tokens Primary task: Image-text-to-text (multimodal reasoning) Why it’s interesting High intelligence density at small sizes: Qwen 3.5 Small models show large reasoning gains relative to parameter count, with the 4B and 9B variants outperforming other sub‑10B models on public reasoning benchmarks. Long‑context by default: Support for up to 262K tokens enables long‑document analysis, codebase review, and multi‑turn workflows without chunking. Native multimodal architecture: Vision is built into the model architecture rather than added via adapters, allowing small models (0.8B, 2B) to handle image‑text tasks efficiently. Open and deployable: Apache‑2.0 licensed models designed for local, edge, or cloud deployment scenarios. Benchmarks AI Model & API Providers Analysis | Artificial Analysis Try it Use case When to use Best‑practice prompt pattern Long‑context reasoning Analyzing full PDFs, long research papers, or large code repositories where chunking would lose context Set a clear goal and scope. Ask the model to summarize key arguments, surface contradictions, or trace decisions across the entire document before producing an output. Lightweight multimodal document understanding OCR‑driven workflows using screenshots, scanned forms, or mixed image‑text inputs Ground the task in the artifact. Instruct the model to first describe what it sees, then extract structured information, then answer follow‑up questions. With these best practices in mind, Qwen 3.5-9B demonstrates how compact, multimodal models can handle complex reasoning tasks without chunking or manual orchestration. The prompt below shows how an operations analyst might use the model to analyze a full report end‑to‑end: "You are assisting an operations analyst. Review the attached PDF report and extracted tables. Identify the three largest cost drivers, explain how they changed quarter‑over‑quarter, and flag any anomalies that would require follow‑up. If information is missing, state what data would be needed." Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation. Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry790Views0likes0CommentsMicrosoft Foundry: An End-to-End Platform for Building, Governing, and Scaling AI
Microsoft Foundry: What It Is and How to Get Started As organizations accelerate their adoption of AI, one challenge consistently emerges: how to move from experimentation to production at scale in a secure, responsible, and efficient way. Microsoft Foundry exists to address exactly that challenge. What Is Microsoft Foundry? Microsoft Foundry is an end-to-end platform experience that brings together Microsoft’s AI development, deployment, and governance capabilities into a unified environment. It enables developers, data scientists, and enterprises to build, customize, deploy, and operate AI solutions, including generative AI, using Microsoft and partner models, tools, and services. Rather than being a single product, Foundry is a curated and integrated experience that spans model access, tooling, orchestration, evaluation, and enterprise-grade controls. Why Microsoft Foundry Exists AI innovation is moving fast, but enterprise adoption requires more than just access to models. Customers need: A consistent way to work with multiple models, including Microsoft, OpenAI, and open-source models Built-in security, compliance, and responsible AI capabilities Tooling that supports the full AI lifecycle, not just prototyping Seamless integration with existing data platforms, applications, and cloud operations Microsoft Foundry was created to reduce friction between experimentation and real-world deployment while aligning AI development with enterprise standards, governance, and scale. What’s Included in Microsoft Foundry? Microsoft Foundry brings together several key capabilities: 1. Model Choice and Flexibility Access to leading foundation models, including Azure OpenAI models and selected open-source models Ability to evaluate and select models based on performance, cost, and use case 2. AI Development and Orchestration Tools Prompt engineering, fine-tuning, and grounding with enterprise data Tools for building copilots, chat experiences, and AI-powered applications Orchestration across tools, APIs, and workflows 3. Evaluation, Safety, and Responsible AI Built-in evaluation for quality, latency, and cost Content safety, monitoring, and governance controls Alignment with Microsoft’s Responsible AI principles 4. Enterprise-Grade Platform Integration Native integration with Azure, Microsoft Fabric, Power Platform, and developer tools Identity, security, and compliance through Microsoft Entra and Azure controls Observability and lifecycle management for production workloads How to Get Started Getting started with Microsoft Foundry is straightforward: Start in Azure - Use Azure as the control plane to access Foundry experiences and services. Explore models and tools - Experiment with available models, build prompts, and prototype AI workflows. Ground with your data - Connect enterprise data securely to create more relevant and contextual AI experiences. Evaluate and deploy - Use built-in evaluation and safety tools, then deploy AI solutions into production with confidence. Scale responsibly - Apply governance, monitoring, and cost controls as adoption grows. Final Thoughts Microsoft Foundry represents Microsoft’s vision for enterprise AI done right. It offers flexible model choice, strong development tooling, and built-in trust. By unifying AI development and operations into a single experience, Foundry helps organizations move faster while staying secure, compliant, and future-ready. Whether you are just starting with generative AI or scaling existing solutions, Microsoft Foundry provides a practical foundation to build on.1.3KViews1like0CommentsWhat is trending in Hugging Face on Microsoft Foundry? Feb, 2, 2026
Open‑source AI is moving fast, with important breakthroughs in reasoning, agentic systems, multimodality, and efficiency emerging every day. Hugging Face has been a leading platform where researchers, startups, and developers share and discover new models. Microsoft Foundry brings these trending Hugging Face models into a production‑ready experience, where developers can explore, evaluate, and deploy them within their Azure environment. Our weekly Model Monday’s series highlights Hugging Face models available in Foundry, focusing on what matters most to developers: why a model is interesting, where it fits, and how to put it to work quickly. This week’s Model Mondays edition highlights three Hugging Face models, including a powerful Mixture-of-Experts model from Z. AI designed for lightweight deployment, Meta’s unified foundation model for image and video segmentation, and MiniMax’s latest open-source agentic model optimized for complex workflows. Models of the week Z.AI’s GLM-4.7-flash Model Basics Model name: zai-org/GLM-4.7-Flash Parameters / size: 30B total -3B active Default settings: 131,072 max new tokens Primary task: Agentic, Reasoning and Coding Why this model matters Why it’s interesting: It utilizes a Mixture-of-Experts (MoE) architecture (30B total parameters and 3B active parameters) to offer a new option for lightweight deployment. It demonstrates strong performance on logic and reasoning benchmarks, outperforming similar sized models like gpt-oss-20b on AIME 25 and GPQA benchmarks. It supports advanced inference features like "Preserved Thinking" mode for multi-turn agentic tasks. Best‑fit use cases: Lightweight local deployment, multi-turn agentic tasks, and logical reasoning applications. What’s notable: From the Foundry catalog, users can deploy on a A100 instance or unsloth/GLM-4.7-Flash-GGUF on a CPU. ource SOTA scores among models of comparable size. Additionally, compared to similarly sized models, GLM-4.7-Flash demonstrates superior frontend and backend development capabilities. Click to see more: https://docs.z.ai Try it Use case Best‑practice prompt pattern Agentic coding (multi‑step repo work, debugging, refactoring) Treat the model as an autonomous coding agent, not a snippet generator. Explicitly require task decomposition and step‑by‑step execution, then a single consolidated result. Long‑context agent workflows (local or low‑cost autonomous agents) Call out long‑horizon consistency and context preservation. Instruct the model to retain earlier assumptions and decisions across turns. Now that you know GLM‑4.7‑Flash works best when you give it a clear goal and let it reason through a bounded task, here’s an example prompt that a product or engineering team might use to identify risks and propose mitigations: You are a software reliability analyst for a mid‑scale SaaS platform. Review recent incident reports, production logs, and customer issues to uncover edge‑case failures outside normal usage (e.g., rare inputs, boundary conditions, timing/concurrency issues, config drift, or unexpected feature interactions). Prioritize low‑frequency, high‑impact risks that standard testing misses. Recommend minimal, low‑cost fixes (validation, guardrails, fallback logic, or documentation). Deliver a concise executive summary with sections: Observed Edge Cases, Root Causes, User Impact, Recommended Lightweight Fixes, and Validation Steps. Meta's Segment Anything 3 (SAM3) Model Basics Model name: facebook/sam3 Parameters / size: 0.9B Primary task: Mask Generation, Promptable Concept Segmentation (PCS) Why this model matters Why it’s interesting: It handles a vastly larger set of open-vocabulary prompts than SAM 2, and unifies image and video segmentation capabilities. It includes a "SAM 3 Tracker" mode that acts as a drop-in replacement for SAM 2 workflows with improved performance. Best‑fit use cases: Open-vocabulary object detection, video object tracking, and automatic mask generation What’s notable: Introduces Promptable Concept Segmentation (PCS), allowing users to find all matching objects (e.g., "dial") via text prompt rather than just single instances. Try it This model enables users to identify specific objects within video footage and isolate them over extended periods. With just one line of code, it is possible to detect multiple similar objects simultaneously. The accompanying GIF demonstrates how SAM3 efficiently highlights players wearing white on the field as they appear and disappear from view. Additional examples are available at the following repository: https://github.com/facebookresearch/sam3/blob/main/assets/player.gif Use case Best‑practice prompt pattern Agentic coding (multi‑step repo work, debugging, refactoring) Treat SAM 3 as a concept detector, not an interactive click tool. Use short, concrete noun‑phrase concept prompts instead of describing the scene or asking questions. Example prompt: “yellow school bus” or “shipping containers”. Avoid verbs or full sentences. Video segmentation + object tracking Specify the same concept prompt once, then apply it across the video sequence. Do not restate the prompt per frame. Let the model maintain identity continuity. Example: “person wearing a red jersey”. Hard‑to‑name or visually subtle objects Use exemplar‑based prompts (image region or box) when text alone is ambiguous. Optionally combine positive and negative exemplars to refine the concept. Avoid over‑constraining with long descriptions. Using the GIF above as a leading example, here is a prompt that shows how SAM 3 turns raw sports footage into structured, reusable data. By identifying and tracking players based on visual concepts like jersey color so that sports leagues can turn tracked data into interactive experiences where automated player identification can relay stats, fun facts, etc when built into a larger application. Here is a prompt that will allow you to start identifying specific players across video: Act as a sports analytics operator analyzing football match footage. Segment and track all football players wearing blue jerseys across the video. Generate pixel‑accurate segmentation masks for each player and assign persistent instance IDs that remain stable during camera movement, zoom, and player occlusion. Exclude referees, opposing team jerseys, sidelines, and crowd. Output frame‑level masks and tracking metadata suitable for overlays, player statistics, and downstream analytics pipelines. MiniMax AI's MiniMax-M2.1 Model Basics Model name: MiniMaxAI/MiniMax-M2.1 Parameters / size: 229B-10B Active Default settings: 200,000 max new tokens Primary task: Agentic and Coding Why this model matters Why it’s interesting: It is optimized for robustness in coding, tool use, and long-horizon planning, outperforming Claude Sonnet 4.5 in multilingual scenarios. It excels in full-stack application development, capable of architecting apps "from zero to one”. Previous coding models focused on Python optimization, M2.1 brings enhanced capabilities in Rust, Java, Golang, C++, Kotlin, Objective-C, TypeScript, JavaScript, and other languages. The model delivers exceptional stability across various coding agent frameworks. Best‑fit use cases: Lightweight local deployment, multi-turn agentic tasks, and logical reasoning applications. What’s notable: The release of open-source weights for M2.1 delivers a massive leap over M2 on software engineering leaderboards. https://www.minimax.io/ Try it Use case Best‑practice prompt pattern End‑to‑end agentic coding (multi‑file edits, run‑fix loops) Treat the model as an autonomous coding agent, not a snippet generator. Explicitly require task decomposition and step‑by‑step execution, then a single consolidated result. Long‑horizon tool‑using agents (shell, browser, Python) Explicitly request stepwise planning and sequential tool use. M2.1’s interleaved thinking and improved instruction‑constraint handling are designed for complex, multi‑step analytical tasks that require evidence tracking and coherent synthesis, not conversational back‑and‑forth. Long‑context reasoning & analysis (large documents / logs) Declare the scope and desired output structure up front. MiniMax‑M2.1 performs best when the objective and final artifact are clear, allowing it to manage long context and maintain coherence. Because MiniMax‑M2.1 is designed to act as a long‑horizon analytical agent, it shines when you give it a clear end goal and let it work through large volumes of information—here’s a prompt a risk or compliance team could use in practice: You are a financial risk analysis agent. Analyze the following transaction logs and compliance policy documents to identify potential regulatory violations and systemic risk patterns. Plan your approach before executing. Work through the data step by step, referencing evidence where relevant. Deliver a final report with the following sections: Key Risk Patterns Identified, Supporting Evidence, Potential Regulatory Impact, Recommended Mitigations. Your response should be a complete, executive-ready report, not a conversational draft. Getting started You can deploy open‑source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation. Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry1.3KViews0likes0CommentsImplementing MCP Remote Servers with Azure Function App and GitHub Copilot Integration
Introduction In the evolving landscape of AI-driven applications, the ability to seamlessly connect large language models (LLMs) with external tools and data sources is becoming a cornerstone of intelligent system design. Model Context Protocol (MCP) — a specification that enables AI agents to discover and invoke tools dynamically, based on context. While MCP is powerful, implementing it from scratch can be daunting !!! That’s where Azure Functions comes in handy. With its event-driven, serverless architecture, Azure Functions now supports a preview extension for MCP, allowing developers to build remote MCP servers that are scalable, secure, and cloud-native. Further, In VS Code, GitHub Copilot Chat in Agent Mode can connect to your deployed Azure Function App acting as an MCP server. This connection allows Copilot to leverage the tools and services exposed by your function app. Why Use Azure Functions for MCP? Serverless Simplicity: Deploy MCP endpoints without managing infrastructure. Secure by Design: Leverage HTTPS, system keys, and OAuth via EasyAuth or API Management. Language Flexibility: Build in .NET, Python, or Node.js using QuickStart templates. AI Integration: Enable GitHub Copilot, VS Code, or other AI agents to invoke your tools via SSE endpoints. Prerequisites Python version 3.11 or higher Azure Functions Core Tools >= 4.0.7030 Azure Developer CLI To use Visual Studio Code to run and debug locally: Visual Studio Code Azure Functions extension An storage emulator is needed when developing azure function app in VScode. you can deploy Azurite extension in VScode to meet this requirement. Press enter or click to view image in full size You can run the Azurite in VS Code as shown below. C:\Program Files\Microsoft Visual Studio\2022\Enterprise\Common7\IDE\Extensions\Microsoft\Azure Storage Emulator> .\azurite.exe Press enter or click to view image in full size alternatively, you can also run Azurite in docker container as shown below. docker run -p 10000:10000 -p 10001:10001 -p 10002:10002 \ mcr.microsoft.com/azure-storage/azurite For more information about setting up Azurite, visit Use Azurite emulator for local Azure Storage development | Microsoft Learn Github Repositories Following Github repos are needed to setup this PoC. Repository for MCP server using Azure Function App https://github.com/mafzal786/mcp-azure-functions-python.git Repository for AI Foundry agent as MCP Client https://github.com/mafzal786/ai-foundry-agent-with-remote-mcp-using-azure-functionapp.git Clone the repository Run the following command to clone the repository to start building your MCP server using Azure function app. git clone https://github.com/mafzal786/mcp-azure-functions-python.git Run the MCP server in VS Code Once cloned. Open the folder in VS Code. Create a virtual environment in VS Code. Change directory to “src” in a new terminal window, install the python dependencies and start the function host locally as shown below. cd src pip install -r requirements.txt func start Note: by default this will use the webhooks route: /runtime/webhooks/mcp/sse. Later we will use this in Azure to set the key on client/host calls: /runtime/webhooks/mcp/sse?code=<system_key> Press enter or click to view image in full size MCP Inspector In a new terminal window, install and run MCP Inspector. npx @modelcontextprotocol/inspector Click to load the MCP inspector. Also provide the generated proxy session token. http://127.0.0.1:6274/#resources In the URL type and click “Connect”: http://localhost:7071/runtime/webhooks/mcp/sse Once connected, click List Tools under Tools and select “hello_mcp” tool and click “Run Tool” for testing as shown below. Press enter or click to view image in full size Select another tool such as get_stockprice and run it as shown below. Press enter or click to view image in full size Deploy Function App to Azure from VS Code For deploying function app to azure from vs code, make sure you have Azure Tools extension enabled in VS Code. To learn more about Azure Tools extension, visit the following Azure Extensions if your VS code environment is not setup for Azure development, follow Configure Visual Studio Code for Azure development with .NET — .NET | Microsoft Learn Once Azure Tools are setup, sign in to Azure account with Azure Tools Press enter or click to view image in full size Once Sign-in is completed, you should be able to see all of your existing resources in the Resources view. These resources can be managed directly in VS Code. Look for Function App in Resource, right click and click “Deploy to Function App”. Press enter or click to view image in full size If you already have it deployed, you will get the following pop-up. Click “Deploy” Press enter or click to view image in full size This will start deploying your function app to Azure. In VS Code, Azure tab will display the following. Press enter or click to view image in full size Once the deployment is completed, you can view the function app and all the tools in Azure portal under function app as shown below. Press enter or click to view image in full size Get the mcp_extension key from Functions → App Keys in Function App. Press enter or click to view image in full size This mcp_extension key would be needed in mcp.json file in VS code, if you would like to test the MCP server using Github Copilot in VS Code. Your entries in mcp.json file will look like as below for example. { "inputs": [ { "type": "promptString", "id": "functions-mcp-extension-system-key", "description": "Azure Functions MCP Extension System Key", "password": true }, { "type": "promptString", "id": "functionapp-name", "description": "Azure Functions App Name" } ], "servers": { "remote-mcp-function": { "type": "sse", "url": "https://${input:functionapp-name}.azurewebsites.net/runtime/webhooks/mcp/sse", "headers": { "x-functions-key": "${input:functions-mcp-extension-system-key}" } }, "local-mcp-function": { "type": "sse", "url": "http://0.0.0.0:7071/runtime/webhooks/mcp/sse" } } } Test Azure Function MCP Server in MCP Inspector Launch MCP Inspector and provide the Azure Function in MCP inspector URL. Provide authentication as shown below. Bearer token is mcp_extension key. Testing an MCP server with GitHub Copilot Testing an MCP server with GitHub Copilot involves configuring and utilizing the server within your development environment to provide enhanced context and capabilities to Copilot Chat. Steps to Test an MCP Server with GitHub Copilot: Ensure Agent Mode is Enabled: Open Copilot Chat in Visual Studio Code and select “Agent” mode. This mode allows Copilot to interact with external tools and services, including MCP servers. Add the MCP Server: Open the Command Palette (Ctrl+Shift+P or Cmd+Shift+P) and run the command MCP: Add Server. Press enter or click to view image in full size Follow the prompts to configure the server. You can choose to add it to your workspace settings (creating a .vscode/mcp.json file) . Select HTTP or Server-Sent events Press enter or click to view image in full size Specify the URL and click Enter Press enter or click to view image in full size Provide a name of your choice Press enter or click to view image in full size Select scope as Global or workspace. I selected Workspace Press enter or click to view image in full size This will generate mcp.json file in .vscode or create a new entry if mcp.json already exists as shown below. Click Start to “start” the server. Also make sure your Azure function app is locally running with func start command. Press enter or click to view image in full size Now Type the prompt as shown below. Press enter or click to view image in full size Try another tool as below. Press enter or click to view image in full size VS code terminal output for reference. Press enter or click to view image in full size Testing an MCP server with Claude Desktop Claude Desktop is a standalone AI application that allows users to interact with Claude AI models directly from their desktop, providing a seamless and efficient experience. you can download Claude desktop at Download Claude In this article, I have added another tool to utilize to test your MCP server running in Azure Function app. Modify claude_desktop_config.json with the following. you can find this file in window environment at C:\Users\<username>\AppData\Roaming\Claude { "mcpServers": { "my mcp": { "command": "npx", "args": [ "mcp-remote", "http://localhost:7071/runtime/webhooks/mcp/sse" ] } } } Note: If claude_desktop_config.json does not exists, click on setting in Claude desktop under user and visit developer tab. You will see you MCP server in Claude Desktop as shown below. Press enter or click to view image in full size Type the prompt such as “What is the stock price of Tesla” . After submitting, you will notice that it is invoking the tool “get_stockprice” from the MCP server running locally and configured in the .json earlier. Click Allow once or Allow always as shown below. Following output will be displayed. Press enter or click to view image in full size Now lets try weather related prompt. As you can see, it has invoked “get_weatheralerts” tool from MCP server. Press enter or click to view image in full size Azure AI Foundry agent as MCP Client Use the following Github repo to set up Azure AI Foundry agent as MCP client. git clone https://github.com/mafzal786/ai-foundry-agent-with-remote-mcp-using-azure-functionapp.git Open the code in VS code and follow the instructions mentioned in README.md file at Github repo. Once you execute the code, following output will show up in VS code. Press enter or click to view image in full size In this code, message is hard coded. Change the content to “what is weather advisory for Florida” and rerun the program. It will call get_weatheralerts tool and output will look like as below. Press enter or click to view image in full size Conclusion The integration of Model Context Protocol (MCP) with Azure Functions marks a pivotal step in democratizing AI agent development. By leveraging Azure’s serverless architecture, developers can now build remote MCP servers that scale automatically, integrate seamlessly with other Azure services, and expose modular tools to intelligent agents like GitHub Copilot. This setup not only simplifies the deployment and management of MCP servers but also enhances the developer experience — allowing tools to be invoked contextually by AI agents in environments like VS Code, GitHub Codespaces, or Copilot Studio[2]. Whether you’re building a tool to query logs, calculate metrics, or manage data, Azure Functions provides the flexibility, security, and scalability needed to bring your AI-powered workflows to life. As the MCP spec continues to evolve, and GitHub Copilot expands its agentic capabilities, this architecture positions you to stay ahead — offering a robust foundation for cloud-native AI tooling that’s both powerful and future-proof.1.7KViews1like1CommentThe Future of AI: The paradigm shifts in Generative AI Operations
Dive into the transformative world of Generative AI Operations (GenAIOps) with Microsoft Azure. Discover how businesses are overcoming the challenges of deploying and scaling generative AI applications. Learn about the innovative tools and services Azure AI offers, and how they empower developers to create high-quality, scalable AI solutions. Explore the paradigm shift from MLOps to GenAIOps and see how continuous improvement practices ensure your AI applications remain cutting-edge. Join us on this journey to harness the full potential of generative AI and drive operational excellence.7.7KViews1like1CommentThe Future of AI: Maximize your fine-tuned model performance with the new Azure AI Evaluation SDK
In this article, we will explore how to effectively evaluate fine-tuned AI models using the new Azure AI Evaluation SDK. This comprehensive guide is the fourth part of our series on making large language model distillation easier. We delve into the importance of model evaluation, outline a systematic process for assessing the performance of a distilled student model against a baseline model, and demonstrate the use of advanced metrics provided by Azure's SDK. Join us as we navigate the intricacies of AI evaluation and provide insights for continuous model improvement and operational efficiency.1.8KViews1like0CommentsThe Future Of AI: Deconstructing Contoso Chat - Learning GenAIOps in practice
How can AI engineers build applied knowledge for GenAIOps practices? By deconstructing working samples! In this multi-part series, we deconstruct Contoso Chat (a RAG-based retail copilot sample) and use it to learn the tools and workflows to streamline out end-to-end developer journey using Azure AI Foundry.1.1KViews0likes0Comments