artificial intelligence
383 TopicsA New Chapter for Realtime AI: Reasoning, Translation, and Real-Time Transcription
Voice can be one of the most direct and productive interfaces for AI — enabling customer support agents that may resolve issues without a single keystroke, live multilingual communication that can take on language barriers as conversations happen, and voice assistants capable of reasoning through complex requests in real time. Developers building these experiences need models that can keep pace with increasingly demanding latency, accuracy, and language coverage requirements. Today, OpenAI’s GPT-realtime-translate, GPT‑realtime‑2 and, GPT-realtime-whisper are rolling out into Microsoft Foundry starting today — together representing a significant step forward for the realtime model lineup available to developers on the platform. GPT-realtime-translate and GPT-realtime-whisper GPT-realtime-translate and GPT-realtime-whisper together extend the realtime stack for live multilingual audio workflows. GPT-realtime-translate is built for continuous, real-time translation, producing translated output as speech unfolds without relying on segmented pipeline processing, while GPT-realtime-whisper provides low-latency streaming transcription of the original audio in parallel. Used together, they help developers support scenarios such as live events, cross-language customer experiences, captions, monitoring, and archival workflows that require both translated output and visibility into the source speech. Continuous stream processing: This new model translates live audio without segmenting or buffering allowing for more natural interactions. New translation and transcription capabilities: Translate between languages in real time and observe faster text to speech. Available via the Realtime API GPT-realtime-2 GPT‑realtime‑2 is a generational upgrade to OpenAI's speech-to-speech model, bringing internal reasoning and an expanded context window to real-time voice applications. Where previous speech to speech models responded immediately, GPT‑realtime‑2 can work through a problem before speaking — making it well suited for voice applications that need to handle complex, multi-step queries entirely in the audio layer without routing to a separate text pipeline. Native reasoning capability: The newest realtime model introduces stronger reasoning capabilities. Now the model thinks internally before responding. Adjustable reasoning effort via {reasoning.effort}: Explicitly request the level of reasoning the model uses -- minimal, low, medium, high – to save on cost and latency. Audio in, audio out: No need for an intermediary text step, conversation stays fluid and natural. Available via the Realtime API Use cases These models work independently, but they're designed to complement each other in real-world pipelines: Live multilingual events. GPT-realtime-translate enables real-time translation of live audio, producing translated speech along with a transcript in the target language. GPT‑realtime‑whisper can be used in parallel to capture a transcription of the original speech for captions, monitoring, or archival purposes. Together, they enable multilingual live streaming with both translated experiences and visibility into the source language. Global customer support. Route inbound calls through GPT-realtime-translate to translate conversations in real time and provide a translated transcript for agents. Use GPT‑realtime‑whisper alongside it to capture the original conversation as text for compliance, quality review, or analytics. Then pass the interaction to an agent built with GPT‑realtime‑2 using {reasoning.effort}: high for complex issue resolution, all within a continuous audio pipeline. International voice assistants. Build once and deploy across languages. GPT-realtime-translate enables multilingual interaction and provides translated output with a target-language transcript, while GPT‑realtime‑whisper can optionally capture the original user input as text. GPT‑realtime‑2 manages reasoning and conversational context, supporting more complex voice interactions. Pricing Model Deployment Modality Pricing per 1M tokens Input Cached Input Output GPT-realtime translation Global Standard Audio $32.00 $0.40 $64.00 Text $4.00 $0.40 $24.00 Image $5.00 $0.50 -- GPT-realtime-whisper Global Standard Audio -- -- $0.034/minute GPT-realtime-2 Global Standard Audio -- -- $0.017/minute Getting Started Looking for ways to dive in? GPT-realtime-translate, GPT-realtime-whisper, and GPT‑realtime‑2 are rolling out into Microsoft Foundry today. Explore the model catalog and start building: https://ai.azure.com267Views0likes0CommentsGoverning Agent Sprawl: A Multi‑Region AI Agent Landing Zone on Azure (Reference Architecture)
It doesn’t take long for AI agents to get out of hand. In most enterprises, the first few agents are celebrated. A chatbot here. A document summarizer there. Then another team ships an agent that calls APIs. Someone else connects one to internal data. Within months, IT is staring at dozens—or hundreds—of autonomous systems running across subscriptions, regions, and tools. At that point, the questions stop being about model quality and start being uncomfortable operational ones: Who owns this agent? What data can it access? What happens if it misbehaves? Why did it just consume half our monthly token budget in a day? Developers can build an AI agent in minutes—the difficult part is understanding what agents are doing, how they perform, and whether they comply with organizational policy. Signals scatter across tools, context is lost, and governance becomes reactive. This reference architecture exists to solve that problem. It describes a multi‑region AI agent landing zone on Azure that treats agents as first‑class, governable workloads—provisioned automatically, constrained by policy, and observable from day one. The architectural principle: separate control from execution The design starts with a simple but non‑negotiable rule: Control plane concerns must be separated from runtime concerns. Azure landing zones already follow this model. Management groups, Azure Policy, and RBAC are global constructs. Workloads run in regions. This architecture applies the same discipline to AI agents. The runtime plane is where agents execute, models infer, and data flows—often in multiple Azure regions. The control plane is where identity, policy, safety, evaluation, and oversight live—independent of region. This separation is what allows teams to scale agents without losing control. Layer 1: Azure AI Gateway — governing every request The first control layer sits directly in the request path. The AI gateway in Azure API Management provides a policy‑enforcement and observability layer in front of AI models, agents, and tools. It is not a separate service—it extends Azure API Management. Everything flows through it: Microsoft Foundry model deployments Azure AI Model Inference API endpoints OpenAI‑compatible third‑party models Self‑hosted models MCP servers and A2A agent APIs (preview) What the gateway actually enforces This layer is intentionally narrow and operational: Token quotas and rate limits The llm-token-limit policy (GA) enforces tokens‑per‑minute or quota ceilings per consumer before requests reach the backend. This prevents one application—or one agent—from exhausting shared capacity. Content safety at ingress The llm-content-safety policy (GA) integrates Azure AI Content Safety to moderate prompts automatically. Unsafe requests never reach the model. Traffic routing and resiliency Azure API Management supports multi‑region gateway deployment (Premium tier). If a region fails, traffic routes to the next closest gateway automatically. Token usage, prompts, and completions are logged to Azure Monitor and Application Insights using built‑in policies such as llm-emit-token-metric. The gateway does not understand agent intent or business context. That is by design. It governs traffic, not behavior. Layer 2: Azure AI Foundry Control Plane — governing behavior at scale The second layer governs what agents do, not just how requests flow. Azure AI Foundry Control Plane provides a unified management surface for AI agents, models, and tools across projects and subscriptions. It is designed specifically for agentic systems. Foundry Control Plane is currently in public preview. What Foundry Control Plane adds Fleet‑wide inventory Every agent, model, and tool appears in a single, searchable view across projects. Continuous evaluation on production traffic Foundry runs evaluations that measure task adherence, groundedness, tool‑call accuracy, sensitive data exposure, and other agent‑specific risk dimensions. Centralized guardrails Policy is enforced across inputs, outputs, and tool interactions—not just prompts. Bulk remediation can be applied across the fleet. Security integration Foundry integrates with: Microsoft Entra for agent identity (Entra Agent ID) Microsoft Defender for threat signals Microsoft Purview for data protection and compliance visibility Foundry Control Plane also requires an AI Gateway to be configured for advanced governance scenarios—reinforcing the layered approach. Layer 3: Microsoft Agent 365 — enterprise oversight, not just Azure oversight The third layer exists because Azure governance alone is not enough. Agents don’t just call APIs. They act on behalf of users. They access enterprise data. They operate inside Microsoft 365 workflows. Microsoft Agent 365 is the tenant‑level control plane for AI agents. It brings agents under the same administrative model used for users and applications. Status: Frontier Preview General availability: May 1, 2026 Why this layer matters Agent 365 introduces controls that Azure alone cannot provide: Agent registry A single inventory of all agents in the tenant—including sanctioned and shadow agents. Unsanctioned agents can be quarantined. Identity‑first access control Every agent is issued an Entra agent ID. Conditional Access policies apply to agents the same way they do to users. Human‑in‑the‑loop oversight Agents surface in Microsoft 365 admin workflows, not just Azure portals. Security and compliance Defender and Purview extend threat detection and data protection policies to agent activity. Agent 365 does not replace Foundry Control Plane. It complements it—connecting agent operations to enterprise identity, compliance, and productivity systems. How the pieces work together Individually, these services are powerful. The architecture works because they are deliberately layered. External approval → automated provisioning When a use case is approved in an external governance system, it triggers an Azure DevOps pipeline using the REST API. That pipeline: Provisions subscriptions and resource groups Deploys Foundry projects Configures Azure API Management with AI Gateway policies Enables monitoring and logging Governance is applied before the first request is made. One policy model, many regions Azure landing zones are region‑agnostic at the governance layer. This architecture follows that guidance. Policies and RBAC apply globally AI Gateway enforces limits locally in each region Runtime services scale region by region Expanding to a new region does not introduce a new governance model—only new capacity. A single operational view Signals flow upward: AI Gateway emits traffic and usage metrics Foundry Control Plane correlates evaluations, guardrail enforcement, and security alerts Agent 365 aggregates tenant‑level identity, compliance, and threat signals Operations teams no longer hunt across dashboards. They work from one prioritized view, with context intact. What this architecture deliberately does not promise This is a reference architecture, not a silver bullet. It does not eliminate the need for: Clear agent ownership Business‑level approval processes Ongoing evaluation of agent usefulness What it does provide is a foundation—one that lets organizations scale agentic AI without accepting chaos as the cost of innovation. Closing thoughts Agent sprawl is not a tooling failure. It’s an architectural one. By separating control from execution, layering governance where it belongs, and aligning AI operations with existing Azure and Microsoft 365 control planes, this architecture gives enterprises a way to move fast without losing sight of what their agents are doing. That’s the difference between experimentation—and production. Co-Contributor: Jorge Pena Alarcon-Sr. Cloud & AI Specialist References (official Microsoft sources) Azure AI Gateway in Azure API Management Configure AI Gateway for Foundry Foundry Control Plane overview Microsoft Agent 365 announcement Agent 365 GA annoucement Azure landing zones and regions Azure DevOps pipeline REST API183Views0likes0CommentsNow in Foundry: IBM Granite 4.1, NVIDIA Nemotron Nano Omni, and Qwen3.6-35B-A3B
This week Microsoft Foundry adds two major model families alongside a reasoning powerhouse that spans the full spectrum from specialized speech and vision to general-purpose coding and long-context analysis. IBM's Granite 4.1 is a famiily of 10: six LLMs across 3B, 8B, and 30B sizes in both full-precision and FP8 variants, plus a safety model, a vision-language model for document extraction, and a multilingual speech recognition model. NVIDIA's Nemotron-3-Nano-Omni-30B-A3B-Reasoning brings multimodal capability—video, audio, image, and text—to a 31B Mamba2-Transformer Hybrid Mixture-of-Experts (MoE) architecture that activates only 3B parameters per forward pass; three variants are available in Foundry (BF16, FP8, and NVFP4), with the FP8 variant featured here. Qwen3.6-35B-A3B is designed for agentic coding among open models, with thinking preservation across conversation turns and a context window extensible to 1 million tokens. Models of the week IBM: granite-4.1-30b Model Specs Parameters / size: 30B (flagship of the Granite 4.1 family) Context length: 131,072 tokens Primary task: Text generation (multilingual instruction following, RAG, tool calling, code, summarization) Why it's interesting The Granite 4.1 release brings 10 models to Microsoft Foundry. The LLM lineup covers granite-4.1-3b-instruct, granite-4.1-8b-instruct, and granite-4.1-30b-instruct with FP8 variants for each, plus granite-guardian-4.1-8b for safety, granite-vision-4.1-4b for document and chart understanding, and granite-speech-4.1-2b for multilingual speech recognition. This is a deployment-ready stack where teams can mix and match model sizes and modalities from a single provider. Strong instruction following and reasoning at the 30B scale: granite-4.1-30b-instruct scores 80.16 on MMLU, 64.09 on MMLU-Pro, 83.74 on Big-Bench Hard (BBH), 77.80 on AGI Eval, 45.76 on GPQA (Graduate-Level Google-Proof Q&A, a graduate-level science reasoning benchmark), and 89.65 average on IFEval (instruction following). These results reflect SFT and reinforcement learning post-training focused specifically on instruction compliance, tool calling accuracy, and long-context retrieval. (View benchmarks here) Enhanced tool calling and 12-language support: Granite 4.1 models are trained for structured function calling and support 12 languages—Arabic, Chinese, Czech, Dutch, English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish—with dialog, extraction, and summarization capabilities. Safety and multimodal coverage within the same family: The inclusion of granite-guardian-4.1-8b (a safety classifier for detecting harmful content and prompt injections), granite-vision-4.1-4b (a Vision Language Model optimized for document extraction from PDFs, charts, and tables), and granite-speech-4.1-2b (a 2B multilingual Automatic Speech Recognition model) means teams can address safety, document parsing, and audio ingestion within the same model family—reducing integration complexity across a full pipeline. Try it Use Case Prompt Pattern Multilingual RAG Submit retrieved document passages in any of 12 supported languages; ask model to synthesize and cite sources Agentic tool calling Provide function definitions + user goal; model plans and executes tool calls in structured format Document extraction (granite-vision-4.1-4b) Submit PDF page image; extract tables, key figures, or form fields as structured JSON Safety classification (granite-guardian-4.1-8b) Pass user input or model output; receive structured risk assessment before serving response Sample prompt for an enterprise document processing deployment: You are building a multilingual document intelligence pipeline for a global financial institution. Using the granite-4.1-30b-instruct endpoint deployed in Microsoft Foundry, submit each incoming policy or regulatory document with the following system instruction: "You are a compliance analysis assistant. Review the document and extract: (1) all regulatory requirements described, (2) the entities to which each requirement applies, (3) any compliance deadlines mentioned, and (4) any penalties or consequences for non-compliance. Return the output as a structured JSON array with one entry per requirement." For documents that include scanned pages, first route them through granite-vision-4.1-4b to extract text and table content before passing to the 30B model for compliance analysis. Pass all user-facing outputs through granite-guardian-4.1-8b to screen for sensitive information before returning results. NVIDIA: Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 Model Specs Parameters / size: 31B total, ~3B activated per forward pass (Mamba2-Transformer Hybrid Mixture-of-Experts) Context length: 256,000 tokens Primary task: Video-audio-image-text-to-text (Multimodal understanding, reasoning, tool calling) Why it's interesting Multimodal input from a single efficient endpoint: Nemotron-3-Nano-Omni-30B-A3B-Reasoning supports video (up to 2 minutes), audio (up to 1 hour), images (RGB), and text—all from a single model deployed in Microsoft Foundry. Three variants are available in Foundry: full-precision BF16, FP8, and NVFP4. Paper: Nemotron Nano Omni technical report. Strong results across vision, document, video, and audio benchmarks: With reasoning mode enabled, the model scores 82.8 on MathVista-MINI (visual math reasoning), 67.04 on OCRBenchV2-EN (document OCR), 63.6 on Charxiv Reasoning (chart understanding), 72.2 on Video MME (video Q&A), 74.52 on Daily Omni (video+audio omnimodal understanding), and 89.39 on VoiceBench (speech instruction following). On OSWorld (GUI agent benchmark measuring autonomous computer use), it scores 47.4—a notable result for a model at the 3B active parameter scale. (Please see above model cards for further benchmark data) Mamba2-Transformer Hybrid MoE for efficient long-context inference: The model's layers alternate between Mamba2 state-space blocks (which process sequences with linear rather than quadratic cost) and standard Transformer attention blocks, combined with Mixture-of-Experts feedforward layers. Only ~3B parameters are activated per token despite the 31B total, making the 256K context window practically usable at lower compute cost than a comparably sized dense model. Word-level timestamps, JSON output, and tool calling for structured media workflows: The model produces word-level timestamps from audio, enabling precise transcript-to-timecode alignment for review and indexing workflows. Combined with JSON-structured output, chain-of-thought reasoning, and native tool calling, it can serve as an agentic step that ingests raw media (meeting recordings, M&E assets, training videos) and produces structured outputs for downstream systems without requiring separate transcription or OCR preprocessing stages. Try it Use Case Prompt Pattern Meeting intelligence Submit audio recording (up to 1 hr); extract transcript with word-level timestamps, action items, and decisions as structured JSON Video content analysis Submit video clip (up to 2 min) + query; retrieve timestamped summary of key events or spoken content Document + audio joint analysis Submit scanned document image alongside narrated walkthrough audio; extract and reconcile information from both modalities Multimodal tool calling Provide tool definitions + combined image/audio input; model reasons over content and executes structured tool calls Sample prompt for a media and compliance deployment: You are building a broadcast compliance review system for a media company. Using the Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 endpoint deployed in Microsoft Foundry, submit each recorded segment as video input with the following instruction: "Review this video segment and produce a compliance report as a JSON object with the following fields: transcript (full text with word-level timestamps), flagged_segments (array of objects with start_time, end_time, content, and reason for flagging), speaker_count (estimated number of distinct speakers), and compliance_summary (overall assessment). Flag any content that includes unverified factual claims, restricted product categories, or regulatory disclosures that may be incomplete." Use the word-level timestamps from the compliance report to route flagged segments directly to the editorial review queue with precise timecode references. Qwen: Qwen3.6-35B-A3B Model Specs Parameters / size: 35B total, 3B activated (Mixture-of-Experts) Context length: 262,144 tokens natively, extensible to 1,010,000 tokens Primary task: Image-text-to-text (agentic coding, reasoning, vision) Why it's interesting Agentic coding improvements over Qwen3.5-35B-A3B: Qwen3.6-35B-A3B scores 73.4 on SWE-bench Verified (vs. 70.0 for Qwen3.5-35B-A3B and 52.0 for Gemma 4 31B), 67.2 on SWE-bench Multilingual (vs. 60.3 and 51.7), and 49.5 on SWE-bench Pro (vs. 44.6 and 35.7). Terminal-Bench 2.0 reaches 51.5 (vs. 40.5 and 42.9). The update targets frontend workflows and repository-level reasoning specifically, areas where earlier Qwen3.5 iterations showed gaps. Blog post: Qwen3.6-35B-A3B. Hybrid architecture: Gated DeltaNet and Mixture-of-Experts: The model's 40 layers alternate between Gated DeltaNet blocks (a form of linear attention that avoids the quadratic cost of standard self-attention), Gated Attention blocks (using Grouped Query Attention with 16 query heads and 2 key-value heads), and Mixture-of-Experts (MoE) feedforward layers with 256 experts (8 routed + 1 shared active per token). Only 3B parameters are activated per forward pass, keeping inference cost comparable to a 3B dense model while retaining the capacity of a 35B model for knowledge and specialization. Thinking preservation across conversation turns: Qwen3.6 introduces an option to retain reasoning context from previous messages in multi-turn conversations. In prior models, chain-of-thought traces were stripped between turns, requiring the model to re-derive context it had already reasoned through. With thinking preservation enabled, iterative coding workflows—such as debugging across multiple exchanges—benefit from accumulated reasoning without repeating earlier analysis. Natively extensible to 1 million token context: The 262K native context is already among the largest in open models at this size, and the architecture supports extension to 1,010,000 tokens. On GPQA Diamond (science reasoning), Qwen3.6-35B-A3B scores 86.0—above both Gemma 4 31B (84.3) and Qwen3.5-27B (85.5)—while matching Gemma 4 31B on MMLU Pro (85.2) and LiveCodeBench v6 (80.4 vs. 80.0). Try it Use Case Prompt Pattern Repository-level code change Provide repository structure + task description; model plans file edits and outputs unified diff Multi-turn iterative debugging Enable thinking preservation; submit failing test + code across multiple turns; accumulate reasoning context Frontend code generation Provide design spec or screenshot + existing codebase context; generate component implementation Long-document reasoning Submit technical specification (up to 262K tokens); ask model to identify ambiguities or implementation gaps Sample prompt for a software engineering deployment: You are building an automated code review and implementation assistant for a platform engineering team. Using the Qwen3.6-35B-A3B endpoint deployed in Microsoft Foundry, enable thinking preservation for multi-turn sessions. In the first turn, submit the repository file tree and a GitHub issue describing a required API endpoint change. Prompt the model: "Review the repository structure and describe your implementation plan, including which files need to change and why." In the second turn, submit the relevant source files and prompt: "Based on your earlier plan, implement the changes and produce a unified diff." In the third turn, submit the test suite and prompt: "Write additional unit tests for the new endpoint, covering edge cases identified in your reasoning." The thinking preservation feature ensures the model carries forward its understanding of the codebase across all three turns without re-explaining context. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation. Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry149Views0likes0CommentsIntroducing OpenAI's newest chat model in Microsoft Foundry
OpenAI's GPT-5.5 Instant (or Chat-latest in the API) begins rolling out in Microsoft Foundry today as GPT-chat-latest. Built on GPT-5.4 and GPT-5.3-chat, the new model delivers measurable gains in factual accuracy, tool calling, and response efficiency. These improvements translate directly into more reliable production deployments. GPT-chat-latest is designed for the workflows builders are actually shipping: multi-turn assistants, agentic systems that orchestrate tools, and retrieval-grounded applications where precision and grounding matter as much as conversational quality. Why the name is changing In Microsoft Foundry, we are introducing GPT-chat-latest as the product name for this release, while the model continues to follow the existing Preview lifecycle and standard notice periods. We are also evaluating ways to simplify how customers access continuously updated models over time, but current behavior remains unchanged as that work continue Smarter, more factually reliable GPT-chat-latest closes the factuality gap from prior iterations with significant reductions in hallucinations, especially in domains where accuracy matters most. According to OpenAI, the new model produces 52.5% fewer hallucinations and reduces hallucinated claims by 37.3% on conversations previously flagged for factual errors when compared to GPT-5.3-chat. These gains extend beyond text. GPT-chat-latest shows improvements in visual reasoning, expert multimodal understanding, and STEM tasks, with measurable lifts across standard benchmarks: Benchmark GPT-5.3-chat GPT-chat-latest CharXiv-reasoning Scientific Chart Reasoning 75.0 81.6 MMMU-Pro Expert multimodal reasoning 69.2 76.0 GPQA PhD-level science questions 78.5 85.6 AIME 2025 Competition math 65.4 81.2 *Data shown comes from OpenAI’s testing” For builders shipping into regulated workloads such as clinical decision support, legal research, financial advisory, and technical analysis, these improvements raise the bar on the kinds of applications GPT-chat-latest can assist with. More efficient outputs GPT-chat-latest produces responses that may be more to-the-point without losing substance. The model may reduce verbosity and over formatting, ask fewer follow-up questions, and avoid cluttered output patterns that often require post-processing in production UIs. For builders, this can translate to two concrete benefits: lower output token costs at scale, and cleaner responses that drop into product surfaces with less downstream cleanup. In comparative testing from OpenAI, GPT-chat-latest produced roughly 25–30% fewer words than GPT-5.3-chat across a range of common prompts while preserving response quality, and in many cases improving it. Improving intelligence and tool calling GPT-chat-latest introduces measurable improvements in how the model interacts with tools, including better judgment about when and how to invoke them. The model produces more structured and context-aware tool invocation outputs, which is particularly relevant for workflows that rely on function calling, retrieval-augmented generation, and multi-step reasoning. Equally important, the model is better at deciding whether a tool is needed in the first place, reducing unnecessary tool calls in scenarios where it already has the information to answer directly. Improved search and context handling GPT-chat-latest includes targeted improvements to how the model retrieves, interprets, and synthesizes information when search is involved, with enhancements to query formulation, result ranking, and filtering, plus more grounded synthesis of retrieved content into final responses. These changes improve handling of ambiguous or underspecified queries and reduce noise in answers that depend on retrieved content. The model also makes better use of the context developers pass in, including system prompts, conversation history, retrieved documents, and structured data. Applications that maintain long-running state or stitch together multiple retrieval steps produce more coherent, context-aware outputs without developers having to over-engineer prompt scaffolding. Use Cases: When to choose the chat model Developers typically choose a chat-optimized model like GPT-5.5-chat when the application needs to sustain multi-turn conversations while reliably following instructions and coordinating external tools. This is a fit for assistants and agentic workflows where the model must interpret user intent over time, decide when to retrieve additional context, and produce structured outputs for downstream systems rather than just generate free-form text. Customer support and contact centers: virtual agents that maintain conversational context across a case, retrieve policy or product documentation via search, and hand off to a ticketing or CRM system through tool calls when escalation is needed. Retail and e-commerce: shopping and service assistants that clarify preferences over multiple turns, reference catalogs and policies via retrieval, and generate structured actions such as returns, exchanges, and order lookups through integrated tools. Manufacturing and field service: technician-facing assistants that combine conversational guidance with retrieval of manuals and work instructions, plus structured task creation in maintenance systems. Use GPT-chat-latest Use GPT-5.5 Reasoning Multi-turn assistants and customer-facing chat experiences Harder problems that benefit from more deliberate, step-by-step thinking Agentic workflows that coordinate tools (search, retrieval, ticketing, CRM) and benefit from structured tool outputs Complex analysis, planning, or decision support where correctness matters more than conversational flow Interactive experiences where you want quick back-and-forth clarification and task completion Tasks involving multi-constraint reasoning (policy interpretation, detailed requirements, long-horizon plans) RAG-based apps where the model must decide when to retrieve and then synthesize grounded answers Offline or low-tool scenarios where the main value is deeper reasoning over provided context Pricing Model Input ($/1M tokens) Cached input ($/1M tokens) Output ($/1M tokens) GPT-chat-latest $5 $0.50 $30 Responsible AI in Microsoft Foundry At Microsoft, our mission to empower people and organizations remains constant. In the age of AI, trust is foundational to adoption, and earning that trust requires a commitment to transparency, safety, and accountability. Microsoft Foundry provides governance controls, monitoring, and evaluation capabilities to help organizations deploy models responsibly in production environments, aligned with Microsoft's Responsible AI principles. Getting started GPT-chat-latest is rolling out in Microsoft Foundry today.1.6KViews1like0CommentsArchitecture to Resilience: A Decision Guide
Start with the framework, accelerate with the tool Watch the video walkthrough The Application Resilience Framework originated from a practical gap we saw in resilience reviews: teams had architecture diagrams, monitoring data, incident history, and runbooks, but no consistent way to connect them into a measurable resilience model. The framework is intended to close that gap by turning architecture context into a structured lifecycle for risk identification, mitigation validation, health modeling, and governance. It aligns closely with the Reliability pillar of the Azure Well-Architected Framework, especially the guidance around identifying critical flows, performing Failure Mode Analysis, defining reliability targets, and building health models. The Application Resilience Framework Tool helps teams apply this framework faster by starting with artifacts they already have, such as data flow diagrams or sequence diagrams in Mermaid or image format. The tool extracts workflows, application components, platform components, dependencies, and initial failure modes, then guides the team through the decisions needed to make resilience measurable. From those artifacts, the tool creates the first version of a resilience model by extracting workflows, application components, platform components, dependencies, and initial failure modes. It then guides the team through one import step followed by four phases: Import Artifacts -> Phase 1: Failure Mode Analysis -> Phase 2: Mitigation and Validation -> Phase 3: Health Model Mapping -> Phase 4: Operations and Governance It is not a replacement for WAF guidance or Resilience Hub style assessments. It is a practical way to operationalize those concepts at the workload and workflow level, producing prioritized risks, mitigation plans, validation paths, health signals, dashboards, reports, and governance ownership. How to use this guide This guide follows the same flow as the tool. For each step, it covers: The decision: What needs to be decided? The options: What paths are available? The guidance: When each option fits Use this with the video walkthrough. The video shows the tool in action. This guide explains the choices behind each step. Question 1: What artifact should you import first? The import step creates the starting point for the model. Regardless of the input path, the output is the same: workflows that move into Phase 1: Failure Mode Analysis. Options Import option Best for What happens Data flow diagram System, module, data movement, and dependency views If imported as an image, the tool breaks it into sequence-style flows. Selected flows become workflows. Sequence diagram Transaction flow and service interaction views Converted directly into workflows. Mermaid input Diagrams maintained as code in Mermaid format Converted directly into workflows. Image input JPG or PNG diagrams Azure Foundry Vision models interpret the image and convert it into workflows. Manual entry Missing or incomplete diagrams User creates or corrects workflows manually. When to pick which Use data flow for system and dependency views. Use sequence diagrams for transaction or interaction views. Regardless of import path, the output is the same: workflows, components, dependencies, and initial failure modes ready for Phase 1. Question 2: Which workflows should be analyzed first? Phase 1 is Failure Mode Analysis. This is where the tool identifies what can fail and how important each failure is. Options Critical user flows: Login, checkout, payment, onboarding, request processing. High-risk platform flows: Database writes, queue processing, storage access, identity, messaging, external APIs. Known issue areas: Workflows with recent incidents, recurring alerts, or customer impact. When to pick which Start where failure creates the highest customer or business impact. The goal is not to model everything at once. The goal is to model the right thing first. Deliverables Failure Mode Analysis catalog RPV risk scores Criticality classification Question 3: How should failure modes be prioritized? After workflows and components are imported, the tool helps score each failure mode using Risk Priority Value or RPV, which uses the four factors of Impact, Likelihood, Detectability and Outage severity. Options Use generated failure modes and scores: Best for a fast first pass. Tune the RPV scores with engineering input: Best when workload context matters. Add custom failure modes: Best when known risks come from incidents, reviews, or customer experience. When to pick which Use the generated model to accelerate the first pass, then adjust it with real system knowledge. The goal is not to create the longest list of risks. The goal is to identify the risks that deserve attention first. Deliverables Failure Mode Catalog RPV Risk Scores Prioritized criticality list Question 4: Are mitigations defined or validated? Phase 2 is Mitigation and Validation. This is where each failure mode gets a response plan. Options Detection only: The team can detect the failure, but the response is not defined. Defined mitigation: The response is documented, such as retry, fallback, failover, scaling, restore, or rebalance. Validated mitigation: The response has been tested through a controlled validation or chaos test. When to pick which For low-risk items, documented mitigation may be enough. For critical and high-risk items, validation is the key. A mitigation that has not been tested is still an assumption. Deliverables Mitigation playbooks Chaos test plans Support playbooks Question 5: Which risks need health signals? Phase 3 is Health Model Mapping. This is where the tool connects risks to observability. A failure mode should not just sit in a document. It should map to a signal that can show whether the system is healthy, degraded, or unhealthy. Options Map all failure modes: Best for small systems or highly critical workloads. Map critical and high-risk failure modes first: Best for large systems. Track unmapped risks as gaps: Best when observability coverage is still improving. When to pick which Start with the highest RPV items. Every critical failure mode should have at least one signal, such as a metric, log, alert, availability check, or dependency signal. Deliverables Health model Signal definitions Coverage report Bicep templates Question 6: Should the health model be exported or deployed? Once the health model is built, the next decision is how to use it. Options Export for review: Best when the team needs to validate the model first. Generate monitoring templates: Best when the team wants repeatable implementation. Deploy to Azure: Best when the model is ready to become part of operations. Use outputs in downstream tools: Best when support, SRE, or incident response workflows need structured playbooks. When to pick which Export first if the model is still being reviewed. Deploy when component relationships, signals, and coverage are accurate enough for operational use. Question 7: How will governance keep the model current? Phase 4 is Operations and Governance. This is where the resilience model becomes an ongoing practice. Options One-time assessment: Useful for quick discovery but limited long term. Recurring review: Best for production workloads that change regularly. Closed-loop governance: Best when incidents, failed validations, and monitoring gaps feed back into the model. When to pick which For production systems, use a recurring governance cadence. Assign owners, track gaps, review dashboards, and update the model as the system changes. Deliverables Governance model Dashboards Reports and exports Runbooks Putting it together: three adoption patterns Once governance is defined, the tool can be used in different ways depending on the team’s maturity and objective. The three common adoption patterns are: Pattern A: Quick resilience review Import one critical workflow Generate failure modes Review RPV scores Identify top risks Export findings Best for fast architecture reviews or early customer conversations. Pattern B: Full workload assessment Import multiple workflows Build a full Failure Mode Catalog Define mitigations and recovery steps Create chaos test plans Map risks to signals Produce coverage reports Best for structured resilience assessments. Pattern C: Operational health model Build and tune the health model Export or deploy monitoring artifacts Track risk and signal coverage Review mitigation effectiveness Assign governance ownership Feed findings back into the model Best when the goal is continuous operational improvement. A short checklist before using the tool Which workflow should we import first? Do we have a data flow diagram, sequence diagram, or Mermaid file? What components and dependencies should be included? Which failure modes matter most? How should RPV be adjusted for this workload? Do critical failure modes have mitigations? Have those mitigations been validated? Are failure modes mapped to health signals? What coverage gaps remain? Should the health model be exported or deployed? Who owns ongoing review? How often should the model be updated? Closing thought The Application Resilience Framework Tool provides a practical way to move from architecture artifacts to measurable, continuously improving resilience. It starts with data flow or sequence diagrams, builds a structured view of the system, and guides teams through the decisions that matter: what can fail, how severe it is, how it is mitigated, how it is detected, and how it is governed. Tool repo: Application Resilience Framework Tool315Views0likes0CommentsIntroducing OpenAI's GPT-image-2 in Microsoft Foundry
Take a small design team running a global social campaign. They have the creative vision to produce localized imagery for every market, but not the resources to reshoot, reformat, or outsource that scale. Every asset needs to fit a different platform, a different dimension, a different cultural context, and they all need to ship at the same time. This is where flexible image generation comes in handy. OpenAI's GPT-image-2 is now generally available and rolling out today to Microsoft Foundry, introducing a step change in image generation. Developers and designers now get more control over image output, so a small team can execute with the reach and flexibility of a much larger one. What is new in GPT-image-2? GPT-image-2 brings real world intelligence, multilingual understanding, improved instruction following, increased resolution support, and an intelligent routing layer giving developers the tools to scale image generation for production workflows. Real world intelligence GPT-image-2 has a knowledge cut off of December 2025, meaning that it is able to give you more contextually relevant and accurate outputs. The model also comes with enhanced thinking capabilities that allow it to search the web, check its own outputs, and create multiple images from just one prompt. These enhancements shift image generation models away from being simple tools and runs them into creative sidekicks. Multilingual understanding GPT-image-2 includes increased language support across Japanese, Korean, Chinese, Hindi, and Bengali, as well as new thinking capabilities. This means the model can create images and render text that feels localized. Increased resolution support GPT-image-2 introduces 4K resolution support, giving developers the ability to generate rich, detailed, and photorealistic images at custom dimensions. Resolution guidelines to keep in mind: Constraint Detail Total pixel budget Maximum pixels in final image cannot exceed 8,294,400 Minimum pixels in final image cannot be less than 655,360 Requests exceeding this are automatically resized to fit. Resolutions 4K, 1024x1024, 1536x1024, and 1024x1536 Dimension alignment Each dimension must be a multiple of 16 Note: If your requested resolution exceeds the pixel budget, the service will automatically resize it down. Intelligent routing layer GPT-image-2 also includes an expanded routing layer with two distinct modes, allowing the service to intelligently select the right generation configuration for a request without requiring an explicitly set size value. Mode 1 — Legacy size selection In Mode 1, the routing layer selects one of the three legacy size tiers to use for generation: Size tier Description smimage Small image output image Standard image output xlimage Large image output This mode is useful for teams already familiar with the legacy size tiers who want to benefit from automatic selection without making any manual changes. Mode 2 — Token size bucket selection In Mode 2, the routing layer selects from six token size buckets — 16, 24, 36, 48, 64, 96 — which map roughly to the legacy size tiers: Token bucket Approximate legacy size 16, 24 smimage 36, 48 image 64, 96 xlimage This approach can allow for more flexibility in the number of tokens generated, which in turn helps to better optimize output quality and efficiency for a given prompt. See it in action GPT-image-2 shows improved image fidelity across visual styles, generating more detailed and refined images. But, don’t just take our word for it, let's see the model in action with a few prompts and edits. Here is the example we used: Prompt: Interior of an empty subway car (no people). Wide-angle view looking down the aisle. Clean, modern subway car with seats, poles, route map strip, and ad frames above the windows. Realistic lighting with a slight cool fluorescent tone, realistic materials (metal poles, vinyl seats, textured floor). As you can see, when using the same base prompt, the image quality and realism improved with each model. Now let’s take a look at adding incremental changes to the same image: Prompt: Populate the ad frames with a cohesive ad campaign for “Zava Flower Delivery” and use an array of flower types. And our subway is now full of ads for the new ZAVA flower delivery service. Let's ask for another small change: Prompt: In all Zava Flower Delivery advertisements, change the flowers shown to roses (red and pink roses). And in three simple prompts, we've created a mockup of a flower delivery ad. From marketing material to website creation to UX design, GPT-image-2 now allows developers to deliver production-grade assets for real business use cases. Image generation across industries These new capabilities open the door to richer, more production-ready image generation workflows across a range of enterprise scenarios: Retail & e-commerce: Generate product imagery at exact platform-required dimensions, from square thumbnails to wide banners, without post-processing. Marketing: Produce crisp, rich in color campaign visuals and social assets localized to different markets. Media & entertainment: Generate storyboard panels and scene at resolutions suited to production pipelines. Education & training: Create visual learning aids and course materials formatted to exact display requirements across devices. UI/UX design: Accelerate mockup and prototype workflows by generating interface assets at the precise dimensions your design system requires. Trust and safety At Microsoft, our mission to empower people and organizations remains constant. As part of this commitment, models made available through Foundry undergo internal reviews and are deployed with safeguards designed to support responsible use at scale. Learn more about responsible AI at Microsoft. For GPT-image-2, Microsoft applied an in-depth safety approach that addresses disallowed content and misuse while maintaining human oversight. The deployment combines OpenAI’s image generation safety mitigations with Azure AI Content Safety, including filters and classifiers for sensitive content. Pricing Model Offer type Pricing - Image Pricing - Text GPT-image-2 Standard Global Input Tokens: $8 Cached Input Tokens: $2 Output Tokens: $30 Input Tokens: $5 Cached Input Tokens: $1.25 Note: All prices are per 1M token. There is no billing for output tokens for the GPT-image-2 model. Getting started Whether you’re building a personalized retail experience, automating visual content pipelines or accelerating design workflows. GPT-image-2 gives your team the resolution control and intelligent routing to generate images that fit your exact needs. Try the GPT-image-2 in Microsoft Foundry today! Deploy the model in Microsoft Foundry Experiment with the model in the Image playground Read the documentation to learn more13KViews3likes3CommentsHow MS Discovery Is Empowering Scientists to Do More
Research and development has traditionally been a slow, sequential, and largely manual endeavour. Scientists formulate hypotheses, design experiments, run computations in constrained environments, and document results, each stage dependent on the last, each transition requiring human review and intervention. Knowledge is fragmented across systems, insights are bottlenecked by individual capacity, and the gap between hypothesis and actionable outcome can span weeks or months. For organisations tackling complex scientific and operational challenges, from drug discovery to industrial process optimisation, this pace of iteration is simply no longer acceptable. At Microsoft, we recently introduced Microsoft Discovery, a platform that I believe fundamentally changes this model. Much like Microsoft 365 transformed the way knowledge workers collaborate and create, Microsoft Discovery is designed to simplify and empower the way scientists and researchers work. It provides a unified, end-to-end platform that integrates advanced artificial intelligence, high-performance computing, and knowledge management to support the full scientific reasoning lifecycle: knowledge gathering, hypothesis generation, experiment design, simulation, results analysis, and documentation. In this article, I want to share how we used Microsoft Discovery to automate a real-world simulation workflow for a mining organisation and what that experience taught our team about the future of AI-augmented science. What Is Microsoft Discovery? Microsoft Discovery is Microsoft's scientific AI platform, a solution designed to accelerate research and experimentation across the full innovation lifecycle. Rather than replacing scientific judgement, Discovery is designed to amplify human expertise, embedding AI assistance at each stage of the R&D process while maintaining governance, traceability, and scientific rigour. From Traditional R&D to AI-Augmented Science To appreciate what Discovery enables, it is important to understand where it fits in. In the traditional R&D model, knowledge discovery centres on manual literature reviews and historical data analysis. Researchers individually search, read, and synthesise information which is a time-intensive process where discovery is limited by each person's capacity to locate and interpret relevant material. Hypothesis generation and experimental design are expert-led and largely manual. Computational experimentation, where it exists, runs in fixed or constrained environments with limited parallelism. Analysis and iteration follow the same sequential pattern: execute, review, document, repeat. Microsoft Discovery changes this fundamentally. In the AI-cloud-enabled model it provides: Knowledge synthesis at scale — Researchers can explore literature, historical experiments, and organisational knowledge through a single interface, with intelligent indexing surfacing insights faster than manual search could ever achieve. AI-assisted hypothesis generation — Collaborative human-and-AI workflows support hypothesis exploration and feasibility assessment, while final decisions remain with the scientist. Cloud-scale experimentation — Elastic compute and parallel processing allow simulations and experiments to run at scale, with integrated tracking and reproducibility built in. Continuous feedback and human-in-the-loop governance — Results are analysed and compared more rapidly, enabling faster iteration, with AI-generated insights reviewed and validated by researchers before action. Governed knowledge assets — Experiment lineage, outcomes, and best practices are captured as reusable, governed assets, supporting long-term organisational learning. The net effect is a transition from slow, manual, and fragmented research processes to an agile, automated, and data-driven R&D model — one that improves research efficiency, increases the return on innovation investment, and enables faster, higher-impact solutions to complex challenges. In high level, the research and deveolopment loop we discussed and how Microsoft Discovery enriches it show in the following diagram. The Real-World Problem: Screening Thousands of Molecules To bring this to life, let me walk you through a real-world use case we worked on recently. A mining organisation needed to identify the best-performing oxidant compounds for a chemical reaction central to their operations. We will be talking about only a workflow that sits squarely in the simulation phase of the scientific loop — and it is a perfect example of the kind of work that Microsoft Discovery can strongly transform. How Scientists Did It Before In the traditional process, scientists would begin by selecting candidate molecules from established molecular libraries based on characteristics identified through literature review. These libraries can contain thousands of molecules, each defined in standard molecular file formats (such as XYZ or CIF files) that describe their three-dimensional atomic structures. From there, a researcher would manually work through a multi-step pipeline: Pre-processing and preparation: The selected molecular files are processed and prepared for quantum mechanical (QM) calculations. This involves filtering molecules based on properties like the types of metals present, electron count, and atomic weight — criteria that directly affect both the scientific relevance and the computational cost of the simulations. The output is a set of prepared input files (known as GJF files) ready for simulation. Running quantum mechanical simulations: The prepared input files are submitted to a computational chemistry tool (Gaussian 16) to perform Density Functional Theory (DFT) calculations. These simulations compute the electronic structure and energy states of each molecule across different charge and multiplicity configurations. Crucially, each molecule requires multiple independent simulation runs, and the computational cost scales rapidly with molecular complexity. With thousands of candidate molecules, this step alone can involve thousands of individual simulation jobs. Collecting and post-processing results: Once all simulations complete, the output log files are collected and processed. For each molecule, the lowest-energy charge and multiplicity combination is identified, and a set of quantum mechanical descriptors and classical molecular descriptors are extracted. These descriptors are then fed into a trained machine learning model to predict the redox potential of each compound, a key metric that indicates how effectively a molecule can act as an oxidant in the target reaction. Summarisation and filtering: Finally, the predicted redox potentials and other relevant characteristics are compiled into a summary, enabling researchers to identify the most promising candidates for further investigation and experimental validation. Every step in this pipeline required manual intervention: writing and adjusting scripts, verifying input and output files, monitoring job queues, handling failures, and stitching results together. A single researcher could easily spend days or weeks moving through this process — and any error at one stage meant going back and re-running subsequent steps. How We Automated This with Microsoft Discovery Agents When we looked at this workflow through the lens of Microsoft Discovery, the opportunity was clear. The scientific reasoning, selecting which molecules to test, interpreting redox potential results, deciding what to investigate next, should remain with the researcher. But the operational overhead of preparing files, submitting simulations, monitoring jobs, collecting results, and assembling summaries? That could be orchestrated by a team of AI agents. A Team of Agents, Working Together We designed a multi-agent architecture within Microsoft Discovery to automate this simulation workflow end to end. Here is how the team of agents operates: Router Agent: The entry point. When a researcher submits a request for example, asking to run QM calculations on a set of candidate molecules the Router Agent interprets the intent and orchestrates the downstream workflow. Planner Agent: Once the Router Agent identifies the task, the Planner Agent examines the input files provided by the researcher and formulates a step-by-step execution plan. It determines what needs to happen, in what order, and with what parameters, much like a project manager scoping out a piece of work. Gaussian Prep Agent: This agent handles the preparation step. It is intelligent enough to inspect the current molecular files, apply the necessary filtering criteria, and prepare them for simulation, generating the input files that the computational chemistry tool requires. What previously involved manual scripting and file-by-file verification is now handled autonomously. We used Microsoft Discovery tools to do the underlying execution with this agent. MPI Gaussian Agent: This is where the power of cloud-scale computing comes in. The Gaussian Agent submits the prepared simulation jobs and manages their execution using an MPI-based master-worker pattern. This approach enables massive parallel execution scaling out across the cloud to run thousands of simulations concurrently rather than sequentially. Given that the candidate molecule libraries can contain thousands of entries, and each molecule may require multiple simulation runs, this parallel execution capability is transformative. What might have taken days in a constrained local environment can now complete in a fraction of the time. Redox Potential Agent: Once the simulations are complete, this agent takes over. It processes the simulation outputs, identifies the optimal charge and multiplicity state for each molecule, extracts the relevant QM and classical descriptors, and runs them through the trained machine learning model to predict redox potentials. Summariser Agent: The final agent in the chain. It maps the predicted redox potentials back to the original molecules, applies any additional filtering criteria, and produces a clean, structured summary a JSON file that the researcher can immediately use to identify the most promising candidates and take them forward into the next phase of their work. What the Researcher Experiences From the scientist's perspective, the transformation is striking. Instead of spending days writing scripts, babysitting job queues, and manually stitching results together, they provide their input files and describe what they need. The agents take it from there planning, preparing, executing, processing, and summarising and deliver a curated output ready for scientific interpretation. The researcher's time is freed to focus on what matters most: thinking critically about the science. Which molecules look most promising? What does the redox potential distribution tell us? Should we adjust the filtering criteria and run another round? These are the high-value questions that require human expertise and now scientists can spend their time on exactly that, rather than on operational mechanics. The Bigger Picture: Accelerating the Entire Scientific Loop It is important to note that this simulation workflow is just one piece of the broader scientific loop. The full cycle of scientific research, from initial knowledge gathering and literature review, through hypothesis generation, experimental design, simulation, results analysis, and documentation involves many stages, each of which can benefit from the same kind of AI-augmented approach. Microsoft Discovery is designed to support this entire cycle. In our project, we did not stop at simulation. We also explored how agents can accelerate the knowledge gathering phase, helping researchers navigate vast bodies of literature and surface relevant prior work more efficiently. We looked at how AI can assist with hypothesis generation and evaluation, helping scientists reason about which directions are most promising before committing to expensive computations. And we examined how agents can support the analysis and reporting phases comparing results against hypotheses, generating visualisations, and even assisting with drafting research documents. What excites me most about Microsoft Discovery is not any single capability, but the cumulative effect of embedding AI assistance across every stage of the research process. Each phase that gets faster and more efficient creates a multiplier effect on the phases that follow. When knowledge gathering takes hours instead of weeks, researchers generate better hypotheses sooner. When simulations run at cloud scale in parallel, results arrive faster. When analysis is augmented by AI, iteration cycles tighten. The entire loop accelerates. Conclusion The way we approach scientific research is undergoing a fundamental shift. Large language models and the AI agents built from them are not replacing scientists, they are empowering them to work at a pace and scale that was previously unimaginable. Microsoft Discovery represents a new operating model for R&D. By combining advanced AI, high-performance cloud computing, and intelligent workflow orchestration, it enables researchers to offload the repetitive, time-consuming operational work to agents and invest their expertise where it has the greatest impact: in asking better questions, interpreting complex results, and pushing the boundaries of what we know. In the use case I have shared here, a team of six AI agents automated a simulation pipeline that would have taken a single researcher days of manual work. They prepared molecular input files, scaled out thousands of quantum mechanical simulations in parallel across the cloud, processed the results, predicted redox potentials using machine learning, and delivered a structured summary all with minimal human intervention. This is just the beginning. As AI agents become more capable and the tools surrounding them more mature, the potential to accelerate discovery across every scientific domain is immense. Whether you are in materials science, pharmaceuticals, energy, agriculture, or any field where complex R&D is central to progress, Microsoft Discovery offers a platform to do more, faster, and with greater confidence. The future of science is not about working harder. It is about working smarter with AI as your partner in discovery.176Views0likes0CommentsTransforming Video Content into Structured SOPs Using Graph-based RAG
Introduction In today’s digital-first environments, a large portion of enterprise knowledge lives inside video content, training sessions, onboarding walkthroughs, and recorded operational procedures. While videos are great for learning, they are not ideal for quick reference, compliance, or repeatable processes. Converting that knowledge into structured documentation like Standard Operating Procedures (SOPs) is often manual and time-consuming. What if this process could be automated using AI? The Problem Transcripts alone don’t solve the problem. When videos are converted into text, the output typically lacks: Clear structure (sections, headings, hierarchy) Context (relationships between steps, tools, and roles) Completeness (definitions and dependencies spread across the content) This leads to a common challenge: Teams spend significant effort manually reading transcripts, interpreting context, and restructuring them into usable documentation. As seen in modern architecture challenges, manual and repetitive configurations don’t scale well and increase maintenance effort Enter Graph-based RAG (GraphRAG) GraphRAG extends traditional RAG by building a knowledge graph instead of treating content as disconnected chunks. What GraphRAG Does Extracts entities (tools, systems, roles, concepts) Maps relationships between them Groups related concepts into logical sections Preserves context across the entire document Architecture Overview Below is the high-level pipeline: Video → Transcription → Knowledge Graph → LLM Generation → Structured SOP Implementation Approach (Step-by-Step) Stage 1: Knowledge Graph Construction Convert video to transcript Split transcript into chunks Feed chunks into GraphRAG GraphRAG performs: Text Unit Extraction Entity Recognition Relationship Mapping Community Detection Result: A structured knowledge graph representation of the transcript Stage 2: Structure Extraction From the knowledge graph: Sequential Steps Preserve procedural flow from transcript order Logical Sections Derived using community detection Key Concepts Identified using graph centrality (importance via connections) This creates a framework for the SOP Stage 3: Intelligent Document Generation Using Azure OpenAI, each SOP section is generated: Section Generated From Title & Purpose High-level concepts Scope Entity boundaries Definitions Entity descriptions Responsibilities Role-based entities Procedures Sequential steps References Linked content The key advantage: LLM is grounded in graph structure not raw text Key Benefits Context Preservation - Relationships between concepts are maintained across sections. Comprehensive Coverage - Community detection ensures important topics are not missed. Reduced Hallucination - LLM generation is grounded in structured knowledge. Scalability- Works for: 30-minute tutorials, 3-hour training sessions and Enterprise knowledge bases Real-World Impact (Example) In enterprise scenarios like pharmaceutical SOP generation: Processing time: ~15–20 minutes for a multi-hour video Output quality: 8–10 structured SOP sections Consistency: Terminology and relationships preserved Coverage: Minimal missing topics Where This Approach Works Best Training videos → SOPs Meeting recordings → action summaries Technical demos → documentation Interview recordings → knowledge bases Tutorials → reference guides Key Takeaway This approach represents a shift from text processing → knowledge understanding. By combining: Knowledge graphs (structure) LLMs (language generation) We can transform raw, unstructured content into usable, enterprise-grade knowledge assets. Resources https://microsoft.github.io/graphrag/index/overview/ Final Thoughts Have you explored GraphRAG or similar approaches in your projects? What challenges did you face? How did you handle unstructured knowledge? Share your experiences — let’s learn together.177Views0likes0CommentsThree tiers of Agentic AI - and when to use none of them
Every enterprise has an AI agent. Almost none of them work in production. Walk into any enterprise technology review right now and you will find the same thing. Pilots running. Demos recorded. Steering committees impressed. And somewhere in the background, a quiet acknowledgment that the thing does not actually work at scale yet. OutSystems surveyed nearly 1,900 global IT leaders and found that 96% of organizations are already running AI agents in some capacity. Yet only one in nine has those agents operating in production at scale. The experiments are everywhere. The production systems are not. That gap is not a capability problem. The infrastructure has matured. Tool calling is standard across all major models. Frameworks like LangGraph, CrewAI, and Microsoft Agent Framework abstract orchestration logic. Model Context Protocol standardizes how agents access external tools and data sources. Google's Agent-to-Agent protocol now under Linux Foundation governance with over 50 enterprise technology partners including Salesforce, SAP, ServiceNow, and Workday standardizes how agents coordinate with each other. The protocols are in place. The frameworks are production ready. The gap is a selection and governance problem. Teams are building agents on problems that do not need them. Choosing the wrong tier for the ones that do. And treating governance as a compliance checkbox to add after launch, rather than an architectural input to design in from the start. The same OutSystems research found that 94% of organizations are concerned that AI sprawl is increasing complexity, technical debt, and security risk and only 12% have a centralized approach to managing it. Teams are deploying agents the way shadow IT spread through enterprises a decade ago: fast, fragmented, and without a shared definition of what production-ready actually means. I've built agentic systems across enterprise clients in logistics, retail, and B2B services. The failures I keep seeing are not technology failures. They are architecture and judgment failures problems that existed before the first line of code was written, in the conversation where nobody asked the prior question. This article is the framework I use before any platform conversation starts. What has genuinely shifted in the agentic landscape Three changes are shaping how enterprise agent architecture should be designed today and they are not incremental improvements on what existed before. The first is the move from single agents to multi-agent systems. Databricks' State of AI Agents report drawing on data from over 20,000 organizations, including more than 60% of the Fortune 500 found that multi-agent workflows on their platform grew 327% in just four months. This is not experimentation. It is production architecture shifting. A single agent handling everything routing, retrieval, reasoning, execution is being replaced by specialized agents coordinating through defined interfaces. A financial organization, for example, might run separate agents for intent classification, document retrieval, and compliance checking each narrow in scope, each connected to the next through a standardized protocol rather than tightly coupled code. The second is protocol standardization. MCP handles vertical connectivity how agents access tools, data sources, and APIs through a typed manifest and standardized invocation pattern. A2A handles horizontal connectivity how agents discover peer agents, delegate subtasks, and coordinate workflows. Production systems today use both. The practical consequence is that multi-agent architectures can be composed and governed as a platform rather than managed as a collection of one-off integrations. The third is governance as the differentiating factor between teams that ship and teams that stall. Databricks found that companies using AI governance tools get over 12 times more AI projects into production compared to those without. The teams running production agents are not running more sophisticated models. They built evaluation pipelines, audit trails, and human oversight gates before scaling not after the first incident. Tier 1 - Low-code agents: fast delivery with a defined ceiling The low-code tier is more capable than it was eighteen months ago. Copilot Studio, Salesforce Agentforce, and equivalent platforms now support richer connector libraries, better generative orchestration, and more flexible topic models. The ceiling is higher than it was. It is still a ceiling. The core pattern remains: a visual topic model drives a platform-managed LLM that classifies intent and routes to named execution branches. Connectors abstract credential management and API surface. A business team — analyst, citizen developer, IT operations — can build, deploy, and iterate without engineering involvement on every change. For bounded conversational problems, this is the fastest path from requirement to production. The production reality is documented clearly. Gartner data found that only 5% of Copilot Studio pilots moved to larger-scale deployment. A European telecom with dedicated IT resources and a full Microsoft enterprise agreement spent six months and did not deliver a single production agent. The visual builder works. The path from prototype to production, production-grade integrations, error handling, compliance logging, exception routing is where most enterprises get stuck, because it requires Power Platform expertise that most business teams do not have. The platform ceiling shows up predictably at four points. Async processing anything beyond a synchronous connector call, including approval chains, document pipelines, or batch operations cannot be handled natively. Full payload audit logs platform logs give conversation transcripts and connector summaries, not structured records of every API call and its parameters. Production volume concurrency limits and message throughput budgets bind faster than planning assumptions suggest. Root cause analysis in production you cannot inspect the LLM's confidence score or the alternatives it considered, which makes diagnosing misbehavior significantly harder than it should be. The correct diagnostic: can this use case be owned end-to-end by a business team, covered by standard connectors, with no latency SLA below three seconds and no payload-level compliance requirement? Yes, low code is the correct tier. Not a compromise. If no on any point, continue. If low-code is the right call for your use case: Copilot Studio quickstart Tier 2 - Pro-code agents: the architecture the current landscape demands The defining pattern in production pro-code architecture today is multi-agent. Specialized agents per domain, coordinating through MCP for tool access and A2A for peer-to-peer delegation, with a governance layer spanning the entire system. What this looks like in practice: a financial organization handling incoming compliance queries runs separate agents for intent classification, document retrieval, and the compliance check itself. None of these agents tries to do all three jobs. Each has a narrow responsibility, a defined input/output contract typed against a JSON Schema, and a clear handoff boundary. The 327% growth in multi-agent workflows reflects production teams discovering that the failure modes of monolithic agents topic collision, context overflow, degraded classification as scope expands are solved by specialization, not by making a single agent more capable. The discipline that makes multi-agent systems reliable is identical to what makes single-agent systems reliable, just enforced across more boundaries: the LLM layer reasons and coordinates; deterministic tool functions enforce. In a compliance pipeline, no LLM decides whether a document satisfies a regulatory requirement. That evaluation runs in a deterministic tool with a versioned rule set, testable outputs, and an immutable audit log. The LLM orchestrates the sequence. The tool produces the compliance record. Mixing these letting an LLM evaluate whether a rule pass collapses the audit trail and introduces probabilistic outputs on questions that have regulatory answers. MCP is the tool interface standard today. An MCP server exposes a typed manifest any compliant agent runtime can discover at startup. Tools are versioned, independently deployable, and reusable across agents without bespoke integration code. A2A extends this horizontally: agents advertise capability cards, discover peers, and delegate subtasks through a standardised protocol. The practical consequence is that multi-agent systems built on both protocols can be composed and governed as a platform rather than managed as a collection of one-off integrations. Observability is the architectural element that separates teams shipping production agents from teams perpetually in pilot. Build evaluation pipelines, distributed traces across all agent boundaries, and human review gates before scaling. The teams that add these after the first production incident spend months retrofitting what should have been designed in. If pro-code is the right call for your use case: Foundry Agent Service The hybrid pattern: still where production deployments land The shift to multi-agent architecture does not change the hybrid pattern it deepens it. Low-code at the conversational surface, pro-code multi-agent systems behind it, with a governance layer spanning both. On a logistics client engagement, the brief was a sales assistant for account managers shipment status, account health, and competitive context inside Teams. The business team wanted everything in Copilot Studio. Engineering wanted a custom agent runtime. Both were wrong. What we built: Copilot Studio handled all high-frequency, low-complexity queries shipment tracking, account status, open cases through Power Platform connectors. Zero custom code. That covered roughly 78% of actual interaction volume. Requests requiring multi-source reasoning competitive positioning on a specific lane, churn risk across an account portfolio, contract renewal analysis delegated via authenticated HTTP action to a pro-code multi-agent service on Azure. A retrieval agent pulled deal history and market intelligence through MCP-exposed tools. A synthesis agent composed the recommendation with confidence scoring. Structured JSON back to the low-code layer, rendered as an adaptive card in Teams. The HITL gate was non-negotiable and designed before deployment, not added after the first incident. No output reached a customer without a manager approval step. The agent drafts. A human sends. This boundary low-code for conversational volume, pro-code for reasoning depth maps directly to what the research shows separates teams that ship from teams that stall. The organizations running agents in production drew the line correctly between what the platform can own and what engineering needs to own. Then they built governance into both sides before scaling. The four gates - the prior question that still gets skipped Run every candidate use case through these four checks before the platform conversation begins. None of the recent infrastructure improvements change what they are checking, because none of them change the fundamental cost structure of agentic reasoning. Gate 1 - is the logic fully deterministic? If every valid output for every valid input can be enumerated in unit tests, the problem does not need an LLM. A rules engine executes in microseconds at zero inference cost and cannot produce a plausible-but-wrong answer. NeuBird AI's production ops agents which have resolved over a million alerts and saved enterprises over $2 million in engineering hours work because alert triage logic that can be expressed as rules runs in deterministic code, and the LLM only handles cases where pattern-matching is insufficient. That boundary is not incidental to the system's reliability. It is the reason for it. Gate 2 - is zero hallucination tolerance required? With over 80% of databases now being built by AI agents per Databricks' State of AI Agents report the surface area for hallucination-induced data errors has grown significantly. In domains where a wrong answer is a compliance event financial calculation, medical logic, regulatory determinations irreducible LLM output uncertainty is disqualifying regardless of model version or prompt engineering effort. Exit to deterministic code or classical ML with bounded output spaces. Gate 3 - is a sub-100ms latency SLA required? LLM inference is faster than it was eighteen months ago. It is not fast enough for payment transaction processing, real-time fraud scoring, or live inventory management. A three-agent system with MCP tool calls has a P50 latency measured in seconds. These problems need purpose-built transactional architecture. Gate 4 - is regulatory explainability required? A2A enables complex agent coordination and delegation. It does not make LLM reasoning reproducible in a regulatory sense. Temperature above zero means the same input produces different outputs across invocations. Regulators in financial services, healthcare, and consumer credit require deterministic, auditable decision rationale. Exit to deterministic workflow with structured audit logging at every Five production failure modes - one of them new The four original anti-patterns are still showing up in production. A fifth has been added by scale. Routing data retrieval through a reasoning loop. A direct API call returns account status in under 10ms. Routing the same request through an LLM reasoning step adds hundreds of milliseconds, consumes tokens on every call, and introduces output parsing on data that is already structured. The agent calls a structured tool. The tool calls the API. The agent never acts as the integration layer. Encoding business rules in prompts. Rules expressed in prompt text drift as models update. They produce probabilistic output across invocations and fail in ways that are difficult to reproduce and diagnose. A rule that must evaluate correctly every time belongs in a deterministic tool function unit-tested, version-controlled, independently deployable via MCP. No approval gate on CRUD operations. CRUD operations without a human approval step will eventually misfire on the input that testing did not cover. The gate needs to be designed before deployment, not added after the first incident involving a financial posting, a customer-facing communication, or a data deletion. Monolithic agent for all domains. A single agent accumulating every domain leads predictably to topic collision, context overflow, and maintenance that becomes impossible as scope expands. Specialized agents per domain, coordinating through A2A, is the architecture that scales. Ungoverned agent sprawl. This is the new one and currently the most prevalent. OutSystems found 94% of organizations concerned about it, with only 12% having a centralized response. Teams building agents independently across fragmented stacks, without shared governance, evaluation standards, or audit infrastructure, produce exactly the same organizational debt that shadow IT created but with higher stakes, because these systems make autonomous decisions rather than just storing and retrieving data. The fix is treating governance as an architectural input before deployment, not a compliance requirement after something breaks. The infrastructure is ready. The judgment is not. The tier decision sequence has not changed. Does the problem need natural language understanding or dynamic generation? No — deterministic system, stop. Can a business team own it through standard connectors with no sub-3-second latency SLA and no payload-level compliance requirement? Yes — low-code. Does it need custom orchestration, multi-agent coordination, or audit-grade observability? Yes — pro-code with MCP and A2A. Does it need both a conversational surface and deep backend reasoning? Hybrid, with a governance layer spanning both. What has changed is that governance is no longer optional infrastructure to add when you have time. The data is unambiguous. Companies with governance tools get over 12 times more AI projects into production than those without. Evaluation pipelines, distributed tracing across agent boundaries, human oversight gates, and centralised agent lifecycle management are not overhead. They are what converts experiments into production systems. The teams still stuck in pilot are not stuck because the technology failed them. They are stuck because they skipped this layer. The protocols are standardised. The frameworks are mature. The infrastructure exists. None of that is what is holding most enterprise agent programmes back. What is holding them back is a selection problem disguised as a technology problem — teams building agents before asking whether agents are warranted, choosing platforms before running the four gates, and treating governance as a checkpoint rather than an architectural input. I have built agents that should have been workflow engines. Not because the technology was wrong, but because nobody stopped early enough to ask whether it was necessary. The four gates in this article exist because I learned those lessons at clients' expense, not mine. The most useful thing I can offer any team starting an agentic AI project is not a framework selection guide. It is permission to say no — and a clear basis for saying it. Take the four gates framework to your next architecture review. If you have already shipped agents to production, I would like to hear what worked and what did not - comment below What to do next Three concrete steps depending on where you are right now. If you have pilots that have not reached production: Run them through the four gates in this article before the next sprint. Gate 1 alone will eliminate a meaningful percentage of them. The ones that survive all four are your real candidates for production investment. Download the attached file for gated checklist and take it into your next architecture review. If you are starting a new agent project: Do not open a platform before you have answered the gate questions. Once you have confirmed an agent is warranted and identified the tier, start here: Copilot Studio guided setup for low-code scenarios, or Foundry Agent Service for pro-code patterns with MCP and multi-agent coordination built in. Build governance infrastructure - evaluation pipeline, distributed tracing, HITL gates - before you scale, not after. If you have already shipped agents to production: Share what worked and what did not in the Azure AI Tech Community — tag posts with #AgentArchitecture. The most useful signal for teams still in pilot is hearing from practitioners who have been through production, not vendors describing what production should look like. References OutSystems — State of AI Development Report - https://www.outsystems.com/1/state-ai-development-report Databricks — State of AI Agents Report - https://www.databricks.com/resources/ebook/state-of-ai-agents Gartner — 2025 Microsoft 365 and Copilot Survey - https://www.gartner.com/en/documents/6548002 (Paywalled primary source — publicly reported via techpartner.news: https://www.techpartner.news/news/gartner-microsoft-copilot-hype-offset-by-roi-and-readiness-realities-618118) Anthropic — Model Context Protocol (MCP) - https://modelcontextprotocol.io Google Cloud — Agent-to-Agent Protocol (A2A) . https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability NeuBird AI — Production Operations Deployment Announcement NeuBird AI Closes $19.3M Funding Round to Scale Agentic AI Across Enterprise Production Operations ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al. https://arxiv.org/abs/2210.03629 Enterprise Integration Patterns — Gregor Hohpe & Bobby Woolf, Addison-Wesley https://www.enterpriseintegrationpatterns.com1.8KViews4likes1CommentCentralizing Enterprise API Access for Agent-Based Architectures
Problem Statement When building AI agents or automation solutions, calling enterprise APIs directly often means configuring individual HTTP actions within each agent for every API. While this works for simple scenarios, it quickly becomes repetitive and difficult to manage as complexity grows. The challenge becomes more pronounced when a single business domain exposes multiple APIs, or when the same APIs are consumed by multiple agents. This leads to duplicated configurations, higher maintenance effort, inconsistent behavior, and increased governance and security risks. A more scalable approach is to centralize and reuse API access. By grouping APIs by business domain using an API management layer, shaping those APIs through a Model Context Protocol (MCP) server, and exposing the MCP server as a standardized tool or connector, agents can consume business capabilities in a consistent, reusable, and governable manner. This pattern not only reduces duplication and configuration overhead but also enables stronger versioning, security controls, observability, and domain‑driven ownership—making agent-based systems easier to scale and operate in enterprise environments. Designing Agent‑Ready APIs with Azure API Management, an MCP Server, and Copilot Studio As enterprises increasingly adopt AI‑powered assistants and Copilots, API design must evolve to meet the needs of intelligent agents. Traditional APIs—often designed for user interfaces or backend integrations—can expose excessive data, lack intent-level abstraction, and increase security risk when consumed directly by AI systems. This document outlines a practical, enterprise-‑ready approach to organize APIs in Azure API Management (APIM), introduce a Model Context Protocol (MCP) server to shape and control context, and integrate the solution with Microsoft Copilot Studio. The goal is to make APIs truly agent-‑ready: secure, scalable, reusable, and easy to govern. Architecture at a glance Back-end services expose domain APIs. Azure API Management (APIM) groups and governs those APIs (products, policies, authentication, throttling, versions). An MCP server calls APIM, orchestrates/filters responses, and returns concise, model-friendly outputs. Copilot Studio connects to the MCP server and invokes a small set of predictable operations to satisfy user intents. Why Traditional API Designs Fall Short for AI Agents Enterprise APIs have historically been built around CRUD operations and service-‑to-‑service integration patterns. While this works well for deterministic applications, AI agents work best with intent-driven operations and context-aware responses. When agents consume traditional APIs directly, common issues include: overly verbose payloads, multiple calls to satisfy a single user intent, and insufficient guardrails for read vs. write operations. The result can be unpredictable agent behavior that is difficult to test, validate, and govern. Structuring APIs Effectively in Azure API Management Azure API Management (APIM) is the control plane between enterprise systems and AI agents. A well-‑structured APIM instance improves security, discoverability, and governance through products, policies, subscriptions, and analytics. Key design principles for agent consumption Organize APIs by business capability (for example, Customer, Orders, Billing) rather than technical layers. Expose agent-facing APIs via dedicated APIM products to enable controlled access, throttling, versioning, and independent lifecycle management. Prefer read-only operations where possible; scope write operations narrowly and protect them with explicit checks, approvals, and least-privilege identities. Read‑only APIs should be prioritized, while action‑oriented APIs must be carefully scoped and gated. The Role of the MCP Server in Agent‑Based Architectures APIM provides governance and security, but agents also need an intent-level interface and model-friendly responses. A Model Context Protocol (MCP) server fills this gap by acting as a mediator between Copilot Studio and APIM-exposed APIs. Instead of exposing many back-end endpoints directly to the agent, the MCP server can: orchestrate multiple API calls, filter irrelevant fields, enforce business rules, enrich results with additional context, and emit concise, predictable JSON outputs. This makes agent behavior more reliable and easier to validate. Instead of exposing multiple backend APIs directly to the agent, the MCP server aggregates responses, filters irrelevant data, enriches results with business context, and formats responses into LLM‑friendly schemas. By introducing this abstraction layer, Copilot interactions become simpler, safer, and more deterministic. The agent interacts with a small number of well‑defined MCP operations that encapsulate enterprise logic without exposing internal complexity. Designing an Effective MCP Server An MCP server should have a focused responsibility: shaping context for AI models. It should not replace core back-end services; it should adapt enterprise capabilities for agent consumption. What MCP should do An MCP server should be designed with a clear and focused responsibility: shaping context for AI models. Its primary role is not to replace backend services, but to adapt enterprise data for intelligent consumption. MCP does not orchestrate enterprise workflows or apply business logic. It standardizes how agents discover and invoke external tools and APIs by exposing them through a structured protocol interface. Orchestration, intent resolution, and policy-driven execution are handled by the agent runtime or host framework. It is equally important to understand what does not belong in MCP. Complex transactional workflows, long‑running processes, and UI‑specific formatting should remain in backend systems. Keeping MCP lightweight ensures scalability and easier maintenance. Call APIM-managed APIs and orchestrate multi-step retrieval when needed. Apply security checks and business rules consistently. Filter and minimize payloads (return only fields needed for the intent). Normalize and reshape responses into stable, predictable JSON schemas. Handle errors and edge cases with safe, descriptive messages. What MCP should not do Avoid implementing complex transactional workflows, long-running processes, or UI-specific formatting in MCP. Keep it lightweight so it remains scalable, testable, and easy to maintain. Step by step guide 1) Create an MCP server in Azure API Management (APIM) Open the Azure portal (portal.azure.com). Go to your API Management instance. In the left navigation, expand APIs. Create (or select) an API group for the business domain you want to expose (for example, Orders or Customers). Add the relevant APIs/operations to that API group. Create or select an APIM product dedicated for agent usage, and ensure the product requires a subscription (subscription key). Create an MCP server in APIM and map it to the API (or API group) you want to expose as MCP operations. In the MCP server settings, ensure Subscription key required is enabled. From the product’s Subscriptions page, copy the subscription key you will use in Copilot Studio. Screenshot placeholders: APIM API group, product configuration, MCP server mapping, subscription settings, subscription key location. * Note: Using an API Management subscription key to access MCP operations is one supported way to authenticate and consume enterprise APIs. However, this approach is best suited for initial setups, demos, or scenarios where key-based access is explicitly required. For production‑grade enterprise solutions, Microsoft recommends using managed identity–based access control. Managed identities for Azure resources eliminate the need to manage secrets such as subscription keys or client secrets, integrate natively with Microsoft Entra ID, and support fine‑grained role‑based access control (RBAC). This approach improves security posture while significantly reducing operational and governance overhead for agent and service‑to‑service integrations. Wherever possible, agents and MCP servers should authenticate using managed identities to ensure secure, scalable, and compliant access to enterprise APIs. 2) Create a Copilot Studio agent and connect to the APIM MCP server using a subscription key Copilot Studio natively supports Model Context Protocol (MCP) servers as tools. When an agent is connected to an MCP server, the tool metadata—including operation names, inputs, and outputs—is automatically discovered and kept in sync, reducing manual configuration and maintenance overhead. Sign in to Copilot Studio. Create a new agent and add clear instructions describing when to use the MCP tool and how to present results (for example, concise summaries plus key fields). Open Tools > Add tool > Model Context Protocol, then choose Create. Enter the MCP server details: Server endpoint URL: copy this from your MCP server in APIM. Authentication: select API Key. Header name: use the subscription key header required by your APIM configuration. Select Create new connection, paste the APIM subscription key, and save. Test the tool in the agent by prompting for a domain-specific task (for example, “Get order status for 12345”). Validate that responses are concise and that errors are handled safely. Screenshot placeholders: MCP tool creation screen, endpoint + auth configuration, connection creation, test prompt and response. Operational best practices and guardrails Least privilege by default: create separate APIM products and identities for agent scenarios; avoid broad access to internal APIs. Prefer intent-level operations: expose fewer, higher-level MCP operations instead of many low-level endpoints. Protect write operations: require explicit parameters, validation, and (when appropriate) approval flows; keep “read” and “write” tools separate. Stable schemas: return predictable JSON shapes and limit optional fields to reduce prompt brittleness. Observability: log MCP requests/responses (with sensitive fields redacted), monitor APIM analytics, and set alerts for failures and throttling. Versioning: version MCP operations and APIM APIs; deprecate safely. Security hygiene: treat subscription keys as secrets, rotate regularly, and avoid exposing them in prompts or logs. Summary As organizations scale agent‑based and Copilot‑driven solutions, directly exposing enterprise APIs to AI agents quickly becomes complex and risky. Centralizing API access through Azure API Management, shaping agent‑ready context via a Model Context Protocol (MCP) server, and consuming those capabilities through Copilot Studio establishes a clean and governable architecture. This pattern reduces duplication, enforces consistent security controls, and enables intent‑driven API consumption without exposing unnecessary backend complexity. By combining domain‑aligned API products, lightweight MCP operations, and least‑privilege identity‑based access, enterprises can confidently scale AI agents while maintaining strong governance, observability, and operational control. References Azure API Management (APIM) – Overview Azure API Management – Key Concepts Azure MCP Server Documentation (Model Context Protocol) Extend your agent with Model Context Protocol Managed identities for Azure resources – Overview357Views0likes0Comments