modern apps
110 TopicsAutonomous AKS Incident Response with Azure SRE Agent: From Alert to Verified Recovery in Minutes
When a Sev1 alert fires on an AKS cluster, detection is rarely the hard part. The hard part is what comes next: proving what broke, why it broke, and fixing it without widening the blast radius, all under time pressure, often at 2 a.m. Azure SRE Agent is designed to close that gap. It connects Azure-native observability, AKS diagnostics, and engineering workflows into a single incident-response loop that can investigate, remediate, verify, and follow up, without waiting for a human to page through dashboards and run ad-hoc kubectl commands. This post walks through that loop in two real AKS failure scenarios. In both cases, the agent received an incident, investigated Azure Monitor and AKS signals, applied targeted remediation, verified recovery, and created follow-up in GitHub, all while keeping the team informed in Microsoft Teams. Core concepts Azure SRE Agent is a governed incident-response system, not a conversational assistant with infrastructure access. Five concepts matter most in an AKS incident workflow: Incident platform. Where incidents originate. In this demo, that is Azure Monitor. Built-in Azure capabilities. The agent uses Azure Monitor, Log Analytics, Azure Resource Graph, Azure CLI/ARM, and AKS diagnostics without requiring external connectors. Connectors. Extend the workflow to systems such as GitHub, Teams, Kusto, and MCP servers. Permission levels. Reader for investigation and read oriented access, privileged for operational changes when allowed. Run modes. Review for approval-gated execution and Autonomous for direct execution. The most important production controls are permission level and run mode, not prompt quality. Custom instructions can shape workflow behavior, but they do not replace RBAC, telemetry quality, or tool availability. The safest production rollout path: Start: Reader + Review Then: Privileged + Review Finally: Privileged + Autonomous. Only for narrow, trusted incident paths. Demo environment The full scripts and manifests are available if you want to reproduce this: Demo repository: github.com/hailugebru/azure-sre-agents-aks. The README includes setup and configuration details. The environment uses an AKS cluster with node auto-provisioning (NAP), Azure CNI Overlay powered by Cilium, managed Prometheus metrics, the AKS Store sample microservices application, and Azure SRE Agent configured for incident-triggered investigation and remediation. This setup is intentionally realistic but minimal. It provides enough surface area to exercise real AKS failure modes without distracting from the incident workflow itself. Azure Monitor → Action Group → Azure SRE Agent → AKS Cluster (Alert) (Webhook) (Investigate / Fix) (Recover) ↓ Teams notification + GitHub issue → GitHub Agent → PR for review How the agent was configured Configuration came down to four things: scope, permissions, incident intake, and response mode. I scoped the agent to the demo resource group and used its user-assigned managed identity (UAMI) for Azure access. That scope defined what the agent could investigate, while RBAC determined what actions it could take. I used broader AKS permissions than I would recommend as a default production baseline so the agent could complete remediation end to end in the lab. That is an important distinction: permissions control what the agent can access, while run mode controls whether it asks for approval or acts directly. For this scenario, Azure Monitor served as the incident platform, and I set the response plan to Autonomous for a narrow, trusted path so the workflow could run without manual approval gates. I also added Teams and GitHub integrations so the workflow could extend beyond Azure. Teams provided milestone updates during the incident, and GitHub provided durable follow up after remediation. For the complete setup, see the README. A note on context. The more context you can provide the agent about your environment, resources, runbooks, and conventions, the better it performs. Scope boundaries, known workloads, common failure patterns, and links to relevant documentation all sharpen its investigations and reduce the time it spends exploring. Treat custom instructions and connector content as first-class inputs, not afterthoughts. Two incidents, two response modes These incidents occurred on the same cluster in one session and illustrate two realistic operating modes: Alert triggered automation. The agent acts when Azure Monitor fires. Ad hoc chat investigation. An engineer sees a symptom first and asks the agent to investigate. Both matter in real environments. The first is your scale path. The second is your operator assist path. Incident 1. CPU starvation (alert driven, ~8 min MTTR) The makeline-service deployment manifest contained a CPU and memory configuration that was not viable for startup: resources: requests: cpu: 1m memory: 6Mi limits: cpu: 5m memory: 20Mi Within five minutes, Azure Monitor fired the pod-not-healthy Sev1 alert. The agent picked it up immediately. Here is the key diagnostic conclusion the agent reached from the pod state, probe behavior, and exit code: "Exit code 1 (not 137) rules out OOMKill. The pod failed at startup, not at runtime memory pressure. CPU limit of 5m is insufficient for the process to bind its port before the startup probe times out. This is a configuration error, not a resource exhaustion scenario." That is the kind of distinction that often takes an on call engineer several minutes to prove under pressure: startup failure from CPU starvation vs. runtime termination from memory pressure. The agent then: Identified three additional CPU-throttled pods at 112 to 200% of configured limit using kubectl top. Patched four workloads: makeline-service, virtual-customer, virtual-worker, and mongodb. Verified that all affected pods returned to healthy running state with 0 restarts cluster wide. Azure SRE Agent's Incident History blade confirming full cluster recovery: 4 patches applied, 0 unhealthy pods — no human intervention required. Outcome. Full cluster recovery in ~8 minutes, 0 human interventions. Incident 2. OOMKilled (chat driven, ~4 min MTTR) For the second case, I deployed a deliberately undersized version of order-service: kubectl apply -f .\manifests\aks-store\order-service-changed.yaml -n pets I started this case from chat before the pod-phase alert fired to demonstrate the interactive troubleshooting flow. That was a demo choice, not an alerting gap. CrashLoopBackOff is a container waiting reason, not a pod phase, so production coverage should come from Prometheus based crash-loop signals rather than pod phase alone. Here is the PromQL query I use in Azure Monitor to catch this class of failure: sum by (namespace, pod) ( ( max_over_time( kube_pod_container_status_waiting_reason{ namespace="pets", reason="CrashLoopBackOff" }[5m] ) == 1 ) and on (namespace, pod, container) ( increase( kube_pod_container_status_restarts_total{ namespace="pets" }[15m] ) > 0 ) ) > 0 This query fires when a container has been in CrashLoopBackOff within the last 5 minutes and its restart count has increased in the last 15 minutes. In production, replace the hardcoded namespace with a regex matcher or remove it entirely to cover all namespaces. The order-service pod in the pets namespace is not healthy. Please investigate, identify the root cause, and fix it. The agent's reasoning: "Container logs are empty. The process was killed before it could write its first log line. Exit code 137 confirms OOMKill. No NODE_OPTIONS in the ConfigMap rules out a V8 heap misconfiguration. The 20Mi limit is 12.8x below the pod's observed 50Mi runtime baseline. This limit was never viable for this workload." The agent increased the memory limit (20Mi to 128Mi) and request (10Mi to 50Mi), then verified the new pod stabilized at 74Mi/128Mi (58% utilization) with 0 restarts. Outcome. Service recovered in ~4 minutes without any manual cluster interaction. Side by side comparison Dimension Incident 1: CPU starvation Incident 2: OOMKilled Trigger Azure Monitor alert (automated) Engineer chat prompt (ad hoc) Failure mode CPU too low for startup probe to pass Memory limit too low for process to start Key signal Exit code 1, probe timeout Exit code 137, empty container logs Blast radius 4 workloads affected cluster wide 1 workload in target namespace Remediation CPU request/limit patches across 4 deployments Memory request/limit patch on 1 deployment MTTR ~8 min ~4 min Human interventions 0 0 Why this matters Most AKS environments already emit rich telemetry through Azure Monitor and managed Prometheus. What is still manual is the response: engineers paging through dashboards, running ad-hoc kubectl commands, and applying hotfixes under time pressure. Azure SRE Agent changes that by turning repeatable investigation and remediation paths into an automated workflow. The value isn't just that the agent patched a CPU limit. It's that the investigation, remediation, and verification loop is the same regardless of failure mode, and it runs while your team sleeps. In this lab, the impact was measurable: Metric This demo with Azure SRE Agent Alert to recovery ~4 to 8 min Human interventions 0 Scope of investigation Cluster wide, automated Correlate evidence and diagnose ~2 min Apply fix and verify ~4 min Post incident follow-up GitHub issue + draft PR These results came from a controlled run on April 10, 2026. Real world outcomes depend on alert quality, cluster size, and how much automation you enable. For reference, industry reports from PagerDuty and Datadog typically place manual Sev1 MTTR in the 30 to 120 minute range for Kubernetes environments. Teams + GitHub follow-up Runtime remediation is only half the story. If the workflow ends when the pod becomes healthy again, the same issue returns on the next deployment. That is why the post incident path matters. After Incident 1 resolved, Azure SRE Agent used the GitHub connector to file an issue with the incident summary, root cause, and runtime changes. In the demo, I assigned that issue to GitHub Copilot agent, which opened a draft pull request to align the source manifests with the hotfix. The agent can also be configured to submit the PR directly in the same workflow, not just open the issue, so the fix is in your review queue by the time anyone sees the notification. Human review still remains the final control point before merge. Setup details for the GitHub connector are in the demo repo README, and the official reference is in the Azure SRE Agent docs. Azure SRE Agent fixes the live issue, and the GitHub follow-up prepares the durable source change so future deployments do not reintroduce the same configuration problem. The operations to engineering handoff: Azure SRE Agent fixed the live cluster; GitHub Copilot agent prepares the durable source change so the same misconfiguration can't ship again. In parallel, the Teams connector posted milestone updates during the incident: Investigation started. Root cause and remediation identified. Incident resolved. Teams handled real time situational awareness. GitHub handled durable engineering follow-up. Together, they closed the gap between operations and software delivery. Key takeaways Three things to carry forward Treat Azure SRE Agent as a governed incident response system, not a chatbot with infrastructure access. The most important controls are permission levels and run modes, not prompt quality. Anchor detection in your existing incident platforms. For this demo, we used Prometheus and Azure Monitor, but the pattern applies regardless of where your signals live. Use connectors to extend the workflow outward. Teams for real time coordination, GitHub for durable engineering follow-up. Start where you're comfortable. If you are just getting your feet wet, begin with one resource group, one incident type, and Review mode. Validate that telemetry flows, RBAC is scoped correctly, and your alert rules cover the failure modes you actually care about before enabling Autonomous. Expand only once each layer is trusted. Next steps Add Prometheus based alert coverage for ImagePullBackOff and node resource pressure to complement the pod phase rule. Expand to multi cluster managed scopes once the single cluster path is trusted and validated. Explore how NAP and Azure SRE Agent complement each other — NAP manages infrastructure capacity, while the agent investigates and remediates incidents. I'd like to thank Cary Chai, Senior Product Manager for Azure SRE Agent, for his early technical guidance and thorough review — his feedback sharpened both the accuracy and quality of this post.350Views0likes0CommentsBuild Multi-Agent AI Apps on Azure App Service with Microsoft Agent Framework 1.0
Part 1 of 3 — Multi-Agent AI on Azure App Service This is part 1 of a 3 part series on deploying and working with multi-agent AI on Azure App Service. Follow allong to learn how to deploy, manage, observe, and secure your agents on Azure App Service. A couple of months ago, we published a three-part series showing how to build multi-agent AI systems on Azure App Service using preview packages from the Microsoft Agent Framework (MAF) (formerly AutoGen / Semantic Kernel Agents). The series walked through async processing, the request-reply pattern, and client-side multi-agent orchestration — all running on App Service. Since then, Microsoft Agent Framework has reached 1.0 GA — unifying AutoGen and Semantic Kernel into a single, production-ready agent platform. This post is a fresh start with the GA bits. We'll rebuild our travel-planner sample on the stable API surface, call out the breaking changes from preview, and get you up and running fast. All of the code is in the companion repo: seligj95/app-service-multi-agent-maf-otel. What Changed in MAF 1.0 GA The 1.0 release is more than a version bump. Here's what moved: Unified platform. AutoGen and Semantic Kernel agent capabilities have converged into Microsoft.Agents.AI . One package, one API surface. Stable APIs with long-term support. The 1.0 contract is now locked for servicing. No more preview churn. Breaking change — Instructions on options removed. In preview, you set instructions through ChatClientAgentOptions.Instructions . In GA, pass them directly to the ChatClientAgent constructor. Breaking change — RunAsync parameter rename. The thread parameter is now session (type AgentSession ). If you were using named arguments, this is a compile error. Microsoft.Extensions.AI upgraded. The framework moved from the 9.x preview of Microsoft.Extensions.AI to the stable 10.4.1 release. OpenTelemetry integration built in. The builder pipeline now includes UseOpenTelemetry() out of the box — more on that in Blog 2. Our project references reflect the GA stack: <PackageReference Include="Microsoft.Agents.AI" Version="1.0.0" /> <PackageReference Include="Microsoft.Extensions.AI" Version="10.4.1" /> <PackageReference Include="Azure.AI.OpenAI" Version="2.1.0" /> Why Azure App Service for AI Agents? If you're building with Microsoft Agent Framework, you need somewhere to run your agents. You could reach for Kubernetes, containers, or serverless — but for most agent workloads, Azure App Service is the sweet spot. Here's why: No infrastructure management — App Service is fully managed. No clusters to configure, no container orchestration to learn. Deploy your .NET or Python agent code and it just runs. Always On — Agent workflows can take minutes. App Service's Always On feature (on Premium tiers) ensures your background workers never go cold, so agents are ready to process requests instantly. WebJobs for background processing — Long-running agent workflows don't belong in HTTP request handlers. App Service's built-in WebJob support gives you a dedicated background worker that shares the same deployment, configuration, and managed identity — no separate compute resource needed. Managed Identity everywhere — Zero secrets in your code. App Service's system-assigned managed identity authenticates to Azure OpenAI, Service Bus, Cosmos DB, and Application Insights automatically. No connection strings, no API keys, no rotation headaches. Built-in observability — Native integration with Application Insights and OpenTelemetry means you can see exactly what your agents are doing in production (more on this in Part 2). Enterprise-ready — VNet integration, deployment slots for safe rollouts, custom domains, auto-scaling rules, and built-in authentication. All the things you'll need when your agent POC becomes a production service. Cost-effective — A single P0v4 instance (~$75/month) hosts both your API and WebJob worker. Compare that to running separate container apps or a Kubernetes cluster for the same workload. The bottom line: App Service lets you focus on building your agents, not managing infrastructure. And since MAF supports both .NET and Python — both first-class citizens on App Service — you're covered regardless of your language preference. Architecture Overview The sample is a travel planner that coordinates six specialized agents to build a personalized trip itinerary. Users fill out a form (destination, dates, budget, interests), and the system returns a comprehensive travel plan complete with weather forecasts, currency advice, a day-by-day itinerary, and a budget breakdown. The Six Agents Currency Converter — calls the Frankfurter API for real-time exchange rates Weather Advisor — calls the National Weather Service API for forecasts and packing tips Local Knowledge Expert — cultural insights, customs, and hidden gems Itinerary Planner — day-by-day scheduling with timing and costs Budget Optimizer — allocates spend across categories and suggests savings Coordinator — assembles everything into a polished final plan Four-Phase Workflow Phase Agents Execution 1 — Parallel Gathering Currency, Weather, Local Knowledge Task.WhenAll 2 — Itinerary Itinerary Planner Sequential (uses Phase 1 context) 3 — Budget Budget Optimizer Sequential (uses Phase 2 output) 4 — Assembly Coordinator Final synthesis Infrastructure Azure App Service (P0v4) — hosts the API and a continuous WebJob for background processing Azure Service Bus — decouples the API from heavy AI work (async request-reply) Azure Cosmos DB — stores task state, results, and per-agent chat histories (24-hour TTL) Azure OpenAI (GPT-4o) — powers all agent LLM calls Application Insights + Log Analytics — monitoring and diagnostics ChatClientAgent Deep Dive At the core of every agent is ChatClientAgent from Microsoft.Agents.AI . It wraps an IChatClient (from Microsoft.Extensions.AI ) with instructions, a name, a description, and optionally a set of tools. This is client-side orchestration — you control the chat history, lifecycle, and execution order. No server-side Foundry agent resources are created. Here's the BaseAgent pattern used by all six agents in the sample: // BaseAgent.cs — constructor for agents with tools Agent = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description, tools: chatOptions.Tools?.ToList()) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName) .Build(); Notice the builder pipeline: .AsBuilder().UseOpenTelemetry(...).Build() . This opts every agent into the framework's built-in OpenTelemetry instrumentation with a single line. We'll explore what that telemetry looks like in Blog 2. Invoking an agent is equally straightforward: // BaseAgent.cs — InvokeAsync public async Task<ChatMessage> InvokeAsync( IList<ChatMessage> chatHistory, CancellationToken cancellationToken = default) { var response = await Agent.RunAsync( chatHistory, session: null, options: null, cancellationToken); return response.Messages.LastOrDefault() ?? new ChatMessage(ChatRole.Assistant, "No response generated."); } Key things to note: session: null — this is the renamed parameter (was thread in preview). We pass null because we manage chat history ourselves. The agent receives the full chatHistory list, so context accumulates across turns. Simple agents (Local Knowledge, Itinerary Planner, Budget Optimizer, Coordinator) use the tool-less constructor; agents that call external APIs (Currency, Weather) use the constructor that accepts ChatOptions with tools. Tool Integration Two of our agents — Weather Advisor and Currency Converter — call real external APIs through the MAF tool-calling pipeline. Tools are registered using AIFunctionFactory.Create() from Microsoft.Extensions.AI . Here's how the WeatherAdvisorAgent wires up its tool: // WeatherAdvisorAgent.cs private static ChatOptions CreateChatOptions( IWeatherService weatherService, ILogger logger) { var chatOptions = new ChatOptions { Tools = new List<AITool> { AIFunctionFactory.Create( GetWeatherForecastFunction(weatherService, logger)) } }; return chatOptions; } GetWeatherForecastFunction returns a Func<double, double, int, Task<string>> that the model can call with latitude, longitude, and number of days. Under the hood, it hits the National Weather Service API and returns a formatted forecast string. The Currency Converter follows the same pattern with the Frankfurter API. This is one of the nicest parts of the GA API: you write a plain C# method, wrap it with AIFunctionFactory.Create() , and the framework handles the JSON schema generation, function-call parsing, and response routing automatically. Multi-Phase Workflow Orchestration The TravelPlanningWorkflow class coordinates all six agents. The key insight is that the orchestration is just C# code — no YAML, no graph DSL, no special runtime. You decide when agents run, what context they receive, and how results flow between phases. // Phase 1: Parallel Information Gathering var gatheringTasks = new[] { GatherCurrencyInfoAsync(request, state, progress, cancellationToken), GatherWeatherInfoAsync(request, state, progress, cancellationToken), GatherLocalKnowledgeAsync(request, state, progress, cancellationToken) }; await Task.WhenAll(gatheringTasks); After Phase 1 completes, results are stored in a WorkflowState object — a simple dictionary-backed container that holds per-agent chat histories and contextual data: // WorkflowState.cs public Dictionary<string, object> Context { get; set; } = new(); public Dictionary<string, List<ChatMessage>> AgentChatHistories { get; set; } = new(); Phases 2–4 run sequentially, each pulling context from the previous phase. For example, the Itinerary Planner receives weather and local knowledge gathered in Phase 1: var localKnowledge = state.GetFromContext<string>("LocalKnowledge") ?? ""; var weatherAdvice = state.GetFromContext<string>("WeatherAdvice") ?? ""; var itineraryChatHistory = state.GetChatHistory("ItineraryPlanner"); itineraryChatHistory.Add(new ChatMessage(ChatRole.User, $"Create a detailed {days}-day itinerary for {request.Destination}..." + $"\n\nWEATHER INFORMATION:\n{weatherAdvice}" + $"\n\nLOCAL KNOWLEDGE & TIPS:\n{localKnowledge}")); var itineraryResponse = await _itineraryAgent.InvokeAsync( itineraryChatHistory, cancellationToken); This pattern — parallel fan-out followed by sequential context enrichment — is simple, testable, and easy to extend. Need a seventh agent? Add it to the appropriate phase and wire it into WorkflowState . Async Request-Reply Pattern A multi-agent workflow with six LLM calls (some with tool invocations) can easily run 30–60 seconds. That's well beyond typical HTTP timeout expectations and not a great user experience for a synchronous request. We use the Async Request-Reply pattern to handle this: The API receives the travel plan request and immediately queues a message to Service Bus. It stores an initial task record in Cosmos DB with status queued and returns a taskId to the client. A continuous WebJob (running as a separate process on the same App Service plan) picks up the message, executes the full multi-agent workflow, and writes the result back to Cosmos DB. The client polls the API for status updates until the task reaches completed . This pattern keeps the API responsive, makes the heavy work retriable (Service Bus handles retries and dead-lettering), and lets the WebJob run independently — you can restart it without affecting the API. We covered this pattern in detail in the previous series, so we won't repeat the plumbing here. Deploy with azd The repo is wired up with the Azure Developer CLI for one-command provisioning and deployment: git clone https://github.com/seligj95/app-service-multi-agent-maf-otel.git cd app-service-multi-agent-maf-otel azd auth login azd up azd up provisions the following resources via Bicep: Azure App Service (P0v4 Windows) with a continuous WebJob Azure Service Bus namespace and queue Azure Cosmos DB account, database, and containers Azure AI Services (Azure OpenAI with GPT-4o deployment) Application Insights and Log Analytics workspace Managed Identity with all necessary role assignments After deployment completes, azd outputs the App Service URL. Open it in your browser, fill in the travel form, and watch six agents collaborate on your trip plan in real time. What's Next We now have a production-ready multi-agent app running on App Service with the GA Microsoft Agent Framework. But how do you actually observe what these agents are doing? When six agents are making LLM calls, invoking tools, and passing context between phases — you need visibility into every step. In the next post, we'll dive deep into how we instrumented these agents with OpenTelemetry and the new Agents (Preview) view in Application Insights — giving you full visibility into agent runs, token usage, tool calls, and model performance. You already saw the .UseOpenTelemetry() call in the builder pipeline; Blog 2 shows what that telemetry looks like end to end and how to light up the new Agents experience in the Azure portal. Stay tuned! Resources Sample repo — app-service-multi-agent-maf-otel Microsoft Agent Framework 1.0 GA Announcement Microsoft Agent Framework Documentation Previous Series — Part 3: Client-Side Multi-Agent Orchestration on App Service Microsoft.Extensions.AI Documentation Azure App Service Documentation Blog 2: Monitor AI Agents on App Service with OpenTelemetry and the New Application Insights Agents View Blog 3: Govern AI Agents on App Service with the Microsoft Agent Governance Toolkit772Views0likes0CommentsMonitor AI Agents on App Service with OpenTelemetry and the New Application Insights Agents View
Part 2 of 3: In Blog 1, we deployed a multi-agent travel planner on Azure App Service using the Microsoft Agent Framework (MAF) 1.0 GA. This post dives deep into how we instrumented those agents with OpenTelemetry and lit up the brand-new Agents (Preview) view in Application Insights. 📋 Prerequisite: This post assumes you've followed the guidance in Blog 1 to deploy the multi-agent travel planner to Azure App Service. If you haven't deployed the app yet, start there first — you'll need a running App Service with the agents, Service Bus, Cosmos DB, and Azure OpenAI provisioned before the monitoring steps in this post will work. Deploying Agents Is Only Half the Battle In Blog 1, we walked through deploying a multi-agent travel planning application on Azure App Service. Six specialized agents — a Coordinator, Currency Converter, Weather Advisor, Local Knowledge Expert, Itinerary Planner, and Budget Optimizer — work together to generate comprehensive travel plans. The architecture uses an ASP.NET Core API backed by a WebJob for async processing, Azure Service Bus for messaging, and Azure OpenAI for the brains. But here's the thing: deploying agents to production is only half the battle. Once they're running, you need answers to questions like: Which agent is consuming the most tokens? How long does the Itinerary Planner take compared to the Weather Advisor? Is the Coordinator making too many LLM calls per workflow? When something goes wrong, which agent in the pipeline failed? Traditional APM gives you HTTP latencies and exception rates. That's table stakes. For AI agents, you need to see inside the agent — the model calls, the tool invocations, the token spend. And that's exactly what Application Insights' new Agents (Preview) view delivers, powered by OpenTelemetry and the GenAI semantic conventions. Let's break down how it all works. The Agents (Preview) View in Application Insights Azure Application Insights now includes a dedicated Agents (Preview) blade that provides unified monitoring purpose-built for AI agents. It's not just a generic dashboard — it understands agent concepts natively. Whether your agents are built with Microsoft Agent Framework, Azure AI Foundry, Copilot Studio, or a third-party framework, this view lights up as long as your telemetry follows the GenAI semantic conventions. Here's what you get out of the box: Agent dropdown filter — A dropdown populated by gen_ai.agent.name values from your telemetry. In our travel planner, this shows all six agents: "Travel Planning Coordinator", "Currency Conversion Specialist", "Weather & Packing Advisor", "Local Expert & Cultural Guide", "Itinerary Planning Expert", and "Budget Optimization Specialist". You can filter the entire dashboard to one agent or view them all. Token usage metrics — Visualizations of input and output token consumption, broken down by agent. Instantly see which agents are the most expensive to run. Operational metrics — Latency distributions, error rates, and throughput for each agent. Spot performance regressions before users notice. End-to-end transaction details — Click into any trace to see the full workflow: which agents were invoked, what tools they called, how long each step took. The "simple view" renders agent steps in a story-like format that's remarkably easy to follow. Grafana integration — One-click export to Azure Managed Grafana for custom dashboards and alerting. The key insight: this view isn't magic. It works because the telemetry is structured using well-defined semantic conventions. Let's look at those next. 📖 Docs: Application Insights Agents (Preview) view documentation GenAI Semantic Conventions — The Foundation The entire Agents view is powered by the OpenTelemetry GenAI semantic conventions. These are a standardized set of span attributes that describe AI agent behavior in a way that any observability backend can understand. Think of them as the "contract" between your instrumented code and Application Insights. Let's walk through the key attributes and why each one matters: gen_ai.agent.name This is the human-readable name of the agent. In our travel planner, each agent sets this via the name parameter when constructing the MAF ChatClientAgent — for example, "Weather & Packing Advisor" or "Budget Optimization Specialist" . This is what populates the agent dropdown in the Agents view. Without this attribute, Application Insights would have no way to distinguish one agent from another in your telemetry. It's the single most important attribute for agent-level monitoring. gen_ai.agent.description A brief description of what the agent does. Our Weather Advisor, for example, is described as "Provides weather forecasts, packing recommendations, and activity suggestions based on destination weather conditions." This metadata helps operators and on-call engineers quickly understand an agent's role without diving into source code. It shows up in trace details and helps contextualize what you're looking at when debugging. gen_ai.agent.id A unique identifier for the agent instance. In MAF, this is typically an auto-generated GUID. While gen_ai.agent.name is the human-friendly label, gen_ai.agent.id is the machine-stable identifier. If you rename an agent, the ID stays the same, which is important for tracking agent behavior across code deployments. gen_ai.operation.name The type of operation being performed. Values include "chat" for standard LLM calls and "execute_tool" for tool/function invocations. In our travel planner, when the Weather Advisor calls the GetWeatherForecast function via NWS, or when the Currency Converter calls ConvertCurrency via the Frankfurter API, those tool calls get their own spans with gen_ai.operation.name = "execute_tool" . This lets you measure LLM think-time separately from tool execution time — a critical distinction for performance optimization. gen_ai.request.model / gen_ai.response.model The model used for the request and the model that actually served the response (these can differ when providers do model routing). In our case, both are "gpt-4o" since that's what we deploy via Azure OpenAI. These attributes let you track model usage across agents, spot unexpected model assignments, and correlate performance changes with model updates. gen_ai.usage.input_tokens / gen_ai.usage.output_tokens Token consumption per LLM call. This is what powers the token usage visualizations in the Agents view. The Coordinator agent, which aggregates results from all five specialist agents, tends to have higher output token counts because it's synthesizing a full travel plan. The Currency Converter, which makes focused API calls, uses fewer tokens overall. These attributes let you answer the question "which agent is costing me the most?" — and more importantly, let you set alerts when token usage spikes unexpectedly. gen_ai.system The AI system or provider. In our case, this is "openai" (set by the Azure OpenAI client instrumentation). If you're using multiple AI providers — say, Azure OpenAI for planning and a local model for classification — this attribute lets you filter and compare. Together, these attributes create a rich, structured view of agent behavior that goes far beyond generic tracing. They're the reason Application Insights can render agent-specific dashboards with token breakdowns, latency distributions, and end-to-end workflow views. Without these conventions, all you'd see is opaque HTTP calls to an OpenAI endpoint. 💡 Key takeaway: The GenAI semantic conventions are what transform generic distributed traces into agent-aware observability. They're the bridge between your code and the Agents view. Any framework that emits these attributes — MAF, Semantic Kernel, LangChain — can light up this dashboard. Two Layers of OpenTelemetry Instrumentation Our travel planner sample instruments at two distinct levels, each capturing different aspects of agent behavior. Let's look at both. Layer 1: IChatClient-Level Instrumentation The first layer instruments at the IChatClient level using Microsoft.Extensions.AI . This is where we wrap the Azure OpenAI chat client with OpenTelemetry: var client = new AzureOpenAIClient(azureOpenAIEndpoint, new DefaultAzureCredential()); // Wrap with OpenTelemetry to emit GenAI semantic convention spans return client.GetChatClient(modelDeploymentName).AsIChatClient() .AsBuilder() .UseOpenTelemetry() .Build(); This single .UseOpenTelemetry() call intercepts every LLM call and emits spans with: gen_ai.system — the AI provider (e.g., "openai" ) gen_ai.request.model / gen_ai.response.model — which model was used gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — token consumption per call gen_ai.operation.name — the operation type ( "chat" ) Think of this as the "LLM layer" — it captures what the model is doing regardless of which agent called it. It's model-centric telemetry. Layer 2: Agent-Level Instrumentation The second layer instruments at the agent level using MAF 1.0 GA's built-in OpenTelemetry support. This happens in the BaseAgent class that all our agents inherit from: Agent = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description, tools: chatOptions.Tools?.ToList()) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName) .Build(); The .UseOpenTelemetry(sourceName: AgentName) call on the MAF agent builder emits a different set of spans: gen_ai.agent.name — the human-readable agent name (e.g., "Weather & Packing Advisor" ) gen_ai.agent.description — what the agent does gen_ai.agent.id — the unique agent identifier Agent invocation traces — spans that represent the full lifecycle of an agent call This is the "agent layer" — it captures which agent is doing the work and provides the identity information that powers the Agents view dropdown and per-agent filtering. Why Both Layers? When both layers are active, you get the richest possible telemetry. The agent-level spans nest around the LLM-level spans, creating a trace hierarchy that looks like: Agent: "Weather & Packing Advisor" (gen_ai.agent.name) └── chat (gen_ai.operation.name) ├── model: gpt-4o, input_tokens: 450, output_tokens: 120 └── execute_tool: GetWeatherForecast └── chat (follow-up with tool results) └── model: gpt-4o, input_tokens: 680, output_tokens: 350 There is a tradeoff: with both layers active, you may see some span duplication since both the IChatClient wrapper and the MAF agent wrapper emit spans for the same underlying LLM call. If you find the telemetry too noisy, you can disable one layer: Agent layer only (remove .UseOpenTelemetry() from the IChatClient ) — You get agent identity but lose per-call token breakdowns. IChatClient layer only (remove .UseOpenTelemetry() from the agent builder) — You get detailed LLM metrics but lose agent identity in the Agents view. For the fullest experience with the Agents (Preview) view, we recommend keeping both layers active. The official sample uses both, and the Agents view is designed to handle the overlapping spans gracefully. 📖 Docs: MAF Observability Guide Exporting Telemetry to Application Insights Emitting OpenTelemetry spans is only useful if they land somewhere you can query them. The good news is that Azure App Service and Application Insights have deep native integration — App Service can auto-instrument your app, forward platform logs, and surface health metrics out of the box. For a full overview of monitoring capabilities, see Monitor Azure App Service. For our AI agent scenario, we go beyond the built-in platform telemetry. We need the GenAI semantic convention spans that we configured in the previous sections to flow into App Insights so the Agents (Preview) view can render them. Our travel planner has two host processes — the ASP.NET Core API and a WebJob — and each requires a slightly different exporter setup. ASP.NET Core API — Azure Monitor OpenTelemetry Distro For the API, it's a single line. The Azure Monitor OpenTelemetry Distro handles everything: // Configure OpenTelemetry with Azure Monitor for traces, metrics, and logs. // The APPLICATIONINSIGHTS_CONNECTION_STRING env var is auto-discovered. builder.Services.AddOpenTelemetry().UseAzureMonitor(); That's it. The distro automatically: Discovers the APPLICATIONINSIGHTS_CONNECTION_STRING environment variable Configures trace, metric, and log exporters to Application Insights Sets up appropriate sampling and batching Registers standard ASP.NET Core HTTP instrumentation This is the recommended approach for any ASP.NET Core application. One NuGet package ( Azure.Monitor.OpenTelemetry.AspNetCore ), one line of code, zero configuration files. WebJob — Manual Exporter Setup The WebJob is a non-ASP.NET Core host (it uses Host.CreateApplicationBuilder ), so the distro's convenience method isn't available. Instead, we configure the exporters explicitly: // Configure OpenTelemetry with Azure Monitor for the WebJob (non-ASP.NET Core host). // The APPLICATIONINSIGHTS_CONNECTION_STRING env var is auto-discovered. builder.Services.AddOpenTelemetry() .ConfigureResource(r => r.AddService("TravelPlanner.WebJob")) .WithTracing(t => t .AddSource("*") .AddAzureMonitorTraceExporter()) .WithMetrics(m => m .AddMeter("*") .AddAzureMonitorMetricExporter()); builder.Logging.AddOpenTelemetry(o => o.AddAzureMonitorLogExporter()); A few things to note: .AddSource("*") — Subscribes to all trace sources, including the ones emitted by MAF's .UseOpenTelemetry(sourceName: AgentName) . In production, you might narrow this to specific source names for performance. .AddMeter("*") — Similarly captures all metrics, including the GenAI metrics emitted by the instrumentation layers. .ConfigureResource(r => r.AddService("TravelPlanner.WebJob")) — Tags all telemetry with the service name so you can distinguish API vs. WebJob telemetry in Application Insights. The connection string is still auto-discovered from the APPLICATIONINSIGHTS_CONNECTION_STRING environment variable — no need to pass it explicitly. The key difference between these two approaches is ceremony, not capability. Both send the same GenAI spans to Application Insights; the Agents view works identically regardless of which exporter setup you use. 📖 Docs: Azure Monitor OpenTelemetry Distro Infrastructure as Code — Provisioning the Monitoring Stack The monitoring infrastructure is provisioned via Bicep modules alongside the rest of the application's Azure resources. Here's how it fits together. Log Analytics Workspace infra/core/monitor/loganalytics.bicep creates the Log Analytics workspace that backs Application Insights: resource logAnalyticsWorkspace 'Microsoft.OperationalInsights/workspaces@2023-09-01' = { name: name location: location tags: tags properties: { sku: { name: 'PerGB2018' } retentionInDays: 30 } } Application Insights infra/core/monitor/appinsights.bicep creates a workspace-based Application Insights resource connected to Log Analytics: resource appInsights 'Microsoft.Insights/components@2020-02-02' = { name: name location: location tags: tags kind: 'web' properties: { Application_Type: 'web' WorkspaceResourceId: logAnalyticsWorkspaceId } } output connectionString string = appInsights.properties.ConnectionString Wiring It All Together In infra/main.bicep , the Application Insights connection string is passed as an app setting to the App Service: appSettings: { APPLICATIONINSIGHTS_CONNECTION_STRING: appInsights.outputs.connectionString // ... other app settings } This is the critical glue: when the app starts, the OpenTelemetry distro (or manual exporters) auto-discover this environment variable and start sending telemetry to your Application Insights resource. No connection strings in code, no configuration files — it's all infrastructure-driven. The same connection string is available to both the API and the WebJob since they run on the same App Service. All agent telemetry from both host processes flows into a single Application Insights resource, giving you a unified view across the entire application. See It in Action Once the application is deployed and processing travel plan requests, here's how to explore the agent telemetry in Application Insights. Step 1: Open the Agents (Preview) View In the Azure portal, navigate to your Application Insights resource. In the left nav, look for Agents (Preview) under the Investigations section. This opens the unified agent monitoring dashboard. Step 2: Filter by Agent The agent dropdown at the top of the page is populated by the gen_ai.agent.name values in your telemetry. You'll see all six agents listed: Travel Planning Coordinator Currency Conversion Specialist Weather & Packing Advisor Local Expert & Cultural Guide Itinerary Planning Expert Budget Optimization Specialist Select a specific agent to filter the entire dashboard — token usage, latency, error rate — down to that one agent. Step 3: Review Token Usage The token usage tile shows total input and output token consumption over your selected time range. Compare agents to find your biggest spenders. In our testing, the Coordinator agent consistently uses the most output tokens because it aggregates and synthesizes results from all five specialists. Step 4: Drill into Traces Click "View Traces with Agent Runs" to see all agent executions. Each row represents a workflow run. You can filter by time range, status (success/failure), and specific agent. Step 5: End-to-End Transaction Details Click any trace to open the end-to-end transaction details. The "simple view" renders the agent workflow as a story — showing each step, which agent handled it, how long it took, and what tools were called. For a full travel plan, you'll see the Coordinator dispatch work to each specialist, tool calls to the NWS weather API and Frankfurter currency API, and the final aggregation step. Grafana Dashboards The Agents (Preview) view in Application Insights is great for ad-hoc investigation. For ongoing monitoring and alerting, Azure Managed Grafana provides prebuilt dashboards specifically designed for agent workloads. From the Agents view, click "Explore in Grafana" to jump directly into these dashboards: Agent Framework Dashboard — Per-agent metrics including token usage trends, latency percentiles, error rates, and throughput over time. Pin this to your operations wall. Agent Framework Workflow Dashboard — Workflow-level metrics showing how multi-agent orchestrations perform end-to-end. See how long complete travel plans take, identify bottleneck agents, and track success rates. These dashboards query the same underlying data in Log Analytics, so there's zero additional instrumentation needed. If your telemetry lights up the Agents view, it lights up Grafana too. Key Packages Summary Here are the NuGet packages that make this work, pulled from the actual project files: Package Version Purpose Azure.Monitor.OpenTelemetry.AspNetCore 1.3.0 Azure Monitor OTEL Distro for ASP.NET Core (API). One-line setup for traces, metrics, and logs. Azure.Monitor.OpenTelemetry.Exporter 1.3.0 Azure Monitor OTEL exporter for non-ASP.NET Core hosts (WebJob). Trace, metric, and log exporters. Microsoft.Agents.AI 1.0.0 MAF 1.0 GA — ChatClientAgent , .UseOpenTelemetry() for agent-level instrumentation. Microsoft.Extensions.AI 10.4.1 IChatClient abstraction with .UseOpenTelemetry() for LLM-level instrumentation. OpenTelemetry.Extensions.Hosting 1.11.2 OTEL dependency injection integration for Host.CreateApplicationBuilder (WebJob). Microsoft.Extensions.AI.OpenAI 10.4.1 OpenAI/Azure OpenAI adapter for IChatClient . Bridges the Azure OpenAI SDK to the M.E.AI abstraction. Wrapping Up Let's zoom out. In this three-part series, so far we've gone from zero to a fully observable, production-grade multi-agent AI application on Azure App Service: Blog 1 covered deploying the multi-agent travel planner with MAF 1.0 GA — the agents, the architecture, the infrastructure. Blog 2 (this post) showed how to instrument those agents with OpenTelemetry, explained the GenAI semantic conventions that make agent-aware monitoring possible, and walked through the new Agents (Preview) view in Application Insights. Blog 3 will show you how to secure those agents for production with the Microsoft Agent Governance Toolkit. The pattern is straightforward: Add .UseOpenTelemetry() at the IChatClient level for LLM metrics. Add .UseOpenTelemetry(sourceName: AgentName) at the MAF agent level for agent identity. Export to Application Insights via the Azure Monitor distro (one line) or manual exporters. Wire the connection string through Bicep and environment variables. Open the Agents (Preview) view and start monitoring. With MAF 1.0 GA's built-in OpenTelemetry support and Application Insights' new Agents view, you get production-grade observability for AI agents with minimal code. The GenAI semantic conventions ensure your telemetry is structured, portable, and understood by any compliant backend. And because it's all standard OpenTelemetry, you're not locked into any single vendor — swap the exporter and your telemetry goes to Jaeger, Grafana, Datadog, or wherever you need it. Now go see what your agents are up to and check out Blog 3. Resources Sample repository: seligj95/app-service-multi-agent-maf-otel App Insights Agents (Preview) view: Documentation GenAI Semantic Conventions: OpenTelemetry GenAI Registry MAF Observability Guide: Microsoft Agent Framework Observability Azure Monitor OpenTelemetry Distro: Enable OpenTelemetry for .NET Grafana Agent Framework Dashboard: aka.ms/amg/dash/af-agent Grafana Workflow Dashboard: aka.ms/amg/dash/af-workflow Blog 1: Deploy Multi-Agent AI Apps on Azure App Service with MAF 1.0 GA Blog 3: Govern AI Agents on App Service with the Microsoft Agent Governance Toolkit423Views0likes0CommentsGovern AI Agents on App Service with the Microsoft Agent Governance Toolkit
Part 3 of 3 — Multi-Agent AI on Azure App Service In Blog 1, we built a multi-agent travel planner with Microsoft Agent Framework 1.0 on App Service. In Blog 2, we added observability with OpenTelemetry and the new Application Insights Agents view. Now in Part 3, we secure those agents for production with the Microsoft Agent Governance Toolkit. This post assumes you've followed the guidance in Blog 1 to deploy the multi-agent travel planner to Azure App Service. If you haven't deployed the app yet, start there first. The governance gap Our travel planner works. It's observable. But here's the question I'm hearing from customers: "How do I make sure my agents don't do something they shouldn't?" It's a fair question. Our six agents — Coordinator, Currency Converter, Weather Advisor, Local Knowledge, Itinerary Planner, and Budget Optimizer — can call external APIs, process user data, and make autonomous decisions. In a demo, that's impressive. In production, that's a risk surface. Consider what can go wrong with ungoverned agents: Unauthorized API calls — An agent calls an external API it was never intended to use, leaking data or incurring costs Sensitive data exposure — An agent passes PII to a third-party service without consent controls Runaway token spend — A recursive agent loop burns through your OpenAI budget in minutes Tool misuse — A prompt injection tricks an agent into executing a tool it shouldn't Cascading failures — One agent's error propagates through the entire multi-agent workflow These aren't theoretical. In December 2025, OWASP published the Top 10 for Agentic Applications — the first formal taxonomy of risks specific to autonomous AI agents, including goal hijacking, tool misuse, identity abuse, memory poisoning, and rogue agents. Regulators are paying attention too: the EU AI Act's high-risk AI obligations take effect in August 2026, and the Colorado AI Act becomes enforceable in June 2026. The bottom line: if you're running agents in production, you need governance. Not eventually — now. What the Agent Governance Toolkit does The Agent Governance Toolkit is an open-source project (MIT license) from Microsoft that brings runtime security governance to autonomous AI agents. It's the first toolkit to address all 10 OWASP agentic AI risks with deterministic, sub-millisecond policy enforcement. The toolkit is organized into 7 packages: Package What it does Think of it as... Agent OS Stateless policy engine, intercepts every action before execution (<0.1ms p99) The kernel for AI agents Agent Mesh Cryptographic identity (DIDs), inter-agent trust protocol, dynamic trust scoring mTLS for agents Agent Runtime Execution rings (like CPU privilege levels), saga orchestration, kill switch Process isolation for agents Agent SRE SLOs, error budgets, circuit breakers, chaos engineering SRE practices for agents Agent Compliance Automated governance verification, regulatory mapping (EU AI Act, HIPAA, SOC2) Compliance-as-code Agent Marketplace Plugin lifecycle management, Ed25519 signing, supply-chain security Package manager security Agent Lightning RL training governance with policy-enforced runners Safe training guardrails The toolkit is available in Python, TypeScript, Rust, Go, and .NET. It's framework-agnostic — it works with MAF, LangChain, CrewAI, Google ADK, and more. For our ASP.NET Core travel planner, we'll use the .NET SDK via NuGet ( Microsoft.AgentGovernance ). For this blog, we're focusing on three packages: Agent OS — the policy engine that intercepts and evaluates every agent action Agent Compliance — regulatory mapping and audit trail generation Agent SRE — SLOs and circuit breakers for agent reliability How easy it was to add governance Here's the part that surprised me. I expected adding governance to a production agent system to be a multi-hour effort — new infrastructure, complex configuration, extensive refactoring. Instead, it took about 30 minutes. Here's exactly what we changed: Step 1: Add NuGet packages Three packages added to TravelPlanner.Shared.csproj : <itemgroup> <!-- Existing packages --> <packagereference include="Azure.Monitor.OpenTelemetry.AspNetCore" version="1.3.0"> <packagereference include="Microsoft.Agents.AI" version="1.0.0"> <!-- NEW: Agent Governance Toolkit (single package, all features included) --> <packagereference include="Microsoft.AgentGovernance" version="3.0.2"> </packagereference></packagereference></packagereference></itemgroup> Step 2: Create the policy file One new file: governance-policies.yaml in the project root. This is where all your governance rules live: apiVersion: governance.toolkit/v1 name: travel-planner-governance description: Policy enforcement for the multi-agent travel planner on App Service scope: global defaultAction: deny rules: - name: allow-currency-conversion condition: "tool == 'ConvertCurrency'" action: allow priority: 10 description: Allow Currency Converter agent to call Frankfurter exchange rate API - name: allow-weather-forecast condition: "tool == 'GetWeatherForecast'" action: allow priority: 10 description: Allow Weather Advisor agent to call NWS forecast API - name: allow-weather-alerts condition: "tool == 'GetWeatherAlerts'" action: allow priority: 10 description: Allow Weather Advisor agent to check NWS weather alerts Step 3: One line in BaseAgent.cs This is the moment. Here's our BaseAgent.cs before: Agent = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName) .Build(); And after: var kernel = serviceProvider.GetService<GovernanceKernel>(); if (kernel is not null) builder.UseGovernance(kernel, AgentName); Agent = builder.Build(); One line of intent, two lines of null-safety. The .UseGovernance(kernel, AgentName) call intercepts every tool/function invocation in the agent's pipeline, evaluating it against the loaded policies before execution. If the GovernanceKernel isn't registered (governance disabled), agents work exactly as before — no crash, no code change needed. Here's the full updated constructor using IServiceProvider to optionally resolve governance: using AgentGovernance; using Microsoft.Extensions.DependencyInjection; public abstract class BaseAgent : IAgent { protected readonly ILogger Logger; protected readonly AgentOptions Options; protected readonly AIAgent Agent; // Constructor for simple agents without tools protected BaseAgent( ILogger logger, IOptions<AgentOptions> options, IChatClient chatClient, IServiceProvider serviceProvider) { Logger = logger; Options = options.Value; var builder = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName); var kernel = serviceProvider.GetService<GovernanceKernel>(); if (kernel is not null) builder.UseGovernance(kernel, AgentName); Agent = builder.Build(); } // Constructor for agents with tools protected BaseAgent( ILogger logger, IOptions<AgentOptions> options, IChatClient chatClient, ChatOptions chatOptions, IServiceProvider serviceProvider) { Logger = logger; Options = options.Value; var builder = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description, tools: chatOptions.Tools?.ToList()) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName); var kernel = serviceProvider.GetService<GovernanceKernel>(); if (kernel is not null) builder.UseGovernance(kernel, AgentName); Agent = builder.Build(); } // ... rest unchanged } Step 4: DI registrations in Program.cs A few lines to wire up governance in the dependency injection container: using AgentGovernance; // ... existing builder setup ... // Configure OpenTelemetry with Azure Monitor (existing — from Blog 2) builder.Services.AddOpenTelemetry().UseAzureMonitor(); // NEW: Configure Agent Governance Toolkit // Load policy from YAML, register as singleton. Agents resolve via IServiceProvider. var policyPath = Path.Combine(builder.Environment.ContentRootPath, "governance-policies.yaml"); if (File.Exists(policyPath)) { try { var yaml = File.ReadAllText(policyPath); var kernel = new GovernanceKernel(new GovernanceOptions { EnableAudit = true, EnableMetrics = true }); kernel.LoadPolicyFromYaml(yaml); builder.Services.AddSingleton(kernel); Console.WriteLine($"[Governance] Loaded policies from {policyPath}"); } catch (Exception ex) { Console.WriteLine($"[Governance] Failed to load: {ex.Message}. Running without governance."); } } That's it. Your agents are now governed. Let me repeat that because it's the core message of this blog: we added production governance to a six-agent system by adding one NuGet package, creating one YAML policy file, adding a few lines to our base agent class, and registering the governance kernel in DI. No new infrastructure. No complex rewiring. No multi-sprint project. If you followed Blog 1 and Blog 2, you can do this in 30 minutes. Policy flexibility deep-dive The YAML policy language is intentionally simple to start with, but it supports real complexity when you need it. Let's walk through what each policy in our file does. API allowlists and blocklists Our travel planner calls two external APIs: Frankfurter (currency exchange) and the National Weather Service. The defaultAction: deny combined with explicit allow rules ensures agents can only call these approved tools. If an agent attempts to call any other function — whether through a prompt injection or a bug — the call is blocked before it executes: defaultAction: deny rules: - name: allow-currency-conversion condition: "tool == 'ConvertCurrency'" action: allow priority: 10 - name: allow-weather-forecast condition: "tool == 'GetWeatherForecast'" action: allow priority: 10 When a blocked call happens, you'll see output like this in your logs: [Governance] Tool call 'DeleteDatabase' blocked for agent 'LocalKnowledgeAgent': No matching rules; default action is deny. Condition language The condition field supports equality checks, pattern matching, and boolean logic. You can match on tool name, agent ID, or any key in the evaluation context: # Match a specific tool condition: "tool == 'ConvertCurrency'" # Match multiple tools with OR condition: "tool == 'GetWeatherForecast' or tool == 'GetWeatherAlerts'" # Match by agent condition: "agent == 'CurrencyConverterAgent' and tool == 'ConvertCurrency'" Priority and conflict resolution When multiple rules match, the toolkit evaluates by priority (higher number = higher priority). A deny rule at priority 100 will override an allow rule at priority 10. This lets you layer broad allows with specific denies: rules: - name: allow-all-weather-tools condition: "tool == 'GetWeatherForecast' or tool == 'GetWeatherAlerts'" action: allow priority: 10 - name: block-during-maintenance condition: "tool == 'GetWeatherForecast'" action: deny priority: 100 description: Temporarily block NWS calls during API maintenance Advanced: OPA Rego and Cedar The YAML policy language handles most scenarios, but for teams with advanced needs, the toolkit also supports OPA Rego and Cedar policy languages. You can mix them — use YAML for simple rules and Rego for complex conditional logic: # policies/advanced.rego — Example: time-based access control package travel_planner.governance default allow_tool_call = false allow_tool_call { input.agent == "CurrencyConverterAgent" input.tool == "get_exchange_rate" time.weekday(time.now_ns()) != "Sunday" # Markets closed } Start simple with YAML. Add complexity only when you need it. Why App Service for governed agent workloads You might be wondering: why does hosting platform matter for governance? It matters a lot. The governance toolkit handles the application-level policies, but a production agent system also needs platform-level security, networking, identity, and deployment controls. App Service gives you these out of the box. Managed Identity Governance policies enforce what agents can access. Managed Identity handles how they authenticate — without secrets to manage, rotate, or leak. Our travel planner already uses DefaultAzureCredential for Azure OpenAI, Cosmos DB, and Service Bus. Governance layers on top of this identity foundation. VNet Integration + Private Endpoints The governance toolkit enforces API allowlists at the application level. App Service's VNet integration and private endpoints enforce network boundaries at the infrastructure level. This is defense in depth: even if a governance policy is misconfigured, the network layer prevents unauthorized egress. Your agents can only reach the networks you've explicitly allowed. Easy Auth App Service's built-in authentication (Easy Auth) protects your agent APIs without custom code. Before a request even reaches your governance engine, App Service has already validated the caller's identity. No custom auth middleware. No JWT parsing. Just toggle it on. Deployment Slots This is underrated for governance. With deployment slots, you can test new governance policies in a staging slot before swapping to production. Deploy updated governance-policies.yaml to staging, run your test suite, verify the policies work as expected, and then swap. Zero-downtime policy updates with full rollback capability. App Insights integration Governance audit events flow into the same Application Insights instance we configured in Blog 2. This means your governance decisions appear alongside your OTel traces in the Agents view. One pane of glass for agent behavior and governance enforcement. Always-on + WebJobs Our travel planner uses WebJobs for long-running agent workflows. With App Service's Always-on feature, those workflows stay warm, and governance is continuous — no cold-start gaps where agents run unmonitored. azd deployment One command deploys the full governed stack — application code, governance policies, infrastructure, and monitoring: azd up App Service gives you the enterprise production features governance needs — identity, networking, observability, safe deployment — out of the box. The governance toolkit handles agent-level policy enforcement; App Service handles platform-level security. Together, they're a complete governed agent platform. Governance audit events in App Insights In Blog 2, we set up OpenTelemetry and the Application Insights Agents view to monitor agent behavior. With the governance toolkit, those same traces now include governance audit events — every policy decision is recorded as a span attribute on the agent's trace. When you open a trace in the Agents view, you'll see governance events inline: Policy: api-allowlist → ALLOWED — CurrencyConverterAgent called Frankfurter API, permitted Policy: token-budget → ALLOWED — Request used 3,200 tokens, within per-request limit of 8,000 Policy: rate-limit → THROTTLED — WeatherAdvisorAgent exceeded 60 calls/min, request delayed For deeper analysis, use KQL to query governance events directly. Here's a query that finds all policy violations in the last 24 hours: // Find all governance policy violations in the last 24 hours traces | where timestamp > ago(24h) | where customDimensions["governance.decision"] != "ALLOWED" | extend agentName = tostring(customDimensions["agent.name"]), policyName = tostring(customDimensions["governance.policy"]), decision = tostring(customDimensions["governance.decision"]), violationReason = tostring(customDimensions["governance.reason"]), targetUrl = tostring(customDimensions["tool.target_url"]) | project timestamp, agentName, policyName, decision, violationReason, targetUrl | order by timestamp desc And here's one for tracking token budget consumption across agents: // Token budget consumption by agent over the last hour customMetrics | where timestamp > ago(1h) | where name == "governance.tokens.consumed" | extend agentName = tostring(customDimensions["agent.name"]) | summarize totalTokens = sum(value), avgTokensPerRequest = avg(value), maxTokensPerRequest = max(value) by agentName, bin(timestamp, 5m) | order by totalTokens desc This is the power of integrating governance with your existing observability stack. You don't need a separate governance dashboard — everything lives in the same App Insights workspace you already know. SRE for agents The Agent SRE package brings Site Reliability Engineering practices to agent systems. This was the part that got me most excited, because it addresses a question I hear constantly: "How do I know my agents are actually reliable?" Service Level Objectives (SLOs) We defined SLOs in our policy file: slos: - name: weather-agent-latency agent: "WeatherAdvisorAgent" metric: latency-p99 target: 5000ms window: 5m This says: "The Weather Advisor Agent must respond within 5 seconds at the 99th percentile, measured over a 5-minute rolling window." When the SLO is breached, the toolkit emits an alert event and can trigger automated responses. Circuit breakers Circuit breakers prevent cascading failures. If an agent fails 5 times in a row, the circuit opens, and subsequent requests get a fast failure response instead of waiting for another timeout: circuit-breakers: - agent: "*" failure-threshold: 5 recovery-timeout: 30s half-open-max-calls: 2 After 30 seconds, the circuit enters a half-open state, allowing 2 test calls through. If those succeed, the circuit closes and normal operation resumes. If they fail, the circuit opens again. This pattern is battle-tested in microservices — now it protects your agents too. Error budgets Error budgets tie SLOs to business decisions. If your Coordinator Agent's success rate target is 99.5% over a 15-minute window, that means you have an error budget of 0.5%. When the budget is consumed, the toolkit can automatically reduce agent autonomy — for example, requiring human approval for high-risk actions until the error budget recovers. SRE practices turn agent reliability from a hope into a measurable, enforceable contract. Architecture Here's how everything fits together after adding governance: ┌─────────────────────────────────────────────────────────────────┐ │ Azure App Service │ │ ┌──────────────┐ ┌─────────────────────────────────────┐ │ │ │ Frontend │───▶│ ASP.NET Core API │ │ │ │ (Static) │ │ │ │ │ └──────────────┘ │ ┌─────────────────────────────┐ │ │ │ │ │ Coordinator Agent │ │ │ │ │ │ ┌───────┐ ┌────────────┐ │ │ │ │ │ │ │ OTel │─▶│ Governance │ │ │ │ │ │ │ └───────┘ │ Engine │ │ │ │ │ │ │ │ ┌────────┐ │ │ │ │ │ │ │ │ │Policies│ │ │ │ │ │ │ │ │ └────────┘ │ │ │ │ │ │ │ └─────┬──────┘ │ │ │ │ │ └───────────────────┼─────────┘ │ │ │ │ ┌───────────────────┼──────────┐ │ │ │ │ │ Specialist Agents │ │ │ │ │ │ │ (Currency, Weather, etc.) │ │ │ │ │ │ Each with OTel + Governance │ │ │ │ │ └───────────────────┼──────────┘ │ │ │ └──────────────────────┼──────────────┘ │ │ │ │ │ ┌────────────┐ ┌───────────┐ ┌───────────┼─────────┐ │ │ │ Managed │ │ VNet │ │ App Insights │ │ │ │ Identity │ │Integration│ │ (Traces + │ │ │ │ (no keys) │ │(network │ │ Governance Audit) │ │ │ │ │ │ boundary) │ │ │ │ │ └────────────┘ └───────────┘ └─────────────────────┘ │ └──────────────────────────────┬──────────────────────────────────┘ │ Only allowed APIs ▼ ┌──────────────────────┐ │ External APIs │ │ ✅ Frankfurter API │ │ ✅ NWS Weather API │ │ ❌ Everything else │ └──────────────────────┘ The key insight: governance is a transparent layer in the agent pipeline. It sits between the agent's decision and the action's execution. The agent code doesn't know or care about governance — it just builds the agent with .UseGovernance() and the policy engine handles the rest. Bring it to your own agents We've shown governance with Microsoft Agent Framework on .NET, but the toolkit is framework-agnostic. Here's how to add it to other popular frameworks: LangChain (Python) from agent_governance import PolicyEngine, GovernanceCallbackHandler policy_engine = PolicyEngine.from_yaml("governance-policies.yaml") # Add governance as a LangChain callback handler agent = create_react_agent( llm=llm, tools=tools, callbacks=[GovernanceCallbackHandler(policy_engine)] ) CrewAI (Python) from agent_governance import PolicyEngine from agent_governance.integrations.crewai import GovernanceTaskDecorator policy_engine = PolicyEngine.from_yaml("governance-policies.yaml") # Add governance as a CrewAI task decorator @GovernanceTaskDecorator(policy_engine) def research_task(agent, context): return agent.execute(context) Google ADK (Python) from agent_governance import PolicyEngine from agent_governance.integrations.google_adk import GovernancePlugin policy_engine = PolicyEngine.from_yaml("governance-policies.yaml") # Add governance as a Google ADK plugin agent = Agent( model="gemini-2.0-flash", tools=[...], plugins=[GovernancePlugin(policy_engine)] ) TypeScript / Node.js import { PolicyEngine } from '@microsoft/agentmesh-sdk'; const policyEngine = PolicyEngine.fromYaml('governance-policies.yaml'); // Use as middleware in your agent pipeline agent.use(policyEngine.middleware()); Every integration hooks into the framework's native extension points — callbacks, decorators, plugins, middleware — so adding governance doesn't require rewriting your agent code. Install the package, point it at your policy file, and you're governed. What's next This wraps up our three-part series on building production-ready multi-agent AI applications on Azure App Service: Blog 1: Build — Deploy a multi-agent travel planner with Microsoft Agent Framework 1.0 Blog 2: Monitor — Add observability with OpenTelemetry and the Application Insights Agents view Blog 3: Govern — Secure agents for production with the Agent Governance Toolkit (you are here) The progression is intentional: first make it work, then make it visible, then make it safe. And the consistent theme across all three parts is that App Service makes each step easier — managed hosting for Blog 1, integrated monitoring for Blog 2, and platform-level security features for Blog 3. Next steps for your agents Explore the Agent Governance Toolkit — star the repo, browse the 20 tutorials, try the demo Customize policies for your compliance needs — start with our YAML template and adapt it to your domain. Healthcare teams: enable HIPAA mappings. Finance teams: add SOC2 controls. Explore Agent Mesh for multi-agent trust — if you have agents communicating across services or trust boundaries, Agent Mesh's cryptographic identity and trust scoring add another layer of defense Deploy the sample — clone our travel planner repo, run azd up , and see governed agents in action AI agents are becoming autonomous decision-makers in high-stakes domains. The question isn't whether we need governance — it's whether we build it proactively, before incidents force our hand. With the Agent Governance Toolkit and Azure App Service, you can add production governance to your agents today. In about 30 minutes.395Views0likes0CommentsBuild and Host MCP Apps on Azure App Service
MCP Apps are here, and they're a game-changer for building AI tools with interactive UIs. If you've been following the Model Context Protocol (MCP) ecosystem, you've probably heard about the MCP Apps spec — the first official MCP extension that lets your tools return rich, interactive UIs that render directly inside AI chat clients like Claude Desktop, ChatGPT, VS Code Copilot, Goose, and Postman. And here's the best part: you can host them on Azure App Service. In this post, I'll walk you through building a weather widget MCP App and deploying it to App Service. You'll have a production-ready MCP server serving interactive UIs in under 10 minutes. What Are MCP Apps? MCP Apps extend the Model Context Protocol by combining tools (the functions your AI client can call) with UI resources (the interactive interfaces that display the results). The pattern is simple: A tool declares a _meta.ui.resourceUri in its metadata When the tool is invoked, the MCP host fetches that UI resource The UI renders in a sandboxed iframe inside the chat client The key insight? MCP Apps are just web apps — HTML, JavaScript, and CSS served through MCP. And that's exactly what App Service does best. The MCP Apps spec supports cross-client rendering, so the same UI works in Claude Desktop, VS Code Copilot, ChatGPT, and other MCP-enabled clients. Your weather widget, map viewer, or data dashboard becomes a universal component in the AI ecosystem. Why App Service for MCP Apps? Azure App Service is a natural fit for hosting MCP Apps. Here's why: Always On — No cold starts. Your UI resources are served instantly, every time. Easy Auth — Secure your MCP endpoint with Entra ID authentication out of the box, no code required. Custom domains + TLS — Professional MCP server endpoints with your own domain and managed certificates. Deployment slots — Canary and staged rollouts for MCP App updates without downtime. Sidecars — Run backend services (Redis, message queues, monitoring agents) alongside your MCP server. App Insights — Built-in telemetry to see which tools and UIs are being invoked, response times, and error rates. Now, these are all capabilities you can add to a production MCP App, but the sample we're building today keeps things simple. We're focusing on the core pattern: serving MCP tools with interactive UIs from App Service. The production features are there when you need them. When to Use Functions vs App Service for MCP Apps Before we dive into the code, let's talk about Azure Functions. The Functions team has done great work with their MCP Apps quickstart, and if serverless is your preferred model, that's a fantastic option. Functions and App Service both host MCP Apps beautifully — they just serve different needs. Azure Functions Azure App Service Best for New, purpose-built MCP Apps that benefit from serverless scaling MCP Apps that need always-on hosting, persistent state, or are part of larger web apps Scaling Scale to zero, pay per invocation Dedicated plans, always running Cold start Possible (mitigated by premium plan) None (Always On) Deployment azd up with Functions template azd up with App Service template MCP Apps quickstart Available This blog post! Additional capabilities Event-driven triggers, durable functions Easy Auth, custom domains, deployment slots, sidecars Think of it this way: if you're building a new MCP App from scratch and want serverless economics, go with Functions. If you're adding MCP capabilities to an existing web app, need zero cold starts, or want production features like Easy Auth and deployment slots, App Service is your friend. Build the Weather Widget MCP App Let's build a simple MCP App that fetches weather data from the Open-Meteo API and displays it in an interactive widget. The sample uses ASP.NET Core for the MCP server and Vite for the frontend UI. Here's the structure: app-service-mcp-app-sample/ ├── src/ │ ├── Program.cs # MCP server setup │ ├── WeatherTool.cs # Weather tool with UI metadata │ ├── WeatherUIResource.cs # MCP resource serving the UI │ ├── WeatherService.cs # Open-Meteo API integration │ └── app/ # Vite frontend (weather widget) │ └── src/ │ └── weather-app.ts # MCP Apps SDK integration ├── .vscode/ │ └── mcp.json # VS Code MCP server config ├── azure.yaml # Azure Developer CLI config └── infra/ # Bicep infrastructure Program.cs — MCP Server Setup The MCP server is an ASP.NET Core app that registers tools and UI resources: using ModelContextProtocol; var builder = WebApplication.CreateBuilder(args); // Register WeatherService builder.Services.AddSingleton<WeatherService>(sp => new WeatherService(WeatherService.CreateDefaultClient())); // Add MCP Server with HTTP transport, tools, and resources builder.Services.AddMcpServer() .WithHttpTransport(t => t.Stateless = true) .WithTools<WeatherTool>() .WithResources<WeatherUIResource>(); var app = builder.Build(); // Map MCP endpoints (no auth required for this sample) app.MapMcp("/mcp").AllowAnonymous(); app.Run(); AddMcpServer() configures the MCP protocol handler. WithHttpTransport() enables Streamable HTTP with stateless mode (no session management needed). WithTools<WeatherTool>() registers our weather tool, and WithResources<WeatherUIResource>() registers the UI resource that the MCP host will fetch and render. MapMcp("/mcp") maps the MCP endpoint at /mcp . WeatherTool.cs — Tool with UI Metadata The WeatherTool class defines the tool and uses the [McpMeta] attribute to declare a ui metadata block containing the resourceUri . This tells the MCP host where to fetch the interactive UI: using System.ComponentModel; using ModelContextProtocol.Server; [McpServerToolType] public class WeatherTool { private readonly WeatherService _weatherService; public WeatherTool(WeatherService weatherService) { _weatherService = weatherService; } [McpServerTool] [Description("Get current weather for a location via Open-Meteo. Returns weather data that displays in an interactive widget.")] [McpMeta("ui", JsonValue = """{"resourceUri": "ui://weather/index.html"}""")] public async Task<object> GetWeather( [Description("City name to check weather for (e.g., Seattle, New York, Miami)")] string location) { var result = await _weatherService.GetCurrentWeatherAsync(location); return result; } } The key line is the [McpMeta("ui", ...)] attribute. This adds _meta.ui.resourceUri to the tool definition, pointing to the ui://weather/index.html resource. When the AI client calls this tool, the host fetches that resource and renders it in a sandboxed iframe alongside the tool result. WeatherUIResource.cs — UI Resource The UI resource class serves the bundled HTML as an MCP resource with the ui:// scheme and text/html;profile=mcp-app MIME type required by the MCP Apps spec: using ModelContextProtocol.Protocol; using ModelContextProtocol.Server; [McpServerResourceType] public class WeatherUIResource { [McpServerResource( UriTemplate = "ui://weather/index.html", Name = "weather_ui", MimeType = "text/html;profile=mcp-app")] public static ResourceContents GetWeatherUI() { var filePath = Path.Combine( AppContext.BaseDirectory, "app", "dist", "index.html"); var html = File.ReadAllText(filePath); return new TextResourceContents { Uri = "ui://weather/index.html", MimeType = "text/html;profile=mcp-app", Text = html }; } } The [McpServerResource] attribute registers this method as the handler for the ui://weather/index.html resource. When the host fetches it, the bundled single-file HTML (built by Vite) is returned with the correct MIME type. WeatherService.cs — Open-Meteo API Integration The WeatherService class handles geocoding and weather data from the Open-Meteo API. Nothing MCP-specific here — it's just a standard HTTP client that geocodes a city name and fetches current weather observations. The UI Resource (Vite Frontend) The app/ directory contains a TypeScript app built with Vite that renders the weather widget. It uses the @modelcontextprotocol/ext-apps SDK to communicate with the host: import { App } from "@modelcontextprotocol/ext-apps"; const app = new App({ name: "Weather Widget", version: "1.0.0" }); // Handle tool results from the server app.ontoolresult = (params) => { const data = parseToolResultContent(params.content); if (data) render(data); }; // Adapt to host theme (light/dark) app.onhostcontextchanged = (ctx) => { if (ctx.theme) applyTheme(ctx.theme); }; await app.connect(); The SDK's App class handles the postMessage communication with the host. When the tool returns weather data, ontoolresult fires and the widget renders the temperature, conditions, humidity, and wind. The app also adapts to the host's theme so it looks native in both light and dark mode. The frontend is bundled into a single index.html file using Vite and the vite-plugin-singlefile plugin, which inlines all JavaScript and CSS. This makes it easy to serve as a single MCP resource. Run Locally To run the sample locally, you'll need the .NET 9 SDK and Node.js 18+ installed. Clone the repo and run: # Clone the repo git clone https://github.com/seligj95/app-service-mcp-app-sample.git cd app-service-mcp-app-sample # Build the frontend cd src/app npm install npm run build # Run the MCP server cd .. dotnet run The server starts on http://localhost:5000 . Now connect from VS Code Copilot: Open your workspace in VS Code The sample includes a .vscode/mcp.json that configures the local MCP server: { "servers": { "local-mcp-appservice": { "type": "http", "url": "http://localhost:5000/mcp" } } } Open the GitHub Copilot Chat panel Ask: "What's the weather in Seattle?" Copilot will invoke the GetWeather tool, and the interactive weather widget will render inline in the chat: Weather widget MCP App rendering inline in VS Code Copilot Chat Deploy to Azure Deploying to Azure is even easier. The sample includes an azure.yaml file and Bicep templates for App Service, so you can deploy with a single command: cd app-service-mcp-app-sample azd auth login azd up azd up will: Provision an App Service plan and web app in your subscription Build the .NET app and Vite frontend Deploy the app to App Service Output the public MCP endpoint URL After deployment, azd will output a URL like https://app-abc123.azurewebsites.net . Update your .vscode/mcp.json to point to the remote server: { "servers": { "remote-weather-app": { "type": "http", "url": "https://app-abc123.azurewebsites.net/mcp" } } } From that point forward, your MCP App is live. Any AI client that supports MCP Apps can invoke your weather tool and render the interactive widget — no local server required. What's Next? You've now built and deployed an MCP App to Azure App Service. Here's what you can explore next: Read the MCP Apps spec to understand the full capabilities of the extension, including input forms, persistent state, and multi-step workflows. Check out the ext-apps examples on GitHub — there are samples for map viewers, PDF renderers, system monitors, and more. Try the Azure Functions MCP Apps quickstart if you want to build a serverless MCP App. Learn about hosting remote MCP servers in App Service for more patterns and best practices. Clone the sample repo and customize it for your own use cases. And remember: App Service gives you a full production hosting platform for your MCP Apps. You can add Easy Auth to secure your endpoints with Entra ID, wire up App Insights for telemetry, configure custom domains and TLS certificates, and set up deployment slots for blue/green rollouts. These features make App Service a great choice when you're ready to take your MCP App to production. If you build something cool with MCP Apps and App Service, let me know — I'd love to see what you create!391Views0likes0CommentsTake Control of Every Message: Partial Failure Handling for Service Bus Triggers in Azure Functions
The Problem: All-or-Nothing Batch Processing in Azure Service Bus Azure Service Bus is one of the most widely used messaging services for building event-driven applications on Azure. When you use Azure Functions with a Service Bus trigger in batch mode, your function receives multiple messages at once for efficient, high-throughput processing. But what happens when one message in the batch fails? Your function receives a batch of 50 Service Bus messages. 49 process perfectly. 1 fails. What happens? In the default model, the entire batch fails. All 50 messages go back on the queue and get reprocessed, including the 49 that already succeeded. This leads to: Duplicate processing — messages that were already handled successfully get processed again Wasted compute — you pay for re-executing work that already completed Infinite retry loops — if that one "poison" message keeps failing, it blocks the entire batch indefinitely Idempotency burden — your downstream systems must handle duplicates gracefully, adding complexity to every consumer This is the classic all-or-nothing batch failure problem. Azure Functions solves it with per-message settlement. The Solution: Per-Message Settlement for Azure Service Bus Azure Functions gives you direct control over how each individual message is settled in real time, as you process it. Instead of treating the batch as all-or-nothing, you settle each message independently based on its processing outcome. With Service Bus message settlement actions in Azure Functions, you can: Action What It Does Complete Remove the message from the queue (successfully processed) Abandon Release the lock so the message returns to the queue for retry, optionally modifying application properties Dead-letter Move the message to the dead-letter queue (poison message handling) Defer Keep the message in the queue but make it only retrievable by sequence number This means in a batch of 50 messages, you can: Complete 47 that processed successfully Abandon 2 that hit a transient error (with updated retry metadata) Dead-letter 1 that is malformed and will never succeed All in a single function invocation. No reprocessing of successful messages. No building failure response objects. No all-or-nothing. Why This Matters 1. Eliminates Duplicate Processing When you complete messages individually, successfully processed messages are immediately removed from the queue. There's no chance of them being redelivered, even if other messages in the same batch fail. 2. Enables Granular Error Handling Different failures deserve different treatments. A malformed message should be dead-lettered immediately. A message that failed due to a transient database timeout should be abandoned for retry. A message that requires manual intervention should be deferred. Per-message settlement gives you this granularity. 3. Implements Exponential Backoff Without External Infrastructure By combining abandon with modified application properties, you can track retry counts per message and implement exponential backoff patterns directly in your function code, no additional queues or Durable Functions required. 4. Reduces Cost You stop paying for redundant re-execution of already-successful work. In high-throughput systems processing millions of messages, this can be a material cost reduction. 5. Simplifies Idempotency Requirements When successful messages are never redelivered, your downstream systems don't need to guard against duplicates as aggressively. This reduces architectural complexity and potential for bugs. Before: One Message = One Function Invocation Before batch support, there was no cardinality option, Azure Functions processed each Service Bus message as a separate function invocation. If your queue had 50 messages, the runtime spun up 50 individual executions. Single-Message Processing (The Old Way) import { app, InvocationContext } from '@azure/functions'; async function processOrder( message: unknown, // ← One message at a time, no batch context: InvocationContext ): Promise<void> { try { const order = message as Order; await processOrder(order); } catch (error) { context.error('Failed to process message:', error); // Message auto-complete by default. throw error; } } app.serviceBusQueue('processOrder', { connection: 'ServiceBusConnection', queueName: 'orders-queue', handler: processOrder, }); What this cost you: 50 messages on the queue Old (single-message) New (batch + settlement) Function invocations 50 separate invocations 1 invocation Connection overhead 50 separate DB/API connections 1 connection, reused across batch Compute cost 50× invocation overhead 1× invocation overhead Settlement control Binary: throw or don't 4 actions per message Every message paid the full price of a function invocation, startup, connection setup, teardown. At scale (millions of messages/day), this was a significant cost and latency penalty. And when a message failed, your only option was to throw (retry the whole message) or swallow the error (lose it silently). Code Examples Let's see how this looks across all three major Azure Functions language stacks. Node.js (TypeScript with @ azure/functions-extensions-servicebus) import '@azure/functions-extensions-servicebus'; import { app, InvocationContext } from '@azure/functions'; import { ServiceBusMessageContext, messageBodyAsJson } from '@azure/functions-extensions-servicebus'; interface Order { id: string; product: string; amount: number; } export async function processOrderBatch( sbContext: ServiceBusMessageContext, context: InvocationContext ): Promise<void> { const { messages, actions } = sbContext; for (const message of messages) { try { const order = messageBodyAsJson<Order>(message); await processOrder(order); await actions.complete(message); // ✅ Done } catch (error) { context.error(`Failed ${message.messageId}:`, error); await actions.deadletter(message); // ☠️ Poison } } } app.serviceBusQueue('processOrderBatch', { connection: 'ServiceBusConnection', queueName: 'orders-queue', sdkBinding: true, autoCompleteMessages: false, cardinality: 'many', handler: processOrderBatch, }); Key points: Enable sdkBinding: true and autoCompleteMessages: false to gain manual settlement control ServiceBusMessageContext provides both the messages array and actions object Settlement actions: complete(), abandon(), deadletter(), defer() Application properties can be passed to abandon() for retry tracking Built-in helpers like messageBodyAsJson<T>() handle Buffer-to-object parsing Full sample: serviceBusSampleWithComplete Python (V2 Programming Model) import json import logging from typing import List import azure.functions as func import azurefunctions.extensions.bindings.servicebus as servicebus app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION) @app.service_bus_queue_trigger(arg_name="messages", queue_name="orders-queue", connection="SERVICEBUS_CONNECTION", auto_complete_messages=False, cardinality="many") def process_order_batch(messages: List[servicebus.ServiceBusReceivedMessage], message_actions: servicebus.ServiceBusMessageActions): for message in messages: try: order = json.loads(message.body) process_order(order) message_actions.complete(message) # ✅ Done except Exception as e: logging.error(f"Failed {message.message_id}: {e}") message_actions.dead_letter(message) # ☠️ Poison def process_order(order): logging.info(f"Processing order: {order['id']}") Key points: Uses azurefunctions.extensions.bindings.servicebus for SDK-type bindings with ServiceBusReceivedMessage Supports both queue and topic triggers with cardinality="many" for batch processing Each message exposes SDK properties like body, enqueued_time_utc, lock_token, message_id, and sequence_number Full sample: servicebus_samples_settlement .NET (C# Isolated Worker) using Azure.Messaging.ServiceBus; using Microsoft.Azure.Functions.Worker; public class ServiceBusBatchProcessor(ILogger<ServiceBusBatchProcessor> logger) { [Function(nameof(ProcessOrderBatch))] public async Task ProcessOrderBatch( [ServiceBusTrigger("orders-queue", Connection = "ServiceBusConnection")] ServiceBusReceivedMessage[] messages, ServiceBusMessageActions messageActions) { foreach (var message in messages) { try { var order = message.Body.ToObjectFromJson<Order>(); await ProcessOrder(order); await messageActions.CompleteMessageAsync(message); // ✅ Done } catch (Exception ex) { logger.LogError(ex, "Failed {MessageId}", message.MessageId); await messageActions.DeadLetterMessageAsync(message); // ☠️ Poison } } } private Task ProcessOrder(Order order) => Task.CompletedTask; } public record Order(string Id, string Product, decimal Amount); Key points: Inject ServiceBusMessageActions directly alongside the message array Each message is individually settled with CompleteMessageAsync, DeadLetterMessageAsync, or AbandonMessageAsync Application properties can be modified on abandon to track retry metadata Full sample: ServiceBusReceivedMessageFunctions.cs397Views3likes0CommentsHTTP Triggers in Azure SRE Agent: From Jira Ticket to Automated Investigation
Introduction Many teams run their observability, incident management, ticketing, and deployment on platforms outside of Azure—Jira, Opsgenie, Grafana, Zendesk, GitLab, Jenkins, Harness, or homegrown internal tools. These are the systems where alerts fire, tickets get filed, deployments happen, and operational decisions are made every day. HTTP Triggers make it easy to connect any of them to Azure SRE Agent—turning events from any platform into automated agent actions with a simple HTTP POST. No manual copy-paste, no context-switching, no delay between detection and response. In this blog, we'll demonstrate by connecting Jira to SRE Agent—so that every new incident ticket automatically triggers an investigation, and the agent posts its findings back to the Jira ticket when it's done. The Scenario: Jira Incident → Automated Investigation Your team manages production applications backed by Azure PostgreSQL Flexible Server. You use Jira for incident tracking. Today, when a P1 or P2 incident is filed, your on-call engineer has to manually triage—reading through the ticket, checking dashboards, querying logs, correlating recent deployments—before they can even begin working on a fix. Some teams have Jira automations that route or label tickets, but the actual investigation still starts with a human. HTTP Triggers let you bring SRE Agent directly into that existing workflow. Instead of adding another tool for engineers to check, the agent meets them where they already work. Jira ticket created → SRE Agent automatically investigates → Agent writes findings back to Jira The on-call engineer opens the Jira ticket and the investigation is already there—root cause analysis, evidence from logs and metrics, and recommended next steps—posted as a comment by the agent. Here's how to set this up. Architecture Overview Here's the end-to-end flow we'll build: Jira — A new issue is created in your project Logic App — The Jira connector detects the new issue, and the Logic App calls the SRE Agent HTTP Trigger, using Managed Identity for authentication HTTP Trigger — The agent prompt is rendered with the Jira ticket details (key, summary, priority, etc.) via payload placeholders Agent Investigation — The agent uses Jira MCP tools to read the ticket and search related issues, queries Azure logs, metrics, and recent deployments, then posts its findings back to the Jira ticket as a comment How HTTP Triggers Work Every HTTP Trigger you create in Azure SRE Agent exposes a unique webhook URL: https://<your-agent>.<instance>.azuresre.ai/api/v1/httptriggers/trigger/<trigger-id> When an external system sends a POST request to this URL with a JSON payload, the SRE Agent: Validates the trigger exists and is enabled Renders your agent prompt by injecting payload values into {payload.X} placeholders Creates a new investigation thread (or reuses an existing one) Executes the agent with the rendered prompt—autonomously or in review mode Records the execution in the trigger's history for auditing Payload Placeholders The real power of HTTP Triggers is in payload placeholders. When you configure a trigger, you write an agent prompt with {payload.X} tokens that get replaced at runtime with values from the incoming JSON. For example, a prompt like: Investigate Jira incident {payload.key}: {payload.summary} (Priority: {payload.priority}) Gets rendered with actual incident data before the agent sees it, giving it immediate context to begin investigating. If your prompt doesn't use any placeholders, the raw JSON payload is automatically appended to the prompt, so the agent always has access to the full context regardless. Thread Modes HTTP Triggers support two thread modes: New Thread (recommended for incidents): Every trigger invocation creates a fresh investigation thread, giving each incident its own isolated workspace Same Thread: All invocations share a single thread, building up a continuous conversation—useful for accumulating alerts from a single source Authenticating External Platforms The HTTP Trigger endpoint is secured with Azure AD authentication, ensuring only authorized callers can create agent investigation threads. Every request requires a valid bearer token scoped to the SRE Agent's data plane. External platforms like Jira send standard HTTP webhooks and don't natively acquire Azure AD tokens. To bridge this, you can use any Azure service that supports Managed Identity as an intermediary—this approach means zero secrets to store or rotate in the external platform. Common options include: Approach Best For Azure Logic Apps Native connectors for many platforms, no code required, visual workflow designer Azure Functions Simple relay with ~15 lines of code, clean URL for any webhook source API Management (APIM) Enterprise environments needing rate limiting, IP filtering, or API key management All three support Managed Identity and can transparently acquire the Azure AD token before forwarding requests to the SRE Agent HTTP Trigger. In this walkthrough, we'll use Azure Logic Apps with the built-in Jira connector. Step-by-Step: Connecting Jira to SRE Agent Prerequisites An Azure SRE Agent resource deployed in your subscription A Jira Cloud project with API token access An Azure subscription for the Logic App Step 1: Set Up the Jira MCP Connector First, let's give the SRE Agent the ability to interact with Jira directly. In your agent's MCP Tool settings, add the Jira connector: Setting Value Package mcp-atlassian (npm, version 2.0.0) Transport STDIO Configure these environment variables: Variable Value ATLASSIAN_BASE_URL https://your-site.atlassian.net ATLASSIAN_EMAIL Your Jira account email ATLASSIAN_API_TOKEN Your Jira API token Once the connector is added, select the specific MCP tools you want the agent to use. The connector provides 18 Jira tools out of 80 available. For our incident investigation workflow, the key tools include: jira-mcp_read_jira_issue — Read details from a Jira issue by issue key jira-mcp_search_jira_issues — Search for Jira issues using JQL (Jira Query Language) jira-mcp_add_jira_comment — Add a comment to a Jira issue (post investigation findings back) jira-mcp_list_jira_projects — List available Jira projects jira-mcp_create_jira_issue — Create a new Jira issue This gives the SRE Agent bidirectional access to Jira—it can read ticket details, fetch comments, query related issues, and post investigation findings back as comments on the original ticket. This closes the loop so your on-call engineers see the agent's analysis directly in Jira without switching tools. Step 2: Create the HTTP Trigger Navigate to Builder → HTTP Triggers in the SRE Agent UI and click Create. Setting Value Name jira-incident-handler Agent Mode Autonomous Thread Mode New Thread (one investigation per incident) Sub-Agent (optional) Select a specialized incident response agent Agent Prompt: A new Jira incident has been filed that requires investigation: Jira Ticket: {payload.key} Summary: {payload.summary} Priority: {payload.priority} Reporter: {payload.reporter} Description: {payload.description} Jira URL: {payload.ticketUrl} Investigate this incident by: Identifying the affected Azure resources mentioned in the description Querying recent metrics and logs for anomalies Checking for recent deployments or configuration changes Providing a structured analysis with Root Cause, Evidence, and Recommended Actions Once your investigation is complete, use the Jira MCP tools to post a summary of your findings as a comment on the original ticket ({payload.key}). After saving, enable the trigger and open the trigger detail view. Copy the Trigger URL—you'll need it for the Logic App. Step 3: Create the Azure Logic App In the Azure Portal, create a new Logic App: Setting Value Type Consumption (Multi-tenant, Stateful) Name jira-sre-agent-bridge Region Same region as your SRE Agent (e.g., East US 2) Resource Group Same resource group as your SRE Agent (recommended for simplicity) Step 4: Enable Managed Identity In the Logic App → Identity → System assigned: Set Status to On Click Save Step 5: Assign the SRE Agent Admin Role Navigate to your SRE Agent resource → Access control (IAM) → Add role assignment: Setting Value Role SRE Agent Admin Assign to Managed Identity → select your Logic App This grants the Logic App's Managed Identity the data-plane permissions needed to invoke HTTP Triggers. Important: The Contributor role alone is not sufficient. Contributor covers the Azure control plane, but SRE Agent uses a separate data plane with its own RBAC. The SRE Agent Admin role provides the required data-plane permissions. Step 6: Create the Jira Connection Open the Logic App designer. When adding the Jira trigger, it will prompt you to create a connection: Setting Value Connection name jira-connection Jira instance https://your-site.atlassian.net Email Your Jira email API Token Your Jira API token Step 7: Configure the Logic App Workflow Switch to the Logic App Code view and paste this workflow definition: { "definition": { "$schema": "https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#", "contentVersion": "1.0.0.0", "triggers": { "When_a_new_issue_is_created_(V2)": { "recurrence": { "interval": 3, "frequency": "Minute" }, "splitOn": "@triggerBody()", "type": "ApiConnection", "inputs": { "host": { "connection": { "name": "@parameters('$connections')['jira']['connectionId']" } }, "method": "get", "path": "/v2/new_issue_trigger/search", "queries": { "X-Request-Jirainstance": "https://YOUR-SITE.atlassian.net", "projectKey": "YOUR_PROJECT_ID" } } } }, "actions": { "Call_SRE_Agent_HTTP_Trigger": { "runAfter": {}, "type": "Http", "inputs": { "uri": "https://YOUR-AGENT.azuresre.ai/api/v1/httptriggers/trigger/YOUR-TRIGGER-ID", "method": "POST", "headers": { "Content-Type": "application/json" }, "body": { "key": "@{triggerBody()?['key']}", "summary": "@{triggerBody()?['fields']?['summary']}", "priority": "@{triggerBody()?['fields']?['priority']?['name']}", "reporter": "@{triggerBody()?['fields']?['reporter']?['displayName']}", "description": "@{triggerBody()?['fields']?['description']}", "ticketUrl": "@{concat('https://YOUR-SITE.atlassian.net/browse/', triggerBody()?['key'])}" }, "authentication": { "type": "ManagedServiceIdentity", "audience": "https://azuresre.dev" } } } }, "outputs": {}, "parameters": { "$connections": { "type": "Object", "defaultValue": {} } } }, "parameters": { "$connections": { "type": "Object", "value": { "jira": { "id": "/subscriptions/YOUR-SUB/providers/Microsoft.Web/locations/YOUR-REGION/managedApis/jira", "connectionId": "/subscriptions/YOUR-SUB/resourceGroups/YOUR-RG/providers/Microsoft.Web/connections/jira", "connectionName": "jira" } } } } } Replace the YOUR-* placeholders with your actual values. To find your Jira project ID, navigate to https://your-site.atlassian.net/rest/api/3/project/YOUR-PROJECT-KEY in your browser and find the "id" field in the JSON response. The critical piece is the authentication block: "authentication": { "type": "ManagedServiceIdentity", "audience": "https://azuresre.dev" } This tells the Logic App to automatically acquire an Azure AD token for the SRE Agent data plane and attach it as a Bearer token. No secrets, no expiration management, no manual token refresh. After pasting the JSON and clicking Save, switch back to the Designer view. The Logic App automatically generates the visual workflow from the code — you'll see the Jira trigger ("When a new issue is created (V2)") connected to the HTTP action ("Call SRE Agent HTTP Trigger") as a two-step flow, with all the field mappings and authentication settings already configured What Happens Inside the Agent When the HTTP Trigger fires, the SRE Agent receives a fully contextualized prompt with all the Jira incident data injected: A new Jira incident has been filed that requires investigation: Jira Ticket: KAN-16 Summary: Elevated API Response Times — PostgreSQL Table Lock Causing Request Blocking on Listings Service Priority: High Reporter: Vineela Suri Description: Severity: P2 — High. Affected Service: Production API (octopets-prod-postgres). Impact: End users experience slow or unresponsive listing pages. Jira URL: https://your-site.atlassian.net/browse/KAN-16 Investigate this incident by: Identifying the affected Azure resources mentioned in the description Querying recent metrics and logs for anomalies ... The agent then uses its configured tools to investigate—Azure CLI to query metrics, Kusto to analyze logs, and the Jira MCP connector to read the ticket for additional context. Once the investigation is complete, the agent posts its findings as a comment directly on the Jira ticket, closing the loop without any manual copy-paste. Each execution is recorded in the trigger's history with timestamp, thread ID, success status, duration, and an AI-generated summary—giving you full observability into your automated investigation pipeline. Extending to Other Platforms The pattern we built here works for any external platform that isn't natively supported by SRE Agent. The core architecture stays the same: External Platform → Auth Bridge (Managed Identity) → SRE Agent HTTP Trigger You only need to swap the inbound side of the bridge. For example: External Platform Auth Bridge Configuration Jira Logic App with Jira V2 connector (polling) OpsGenie Logic App with OpsGenie connector, or Azure Function relay receiving OpsGenie webhooks Datadog Azure Function relay or APIM policy receiving Datadog webhook notifications Grafana Azure Function relay or APIM policy receiving Grafana alert webhooks Splunk APIM with webhook endpoint and Managed Identity forwarding Custom / Internal tools Logic App HTTP trigger, Azure Function relay, or APIM — any service that supports Managed Identity The SRE Agent HTTP Trigger and the Managed Identity authentication remain the same regardless of the source platform. You configure the trigger once, set up the auth bridge, and connect as many external sources as needed. Each trigger can have its own tailored prompt, sub-agent, and thread mode optimized for the type of incoming event. Key Takeaways HTTP Triggers extend Azure SRE Agent's reach to any external platform: Connect What You Use: If your incident platform isn't natively supported, HTTP Triggers provide the integration point—no code changes to SRE Agent required Secure by Design: Azure AD authentication with Managed Identity keeps the data plane protected while making integration straightforward through standard Azure services Bidirectional with MCP: Combine HTTP Triggers (inbound) with MCP connectors (outbound) for full round-trip integration—receive incidents automatically and post findings back to the source platform Full Observability: Every trigger execution is recorded with timestamps, thread IDs, duration, and AI-generated summaries Flexible Context Injection: Payload placeholders let you craft precise investigation prompts from incident data, while raw payload passthrough ensures the agent always has full context Getting Started HTTP Triggers are available now in the Azure SRE Agent platform: Create a Trigger: Navigate to Builder → HTTP Triggers → Create. Define your agent prompt with {payload.X} placeholders Set Up an Auth Bridge: Use Logic Apps, Azure Functions, or APIM with Managed Identity to handle Azure AD authentication Connect Your Platform: Point your external platform at the bridge and create a test event Within minutes, you'll have an automated pipeline that turns every incident ticket into an AI-driven investigation. Learn More HTTP Triggers Documentation Agent Hooks Blog Post — Governance controls for automated investigations YAML Schema Reference SRE Agent Getting Started Guide Ready to extend your SRE Agent to platforms it doesn't support natively? Set up your first HTTP Trigger today at sre.azure.com.458Views0likes0CommentsMigrating Ant Builds to Maven with GitHub Copilot app modernization
Many legacy Java applications still rely on Apache Ant for building, packaging, and dependency management. While Ant remains flexible, it lacks the structured lifecycle, dependency resolution, and ecosystem support that modern build tools like Maven provide. Migrating from Ant to Maven improves maintainability, build reproducibility, IDE compatibility, and enables modern Java workflows such as dependency upgrades, framework updates, and containerization. GitHub Copilot app modernization accelerates this transition by analyzing an Ant‑based project, generating a migration plan, and applying transformations to produce a Maven‑based build aligned with modern Java tooling. What GitHub Copilot app modernization Supports GitHub Copilot app modernization can help teams: Detect Ant build scripts (build.xml) and related custom task files Recommend Maven project structure and lifecycle alignment Generate an initial pom.xml with matched project metadata Map Ant targets to Maven phases where possible Identify external dependencies and translate them into Maven coordinates Migrate resource directories and compiled output locations Surface code or configuration changes required for a Maven‑driven build Validate the new Maven configuration through iterative builds This modernizes the build foundation before performing other upgrades such as JDK, Spring, Jakarta, or container‑readiness transformations. Project Analysis When you open an Ant‑based project in Visual Studio Code or IntelliJ IDEA, GitHub Copilot app modernization performs an analysis: Detects build.xml and auxiliary Ant scripts Identifies classpaths defined across Ant targets Evaluates manually referenced JARs in lib directories Inspects source layout and output directories Determines project metadata such as groupId, artifactId, and version Determines whether frameworks or libraries require updates before Maven migration This analysis forms the basis of the migration plan. Migration Plan Generation GitHub Copilot app modernization produces a migration plan that outlines: The recommended Maven project layout (src/main/java, src/test/java, resources directories) A generated pom.xml with discovered dependencies Mapped Ant targets to Maven lifecycle phases (compile, test, package) Plugin configurations needed to replicate custom Ant functionality Suggested removal of lib directory JARs in favor of dependency management Notes on unsupported or manual‑review areas (custom Ant tasks, script‑heavy targets, specialized packaging logic) You can review and adjust the plan before proceeding. Automated Transformations Once confirmed, GitHub Copilot app modernization applies targeted updates: Generates the project’s pom.xml Migrates dependency JAR references to Maven dependency entries Moves source and resource files into Maven‑compatible structure Updates ignore files, build output directories, and paths Introduces common Maven plugins for compiler, surefire, assembly, or shading Suggests replacements for custom Ant tasks if built‑in Maven plugins exist This automated work removes most of the manual lifting normally required for Ant → Maven transitions. Build & Fix Iteration After applying the transformations, the tool attempts to build the new Maven project: Runs the build Captures missing dependencies, incorrect scopes, or misaligned plugin versions Suggests targeted fixes Applies adjustments and rebuilds Iterates until the project compiles or no further automated fixes are possible This helps stabilize the migration quickly. Security & Behavior Validation GitHub Copilot app modernization also performs additional validation: Flags CVEs introduced or resolved through dependency discovery Alerts you to behavioral differences between Ant‑driven and Maven‑driven builds Highlights test failures, packaging differences, or altered classpaths that may need review These findings allow developers to refine the migration safely. Expected Output After the migration, you can expect: A newly generated and fully structured Maven project A populated pom.xml with dependencies, plugins, and metadata Updated project layout aligned with Maven standards Removed or deprecated Ant build files where appropriate Aligned dependency versions ready for further modernization A summary file detailing: Build changes Dependency mappings Code or config adjustments Remaining manual review items Developer Responsibilities While GitHub Copilot app modernization automates the mechanical migration from Ant to Maven, developers remain responsible for: Reviewing tests and build artifacts for behavioral differences Validating packaging steps for WAR/EAR/JAR outputs Replacing complex custom Ant scripts with proper Maven plugins Verifying deployment and CI workflows dependent on Ant build logic Confirming integration points that rely on Ant‑specific tasks or ordering Once validated, the Maven‑based structure becomes a strong foundation for further modernization such as JDK upgrades, Spring migration, Jakarta adoption, and containerization. Learn More For project setup and the complete modernization workflow, refer to the Microsoft Learn guide for upgrading Java projects with GitHub Copilot app modernization. Quickstart: Upgrade a Java Project with GitHub Copilot App Modernization | Microsoft Learn156Views1like0CommentsA Practical Path Forward for Heroku Customers with Azure
On February 6, 2026, Heroku announced it is moving to a sustaining engineering model focused on stability, security, reliability, and ongoing support. Many customers are now reassessing how their application platforms will support today’s workloads and future innovation. Microsoft is committed to helping customers migrate and modernize applications from platforms like Heroku to Azure.222Views0likes0Comments