web apps
321 TopicsEvent-Driven IaC Operations with Azure SRE Agent: Terraform Drift Detection via HTTP Triggers
What Happens After terraform plan Finds Drift? If your team is like most, the answer looks something like this: A nightly terraform plan runs and finds 3 drifted resources A notification lands in Slack or Teams Someone files a ticket During the next sprint, an engineer opens 4 browser tabs — Terraform state, Azure Portal, Activity Log, Application Insights — and spends 30 minutes piecing together what happened They discover the drift was caused by an on-call engineer who scaled up the App Service during a latency incident at 2 AM They revert the drift with terraform apply The app goes down because they just scaled it back down while the bug that caused the incident is still deployed Step 7 is the one nobody talks about. Drift detection tooling has gotten remarkably good — scheduled plans, speculative runs, drift alerts — but the output is always the same: a list of differences. What changed. Not why. Not whether it's safe to fix. The gap isn't detection. It's everything that happens after detection. HTTP Triggers in Azure SRE Agent close that gap. They turn the structured output that drift detection already produces — webhook payloads, plan summaries, run notifications — into the starting point of an autonomous investigation. Detection feeds the agent. The agent does the rest: correlates with incidents, reads source code, classifies severity, recommends context-aware remediation, notifies the team, and even ships a fix. Here's what that looks like end to end. What you'll see in this blog: An agent that classifies drift as Benign, Risky, or Critical — not just "changed" Incident correlation that links a SKU change to a latency spike in Application Insights A remediation recommendation that says "Do NOT revert" — and why reverting would cause an outage A Teams notification with the full investigation summary An agent that reviews its own performance, finds gaps, and improves its own skill file A pull request the agent created on its own to fix the root cause The Pipeline: Detection to Resolution in One Webhook The architecture is straightforward. Terraform Cloud (or any drift detection tool) sends a webhook when it finds drift. An Azure Logic App adds authentication. The SRE Agent's HTTP Trigger receives it and starts an autonomous investigation. The end-to-end pipeline: Terraform Cloud detects drift and sends a webhook. The Logic App adds Azure AD authentication via Managed Identity. The SRE Agent's HTTP Trigger fires and the agent autonomously investigates across 7 dimensions. Setting Up the Pipeline Step 1: Deploy the Infrastructure with Terraform We start with a simple Azure App Service running a Node.js application, deployed via Terraform. The Terraform configuration defines the desired state: App Service Plan: B1 (Basic) — single vCPU, ~$13/mo App Service: Node 20-lts with TLS 1.2 Tags: environment: demo, managed_by: terraform, project: sre-agent-iac-blog resource "azurerm_service_plan" "demo" { name = "iacdemo-plan" resource_group_name = azurerm_resource_group.demo.name location = azurerm_resource_group.demo.location os_type = "Linux" sku_name = "B1" } A Logic App is also deployed to act as the authentication bridge between Terraform Cloud webhooks and the SRE Agent's HTTP Trigger endpoint, using Managed Identity to acquire Azure AD tokens. Learn more about HTTP Triggers here. Step 2: Create the Drift Analysis Skill Skills are domain knowledge files that teach the agent how to approach a problem. We create a terraform-drift-analysis skill with an 8-step workflow: Identify Scope — Which resource group and resources to check Detect Drift — Compare Terraform config against Azure reality Correlate with Incidents — Check Activity Log and App Insights Classify Severity — Benign, Risky, or Critical Investigate Root Cause — Read source code from the connected repository Generate Drift Report — Structured summary with severity-coded table Recommend Smart Remediation — Context-aware: don't blindly revert Notify Team — Post findings to Microsoft Teams The key insight in the skill: "NEVER revert critical drift that is actively mitigating an incident." This teaches the agent to think like an experienced SRE, not just a diff tool. Step 3: Create the HTTP Trigger In the SRE Agent UI, we create an HTTP Trigger named tfc-drift-handler with a 7-step agent prompt: A Terraform Cloud run has completed and detected infrastructure drift. Workspace: {payload.workspace_name} Organization: {payload.organization_name} Run ID: {payload.run_id} Run Message: {payload.run_message} STEP 1 — DETECT DRIFT: Compare Terraform configuration against actual Azure state... STEP 2 — CORRELATE WITH INCIDENTS: Check Azure Activity Log and App Insights... STEP 3 — CLASSIFY SEVERITY: Rate each drift item as Benign, Risky, or Critical... STEP 4 — INVESTIGATE ROOT CAUSE: Read the application source code... STEP 5 — GENERATE DRIFT REPORT: Produce a structured summary... STEP 6 — RECOMMEND SMART REMEDIATION: Context-aware recommendations... STEP 7 — NOTIFY TEAM: Post a summary to Microsoft Teams... Step 4: Connect GitHub and Teams We connect two integrations in the SRE Agent Connectors settings: Code Repository: GitHub — so the agent can read application source code during investigations Notification: Microsoft Teams — so the agent can post drift reports to the team channel The Incident Story Act 1: The Latency Bug Our demo app has a subtle but devastating bug. The /api/data endpoint calls processLargeDatasetSync() — a function that sorts an array on every iteration, creating an O(n² log n) blocking operation. On a B1 App Service Plan (single vCPU), this blocks the Node.js event loop entirely. Under load, response times spike from milliseconds to 25-58 seconds, with 502 Bad Gateway errors from the Azure load balancer. Act 2: The On-Call Response An on-call engineer sees the latency alerts and responds — not through Terraform, but directly through the Azure Portal and CLI. They: Add diagnostic tags — manual_update=True, changed_by=portal_user (benign) Downgrade TLS from 1.2 to 1.0 while troubleshooting (risky — security regression) Scale the App Service Plan from B1 to S1 to throw more compute at the problem (critical — cost increase from ~$13/mo to ~$73/mo) The incident is partially mitigated — S1 has more compute, so latency drops from catastrophic to merely bad. Everyone goes back to sleep. Nobody updates Terraform. Act 3: The Drift Check Fires The next morning, a nightly speculative Terraform plan runs and detects 3 drifted attributes. The notification webhook fires, flowing through the Logic App auth bridge to the SRE Agent HTTP Trigger. The agent wakes up and begins its investigation. What the Agent Found Layer 1: Drift Detection The agent compares Terraform configuration against Azure reality and produces a severity-classified drift report: Three drift items detected: Critical: App Service Plan SKU changed from B1 (~$13/mo) to S1 (~$73/mo) — a +462% cost increase Risky: Minimum TLS version downgraded from 1.2 to 1.0 — a security regression vulnerable to BEAST and POODLE attacks Benign: Additional tags (changed_by: portal_user, manual_update: True) — cosmetic, no functional impact Layer 2: Incident Correlation Here's where the agent goes beyond simple drift detection. It queries Application Insights and discovers a performance incident correlated with the SKU change: Key findings from the incident correlation: 97.6% of requests (40 of 41) were impacted by high latency The /api/data endpoint does not exist in the repository source code — the deployed application has diverged from the codebase The endpoint likely contains a blocking synchronous pattern — Node.js runs on a single event loop, and any synchronous blocking call would explain 26-58s response times The SKU scale-up from B1→S1 was an attempt to mitigate latency by adding more compute, but scaling cannot fix application-level blocking code on a single-threaded Node.js server Layer 3: Smart Remediation This is the insight that separates an autonomous agent from a reporting tool. Instead of blindly recommending "revert all drift," the agent produces context-aware remediation recommendations: The agent's remediation logic: Tags (Benign) → Safe to revert anytime via terraform apply -target TLS 1.0 (Risky) → Revert immediately — the TLS downgrade is a security risk unrelated to the incident SKU S1 (Critical) → DO NOT revert until the /api/data performance root cause is fixed This is the logic an experienced SRE would apply. Blindly running terraform apply to revert all drift would scale the app back down to B1 while the blocking code is still deployed — turning a mitigated incident into an active outage. Layer 4: Investigation Summary The agent produces a complete summary tying everything together: Key findings in the summary: Actor: surivineela@microsoft.com made all changes via Azure Portal at ~23:19 UTC Performance incident: /api/data averaging 25-57s latency, affecting 97.6% of requests Code-infrastructure mismatch: /api/data exists in production but not in the repository source code Root cause: SKU scale-up was emergency incident response, not unauthorized drift Layer 5: Teams Notification The agent posts a structured drift report to the team's Microsoft Teams channel: The on-call engineer opens Teams in the morning and sees everything they need: what drifted, why it drifted, and exactly what to do about it — without logging into any dashboard. The Payoff: A Self-Improving Agent Here's where the demo surprised us. After completing the investigation, the agent did two things we didn't explicitly ask for. The Agent Improved Its Own Skill The agent performed an Execution Review — analyzing what worked and what didn't during its investigation — and found 5 gaps in its own terraform-drift-analysis.md skill file: What worked well: Drift detection via az CLI comparison against Terraform HCL was straightforward Activity Log correlation identified the actor and timing Application Insights telemetry revealed the performance incident driving the SKU change Gaps it found and fixed: No incident correlation guidance — the skill didn't instruct checking App Insights No code-infrastructure mismatch detection — no guidance to verify deployed code matches the repository No smart remediation logic — didn't warn against reverting critical drift during active incidents Report template missing incident correlation column No Activity Log integration guidance — didn't instruct checking who made changes and when The agent then edited its own skill file to incorporate these learnings. Next time it runs a drift analysis, it will include incident correlation, code-infra mismatch checks, and smart remediation logic by default. This is a learning loop — every investigation makes the agent better at future investigations. The Agent Created a PR Without being asked, the agent identified the root cause code issue and proactively created a pull request to fix it: The PR includes: App safety fixes: Adding MAX_DELAY_MS and SERVER_TIMEOUT_MS constants to prevent unbounded latency Skill improvements: Incorporating incident correlation, code-infra mismatch detection, and smart remediation logic From a single webhook: drift detected → incident correlated → root cause found → team notified → skill improved → fix shipped. Key Takeaways Drift detection is not enough. Knowing that B1 changed to S1 is table stakes. Knowing it changed because of a latency incident, and that reverting it would cause an outage — that's the insight that matters. Context-aware remediation prevents outages. Blindly running terraform apply after drift would have scaled the app back to B1 while blocking code was still deployed. The agent's "DO NOT revert SKU" recommendation is the difference between fixing drift and causing a P1. Skills create a learning loop. The agent's self-review and skill improvement means every investigation makes the next one better — without human intervention. HTTP Triggers connect any platform. The auth bridge pattern (Logic App + Managed Identity) works for Terraform Cloud, but the same architecture applies to any webhook source: GitHub Actions, Jenkins, Datadog, PagerDuty, custom internal tools. The agent acts, not just reports. From a single webhook: drift detected, incident correlated, root cause identified, team notified via Teams, skill improved, and PR created. End-to-end in one autonomous session. Getting Started HTTP Triggers are available now in Azure SRE Agent: Create a Skill — Teach the agent your operational runbook (in this case, drift analysis with severity classification and smart remediation) Create an HTTP Trigger — Define your agent prompt with {payload.X} placeholders and connect it to a skill Set Up an Auth Bridge — Deploy a Logic App with Managed Identity to handle Azure AD token acquisition Connect Your Source — Point Terraform Cloud (or any webhook-capable platform) at the Logic App URL Connect GitHub + Teams — Give the agent access to source code and team notifications Within minutes, you'll have an autonomous pipeline that turns infrastructure drift events into fully contextualized investigations — with incident correlation, root cause analysis, and smart remediation recommendations. The full implementation guide, Terraform files, skill definitions, and demo scripts are available in this repository.118Views0likes0CommentsExplaining what GitHub Copilot Modernization can (and cannot do)
In the last post, we looked at the workflow: assess, plan, execute. You get reports you can review and the agent makes changes you can inspect. If you don’t know, GitHub Copilot Modernization is the new agentic tool that supports you to in modernizing older applications. Could it support you with that old 4.8 Framework app, even that forgotten VB.NET script? You're probably not modernizing one small app. It is probably a handful of projects, each with its own stack of blockers. Different frameworks, different databases, different dependencies frozen in time because nobody wants to touch them. GitHub Copilot modernization handles two big categories: upgrading .NET projects to newer versions and migrating .NET apps to Azure. But what does that look like? Upgrading .NET Projects Let’s say, you've got an ASP.NET app running on .NET Framework 4.8 or it's a web API stuck on .NET Core 3.1. Unfortunately, getting it to .NET 9 or 10 isn't just updating a target framework property. Here's what the upgrade workflow handles in Visual Studio: Assessment first. - The agent examines your project structure, dependencies, and code patterns. It generates an Assessment Report UI, which shows both the app information, to create the plan, and the Cloud Readiness, for Azure deployment. Then planning. - Once you approve the assessment, it moves to planning. Here you get upgrade strategies, refactoring approaches, dependency upgrade paths, and risk mitigations documented in a plan.md file at .appmod/.migration, you can check and edit that Markdown before moving forward or ask in the Copilot Chat window to change it. # .NET 10.0 Upgrade Plan ## Execution Steps Execute steps below sequentially one by one in the order they are listed. 1. Validate that a .NET 10.0 SDK required for this upgrade is installed on the machine and if not, help to get it installed. 2. Ensure that the SDK version specified in global.json files is compatible with the .NET 10.0 upgrade. 3. Upgrade src\eShopLite.StoreFx\eShopLite.StoreFx.csproj ## Settings This section contains settings and data used by execution steps. ### Excluded projects No projects are excluded from this upgrade. ### Aggregate NuGet packages modifications across all projects NuGet packages used across all selected projects or their dependencies that need version update in projects that reference them Then execution. - After you approve the plan, and the agent breaks it into discrete tasks in a tasks.md file. Each task gets validation criteria. As it works, it updates the file with checkboxes and completion percentages so you can track progress. It makes code changes, verifies builds, runs tests. If it hits a problem, it tries to identify the cause and apply a fix. Go to the GitHub Copilot Chat window and type: The plan and progress tracker look good to me. Go ahead with the migration. It usually creates Git commits for each portion so you can review what changed or roll back if you need to. In case you don’t have a need for the Git commits for the change, you can ask the agent at the start to not commit anything. The agent primarily focuses on ASP.NET, ASP.NET Core, Blazor, Razor Pages, MVC, and Web API. It can also handle Azure Functions, WPF, Windows Forms, console apps, class libraries, and test projects. What It Handles Well (and What It Doesn't) The agent is good at code-level transformations: updating TargetFramework in .csproj files, upgrading NuGet packages, replacing deprecated APIs with their modern equivalents, fixing breaking changes like removed BinaryFormatter methods, running builds, and validating test suites. It can handle repetitive work across multiple projects in a solution without you needing to track every dependency manually. It's also solid at applying predefined Azure migration patterns, swapping plaintext credentials for managed identity, replacing file I/O with Azure Blob Storage calls, moving authentication from on-prem Active Directory to Microsoft Entra ID. These are structured transformations with clear before-and-after code patterns. But here's where you may need to pay closer attention: Language and framework coverage: It works with C# projects mainly. If your codebase includes complex Entity Framework migrations that rely on hand-tuned database scripts, the agent won't rewrite those for you. It also won't handle third-party UI framework patterns that don't map cleanly to ASP.NET Core conventions that have breaking changes between .NET Framework and later .NET versions. Web Forms migration is underway. Configuration and infrastructure: The agent doesn't migrate IIS-specific web.config settings that don't have direct equivalents in Kestrel or ASP.NET Core. It won't automatically set up a CI/CD pipeline or any modernization features; for that, you need to implement it with Copilot’s help. If you've got frontend frameworks bundled with ASP.NET (like an older Angular app served through MVC), you'll need to separate and upgrade that layer yourself. Learning and memory: The agent uses your code as context during the session, and if you correct a fix or update the plan, it tries to apply that learning within the same session. But those corrections don't persist across future upgrades. You can encode internal standards using custom skills, but that requires deliberate setup. Offline and deployment: There's no offline mode. The agent needs connectivity to run. And while it can help prepare your app for Azure deployment, it doesn't manage the actual infrastructure provisioning or ongoing operations, that's still on you. Guarantees: The suggestions aren't guaranteed to follow best practices. The agent won't always pick the best migration path. It won't catch every edge case. You're reviewing the work; pay attention to the results before putting it into production. What it does handle: the tedious parts. Reading dependency graphs. Finding all the places a deprecated API is used. Updating project files. Writing boilerplate for managed identity. Fixing compilation errors that follow a predictable pattern. Where to Start If you've been staring at a modernization backlog, pick one project. See what it comes up with! You don't have to commit to upgrading your entire portfolio. Try it on one project and see if it saves you time. Modernization at scale still happens application by application, repo by repo, and decision by decision. GitHub Copilot modernization just makes each one a little less painful. Experiment with it!133Views0likes0CommentsUsing an AI Agent to Troubleshoot and Fix Azure Function App Issues
TOC Preparation Troubleshooting Workflow Conclusion Preparation Topic: Required tools AI agent: for example, Copilot CLI / OpenCode / Hermes / OpenClaw, etc. In this example, we use Copilot CLI. Model access: for example, Anthropic Claude Opus. Relevant skills: this example does not use skills, but using relevant skills can speed up troubleshooting. Topic: Compliant with your organization Enterprise-level projects are sensitive, so you must confirm with the appropriate stakeholders before using them. Enterprise environments may also have strict standards for AI agent usage. Topic: Network limitations If the process involves restarting the Function App container or restarting related settings, communication between the user and the agent may be interrupted, and you will need to use /resume. If the agent needs internet access for investigation, the app must have outbound connectivity. If the Kudu container cannot be used because of network issues, this type of investigation cannot be carried out. Topic: Permission limitations If you are using Azure blessed images, according to the official documentation, the containers use the fixed password Docker!. However, if you are using a custom container, you will need to provide an additional login method. For resources the agent does not already have permission to investigate, you will need to enable SAMI and assign the appropriate RBAC roles. Troubleshooting Workflow Let’s use a classic case where an HTTP trigger cannot be tested from the Azure Portal. As you can see, when clicking Test/Run in the Azure Portal, an error message appears. At the same time, however, the home page does not show any abnormal status. At this point, we first obtain the Function App’s SAMI and assign it the Owner role for the entire resource group. This is only for demonstration purposes. In practice, you should follow the principle of least privilege and scope permissions down to only the specific resources and operations that are actually required. Next, go to the Kudu container, which is the always-on maintenance container dedicated to the app. Install and enable Copilot CLI. Then we can describe the problem we are encountering. After the agent processes the issue and interacts with you further, it can generate a reasonable investigation report. In this example, it appears that the Function App’s Storage Account access key had been rotated previously, but the Function App had not updated the corresponding environment variable. Once we understand the issue, we could perform the follow-up actions ourselves. However, to demonstrate the agent’s capabilities, you can also allow it to fix the problem directly, provided that you have granted the corresponding permissions through SAMI. During the process, the container restart will disconnect the session, so you will need to return to the Kudu container and resume the previous session so it can continue. Finally, it will inform you that the issue has been fixed, and then you can validate the result. This is the validation result, and it looks like the repair was successful. Conclusion After each repair, we can even extract the experience from that case into a skill and store it in a Storage Account for future reuse. In this way, we can not only reduce the agent’s initial investigation time for similar issues, but also save tokens. This makes both time and cost management more efficient.153Views3likes0CommentsBuild Multi-Agent AI Apps on Azure App Service with Microsoft Agent Framework 1.0
Part 1 of 3 — Multi-Agent AI on Azure App Service This is part 1 of a 3 part series on deploying and working with multi-agent AI on Azure App Service. Follow allong to learn how to deploy, manage, observe, and secure your agents on Azure App Service. A couple of months ago, we published a three-part series showing how to build multi-agent AI systems on Azure App Service using preview packages from the Microsoft Agent Framework (MAF) (formerly AutoGen / Semantic Kernel Agents). The series walked through async processing, the request-reply pattern, and client-side multi-agent orchestration — all running on App Service. Since then, Microsoft Agent Framework has reached 1.0 GA — unifying AutoGen and Semantic Kernel into a single, production-ready agent platform. This post is a fresh start with the GA bits. We'll rebuild our travel-planner sample on the stable API surface, call out the breaking changes from preview, and get you up and running fast. All of the code is in the companion repo: seligj95/app-service-multi-agent-maf-otel. What Changed in MAF 1.0 GA The 1.0 release is more than a version bump. Here's what moved: Unified platform. AutoGen and Semantic Kernel agent capabilities have converged into Microsoft.Agents.AI . One package, one API surface. Stable APIs with long-term support. The 1.0 contract is now locked for servicing. No more preview churn. Breaking change — Instructions on options removed. In preview, you set instructions through ChatClientAgentOptions.Instructions . In GA, pass them directly to the ChatClientAgent constructor. Breaking change — RunAsync parameter rename. The thread parameter is now session (type AgentSession ). If you were using named arguments, this is a compile error. Microsoft.Extensions.AI upgraded. The framework moved from the 9.x preview of Microsoft.Extensions.AI to the stable 10.4.1 release. OpenTelemetry integration built in. The builder pipeline now includes UseOpenTelemetry() out of the box — more on that in Blog 2. Our project references reflect the GA stack: <PackageReference Include="Microsoft.Agents.AI" Version="1.0.0" /> <PackageReference Include="Microsoft.Extensions.AI" Version="10.4.1" /> <PackageReference Include="Azure.AI.OpenAI" Version="2.1.0" /> Why Azure App Service for AI Agents? If you're building with Microsoft Agent Framework, you need somewhere to run your agents. You could reach for Kubernetes, containers, or serverless — but for most agent workloads, Azure App Service is the sweet spot. Here's why: No infrastructure management — App Service is fully managed. No clusters to configure, no container orchestration to learn. Deploy your .NET or Python agent code and it just runs. Always On — Agent workflows can take minutes. App Service's Always On feature (on Premium tiers) ensures your background workers never go cold, so agents are ready to process requests instantly. WebJobs for background processing — Long-running agent workflows don't belong in HTTP request handlers. App Service's built-in WebJob support gives you a dedicated background worker that shares the same deployment, configuration, and managed identity — no separate compute resource needed. Managed Identity everywhere — Zero secrets in your code. App Service's system-assigned managed identity authenticates to Azure OpenAI, Service Bus, Cosmos DB, and Application Insights automatically. No connection strings, no API keys, no rotation headaches. Built-in observability — Native integration with Application Insights and OpenTelemetry means you can see exactly what your agents are doing in production (more on this in Part 2). Enterprise-ready — VNet integration, deployment slots for safe rollouts, custom domains, auto-scaling rules, and built-in authentication. All the things you'll need when your agent POC becomes a production service. Cost-effective — A single P0v4 instance (~$75/month) hosts both your API and WebJob worker. Compare that to running separate container apps or a Kubernetes cluster for the same workload. The bottom line: App Service lets you focus on building your agents, not managing infrastructure. And since MAF supports both .NET and Python — both first-class citizens on App Service — you're covered regardless of your language preference. Architecture Overview The sample is a travel planner that coordinates six specialized agents to build a personalized trip itinerary. Users fill out a form (destination, dates, budget, interests), and the system returns a comprehensive travel plan complete with weather forecasts, currency advice, a day-by-day itinerary, and a budget breakdown. The Six Agents Currency Converter — calls the Frankfurter API for real-time exchange rates Weather Advisor — calls the National Weather Service API for forecasts and packing tips Local Knowledge Expert — cultural insights, customs, and hidden gems Itinerary Planner — day-by-day scheduling with timing and costs Budget Optimizer — allocates spend across categories and suggests savings Coordinator — assembles everything into a polished final plan Four-Phase Workflow Phase Agents Execution 1 — Parallel Gathering Currency, Weather, Local Knowledge Task.WhenAll 2 — Itinerary Itinerary Planner Sequential (uses Phase 1 context) 3 — Budget Budget Optimizer Sequential (uses Phase 2 output) 4 — Assembly Coordinator Final synthesis Infrastructure Azure App Service (P0v4) — hosts the API and a continuous WebJob for background processing Azure Service Bus — decouples the API from heavy AI work (async request-reply) Azure Cosmos DB — stores task state, results, and per-agent chat histories (24-hour TTL) Azure OpenAI (GPT-4o) — powers all agent LLM calls Application Insights + Log Analytics — monitoring and diagnostics ChatClientAgent Deep Dive At the core of every agent is ChatClientAgent from Microsoft.Agents.AI . It wraps an IChatClient (from Microsoft.Extensions.AI ) with instructions, a name, a description, and optionally a set of tools. This is client-side orchestration — you control the chat history, lifecycle, and execution order. No server-side Foundry agent resources are created. Here's the BaseAgent pattern used by all six agents in the sample: // BaseAgent.cs — constructor for agents with tools Agent = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description, tools: chatOptions.Tools?.ToList()) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName) .Build(); Notice the builder pipeline: .AsBuilder().UseOpenTelemetry(...).Build() . This opts every agent into the framework's built-in OpenTelemetry instrumentation with a single line. We'll explore what that telemetry looks like in Blog 2. Invoking an agent is equally straightforward: // BaseAgent.cs — InvokeAsync public async Task<ChatMessage> InvokeAsync( IList<ChatMessage> chatHistory, CancellationToken cancellationToken = default) { var response = await Agent.RunAsync( chatHistory, session: null, options: null, cancellationToken); return response.Messages.LastOrDefault() ?? new ChatMessage(ChatRole.Assistant, "No response generated."); } Key things to note: session: null — this is the renamed parameter (was thread in preview). We pass null because we manage chat history ourselves. The agent receives the full chatHistory list, so context accumulates across turns. Simple agents (Local Knowledge, Itinerary Planner, Budget Optimizer, Coordinator) use the tool-less constructor; agents that call external APIs (Currency, Weather) use the constructor that accepts ChatOptions with tools. Tool Integration Two of our agents — Weather Advisor and Currency Converter — call real external APIs through the MAF tool-calling pipeline. Tools are registered using AIFunctionFactory.Create() from Microsoft.Extensions.AI . Here's how the WeatherAdvisorAgent wires up its tool: // WeatherAdvisorAgent.cs private static ChatOptions CreateChatOptions( IWeatherService weatherService, ILogger logger) { var chatOptions = new ChatOptions { Tools = new List<AITool> { AIFunctionFactory.Create( GetWeatherForecastFunction(weatherService, logger)) } }; return chatOptions; } GetWeatherForecastFunction returns a Func<double, double, int, Task<string>> that the model can call with latitude, longitude, and number of days. Under the hood, it hits the National Weather Service API and returns a formatted forecast string. The Currency Converter follows the same pattern with the Frankfurter API. This is one of the nicest parts of the GA API: you write a plain C# method, wrap it with AIFunctionFactory.Create() , and the framework handles the JSON schema generation, function-call parsing, and response routing automatically. Multi-Phase Workflow Orchestration The TravelPlanningWorkflow class coordinates all six agents. The key insight is that the orchestration is just C# code — no YAML, no graph DSL, no special runtime. You decide when agents run, what context they receive, and how results flow between phases. // Phase 1: Parallel Information Gathering var gatheringTasks = new[] { GatherCurrencyInfoAsync(request, state, progress, cancellationToken), GatherWeatherInfoAsync(request, state, progress, cancellationToken), GatherLocalKnowledgeAsync(request, state, progress, cancellationToken) }; await Task.WhenAll(gatheringTasks); After Phase 1 completes, results are stored in a WorkflowState object — a simple dictionary-backed container that holds per-agent chat histories and contextual data: // WorkflowState.cs public Dictionary<string, object> Context { get; set; } = new(); public Dictionary<string, List<ChatMessage>> AgentChatHistories { get; set; } = new(); Phases 2–4 run sequentially, each pulling context from the previous phase. For example, the Itinerary Planner receives weather and local knowledge gathered in Phase 1: var localKnowledge = state.GetFromContext<string>("LocalKnowledge") ?? ""; var weatherAdvice = state.GetFromContext<string>("WeatherAdvice") ?? ""; var itineraryChatHistory = state.GetChatHistory("ItineraryPlanner"); itineraryChatHistory.Add(new ChatMessage(ChatRole.User, $"Create a detailed {days}-day itinerary for {request.Destination}..." + $"\n\nWEATHER INFORMATION:\n{weatherAdvice}" + $"\n\nLOCAL KNOWLEDGE & TIPS:\n{localKnowledge}")); var itineraryResponse = await _itineraryAgent.InvokeAsync( itineraryChatHistory, cancellationToken); This pattern — parallel fan-out followed by sequential context enrichment — is simple, testable, and easy to extend. Need a seventh agent? Add it to the appropriate phase and wire it into WorkflowState . Async Request-Reply Pattern A multi-agent workflow with six LLM calls (some with tool invocations) can easily run 30–60 seconds. That's well beyond typical HTTP timeout expectations and not a great user experience for a synchronous request. We use the Async Request-Reply pattern to handle this: The API receives the travel plan request and immediately queues a message to Service Bus. It stores an initial task record in Cosmos DB with status queued and returns a taskId to the client. A continuous WebJob (running as a separate process on the same App Service plan) picks up the message, executes the full multi-agent workflow, and writes the result back to Cosmos DB. The client polls the API for status updates until the task reaches completed . This pattern keeps the API responsive, makes the heavy work retriable (Service Bus handles retries and dead-lettering), and lets the WebJob run independently — you can restart it without affecting the API. We covered this pattern in detail in the previous series, so we won't repeat the plumbing here. Deploy with azd The repo is wired up with the Azure Developer CLI for one-command provisioning and deployment: git clone https://github.com/seligj95/app-service-multi-agent-maf-otel.git cd app-service-multi-agent-maf-otel azd auth login azd up azd up provisions the following resources via Bicep: Azure App Service (P0v4 Windows) with a continuous WebJob Azure Service Bus namespace and queue Azure Cosmos DB account, database, and containers Azure AI Services (Azure OpenAI with GPT-4o deployment) Application Insights and Log Analytics workspace Managed Identity with all necessary role assignments After deployment completes, azd outputs the App Service URL. Open it in your browser, fill in the travel form, and watch six agents collaborate on your trip plan in real time. What's Next We now have a production-ready multi-agent app running on App Service with the GA Microsoft Agent Framework. But how do you actually observe what these agents are doing? When six agents are making LLM calls, invoking tools, and passing context between phases — you need visibility into every step. In the next post, we'll dive deep into how we instrumented these agents with OpenTelemetry and the new Agents (Preview) view in Application Insights — giving you full visibility into agent runs, token usage, tool calls, and model performance. You already saw the .UseOpenTelemetry() call in the builder pipeline; Blog 2 shows what that telemetry looks like end to end and how to light up the new Agents experience in the Azure portal. Stay tuned! Resources Sample repo — app-service-multi-agent-maf-otel Microsoft Agent Framework 1.0 GA Announcement Microsoft Agent Framework Documentation Previous Series — Part 3: Client-Side Multi-Agent Orchestration on App Service Microsoft.Extensions.AI Documentation Azure App Service Documentation Blog 2: Monitor AI Agents on App Service with OpenTelemetry and the New Application Insights Agents View Blog 3: Govern AI Agents on App Service with the Microsoft Agent Governance Toolkit656Views0likes0CommentsMonitor AI Agents on App Service with OpenTelemetry and the New Application Insights Agents View
Part 2 of 3: In Blog 1, we deployed a multi-agent travel planner on Azure App Service using the Microsoft Agent Framework (MAF) 1.0 GA. This post dives deep into how we instrumented those agents with OpenTelemetry and lit up the brand-new Agents (Preview) view in Application Insights. 📋 Prerequisite: This post assumes you've followed the guidance in Blog 1 to deploy the multi-agent travel planner to Azure App Service. If you haven't deployed the app yet, start there first — you'll need a running App Service with the agents, Service Bus, Cosmos DB, and Azure OpenAI provisioned before the monitoring steps in this post will work. Deploying Agents Is Only Half the Battle In Blog 1, we walked through deploying a multi-agent travel planning application on Azure App Service. Six specialized agents — a Coordinator, Currency Converter, Weather Advisor, Local Knowledge Expert, Itinerary Planner, and Budget Optimizer — work together to generate comprehensive travel plans. The architecture uses an ASP.NET Core API backed by a WebJob for async processing, Azure Service Bus for messaging, and Azure OpenAI for the brains. But here's the thing: deploying agents to production is only half the battle. Once they're running, you need answers to questions like: Which agent is consuming the most tokens? How long does the Itinerary Planner take compared to the Weather Advisor? Is the Coordinator making too many LLM calls per workflow? When something goes wrong, which agent in the pipeline failed? Traditional APM gives you HTTP latencies and exception rates. That's table stakes. For AI agents, you need to see inside the agent — the model calls, the tool invocations, the token spend. And that's exactly what Application Insights' new Agents (Preview) view delivers, powered by OpenTelemetry and the GenAI semantic conventions. Let's break down how it all works. The Agents (Preview) View in Application Insights Azure Application Insights now includes a dedicated Agents (Preview) blade that provides unified monitoring purpose-built for AI agents. It's not just a generic dashboard — it understands agent concepts natively. Whether your agents are built with Microsoft Agent Framework, Azure AI Foundry, Copilot Studio, or a third-party framework, this view lights up as long as your telemetry follows the GenAI semantic conventions. Here's what you get out of the box: Agent dropdown filter — A dropdown populated by gen_ai.agent.name values from your telemetry. In our travel planner, this shows all six agents: "Travel Planning Coordinator", "Currency Conversion Specialist", "Weather & Packing Advisor", "Local Expert & Cultural Guide", "Itinerary Planning Expert", and "Budget Optimization Specialist". You can filter the entire dashboard to one agent or view them all. Token usage metrics — Visualizations of input and output token consumption, broken down by agent. Instantly see which agents are the most expensive to run. Operational metrics — Latency distributions, error rates, and throughput for each agent. Spot performance regressions before users notice. End-to-end transaction details — Click into any trace to see the full workflow: which agents were invoked, what tools they called, how long each step took. The "simple view" renders agent steps in a story-like format that's remarkably easy to follow. Grafana integration — One-click export to Azure Managed Grafana for custom dashboards and alerting. The key insight: this view isn't magic. It works because the telemetry is structured using well-defined semantic conventions. Let's look at those next. 📖 Docs: Application Insights Agents (Preview) view documentation GenAI Semantic Conventions — The Foundation The entire Agents view is powered by the OpenTelemetry GenAI semantic conventions. These are a standardized set of span attributes that describe AI agent behavior in a way that any observability backend can understand. Think of them as the "contract" between your instrumented code and Application Insights. Let's walk through the key attributes and why each one matters: gen_ai.agent.name This is the human-readable name of the agent. In our travel planner, each agent sets this via the name parameter when constructing the MAF ChatClientAgent — for example, "Weather & Packing Advisor" or "Budget Optimization Specialist" . This is what populates the agent dropdown in the Agents view. Without this attribute, Application Insights would have no way to distinguish one agent from another in your telemetry. It's the single most important attribute for agent-level monitoring. gen_ai.agent.description A brief description of what the agent does. Our Weather Advisor, for example, is described as "Provides weather forecasts, packing recommendations, and activity suggestions based on destination weather conditions." This metadata helps operators and on-call engineers quickly understand an agent's role without diving into source code. It shows up in trace details and helps contextualize what you're looking at when debugging. gen_ai.agent.id A unique identifier for the agent instance. In MAF, this is typically an auto-generated GUID. While gen_ai.agent.name is the human-friendly label, gen_ai.agent.id is the machine-stable identifier. If you rename an agent, the ID stays the same, which is important for tracking agent behavior across code deployments. gen_ai.operation.name The type of operation being performed. Values include "chat" for standard LLM calls and "execute_tool" for tool/function invocations. In our travel planner, when the Weather Advisor calls the GetWeatherForecast function via NWS, or when the Currency Converter calls ConvertCurrency via the Frankfurter API, those tool calls get their own spans with gen_ai.operation.name = "execute_tool" . This lets you measure LLM think-time separately from tool execution time — a critical distinction for performance optimization. gen_ai.request.model / gen_ai.response.model The model used for the request and the model that actually served the response (these can differ when providers do model routing). In our case, both are "gpt-4o" since that's what we deploy via Azure OpenAI. These attributes let you track model usage across agents, spot unexpected model assignments, and correlate performance changes with model updates. gen_ai.usage.input_tokens / gen_ai.usage.output_tokens Token consumption per LLM call. This is what powers the token usage visualizations in the Agents view. The Coordinator agent, which aggregates results from all five specialist agents, tends to have higher output token counts because it's synthesizing a full travel plan. The Currency Converter, which makes focused API calls, uses fewer tokens overall. These attributes let you answer the question "which agent is costing me the most?" — and more importantly, let you set alerts when token usage spikes unexpectedly. gen_ai.system The AI system or provider. In our case, this is "openai" (set by the Azure OpenAI client instrumentation). If you're using multiple AI providers — say, Azure OpenAI for planning and a local model for classification — this attribute lets you filter and compare. Together, these attributes create a rich, structured view of agent behavior that goes far beyond generic tracing. They're the reason Application Insights can render agent-specific dashboards with token breakdowns, latency distributions, and end-to-end workflow views. Without these conventions, all you'd see is opaque HTTP calls to an OpenAI endpoint. 💡 Key takeaway: The GenAI semantic conventions are what transform generic distributed traces into agent-aware observability. They're the bridge between your code and the Agents view. Any framework that emits these attributes — MAF, Semantic Kernel, LangChain — can light up this dashboard. Two Layers of OpenTelemetry Instrumentation Our travel planner sample instruments at two distinct levels, each capturing different aspects of agent behavior. Let's look at both. Layer 1: IChatClient-Level Instrumentation The first layer instruments at the IChatClient level using Microsoft.Extensions.AI . This is where we wrap the Azure OpenAI chat client with OpenTelemetry: var client = new AzureOpenAIClient(azureOpenAIEndpoint, new DefaultAzureCredential()); // Wrap with OpenTelemetry to emit GenAI semantic convention spans return client.GetChatClient(modelDeploymentName).AsIChatClient() .AsBuilder() .UseOpenTelemetry() .Build(); This single .UseOpenTelemetry() call intercepts every LLM call and emits spans with: gen_ai.system — the AI provider (e.g., "openai" ) gen_ai.request.model / gen_ai.response.model — which model was used gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — token consumption per call gen_ai.operation.name — the operation type ( "chat" ) Think of this as the "LLM layer" — it captures what the model is doing regardless of which agent called it. It's model-centric telemetry. Layer 2: Agent-Level Instrumentation The second layer instruments at the agent level using MAF 1.0 GA's built-in OpenTelemetry support. This happens in the BaseAgent class that all our agents inherit from: Agent = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description, tools: chatOptions.Tools?.ToList()) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName) .Build(); The .UseOpenTelemetry(sourceName: AgentName) call on the MAF agent builder emits a different set of spans: gen_ai.agent.name — the human-readable agent name (e.g., "Weather & Packing Advisor" ) gen_ai.agent.description — what the agent does gen_ai.agent.id — the unique agent identifier Agent invocation traces — spans that represent the full lifecycle of an agent call This is the "agent layer" — it captures which agent is doing the work and provides the identity information that powers the Agents view dropdown and per-agent filtering. Why Both Layers? When both layers are active, you get the richest possible telemetry. The agent-level spans nest around the LLM-level spans, creating a trace hierarchy that looks like: Agent: "Weather & Packing Advisor" (gen_ai.agent.name) └── chat (gen_ai.operation.name) ├── model: gpt-4o, input_tokens: 450, output_tokens: 120 └── execute_tool: GetWeatherForecast └── chat (follow-up with tool results) └── model: gpt-4o, input_tokens: 680, output_tokens: 350 There is a tradeoff: with both layers active, you may see some span duplication since both the IChatClient wrapper and the MAF agent wrapper emit spans for the same underlying LLM call. If you find the telemetry too noisy, you can disable one layer: Agent layer only (remove .UseOpenTelemetry() from the IChatClient ) — You get agent identity but lose per-call token breakdowns. IChatClient layer only (remove .UseOpenTelemetry() from the agent builder) — You get detailed LLM metrics but lose agent identity in the Agents view. For the fullest experience with the Agents (Preview) view, we recommend keeping both layers active. The official sample uses both, and the Agents view is designed to handle the overlapping spans gracefully. 📖 Docs: MAF Observability Guide Exporting Telemetry to Application Insights Emitting OpenTelemetry spans is only useful if they land somewhere you can query them. The good news is that Azure App Service and Application Insights have deep native integration — App Service can auto-instrument your app, forward platform logs, and surface health metrics out of the box. For a full overview of monitoring capabilities, see Monitor Azure App Service. For our AI agent scenario, we go beyond the built-in platform telemetry. We need the GenAI semantic convention spans that we configured in the previous sections to flow into App Insights so the Agents (Preview) view can render them. Our travel planner has two host processes — the ASP.NET Core API and a WebJob — and each requires a slightly different exporter setup. ASP.NET Core API — Azure Monitor OpenTelemetry Distro For the API, it's a single line. The Azure Monitor OpenTelemetry Distro handles everything: // Configure OpenTelemetry with Azure Monitor for traces, metrics, and logs. // The APPLICATIONINSIGHTS_CONNECTION_STRING env var is auto-discovered. builder.Services.AddOpenTelemetry().UseAzureMonitor(); That's it. The distro automatically: Discovers the APPLICATIONINSIGHTS_CONNECTION_STRING environment variable Configures trace, metric, and log exporters to Application Insights Sets up appropriate sampling and batching Registers standard ASP.NET Core HTTP instrumentation This is the recommended approach for any ASP.NET Core application. One NuGet package ( Azure.Monitor.OpenTelemetry.AspNetCore ), one line of code, zero configuration files. WebJob — Manual Exporter Setup The WebJob is a non-ASP.NET Core host (it uses Host.CreateApplicationBuilder ), so the distro's convenience method isn't available. Instead, we configure the exporters explicitly: // Configure OpenTelemetry with Azure Monitor for the WebJob (non-ASP.NET Core host). // The APPLICATIONINSIGHTS_CONNECTION_STRING env var is auto-discovered. builder.Services.AddOpenTelemetry() .ConfigureResource(r => r.AddService("TravelPlanner.WebJob")) .WithTracing(t => t .AddSource("*") .AddAzureMonitorTraceExporter()) .WithMetrics(m => m .AddMeter("*") .AddAzureMonitorMetricExporter()); builder.Logging.AddOpenTelemetry(o => o.AddAzureMonitorLogExporter()); A few things to note: .AddSource("*") — Subscribes to all trace sources, including the ones emitted by MAF's .UseOpenTelemetry(sourceName: AgentName) . In production, you might narrow this to specific source names for performance. .AddMeter("*") — Similarly captures all metrics, including the GenAI metrics emitted by the instrumentation layers. .ConfigureResource(r => r.AddService("TravelPlanner.WebJob")) — Tags all telemetry with the service name so you can distinguish API vs. WebJob telemetry in Application Insights. The connection string is still auto-discovered from the APPLICATIONINSIGHTS_CONNECTION_STRING environment variable — no need to pass it explicitly. The key difference between these two approaches is ceremony, not capability. Both send the same GenAI spans to Application Insights; the Agents view works identically regardless of which exporter setup you use. 📖 Docs: Azure Monitor OpenTelemetry Distro Infrastructure as Code — Provisioning the Monitoring Stack The monitoring infrastructure is provisioned via Bicep modules alongside the rest of the application's Azure resources. Here's how it fits together. Log Analytics Workspace infra/core/monitor/loganalytics.bicep creates the Log Analytics workspace that backs Application Insights: resource logAnalyticsWorkspace 'Microsoft.OperationalInsights/workspaces@2023-09-01' = { name: name location: location tags: tags properties: { sku: { name: 'PerGB2018' } retentionInDays: 30 } } Application Insights infra/core/monitor/appinsights.bicep creates a workspace-based Application Insights resource connected to Log Analytics: resource appInsights 'Microsoft.Insights/components@2020-02-02' = { name: name location: location tags: tags kind: 'web' properties: { Application_Type: 'web' WorkspaceResourceId: logAnalyticsWorkspaceId } } output connectionString string = appInsights.properties.ConnectionString Wiring It All Together In infra/main.bicep , the Application Insights connection string is passed as an app setting to the App Service: appSettings: { APPLICATIONINSIGHTS_CONNECTION_STRING: appInsights.outputs.connectionString // ... other app settings } This is the critical glue: when the app starts, the OpenTelemetry distro (or manual exporters) auto-discover this environment variable and start sending telemetry to your Application Insights resource. No connection strings in code, no configuration files — it's all infrastructure-driven. The same connection string is available to both the API and the WebJob since they run on the same App Service. All agent telemetry from both host processes flows into a single Application Insights resource, giving you a unified view across the entire application. See It in Action Once the application is deployed and processing travel plan requests, here's how to explore the agent telemetry in Application Insights. Step 1: Open the Agents (Preview) View In the Azure portal, navigate to your Application Insights resource. In the left nav, look for Agents (Preview) under the Investigations section. This opens the unified agent monitoring dashboard. Step 2: Filter by Agent The agent dropdown at the top of the page is populated by the gen_ai.agent.name values in your telemetry. You'll see all six agents listed: Travel Planning Coordinator Currency Conversion Specialist Weather & Packing Advisor Local Expert & Cultural Guide Itinerary Planning Expert Budget Optimization Specialist Select a specific agent to filter the entire dashboard — token usage, latency, error rate — down to that one agent. Step 3: Review Token Usage The token usage tile shows total input and output token consumption over your selected time range. Compare agents to find your biggest spenders. In our testing, the Coordinator agent consistently uses the most output tokens because it aggregates and synthesizes results from all five specialists. Step 4: Drill into Traces Click "View Traces with Agent Runs" to see all agent executions. Each row represents a workflow run. You can filter by time range, status (success/failure), and specific agent. Step 5: End-to-End Transaction Details Click any trace to open the end-to-end transaction details. The "simple view" renders the agent workflow as a story — showing each step, which agent handled it, how long it took, and what tools were called. For a full travel plan, you'll see the Coordinator dispatch work to each specialist, tool calls to the NWS weather API and Frankfurter currency API, and the final aggregation step. Grafana Dashboards The Agents (Preview) view in Application Insights is great for ad-hoc investigation. For ongoing monitoring and alerting, Azure Managed Grafana provides prebuilt dashboards specifically designed for agent workloads. From the Agents view, click "Explore in Grafana" to jump directly into these dashboards: Agent Framework Dashboard — Per-agent metrics including token usage trends, latency percentiles, error rates, and throughput over time. Pin this to your operations wall. Agent Framework Workflow Dashboard — Workflow-level metrics showing how multi-agent orchestrations perform end-to-end. See how long complete travel plans take, identify bottleneck agents, and track success rates. These dashboards query the same underlying data in Log Analytics, so there's zero additional instrumentation needed. If your telemetry lights up the Agents view, it lights up Grafana too. Key Packages Summary Here are the NuGet packages that make this work, pulled from the actual project files: Package Version Purpose Azure.Monitor.OpenTelemetry.AspNetCore 1.3.0 Azure Monitor OTEL Distro for ASP.NET Core (API). One-line setup for traces, metrics, and logs. Azure.Monitor.OpenTelemetry.Exporter 1.3.0 Azure Monitor OTEL exporter for non-ASP.NET Core hosts (WebJob). Trace, metric, and log exporters. Microsoft.Agents.AI 1.0.0 MAF 1.0 GA — ChatClientAgent , .UseOpenTelemetry() for agent-level instrumentation. Microsoft.Extensions.AI 10.4.1 IChatClient abstraction with .UseOpenTelemetry() for LLM-level instrumentation. OpenTelemetry.Extensions.Hosting 1.11.2 OTEL dependency injection integration for Host.CreateApplicationBuilder (WebJob). Microsoft.Extensions.AI.OpenAI 10.4.1 OpenAI/Azure OpenAI adapter for IChatClient . Bridges the Azure OpenAI SDK to the M.E.AI abstraction. Wrapping Up Let's zoom out. In this three-part series, so far we've gone from zero to a fully observable, production-grade multi-agent AI application on Azure App Service: Blog 1 covered deploying the multi-agent travel planner with MAF 1.0 GA — the agents, the architecture, the infrastructure. Blog 2 (this post) showed how to instrument those agents with OpenTelemetry, explained the GenAI semantic conventions that make agent-aware monitoring possible, and walked through the new Agents (Preview) view in Application Insights. Blog 3 will show you how to secure those agents for production with the Microsoft Agent Governance Toolkit. The pattern is straightforward: Add .UseOpenTelemetry() at the IChatClient level for LLM metrics. Add .UseOpenTelemetry(sourceName: AgentName) at the MAF agent level for agent identity. Export to Application Insights via the Azure Monitor distro (one line) or manual exporters. Wire the connection string through Bicep and environment variables. Open the Agents (Preview) view and start monitoring. With MAF 1.0 GA's built-in OpenTelemetry support and Application Insights' new Agents view, you get production-grade observability for AI agents with minimal code. The GenAI semantic conventions ensure your telemetry is structured, portable, and understood by any compliant backend. And because it's all standard OpenTelemetry, you're not locked into any single vendor — swap the exporter and your telemetry goes to Jaeger, Grafana, Datadog, or wherever you need it. Now go see what your agents are up to and check out Blog 3. Resources Sample repository: seligj95/app-service-multi-agent-maf-otel App Insights Agents (Preview) view: Documentation GenAI Semantic Conventions: OpenTelemetry GenAI Registry MAF Observability Guide: Microsoft Agent Framework Observability Azure Monitor OpenTelemetry Distro: Enable OpenTelemetry for .NET Grafana Agent Framework Dashboard: aka.ms/amg/dash/af-agent Grafana Workflow Dashboard: aka.ms/amg/dash/af-workflow Blog 1: Deploy Multi-Agent AI Apps on Azure App Service with MAF 1.0 GA Blog 3: Govern AI Agents on App Service with the Microsoft Agent Governance Toolkit363Views0likes0CommentsAzure Functions Ignite 2025 Update
Azure Functions is redefining event-driven applications and high-scale APIs in 2025, accelerating innovation for developers building the next generation of intelligent, resilient, and scalable workloads. This year, our focus has been on empowering AI and agentic scenarios: remote MCP server hosting, bulletproofing agents with Durable Functions, and first-class support for critical technologies like OpenTelemetry, .NET 10 and Aspire. With major advances in serverless Flex Consumption, enhanced performance, security, and deployment fundamentals across Elastic Premium and Flex, Azure Functions is the platform of choice for building modern, enterprise-grade solutions. Remote MCP Model Context Protocol (MCP) has taken the world by storm, offering an agent a mechanism to discover and work deeply with the capabilities and context of tools. When you want to expose MCP/tools to your enterprise or the world securely, we recommend you think deeply about building remote MCP servers that are designed to run securely at scale. Azure Functions is uniquely optimized to run your MCP servers at scale, offering serverless and highly scalable features of Flex Consumption plan, plus two flexible programming model options discussed below. All come together using the hardened Functions service plus new authentication modes for Entra and OAuth using Built-in authentication. Remote MCP Triggers and Bindings Extension GA Back in April, we shared a new extension that allows you to author MCP servers using functions with the MCP tool trigger. That MCP extension is now generally available, with support for C#(.NET), Java, JavaScript (Node.js), Python, and Typescript (Node.js). The MCP tool trigger allows you to focus on what matters most: the logic of the tool you want to expose to agents. Functions will take care of all the protocol and server logistics, with the ability to scale out to support as many sessions as you want to throw at it. [Function(nameof(GetSnippet))] public object GetSnippet( [McpToolTrigger(GetSnippetToolName, GetSnippetToolDescription)] ToolInvocationContext context, [BlobInput(BlobPath)] string snippetContent ) { return snippetContent; } New: Self-hosted MCP Server (Preview) If you’ve built servers with official MCP SDKs and want to run them as remote cloud‑scale servers without re‑writing any code, this public preview is for you. You can now self‑host your MCP server on Azure Functions—keep your existing Python, TypeScript, .NET, or Java code and get rapid 0 to N scaling, built-in server authentication and authorization, consumption-based billing, and more from the underlying Azure Functions service. This feature complements the Azure Functions MCP extension for building MCP servers using the Functions programming model (triggers & bindings). Pick the path that fits your scenario—build with the extension or standard MCP SDKs. Either way you benefit from the same scalable, secure, and serverless platform. Use the official MCP SDKs: # MCP.tool() async def get_alerts(state: str) -> str: """Get weather alerts for a US state. Args: state: Two-letter US state code (e.g. CA, NY) """ url = f"{NWS_API_BASE}/alerts/active/area/{state}" data = await make_nws_request(url) if not data or "features" not in data: return "Unable to fetch alerts or no alerts found." if not data["features"]: return "No active alerts for this state." alerts = [format_alert(feature) for feature in data["features"]] return "\n---\n".join(alerts) Use Azure Functions Flex Consumption Plan's serverless compute using Custom Handlers in host.json: { "version": "2.0", "configurationProfile": "mcp-custom-handler", "customHandler": { "description": { "defaultExecutablePath": "python", "arguments": ["weather.py"] }, "http": { "DefaultAuthorizationLevel": "anonymous" }, "port": "8000" } } Learn more about MCPTrigger and self-hosted MCP servers at https://aka.ms/remote-mcp Built-in MCP server authorization (Preview) The built-in authentication and authorization feature can now be used for MCP server authorization, using a new preview option. You can quickly define identity-based access control for your MCP servers with Microsoft Entra ID or other OpenID Connect providers. Learn more at https://aka.ms/functions-mcp-server-authorization. Better together with Foundry agents Microsoft Foundry is the starting point for building intelligent agents, and Azure Functions is the natural next step for extending those agents with remote MCP tools. Running your tools on Functions gives you clean separation of concerns, reuse across multiple agents, and strong security isolation. And with built-in authorization, Functions enables enterprise-ready authentication patterns, from calling downstream services with the agent’s identity to operating on behalf of end users with their delegated permissions. Build your first remote MCP server and connect it to your Foundry agent at https://aka.ms/foundry-functions-mcp-tutorial. Agents Microsoft Agent Framework 2.0 (Public Preview Refresh) We’re excited about the preview refresh 2.0 release of Microsoft Agent Framework that builds on battle hardened work from Semantic Kernel and AutoGen. Agent Framework is an outstanding solution for building multi-agent orchestrations that are both simple and powerful. Azure Functions is a strong fit to host Agent Framework with the service’s extreme scale, serverless billing, and enterprise grade features like VNET networking and built-in auth. Durable Task Extension for Microsoft Agent Framework (Preview) The durable task extension for Microsoft Agent Framework transforms how you build production-ready, resilient and scalable AI agents by bringing the proven durable execution (survives crashes and restarts) and distributed execution (runs across multiple instances) capabilities of Azure Durable Functions directly into the Microsoft Agent Framework. Combined with Azure Functions for hosting and event-driven execution, you can now deploy stateful, resilient AI agents that automatically handle session management, failure recovery, and scaling, freeing you to focus entirely on your agent logic. Key features of the durable task extension include: Serverless Hosting: Deploy agents on Azure Functions with auto-scaling from thousands of instances to zero, while retaining full control in a serverless architecture. Automatic Session Management: Agents maintain persistent sessions with full conversation context that survives process crashes, restarts, and distributed execution across instances Deterministic Multi-Agent Orchestrations: Coordinate specialized durable agents with predictable, repeatable, code-driven execution patterns Human-in-the-Loop with Serverless Cost Savings: Pause for human input without consuming compute resources or incurring costs Built-in Observability with Durable Task Scheduler: Deep visibility into agent operations and orchestrations through the Durable Task Scheduler UI dashboard Create a durable agent: endpoint = os.getenv("AZURE_OPENAI_ENDPOINT") deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini") # Create an AI agent following the standard Microsoft Agent Framework pattern agent = AzureOpenAIChatClient( endpoint=endpoint, deployment_name=deployment_name, credential=AzureCliCredential() ).create_agent( instructions="""You are a professional content writer who creates engaging, well-structured documents for any given topic. When given a topic, you will: 1. Research the topic using the web search tool 2. Generate an outline for the document 3. Write a compelling document with proper formatting 4. Include relevant examples and citations""", name="DocumentPublisher", tools=[ AIFunctionFactory.Create(search_web), AIFunctionFactory.Create(generate_outline) ] ) # Configure the function app to host the agent with durable session management app = AgentFunctionApp(agents=[agent]) app.run() Durable Task Scheduler dashboard for agent and agent workflow observability and debugging For more information on the durable task extension for Agent Framework, see the announcement: https://aka.ms/durable-extension-for-af-blog. Flex Consumption Updates As you know, Flex Consumption means serverless without compromise. It combines elastic scale and pay‑for‑what‑you‑use pricing with the controls you expect: per‑instance concurrency, longer executions, VNet/private networking, and Always Ready instances to minimize cold starts. Since launching GA at Ignite 2024 last year, Flex Consumption has had tremendous growth with over 1.5 billion function executions per day and nearly 40 thousand apps. Here’s what’s new for Ignite 2025: 512 MB instance size (GA). Right‑size lighter workloads, scale farther within default quota. Availability Zones (GA). Distribute instances across zones. Rolling updates (Public Preview). Unlock zero-downtime deployments of code or config by setting a single configuration. See below for more information. Even more improvements including: new diagnostic settingsto route logs/metrics, use Key Vault App Config references, new regions, and Custom Handler support. To get started, review Flex Consumption samples, or dive into the documentation to see how Flex can support your workloads. Migrating to Azure Functions Flex Consumption Migrating to Flex Consumption is simple with our step-by-step guides and agentic tools. Move your Azure Functions apps or AWS Lambda workloads, update your code and configuration, and take advantage of new automation tools. With Linux Consumption retiring, now is the time to switch. For more information, see: Migrate Consumption plan apps to the Flex Consumption plan Migrate AWS Lambda workloads to Azure Functions Durable Functions Durable Functions introduces powerful new features to help you build resilient, production-ready workflows: Distributed Tracing: lets you track requests across components and systems, giving you deep visibility into orchestration and activities with support for App Insights and OpenTelemetry. Extended Sessions support in .NET isolated: improves performance by caching orchestrations in memory, ideal for fast sequential activities and large fan-out/fan-in patterns. Orchestration versioning (public preview): enables zero-downtime deployments and backward compatibility, so you can safely roll out changes without disrupting in-flight workflows Durable Task Scheduler Updates Durable Task Scheduler Dedicated SKU (GA): Now generally available, the Dedicated SKU offers advanced orchestration for complex workflows and intelligent apps. It provides predictable pricing for steady workloads, automatic checkpointing, state protection, and advanced monitoring for resilient, reliable execution. Durable Task Scheduler Consumption SKU (Public Preview): The new Consumption SKU brings serverless, pay-as-you-go orchestration to dynamic and variable workloads. It delivers the same orchestration capabilities with flexible billing, making it easy to scale intelligent applications as needed. For more information see: https://aka.ms/dts-ga-blog OpenTelemetry support in GA Azure Functions OpenTelemetry is now generally available, bringing unified, production-ready observability to serverless applications. Developers can now export logs, traces, and metrics using open standards—enabling consistent monitoring and troubleshooting across every workload. Key capabilities include: Unified observability: Standardize logs, traces, and metrics across all your serverless workloads for consistent monitoring and troubleshooting. Vendor-neutral telemetry: Integrate seamlessly with Azure Monitor or any OpenTelemetry-compliant backend, ensuring flexibility and choice. Broad language support: Works with .NET (isolated), Java, JavaScript, Python, PowerShell, and TypeScript. Start using OpenTelemetry in Azure Functions today to unlock standards-based observability for your apps. For step-by-step guidance on enabling OpenTelemetry and configuring exporters for your preferred backend, see the documentation. Deployment with Rolling Updates (Preview) Achieving zero-downtime deployments has never been easier. The Flex Consumption plan now offers rolling updates as a site update strategy. Set a single property, and all future code deployments and configuration changes will be released with zero-downtime. Instead of restarting all instances at once, the platform now drains existing instances in batches while scaling out the latest version to match real-time demand. This ensures uninterrupted in-flight executions and resilient throughput across your HTTP, non-HTTP, and Durable workloads – even during intensive scale-out scenarios. Rolling updates are now in public preview. Learn more at https://aka.ms/functions/rolling-updates. Secure Identity and Networking Everywhere By Design Security and trust are paramount. Azure Functions incorporates proven best practices by design, with full support for managed identity—eliminating secrets and simplifying secure authentication and authorization. Flex Consumption and other plans offer enterprise-grade networking features like VNETs, private endpoints, and NAT gateways for deep protection. The Azure Portal streamlines secure function creation, and updated scenarios and samples showcase these identity and networking capabilities in action. Built-in authentication (discussed above) enables inbound client traffic to use identity as well. Check out our updated Functions Scenarios page with quickstarts or our secure samples gallery to see these identity and networking best practices in action. .NET 10 Azure Functions now supports .NET 10, bringing in a great suite of new features and performance benefits for your code. .NET 10 is supported on the isolated worker model, and it’s available for all plan types except Linux Consumption. As a reminder, support ends for the legacy in-process model on November 10, 2026, and the in-process model is not being updated with .NET 10. To stay supported and take advantage of the latest features, migrate to the isolated worker model. Aspire Aspire is an opinionated stack that simplifies development of distributed applications in the cloud. The Azure Functions integration for Aspire enables you to develop, debug, and orchestrate an Azure Functions .NET project as part of an Aspire solution. Aspire publish directly deploys to your functions to Azure Functions on Azure Container Apps. Aspire 13 includes an updated preview version of the Functions integration that acts as a release candidate with go-live support. The package will be moved to GA quality with Aspire 13.1. Java 25, Node.js 24 Azure Functions now supports Java 25 and Node.js 24 in preview. You can now develop functions using these versions locally and deploy them to Azure Functions plans. Learn how to upgrade your apps to these versions here In Summary Ready to build what’s next? Update your Azure Functions Core Tools today and explore the latest samples and quickstarts to unlock new capabilities for your scenarios. The guided quickstarts run and deploy in under 5 minutes, and incorporate best practices—from architecture to security to deployment. We’ve made it easier than ever to scaffold, deploy, and scale real-world solutions with confidence. The future of intelligent, scalable, and secure applications starts now—jump in and see what you can create!3.5KViews0likes2CommentsGovern AI Agents on App Service with the Microsoft Agent Governance Toolkit
Part 3 of 3 — Multi-Agent AI on Azure App Service In Blog 1, we built a multi-agent travel planner with Microsoft Agent Framework 1.0 on App Service. In Blog 2, we added observability with OpenTelemetry and the new Application Insights Agents view. Now in Part 3, we secure those agents for production with the Microsoft Agent Governance Toolkit. This post assumes you've followed the guidance in Blog 1 to deploy the multi-agent travel planner to Azure App Service. If you haven't deployed the app yet, start there first. The governance gap Our travel planner works. It's observable. But here's the question I'm hearing from customers: "How do I make sure my agents don't do something they shouldn't?" It's a fair question. Our six agents — Coordinator, Currency Converter, Weather Advisor, Local Knowledge, Itinerary Planner, and Budget Optimizer — can call external APIs, process user data, and make autonomous decisions. In a demo, that's impressive. In production, that's a risk surface. Consider what can go wrong with ungoverned agents: Unauthorized API calls — An agent calls an external API it was never intended to use, leaking data or incurring costs Sensitive data exposure — An agent passes PII to a third-party service without consent controls Runaway token spend — A recursive agent loop burns through your OpenAI budget in minutes Tool misuse — A prompt injection tricks an agent into executing a tool it shouldn't Cascading failures — One agent's error propagates through the entire multi-agent workflow These aren't theoretical. In December 2025, OWASP published the Top 10 for Agentic Applications — the first formal taxonomy of risks specific to autonomous AI agents, including goal hijacking, tool misuse, identity abuse, memory poisoning, and rogue agents. Regulators are paying attention too: the EU AI Act's high-risk AI obligations take effect in August 2026, and the Colorado AI Act becomes enforceable in June 2026. The bottom line: if you're running agents in production, you need governance. Not eventually — now. What the Agent Governance Toolkit does The Agent Governance Toolkit is an open-source project (MIT license) from Microsoft that brings runtime security governance to autonomous AI agents. It's the first toolkit to address all 10 OWASP agentic AI risks with deterministic, sub-millisecond policy enforcement. The toolkit is organized into 7 packages: Package What it does Think of it as... Agent OS Stateless policy engine, intercepts every action before execution (<0.1ms p99) The kernel for AI agents Agent Mesh Cryptographic identity (DIDs), inter-agent trust protocol, dynamic trust scoring mTLS for agents Agent Runtime Execution rings (like CPU privilege levels), saga orchestration, kill switch Process isolation for agents Agent SRE SLOs, error budgets, circuit breakers, chaos engineering SRE practices for agents Agent Compliance Automated governance verification, regulatory mapping (EU AI Act, HIPAA, SOC2) Compliance-as-code Agent Marketplace Plugin lifecycle management, Ed25519 signing, supply-chain security Package manager security Agent Lightning RL training governance with policy-enforced runners Safe training guardrails The toolkit is available in Python, TypeScript, Rust, Go, and .NET. It's framework-agnostic — it works with MAF, LangChain, CrewAI, Google ADK, and more. For our ASP.NET Core travel planner, we'll use the .NET SDK via NuGet ( Microsoft.AgentGovernance ). For this blog, we're focusing on three packages: Agent OS — the policy engine that intercepts and evaluates every agent action Agent Compliance — regulatory mapping and audit trail generation Agent SRE — SLOs and circuit breakers for agent reliability How easy it was to add governance Here's the part that surprised me. I expected adding governance to a production agent system to be a multi-hour effort — new infrastructure, complex configuration, extensive refactoring. Instead, it took about 30 minutes. Here's exactly what we changed: Step 1: Add NuGet packages Three packages added to TravelPlanner.Shared.csproj : <itemgroup> <!-- Existing packages --> <packagereference include="Azure.Monitor.OpenTelemetry.AspNetCore" version="1.3.0"> <packagereference include="Microsoft.Agents.AI" version="1.0.0"> <!-- NEW: Agent Governance Toolkit (single package, all features included) --> <packagereference include="Microsoft.AgentGovernance" version="3.0.2"> </packagereference></packagereference></packagereference></itemgroup> Step 2: Create the policy file One new file: governance-policies.yaml in the project root. This is where all your governance rules live: apiVersion: governance.toolkit/v1 name: travel-planner-governance description: Policy enforcement for the multi-agent travel planner on App Service scope: global defaultAction: deny rules: - name: allow-currency-conversion condition: "tool == 'ConvertCurrency'" action: allow priority: 10 description: Allow Currency Converter agent to call Frankfurter exchange rate API - name: allow-weather-forecast condition: "tool == 'GetWeatherForecast'" action: allow priority: 10 description: Allow Weather Advisor agent to call NWS forecast API - name: allow-weather-alerts condition: "tool == 'GetWeatherAlerts'" action: allow priority: 10 description: Allow Weather Advisor agent to check NWS weather alerts Step 3: One line in BaseAgent.cs This is the moment. Here's our BaseAgent.cs before: Agent = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName) .Build(); And after: var kernel = serviceProvider.GetService<GovernanceKernel>(); if (kernel is not null) builder.UseGovernance(kernel, AgentName); Agent = builder.Build(); One line of intent, two lines of null-safety. The .UseGovernance(kernel, AgentName) call intercepts every tool/function invocation in the agent's pipeline, evaluating it against the loaded policies before execution. If the GovernanceKernel isn't registered (governance disabled), agents work exactly as before — no crash, no code change needed. Here's the full updated constructor using IServiceProvider to optionally resolve governance: using AgentGovernance; using Microsoft.Extensions.DependencyInjection; public abstract class BaseAgent : IAgent { protected readonly ILogger Logger; protected readonly AgentOptions Options; protected readonly AIAgent Agent; // Constructor for simple agents without tools protected BaseAgent( ILogger logger, IOptions<AgentOptions> options, IChatClient chatClient, IServiceProvider serviceProvider) { Logger = logger; Options = options.Value; var builder = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName); var kernel = serviceProvider.GetService<GovernanceKernel>(); if (kernel is not null) builder.UseGovernance(kernel, AgentName); Agent = builder.Build(); } // Constructor for agents with tools protected BaseAgent( ILogger logger, IOptions<AgentOptions> options, IChatClient chatClient, ChatOptions chatOptions, IServiceProvider serviceProvider) { Logger = logger; Options = options.Value; var builder = new ChatClientAgent( chatClient, instructions: Instructions, name: AgentName, description: Description, tools: chatOptions.Tools?.ToList()) .AsBuilder() .UseOpenTelemetry(sourceName: AgentName); var kernel = serviceProvider.GetService<GovernanceKernel>(); if (kernel is not null) builder.UseGovernance(kernel, AgentName); Agent = builder.Build(); } // ... rest unchanged } Step 4: DI registrations in Program.cs A few lines to wire up governance in the dependency injection container: using AgentGovernance; // ... existing builder setup ... // Configure OpenTelemetry with Azure Monitor (existing — from Blog 2) builder.Services.AddOpenTelemetry().UseAzureMonitor(); // NEW: Configure Agent Governance Toolkit // Load policy from YAML, register as singleton. Agents resolve via IServiceProvider. var policyPath = Path.Combine(builder.Environment.ContentRootPath, "governance-policies.yaml"); if (File.Exists(policyPath)) { try { var yaml = File.ReadAllText(policyPath); var kernel = new GovernanceKernel(new GovernanceOptions { EnableAudit = true, EnableMetrics = true }); kernel.LoadPolicyFromYaml(yaml); builder.Services.AddSingleton(kernel); Console.WriteLine($"[Governance] Loaded policies from {policyPath}"); } catch (Exception ex) { Console.WriteLine($"[Governance] Failed to load: {ex.Message}. Running without governance."); } } That's it. Your agents are now governed. Let me repeat that because it's the core message of this blog: we added production governance to a six-agent system by adding one NuGet package, creating one YAML policy file, adding a few lines to our base agent class, and registering the governance kernel in DI. No new infrastructure. No complex rewiring. No multi-sprint project. If you followed Blog 1 and Blog 2, you can do this in 30 minutes. Policy flexibility deep-dive The YAML policy language is intentionally simple to start with, but it supports real complexity when you need it. Let's walk through what each policy in our file does. API allowlists and blocklists Our travel planner calls two external APIs: Frankfurter (currency exchange) and the National Weather Service. The defaultAction: deny combined with explicit allow rules ensures agents can only call these approved tools. If an agent attempts to call any other function — whether through a prompt injection or a bug — the call is blocked before it executes: defaultAction: deny rules: - name: allow-currency-conversion condition: "tool == 'ConvertCurrency'" action: allow priority: 10 - name: allow-weather-forecast condition: "tool == 'GetWeatherForecast'" action: allow priority: 10 When a blocked call happens, you'll see output like this in your logs: [Governance] Tool call 'DeleteDatabase' blocked for agent 'LocalKnowledgeAgent': No matching rules; default action is deny. Condition language The condition field supports equality checks, pattern matching, and boolean logic. You can match on tool name, agent ID, or any key in the evaluation context: # Match a specific tool condition: "tool == 'ConvertCurrency'" # Match multiple tools with OR condition: "tool == 'GetWeatherForecast' or tool == 'GetWeatherAlerts'" # Match by agent condition: "agent == 'CurrencyConverterAgent' and tool == 'ConvertCurrency'" Priority and conflict resolution When multiple rules match, the toolkit evaluates by priority (higher number = higher priority). A deny rule at priority 100 will override an allow rule at priority 10. This lets you layer broad allows with specific denies: rules: - name: allow-all-weather-tools condition: "tool == 'GetWeatherForecast' or tool == 'GetWeatherAlerts'" action: allow priority: 10 - name: block-during-maintenance condition: "tool == 'GetWeatherForecast'" action: deny priority: 100 description: Temporarily block NWS calls during API maintenance Advanced: OPA Rego and Cedar The YAML policy language handles most scenarios, but for teams with advanced needs, the toolkit also supports OPA Rego and Cedar policy languages. You can mix them — use YAML for simple rules and Rego for complex conditional logic: # policies/advanced.rego — Example: time-based access control package travel_planner.governance default allow_tool_call = false allow_tool_call { input.agent == "CurrencyConverterAgent" input.tool == "get_exchange_rate" time.weekday(time.now_ns()) != "Sunday" # Markets closed } Start simple with YAML. Add complexity only when you need it. Why App Service for governed agent workloads You might be wondering: why does hosting platform matter for governance? It matters a lot. The governance toolkit handles the application-level policies, but a production agent system also needs platform-level security, networking, identity, and deployment controls. App Service gives you these out of the box. Managed Identity Governance policies enforce what agents can access. Managed Identity handles how they authenticate — without secrets to manage, rotate, or leak. Our travel planner already uses DefaultAzureCredential for Azure OpenAI, Cosmos DB, and Service Bus. Governance layers on top of this identity foundation. VNet Integration + Private Endpoints The governance toolkit enforces API allowlists at the application level. App Service's VNet integration and private endpoints enforce network boundaries at the infrastructure level. This is defense in depth: even if a governance policy is misconfigured, the network layer prevents unauthorized egress. Your agents can only reach the networks you've explicitly allowed. Easy Auth App Service's built-in authentication (Easy Auth) protects your agent APIs without custom code. Before a request even reaches your governance engine, App Service has already validated the caller's identity. No custom auth middleware. No JWT parsing. Just toggle it on. Deployment Slots This is underrated for governance. With deployment slots, you can test new governance policies in a staging slot before swapping to production. Deploy updated governance-policies.yaml to staging, run your test suite, verify the policies work as expected, and then swap. Zero-downtime policy updates with full rollback capability. App Insights integration Governance audit events flow into the same Application Insights instance we configured in Blog 2. This means your governance decisions appear alongside your OTel traces in the Agents view. One pane of glass for agent behavior and governance enforcement. Always-on + WebJobs Our travel planner uses WebJobs for long-running agent workflows. With App Service's Always-on feature, those workflows stay warm, and governance is continuous — no cold-start gaps where agents run unmonitored. azd deployment One command deploys the full governed stack — application code, governance policies, infrastructure, and monitoring: azd up App Service gives you the enterprise production features governance needs — identity, networking, observability, safe deployment — out of the box. The governance toolkit handles agent-level policy enforcement; App Service handles platform-level security. Together, they're a complete governed agent platform. Governance audit events in App Insights In Blog 2, we set up OpenTelemetry and the Application Insights Agents view to monitor agent behavior. With the governance toolkit, those same traces now include governance audit events — every policy decision is recorded as a span attribute on the agent's trace. When you open a trace in the Agents view, you'll see governance events inline: Policy: api-allowlist → ALLOWED — CurrencyConverterAgent called Frankfurter API, permitted Policy: token-budget → ALLOWED — Request used 3,200 tokens, within per-request limit of 8,000 Policy: rate-limit → THROTTLED — WeatherAdvisorAgent exceeded 60 calls/min, request delayed For deeper analysis, use KQL to query governance events directly. Here's a query that finds all policy violations in the last 24 hours: // Find all governance policy violations in the last 24 hours traces | where timestamp > ago(24h) | where customDimensions["governance.decision"] != "ALLOWED" | extend agentName = tostring(customDimensions["agent.name"]), policyName = tostring(customDimensions["governance.policy"]), decision = tostring(customDimensions["governance.decision"]), violationReason = tostring(customDimensions["governance.reason"]), targetUrl = tostring(customDimensions["tool.target_url"]) | project timestamp, agentName, policyName, decision, violationReason, targetUrl | order by timestamp desc And here's one for tracking token budget consumption across agents: // Token budget consumption by agent over the last hour customMetrics | where timestamp > ago(1h) | where name == "governance.tokens.consumed" | extend agentName = tostring(customDimensions["agent.name"]) | summarize totalTokens = sum(value), avgTokensPerRequest = avg(value), maxTokensPerRequest = max(value) by agentName, bin(timestamp, 5m) | order by totalTokens desc This is the power of integrating governance with your existing observability stack. You don't need a separate governance dashboard — everything lives in the same App Insights workspace you already know. SRE for agents The Agent SRE package brings Site Reliability Engineering practices to agent systems. This was the part that got me most excited, because it addresses a question I hear constantly: "How do I know my agents are actually reliable?" Service Level Objectives (SLOs) We defined SLOs in our policy file: slos: - name: weather-agent-latency agent: "WeatherAdvisorAgent" metric: latency-p99 target: 5000ms window: 5m This says: "The Weather Advisor Agent must respond within 5 seconds at the 99th percentile, measured over a 5-minute rolling window." When the SLO is breached, the toolkit emits an alert event and can trigger automated responses. Circuit breakers Circuit breakers prevent cascading failures. If an agent fails 5 times in a row, the circuit opens, and subsequent requests get a fast failure response instead of waiting for another timeout: circuit-breakers: - agent: "*" failure-threshold: 5 recovery-timeout: 30s half-open-max-calls: 2 After 30 seconds, the circuit enters a half-open state, allowing 2 test calls through. If those succeed, the circuit closes and normal operation resumes. If they fail, the circuit opens again. This pattern is battle-tested in microservices — now it protects your agents too. Error budgets Error budgets tie SLOs to business decisions. If your Coordinator Agent's success rate target is 99.5% over a 15-minute window, that means you have an error budget of 0.5%. When the budget is consumed, the toolkit can automatically reduce agent autonomy — for example, requiring human approval for high-risk actions until the error budget recovers. SRE practices turn agent reliability from a hope into a measurable, enforceable contract. Architecture Here's how everything fits together after adding governance: ┌─────────────────────────────────────────────────────────────────┐ │ Azure App Service │ │ ┌──────────────┐ ┌─────────────────────────────────────┐ │ │ │ Frontend │───▶│ ASP.NET Core API │ │ │ │ (Static) │ │ │ │ │ └──────────────┘ │ ┌─────────────────────────────┐ │ │ │ │ │ Coordinator Agent │ │ │ │ │ │ ┌───────┐ ┌────────────┐ │ │ │ │ │ │ │ OTel │─▶│ Governance │ │ │ │ │ │ │ └───────┘ │ Engine │ │ │ │ │ │ │ │ ┌────────┐ │ │ │ │ │ │ │ │ │Policies│ │ │ │ │ │ │ │ │ └────────┘ │ │ │ │ │ │ │ └─────┬──────┘ │ │ │ │ │ └───────────────────┼─────────┘ │ │ │ │ ┌───────────────────┼──────────┐ │ │ │ │ │ Specialist Agents │ │ │ │ │ │ │ (Currency, Weather, etc.) │ │ │ │ │ │ Each with OTel + Governance │ │ │ │ │ └───────────────────┼──────────┘ │ │ │ └──────────────────────┼──────────────┘ │ │ │ │ │ ┌────────────┐ ┌───────────┐ ┌───────────┼─────────┐ │ │ │ Managed │ │ VNet │ │ App Insights │ │ │ │ Identity │ │Integration│ │ (Traces + │ │ │ │ (no keys) │ │(network │ │ Governance Audit) │ │ │ │ │ │ boundary) │ │ │ │ │ └────────────┘ └───────────┘ └─────────────────────┘ │ └──────────────────────────────┬──────────────────────────────────┘ │ Only allowed APIs ▼ ┌──────────────────────┐ │ External APIs │ │ ✅ Frankfurter API │ │ ✅ NWS Weather API │ │ ❌ Everything else │ └──────────────────────┘ The key insight: governance is a transparent layer in the agent pipeline. It sits between the agent's decision and the action's execution. The agent code doesn't know or care about governance — it just builds the agent with .UseGovernance() and the policy engine handles the rest. Bring it to your own agents We've shown governance with Microsoft Agent Framework on .NET, but the toolkit is framework-agnostic. Here's how to add it to other popular frameworks: LangChain (Python) from agent_governance import PolicyEngine, GovernanceCallbackHandler policy_engine = PolicyEngine.from_yaml("governance-policies.yaml") # Add governance as a LangChain callback handler agent = create_react_agent( llm=llm, tools=tools, callbacks=[GovernanceCallbackHandler(policy_engine)] ) CrewAI (Python) from agent_governance import PolicyEngine from agent_governance.integrations.crewai import GovernanceTaskDecorator policy_engine = PolicyEngine.from_yaml("governance-policies.yaml") # Add governance as a CrewAI task decorator @GovernanceTaskDecorator(policy_engine) def research_task(agent, context): return agent.execute(context) Google ADK (Python) from agent_governance import PolicyEngine from agent_governance.integrations.google_adk import GovernancePlugin policy_engine = PolicyEngine.from_yaml("governance-policies.yaml") # Add governance as a Google ADK plugin agent = Agent( model="gemini-2.0-flash", tools=[...], plugins=[GovernancePlugin(policy_engine)] ) TypeScript / Node.js import { PolicyEngine } from '@microsoft/agentmesh-sdk'; const policyEngine = PolicyEngine.fromYaml('governance-policies.yaml'); // Use as middleware in your agent pipeline agent.use(policyEngine.middleware()); Every integration hooks into the framework's native extension points — callbacks, decorators, plugins, middleware — so adding governance doesn't require rewriting your agent code. Install the package, point it at your policy file, and you're governed. What's next This wraps up our three-part series on building production-ready multi-agent AI applications on Azure App Service: Blog 1: Build — Deploy a multi-agent travel planner with Microsoft Agent Framework 1.0 Blog 2: Monitor — Add observability with OpenTelemetry and the Application Insights Agents view Blog 3: Govern — Secure agents for production with the Agent Governance Toolkit (you are here) The progression is intentional: first make it work, then make it visible, then make it safe. And the consistent theme across all three parts is that App Service makes each step easier — managed hosting for Blog 1, integrated monitoring for Blog 2, and platform-level security features for Blog 3. Next steps for your agents Explore the Agent Governance Toolkit — star the repo, browse the 20 tutorials, try the demo Customize policies for your compliance needs — start with our YAML template and adapt it to your domain. Healthcare teams: enable HIPAA mappings. Finance teams: add SOC2 controls. Explore Agent Mesh for multi-agent trust — if you have agents communicating across services or trust boundaries, Agent Mesh's cryptographic identity and trust scoring add another layer of defense Deploy the sample — clone our travel planner repo, run azd up , and see governed agents in action AI agents are becoming autonomous decision-makers in high-stakes domains. The question isn't whether we need governance — it's whether we build it proactively, before incidents force our hand. With the Agent Governance Toolkit and Azure App Service, you can add production governance to your agents today. In about 30 minutes.308Views0likes0CommentsPHP 8.5 is now available on Azure App Service for Linux
PHP 8.5 is now available on Azure App Service for Linux across all public regions. You can create a new PHP 8.5 app through the Azure portal, automate it with the Azure CLI, or deploy using ARM/Bicep templates. PHP 8.5 brings several useful runtime improvements. It includes better diagnostics, with fatal errors now providing a backtrace, which can make troubleshooting easier. It also adds the pipe operator (|>) for cleaner, more readable code, along with broader improvements in syntax, performance, and type safety. You can take advantage of these improvements while continuing to use the deployment and management experience you already know in App Service. For the full list of features, deprecations, and migration notes, see the official PHP 8.5 release page: https://www.php.net/releases/8.5/en.php Getting started Create a PHP web app in Azure App Service Configure a PHP app for Azure App Service94Views0likes0CommentsA simpler way to deploy your code to Azure App Service for Linux
We’ve added a new deployment experience for Azure App Service for Linux that makes it easier to get your code running on your web app. To get started, go to the Kudu/SCM site for your app: <sitename>.scm.azurewebsites.net From there, open the new Deployments experience. You can now deploy your app by simply dragging and dropping a zip file containing your code. Once your file is uploaded, App Service shows you the contents of the zip so you can quickly verify what you’re about to deploy. If your application is already built and ready to run, you also have the option to skip server-side build. Otherwise, App Service can handle the build step for you. When you’re ready, select Deploy. From there, the deployment starts right away, and you can follow each phase of the process as it happens. The experience shows clear progress through upload, build, and deployment, along with deployment logs to help you understand what’s happening behind the scenes. After the deployment succeeds, you can also view runtime logs, which makes it easier to confirm that your app has started successfully. This experience is ideal if you’re getting started with Azure App Service and want the quickest path from code to a running app. For production workloads and teams with established release processes, you’ll typically continue using an automated CI/CD pipeline (for example, GitHub Actions or Azure DevOps) for repeatable deployments. We’re continuing to improve the developer experience on App Service for Linux. Give it a try and let us know what you think.190Views1like0CommentsDeploying to Azure Web App from Azure DevOps Using UAMI
TOC UAMI Configuration App Configuration Azure DevOps Configuration Logs UAMI Configuration Create a User Assigned Managed Identity with no additional configuration. This identity will be mentioned in later steps, especially at Object ID. App Configuration On an existing Azure Web App, enable Diagnostic Settings and configure it to retain certain types of logs, such as Access Audit Logs. These logs will be discussed in the final section of this article. Next, navigate to Access Control (IAM) and assign the previously created User Assigned Managed Identity the Website Contributor role. Azure DevOps Configuration Go to Azure DevOps → Project Settings → Service Connections, and create a new ARM (Azure Resource Manager) connection. While creating the connection: Select the corresponding User Assigned Managed Identity Grant it appropriate permissions at the Resource Group level During this process, you will be prompted to sign in again using your own account. This authentication will later be reflected in the deployment logs discussed below. Assuming the following deployment template is used in the pipeline, you will notice that additional steps appear in the deployment process compared to traditional service principal–based authentication. Logs A few minutes after deployment, related log records will appear. In the AppServiceAuditLogs table, you can observe that the deployment initiator is shown as the Object ID from UAMI, and the Source is listed as Azure (DevOps). This indicates that the User Assigned Managed Identity is authorized under my user context, while the deployment action itself is initiated by Azure DevOps.142Views0likes0Comments