microservices
112 TopicsAutonomous AKS Incident Response with Azure SRE Agent: From Alert to Verified Recovery in Minutes
When a Sev1 alert fires on an AKS cluster, detection is rarely the hard part. The hard part is what comes next: proving what broke, why it broke, and fixing it without widening the blast radius, all under time pressure, often at 2 a.m. Azure SRE Agent is designed to close that gap. It connects Azure-native observability, AKS diagnostics, and engineering workflows into a single incident-response loop that can investigate, remediate, verify, and follow up, without waiting for a human to page through dashboards and run ad-hoc kubectl commands. This post walks through that loop in two real AKS failure scenarios. In both cases, the agent received an incident, investigated Azure Monitor and AKS signals, applied targeted remediation, verified recovery, and created follow-up in GitHub, all while keeping the team informed in Microsoft Teams. Core concepts Azure SRE Agent is a governed incident-response system, not a conversational assistant with infrastructure access. Five concepts matter most in an AKS incident workflow: Incident platform. Where incidents originate. In this demo, that is Azure Monitor. Built-in Azure capabilities. The agent uses Azure Monitor, Log Analytics, Azure Resource Graph, Azure CLI/ARM, and AKS diagnostics without requiring external connectors. Connectors. Extend the workflow to systems such as GitHub, Teams, Kusto, and MCP servers. Permission levels. Reader for investigation and read oriented access, privileged for operational changes when allowed. Run modes. Review for approval-gated execution and Autonomous for direct execution. The most important production controls are permission level and run mode, not prompt quality. Custom instructions can shape workflow behavior, but they do not replace RBAC, telemetry quality, or tool availability. The safest production rollout path: Start: Reader + Review Then: Privileged + Review Finally: Privileged + Autonomous. Only for narrow, trusted incident paths. Demo environment The full scripts and manifests are available if you want to reproduce this: Demo repository: github.com/hailugebru/azure-sre-agents-aks. The README includes setup and configuration details. The environment uses an AKS cluster with node auto-provisioning (NAP), Azure CNI Overlay powered by Cilium, managed Prometheus metrics, the AKS Store sample microservices application, and Azure SRE Agent configured for incident-triggered investigation and remediation. This setup is intentionally realistic but minimal. It provides enough surface area to exercise real AKS failure modes without distracting from the incident workflow itself. Azure Monitor → Action Group → Azure SRE Agent → AKS Cluster (Alert) (Webhook) (Investigate / Fix) (Recover) ↓ Teams notification + GitHub issue → GitHub Agent → PR for review How the agent was configured Configuration came down to four things: scope, permissions, incident intake, and response mode. I scoped the agent to the demo resource group and used its user-assigned managed identity (UAMI) for Azure access. That scope defined what the agent could investigate, while RBAC determined what actions it could take. I used broader AKS permissions than I would recommend as a default production baseline so the agent could complete remediation end to end in the lab. That is an important distinction: permissions control what the agent can access, while run mode controls whether it asks for approval or acts directly. For this scenario, Azure Monitor served as the incident platform, and I set the response plan to Autonomous for a narrow, trusted path so the workflow could run without manual approval gates. I also added Teams and GitHub integrations so the workflow could extend beyond Azure. Teams provided milestone updates during the incident, and GitHub provided durable follow up after remediation. For the complete setup, see the README. A note on context. The more context you can provide the agent about your environment, resources, runbooks, and conventions, the better it performs. Scope boundaries, known workloads, common failure patterns, and links to relevant documentation all sharpen its investigations and reduce the time it spends exploring. Treat custom instructions and connector content as first-class inputs, not afterthoughts. Two incidents, two response modes These incidents occurred on the same cluster in one session and illustrate two realistic operating modes: Alert triggered automation. The agent acts when Azure Monitor fires. Ad hoc chat investigation. An engineer sees a symptom first and asks the agent to investigate. Both matter in real environments. The first is your scale path. The second is your operator assist path. Incident 1. CPU starvation (alert driven, ~8 min MTTR) The makeline-service deployment manifest contained a CPU and memory configuration that was not viable for startup: resources: requests: cpu: 1m memory: 6Mi limits: cpu: 5m memory: 20Mi Within five minutes, Azure Monitor fired the pod-not-healthy Sev1 alert. The agent picked it up immediately. Here is the key diagnostic conclusion the agent reached from the pod state, probe behavior, and exit code: "Exit code 1 (not 137) rules out OOMKill. The pod failed at startup, not at runtime memory pressure. CPU limit of 5m is insufficient for the process to bind its port before the startup probe times out. This is a configuration error, not a resource exhaustion scenario." That is the kind of distinction that often takes an on call engineer several minutes to prove under pressure: startup failure from CPU starvation vs. runtime termination from memory pressure. The agent then: Identified three additional CPU-throttled pods at 112 to 200% of configured limit using kubectl top. Patched four workloads: makeline-service, virtual-customer, virtual-worker, and mongodb. Verified that all affected pods returned to healthy running state with 0 restarts cluster wide. Azure SRE Agent's Incident History blade confirming full cluster recovery: 4 patches applied, 0 unhealthy pods — no human intervention required. Outcome. Full cluster recovery in ~8 minutes, 0 human interventions. Incident 2. OOMKilled (chat driven, ~4 min MTTR) For the second case, I deployed a deliberately undersized version of order-service: kubectl apply -f .\manifests\aks-store\order-service-changed.yaml -n pets I started this case from chat before the pod-phase alert fired to demonstrate the interactive troubleshooting flow. That was a demo choice, not an alerting gap. CrashLoopBackOff is a container waiting reason, not a pod phase, so production coverage should come from Prometheus based crash-loop signals rather than pod phase alone. Here is the PromQL query I use in Azure Monitor to catch this class of failure: sum by (namespace, pod) ( ( max_over_time( kube_pod_container_status_waiting_reason{ namespace="pets", reason="CrashLoopBackOff" }[5m] ) == 1 ) and on (namespace, pod, container) ( increase( kube_pod_container_status_restarts_total{ namespace="pets" }[15m] ) > 0 ) ) > 0 This query fires when a container has been in CrashLoopBackOff within the last 5 minutes and its restart count has increased in the last 15 minutes. In production, replace the hardcoded namespace with a regex matcher or remove it entirely to cover all namespaces. The order-service pod in the pets namespace is not healthy. Please investigate, identify the root cause, and fix it. The agent's reasoning: "Container logs are empty. The process was killed before it could write its first log line. Exit code 137 confirms OOMKill. No NODE_OPTIONS in the ConfigMap rules out a V8 heap misconfiguration. The 20Mi limit is 12.8x below the pod's observed 50Mi runtime baseline. This limit was never viable for this workload." The agent increased the memory limit (20Mi to 128Mi) and request (10Mi to 50Mi), then verified the new pod stabilized at 74Mi/128Mi (58% utilization) with 0 restarts. Outcome. Service recovered in ~4 minutes without any manual cluster interaction. Side by side comparison Dimension Incident 1: CPU starvation Incident 2: OOMKilled Trigger Azure Monitor alert (automated) Engineer chat prompt (ad hoc) Failure mode CPU too low for startup probe to pass Memory limit too low for process to start Key signal Exit code 1, probe timeout Exit code 137, empty container logs Blast radius 4 workloads affected cluster wide 1 workload in target namespace Remediation CPU request/limit patches across 4 deployments Memory request/limit patch on 1 deployment MTTR ~8 min ~4 min Human interventions 0 0 Why this matters Most AKS environments already emit rich telemetry through Azure Monitor and managed Prometheus. What is still manual is the response: engineers paging through dashboards, running ad-hoc kubectl commands, and applying hotfixes under time pressure. Azure SRE Agent changes that by turning repeatable investigation and remediation paths into an automated workflow. The value isn't just that the agent patched a CPU limit. It's that the investigation, remediation, and verification loop is the same regardless of failure mode, and it runs while your team sleeps. In this lab, the impact was measurable: Metric This demo with Azure SRE Agent Alert to recovery ~4 to 8 min Human interventions 0 Scope of investigation Cluster wide, automated Correlate evidence and diagnose ~2 min Apply fix and verify ~4 min Post incident follow-up GitHub issue + draft PR These results came from a controlled run on April 10, 2026. Real world outcomes depend on alert quality, cluster size, and how much automation you enable. For reference, industry reports from PagerDuty and Datadog typically place manual Sev1 MTTR in the 30 to 120 minute range for Kubernetes environments. Teams + GitHub follow-up Runtime remediation is only half the story. If the workflow ends when the pod becomes healthy again, the same issue returns on the next deployment. That is why the post incident path matters. After Incident 1 resolved, Azure SRE Agent used the GitHub connector to file an issue with the incident summary, root cause, and runtime changes. In the demo, I assigned that issue to GitHub Copilot agent, which opened a draft pull request to align the source manifests with the hotfix. The agent can also be configured to submit the PR directly in the same workflow, not just open the issue, so the fix is in your review queue by the time anyone sees the notification. Human review still remains the final control point before merge. Setup details for the GitHub connector are in the demo repo README, and the official reference is in the Azure SRE Agent docs. Azure SRE Agent fixes the live issue, and the GitHub follow-up prepares the durable source change so future deployments do not reintroduce the same configuration problem. The operations to engineering handoff: Azure SRE Agent fixed the live cluster; GitHub Copilot agent prepares the durable source change so the same misconfiguration can't ship again. In parallel, the Teams connector posted milestone updates during the incident: Investigation started. Root cause and remediation identified. Incident resolved. Teams handled real time situational awareness. GitHub handled durable engineering follow-up. Together, they closed the gap between operations and software delivery. Key takeaways Three things to carry forward Treat Azure SRE Agent as a governed incident response system, not a chatbot with infrastructure access. The most important controls are permission levels and run modes, not prompt quality. Anchor detection in your existing incident platforms. For this demo, we used Prometheus and Azure Monitor, but the pattern applies regardless of where your signals live. Use connectors to extend the workflow outward. Teams for real time coordination, GitHub for durable engineering follow-up. Start where you're comfortable. If you are just getting your feet wet, begin with one resource group, one incident type, and Review mode. Validate that telemetry flows, RBAC is scoped correctly, and your alert rules cover the failure modes you actually care about before enabling Autonomous. Expand only once each layer is trusted. Next steps Add Prometheus based alert coverage for ImagePullBackOff and node resource pressure to complement the pod phase rule. Expand to multi cluster managed scopes once the single cluster path is trusted and validated. Explore how NAP and Azure SRE Agent complement each other — NAP manages infrastructure capacity, while the agent investigates and remediates incidents. I'd like to thank Cary Chai, Senior Product Manager for Azure SRE Agent, for his early technical guidance and thorough review — his feedback sharpened both the accuracy and quality of this post.644Views0likes0CommentsGive your Foundry Agent Custom Tools with MCP Servers on Azure Functions
This blog post is for developers who have an MCP server deployed to Azure Functions and want to connect it to Microsoft Foundry agents. It walks through why you'd want to do this, the different authentication options available, and how to get your agent calling your MCP tools. Connect your MCP server on Azure Functions to Foundry Agent If you've been following along with this blog series, you know that Azure Functions is a great place to host remote MCP servers. You get scalable infrastructure, built-in auth, and serverless billing. All the good stuff. But hosting an MCP server is only half the picture. The real value comes when something actually uses those tools. Microsoft Foundry lets you build AI agents that can reason, plan, and take actions. By connecting your MCP server to an agent, you're giving it access to your custom tools, whether that's querying a database, calling an API, or running some business logic. The agent discovers your tools, decides when to call them, and uses the results to respond to the user. Why connect MCP servers to Foundry agents? You might already have an MCP server that works great with VS Code, VS, Cursor, or other MCP clients. Connecting that same server to a Foundry agent means you can reuse those tools in a completely different context, i.e. in an enterprise AI agent that your team or customers interact with. No need to rebuild anything. Your MCP server stays the same; you're just adding another consumer. Prerequisites Before proceeding, make sure you have the following: 1. An MCP server deployed to Azure Functions. If you don't have one yet, you can deploy one quickly by following one of the samples: Python TypeScript .NET 2. A Foundry project with a deployed model and a Foundry agent Authentication options Depending on where you are in development, you can pick what makes sense and upgrade later. Here's a summary: Method Description When to use Key-based (default) Agent authenticates by passing a shared function access key in the request header. This method is the default authentication for HTTP endpoints in Functions. Development, or when Entra auth isn't required. Microsoft Entra Agent authenticates using either its own identity (agent identity) or the shared identity of the Foundry project (project managed identity). Use agent identity for production scenarios, but limit shared identity to development. OAuth identity passthrough Agent prompts users to sign in and authorize access, using the provided token to authenticate. Production, when each user must authenticate individually. Unauthenticated Agent makes unauthenticated calls. Development only, or tools that access only public information. Connect your MCP server to your Foundry agent If your server uses key-based auth or is unauthenticated, it should be relatively straightforward to set up the connection from a Foundry agent. The Microsoft Entra and OAuth identity passthrough are options that require extra steps to set up. Check out detailed step-by-step instructions for each authentication method. At a high level, the process looks like this: Enable built-in MCP authentication : When you deploy a server to Azure Functions, key-based auth is the default. You'll need to disable that and enable built-in MCP auth instead. If you deployed one of the sample servers in the Prerequisite section, this step is already done for you. Get your MCP server endpoint URL: For MCP extension-based servers, it's https://<FUNCTION_APP_NAME>.azurewebsites.net/runtime/webhooks/mcp Get your credentials based on your chosen auth method: a managed identity configuration, OAuth credentials Add the MCP server as a tool in the Foundry portal by navigating to your agent, adding a new MCP tool, and providing the endpoint and credentials. Microsoft Entra connection required fields OAuth Identity required fields Once the server is configured as a tool, test it in the Agent Builder playground by sending a prompt that triggers one of your MCP tools. Closing thoughts What I find exciting about this is the composability. You build your MCP server once and it works everywhere: VS Code, VS, Cursor, ChatGPT, and now Foundry agents. The MCP protocol is becoming the universal interface for tool use in AI, and Azure Functions makes it easy to host these servers at scale and with security. Are you building agents with Foundry? Have you connected your MCP servers to other clients? I'd love to hear what tools you're exposing and how you're using them. Share with us your thoughts! What's next In the next blog post, we'll go deeper into other MCP topics and cover new MCP features and developments in Azure Functions. Stay tuned!460Views0likes0CommentsMCP Apps on Azure Functions: Quick Start with TypeScript
Azure Functions makes hosting MCP apps simple: build locally, create a secure endpoint, and deploy fast with Azure Developer CLI (azd). This guide shows you how using a weather app example. What Are MCP Apps? MCP Apps let MCP servers return interactive HTML interfaces such as data visualizations, forms, dashboards that render directly inside MCP-compatible hosts (Visual Studio Code Copilot, Claude, ChatGPT, etc.). Learn more about MCP Apps in the official documentation. Having an interactive UI removes many restrictions that plain texts have, such as if your scenario has: Interactive Data: Replacing lists with clickable maps or charts for deep exploration. Complex Setup: Use one-page forms instead of long, back-and-forth questioning. Rich Media: Embed native viewers to pan, zoom, or rotate 3D models and documents. Live Updates: Maintain real-time dashboards that refresh without new prompts. Workflow Management: Handle multi-step tasks like approvals with navigation buttons and persistent state. MCP App Hosting as a Feature Azure Functions provides an easy abstraction to help you build MCP servers without having to learn the nitty-gritty of the MCP protocol. When hosting your MCP App on Functions, you get: MCP tools (server logic): Handle client requests, call backend services, return structured data - Azure Functions manages the MCP protocol details for you MCP resources (UI payloads such as app widgets): Serve interactive HTML, JSON documents, or formatted content - just focus on your UI logic Secure HTTPS access: Built-in authentication using Azure Functions keys, plus built-in MCP authentication with OAuth support for enterprise-grade security Easy deployment with Bicep and azd: Infrastructure as Code for reliable deployments Local development: Test and debug locally before deploying Auto-scaling: Azure Functions handles scaling, retries, and monitoring automatically The weather app in this repo is an example of this feature, not the only use case. Architecture Overview Example: The classic Weather App The sample implementation includes: A GetWeather MCP tool that fetches weather by location (calls Open-Meteo geocoding and forecast APIs) A Weather Widget MCP resource that serves interactive HTML/JS code (runs in the client; fetches data via GetWeather tool) A TypeScript service layer that abstracts API calls and data transformation (runs on the server) Bidirectional communication: client-side UI calls server-side tools, receives data, renders locally Local and remote testing flow for MCP clients (via MCP Inspector, VS Code, or custom clients) How UI Rendering Works in MCP Apps In the Weather App example: Azure Functions serves getWeatherWidget as a resource → returns weather-app.ts compiled to HTML/JS Client renders the Weather Widget UI User interacts with the widget or requests are made internally The widget calls the getWeather tool → server processes and returns weather data The widget renders the weather data on the client side This architecture keeps the UI responsive locally while using server-side logic and data on demand. Quick Start Checkout repository: https://github.com/Azure-Samples/remote-mcp-functions-typescript Run locally: npm install npm run build func start Local endpoint: http://0.0.0.0:7071/runtime/webhooks/mcp Deploy to Azure: azd provision azd deploy Remote endpoint: https://.azurewebsites.net/runtime/webhooks/mcp TypeScript MCP Tools Snippet (Get Weather service) In Azure Functions, you define MCP tools using app.mcpTool(). The toolName and description tell clients what this tool does, toolProperties defines the input arguments (like location as a string), and handler points to your function that processes the request. app.mcpTool("getWeather", { toolName: "GetWeather", description: "Returns current weather for a location via Open-Meteo.", toolProperties: { location: arg.string().describe("City name to check weather for") }, handler: getWeather, }); Resource Trigger Snippet (Weather App Hook) MCP resources are defined using app.mcpResource(). The uri is how clients reference this resource, resourceName and description provide metadata, mimeType tells clients what type of content to expect, and handler is your function that returns the actual content (like HTML for a widget). app.mcpResource("getWeatherWidget", { uri: "ui://weather/index.html", resourceName: "Weather Widget", description: "Interactive weather display for MCP Apps", mimeType: "text/html;profile=mcp-app", handler: getWeatherWidget, }); Sample repos and references Complete sample repository with TypeScript implementation: https://github.com/Azure-Samples/remote-mcp-functions-typescript Official MCP extension documentation: https://learn.microsoft.com/azure/azure-functions/functions-bindings-mcp?pivots=programming-language-typescript Java sample: https://github.com/Azure-Samples/remote-mcp-functions-java .NET sample: https://github.com/Azure-Samples/remote-mcp-functions-dotnet Python sample: https://github.com/Azure-Samples/remote-mcp-functions-python MCP Inspector: https://github.com/modelcontextprotocol/inspector Final Takeaway MCP Apps are just MCP servers but they represent a paradigm shift by transforming the AI from a text-based chatbot into a functional interface. Instead of forcing users to navigate complex tasks through back-and-forth conversations, these apps embed interactive UIs and tools directly into the chat, significantly improving the user experience and the usefulness of MCP servers. Azure Functions allows developers to quickly build and host an MCP app by providing an easy abstraction and deployment experience. The platform also provides built-in features to secure and scale your MCP apps, plus a serverless pricing model so you can just focus on the business logic.400Views1like0CommentsThe Durable Task Scheduler Consumption SKU is now Generally Available
Today, we're excited to announce that the Durable Task Scheduler Consumption SKU has reached General Availability. Developers can now run durable workflows and agents on Azure with pay-per-use pricing, no storage to manage, no capacity to plan, and no idle costs. Just create a scheduler, connect your app, and start orchestrating. Whether you're coordinating AI agent workflows, processing event-driven pipelines, or running background jobs, the Consumption SKU is ready to go. GET STARTED WITH THE DURABLE TASK SCHEDULER CONSUMPTION SKU Since launching the Consumption SKU in public preview last November, we've seen incredible adoption and have incorporated feedback from developers around the world to ensure the GA release is truly production ready. “The Durable Task Scheduler has become a foundational piece of what we call ‘workflows’. It gives us the reliability guarantees we need for processing financial documents and sensitive workflows, while keeping the programming model straightforward. The combination of durable execution, external event correlation, deterministic idempotency, and the local emulator experience has made it a natural fit for our event-driven architecture. We have been delighted with the consumption SKUs cost model for our lower environments.”– Emily Lewis, CarMax What is the Durable Task Scheduler? If you're new to the Durable Task Scheduler, we recommend checking out our previous blog posts for a detailed background: Announcing Limited Early Access of the Durable Task Scheduler Announcing Workflow in Azure Container Apps with the Durable Task Scheduler Announcing Dedicated SKU GA & Consumption SKU Public Preview In brief, the Durable Task Scheduler is a fully managed orchestration backend for durable execution on Azure, meaning your workflows and agent sessions can reliably resume and run to completion, even through process failures, restarts, and scaling events. Whether you’re running workflows or orchestrating durable agents, it handles task scheduling, state persistence, fault tolerance, and built-in monitoring, freeing developers from the operational overhead of managing their own execution engines and storage backends. The Durable Task Scheduler works across Azure compute environments: Azure Functions: Using the Durable Functions extension across all Function App SKUs, including Flex Consumption. Azure Container Apps: Using the Durable Functions or Durable Task SDKs with built-in workflow support and auto-scaling. Any compute: Azure Kubernetes Service, Azure App Service, or any environment where you can run the Durable Task SDKs (.NET, Python, Java, JavaScript). Why choose the Consumption SKU? With the Consumption SKU you’re charged only for actions dispatched, with no minimum commitments or idle costs. There’s no capacity to size or throughput to reserve. Create a scheduler, connect your app, and you’re running. The Consumption SKU is a natural fit for workloads with unpredictable or bursty usage patterns: AI agent orchestration: Multi-step agent workflows that call LLMs, retrieve data, and take actions. Users trigger these on demand, so volume is spiky and pay-per-use avoids idle costs between bursts. Event-driven pipelines: Processing events from queues, webhooks, or streams with reliable orchestration and automatic checkpointing, where volumes spike and dip unpredictably. API-triggered workflows: User signups, form submissions, payment flows, and other request-driven processing where volume varies throughout the day. Distributed transactions: Retries and compensation logic across microservices with durable sagas that survive failures and restarts. What's included in the Consumption SKU at GA The Consumption SKU has been hardened based on feedback and real-world usage during the public preview. Here's what's included at GA: Performance Up to 500 actions per second: Sufficient throughput for a wide range of workloads, with the option to move to the Dedicated SKU for higher-scale scenarios. Up to 30 days of data retention: View and manage orchestration history, debug failures, and audit execution data for up to 30 days. Built-in monitoring dashboard Filter orchestrations by status, drill into execution history, view visual Gantt and sequence charts, and manage orchestrations (pause, resume, terminate, or raise events), all from the dashboard, secured with Role-Based Access Control (RBAC). Identity-based security The Consumption SKU uses Entra ID for authentication and RBAC for authorization. No SAS tokens or access keys to manage, just assign the appropriate role and connect. Get started with the Durable Task Scheduler today The Consumption SKU is available now Generally Available. Provision a scheduler in the Azure portal, connect your app, and start orchestrating. You only pay for what you use. Documentation Getting started Samples Pricing Consumption SKU docs We'd love to hear your feedback. Reach out to us by filing an issue on our GitHub repository504Views0likes0CommentsMigrating to the next generation of Virtual Nodes on Azure Container Instances (ACI)
What is ACI/Virtual Nodes? Azure Container Instances (ACI) is a fully-managed serverless container platform which gives you the ability to run containers on-demand without provisioning infrastructure. Virtual Nodes on ACI allows you to run Kubernetes pods managed by an AKS cluster in a serverless way on ACI instead of traditional VM‑backed node pools. From a developer’s perspective, Virtual Nodes look just like regular Kubernetes nodes, but under the hood the pods are executed on ACI’s serverless infrastructure, enabling fast scale‑out without waiting for new VMs to be provisioned. This makes Virtual Nodes ideal for bursty, unpredictable, or short‑lived workloads where speed and cost efficiency matter more than long‑running capacity planning. Introducing the next generation of Virtual Nodes on ACI The newer Virtual Nodes v2 implementation modernises this capability by removing many of the limitations of the original AKS managed add‑on and delivering a more Kubernetes‑native, flexible, and scalable experience when bursting workloads from AKS to ACI. In this article I will demonstrate how you can migrate an existing AKS cluster using the Virtual Nodes managed add-on (legacy), to the new generation of Virtual Nodes on ACI, which is deployed and managed via Helm. More information about Virtual Nodes on Azure Container Instances can be found here, and the GitHub repo is available here. Advanced documentation for Virtual Nodes on ACI is also available here, and includes topics such as node customisation, release notes and a troubleshooting guide. Please note that all code samples within this guide are examples only, and are provided without warranty/support. Background Virtual Nodes on ACI is rebuilt from the ground-up, and includes several fixes and enhancements, for instance: Added support/features VNet peering, outbound traffic to the internet with network security groups Init containers Host aliases Arguments for exec in ACI Persistent Volumes and Persistent Volume Claims Container hooks Confidential containers (see supported regions list here) ACI standby pools Support for image pulling via Private Link and Managed Identity (MSI) Planned future enhancements Kubernetes network policies Support for IPv6 Windows containers Port Forwarding Note: The new generation of the add-on is managed via Helm rather than as an AKS managed add-on. Requirements & limitations Each Virtual Nodes on ACI deployment requires 3 vCPUs and 12 GiB memory on one of the AKS cluster’s VMs Each Virtual Nodes node supports up to 200 pods DaemonSets are not supported Virtual Nodes on ACI requires AKS clusters with Azure CNI networking (Kubenet is not supported, nor is overlay networking) Migrating to the next generation of Virtual Nodes on Azure Container Instances via Helm chart For this walkthrough, I'm using Bash via Windows Subsystem for Linux (WSL), along with the Azure CLI. Direct migration is not supported, and therefore the steps below show an example of removing Virtual Nodes managed add-on and its resources and then installing the Virtual Nodes on ACI Helm chart. In this walkthrough I will explain how to delete and re-create the Virtual Nodes subnet, however if you need to preserve the VNet and/or use a custom subnet name, refer to the Helm customisation steps here. Be sure to use a new subnet CIDR within the VNet address space, which doesn't overlap with other subnets nor the AKS CIDRS for nodes/pods and ClusterIP services. To minimise disruption, we'll first install the Virtual Nodes on ACI Helm chart, before then removing the legacy managed add-on and its resources. Prerequisites A recent version of the Azure CLI An Azure subscription with sufficient ACI quota for your selected region Helm Deployment steps Initialise environment variables location=northeurope rg=rg-virtualnode-demo vnetName=vnet-virtualnode-demo clusterName=aks-virtualnode-demo aksSubnetName=subnet-aks vnSubnetName=subnet-vn Create the new Virtual Nodes on ACI subnet with the specific name value of cg (a custom subnet can be used by following the steps here): vnSubnetId=$(az network vnet subnet create \ --resource-group $rg \ --vnet-name $vnetName \ --name cg \ --address-prefixes <your subnet CIDR> \ --delegations Microsoft.ContainerInstance/containerGroups --query id -o tsv) Assign the cluster's -kubelet identity Contributor access to the infrastructure resource group, and Network Contributor access to the ACI subnet: nodeRg=$(az aks show --resource-group $rg --name $clusterName --query nodeResourceGroup -o tsv) nodeRgId=$(az group show -n $nodeRg --query id -o tsv) agentPoolIdentityId=$(az aks show --resource-group $rg --name $clusterName --query "identityProfile.kubeletidentity.resourceId" -o tsv) agentPoolIdentityObjectId=$(az identity show --ids $agentPoolIdentityId --query principalId -o tsv) az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Contributor" \ --scope "$nodeRgId" az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Network Contributor" \ --scope "$vnSubnetId" Download the cluster's kubeconfig file: az aks get-credentials -n $clusterName -g $rg Clone the virtualnodesOnAzureContainerInstances GitHub repo: git clone https://github.com/microsoft/virtualnodesOnAzureContainerInstances.git Install the Virtual Nodes on ACI Helm chart: helm install <yourReleaseName> <GitRepoRoot>/Helm/virtualnode Confirm the Virtual Nodes node shows within the cluster and is in a Ready state (virtualnode-n): $ kubectl get node NAME STATUS ROLES AGE VERSION aks-nodepool1-35702456-vmss000000 Ready <none> 4h13m v1.33.6 aks-nodepool1-35702456-vmss000001 Ready <none> 4h13m v1.33.6 virtualnode-0 Ready <none> 162m v1.33.7 Scale-down any running Virtual Nodes workloads (example below): kubectl scale deploy <deploymentName> -n <namespace> --replicas=0 Drain and cordon the legacy Virtual Nodes node: kubectl drain virtual-node-aci-linux Disable the Virtual Nodes managed add-on (legacy): az aks disable-addons --resource-group $rg --name $clusterName --addons virtual-node Export a backup of the original subnet configuration: az network vnet subnet show --resource-group $rg --vnet-name $vnetName --name $vnSubnetName > subnetConfigOriginal.json Delete the original subnet (subnets cannot be renamed and therefore must be re-created): az network vnet subnet delete -g $rg -n $vnSubnetName --vnet-name $vnetName Delete the previous (legacy) Virtual Nodes node from the cluster: kubectl delete node virtual-node-aci-linux Test and confirm pod scheduling on Virtual Node: apiVersion: v1 kind: Pod metadata: annotations: name: demo-pod spec: containers: - command: - /bin/bash - -c - 'counter=1; while true; do echo "Hello, World! Counter: $counter"; counter=$((counter+1)); sleep 1; done' image: mcr.microsoft.com/azure-cli name: hello-world-counter resources: limits: cpu: 2250m memory: 2256Mi requests: cpu: 100m memory: 128Mi nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists If the pod successfully starts on the Virtual Node, you should see similar to the below: $ kubectl get pod -o wide demo-pod NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo-pod 1/1 Running 0 95s 10.241.0.4 vnode2-virtualnode-0 <none> <none> Modify the nodeSelector and tolerations properties of your Virtual Nodes workloads to match the requirements of Virtual Nodes on ACI (see details below) Modify your deployments to run on Virtual Nodes on ACI For Virtual Nodes managed add-on (legacy), the following nodeSelector and tolerations are used to run pods on Virtual Nodes: nodeSelector: kubernetes.io/role: agent kubernetes.io/os: linux type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Exists - key: azure.com/aci effect: NoSchedule For Virtual Nodes on ACI, the nodeSelector/tolerations are slightly different: nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists Troubleshooting Check the virtual-node-admission-controller and virtualnode-n pods are running within the vn2 namespace: $ kubectl get pod -n vn2 NAME READY STATUS RESTARTS AGE virtual-node-admission-controller-54cb7568f5-b7hnr 1/1 Running 1 (5h21m ago) 5h21m virtualnode-0 6/6 Running 6 (4h48m ago) 4h51m If these pods are in a Pending state, your node pool(s) may not have enough resources available to schedule them (use kubectl describe pod to validate). If the virtualnode-n pod is crashing, check the logs of the proxycri container to see whether there are any Managed Identity permissions issues (the cluster's -agentpool MSI needs to have Contributor access on the infrastructure resource group): kubectl logs -n vn2 virtualnode-0 -c proxycri Further troubleshooting guidance is available within the official documentation. Support If you have issues deploying or using Virtual Nodes on ACI, add a GitHub issue here645Views3likes0CommentsBeyond the Desktop: The Future of Development with Microsoft Dev Box and GitHub Codespaces
The modern developer platform has already moved past the desktop. We’re no longer defined by what’s installed on our laptops, instead we look at what tooling we can use to move from idea to production. An organisations developer platform strategy is no longer a nice to have, it sets the ceiling for what’s possible, an organisation can’t iterate it's way to developer nirvana if the foundation itself is brittle. A great developer platform shrinks TTFC (time to first commit), accelerates release velocity, and maybe most importantly, helps alleviate everyday frictions that lead to developer burnout. Very few platforms deliver everything an organization needs from a developer platform in one product. Modern development spans multiple dimensions, local tooling, cloud infrastructure, compliance, security, cross-platform builds, collaboration, and rapid onboarding. The options organizations face are then to either compromise on one or more of these areas or force developers into rigid environments that slow productivity and innovation. This is where Microsoft Dev Box and GitHub Codespaces come into play. On their own, each addresses critical parts of the modern developer platform: Microsoft Dev Box provides a full, managed cloud workstation. Dev Box gives developers a consistent, high-performance environment while letting central IT apply strict governance and control. Internally at Microsoft, we estimate that usage of Dev Box by our development teams delivers savings of 156 hours per year per developer purely on local environment setup and upkeep. We have also seen significant gains in other key SPACE metrics reducing context-switching friction and improving build/test cycles. Although the benefits of Dev Box are clear in the results demonstrated by our customers it is not without its challenges. The biggest challenge often faced by Dev Box customers is its lack of native Linux support. At the time of writing and for the foreseeable future Dev Box does not support native Linux developer workstations. While WSL2 provides partial parity, I know from my own engineering projects it still does not deliver the full experience. This is where GitHub Codespaces comes into this story. GitHub Codespaces delivers instant, Linux-native environments spun up directly from your repository. It’s lightweight, reproducible, and ephemeral ideal for rapid iteration, PR testing, and cross-platform development where you need Linux parity or containerized workflows. Unlike Dev Box, Codespaces can run fully in Linux, giving developers access to native tools, scripts, and runtimes without workarounds. It also removes much of the friction around onboarding: a new developer can open a repository and be coding in minutes, with the exact environment defined by the project’s devcontainer.json. That said, Codespaces isn’t a complete replacement for a full workstation. While it’s perfect for isolated project work or ephemeral testing, it doesn’t provide the persistent, policy-controlled environment that enterprise teams often require for heavier workloads or complex toolchains. Used together, they fill the gaps that neither can cover alone: Dev Box gives the enterprise-grade foundation, while Codespaces provides the agile, cross-platform sandbox. For organizations, this pairing sets a higher ceiling for developer productivity, delivering a truly hybrid, agile and well governed developer platform. Better Together: Dev Box and GitHub Codespaces in action Together, Microsoft Dev Box and GitHub Codespaces deliver a hybrid developer platform that combines consistency, speed, and flexibility. Teams can spin up full, policy-compliant Dev Box workstations preloaded with enterprise tooling, IDEs, and local testing infrastructure, while Codespaces provides ephemeral, Linux-native environments tailored to each project. One of my favourite use cases is having local testing setups like a Docker Swarm cluster, ready to go in either Dev Box or Codespaces. New developers can jump in and start running services or testing microservices immediately, without spending hours on environment setup. Anecdotally, my time to first commit and time to delivering “impact” has been significantly faster on projects where one or both technologies provide local development services out of the box. Switching between Dev Boxes and Codespaces is seamless every environment keeps its own libraries, extensions, and settings intact, so developers can jump between projects without reconfiguring or breaking dependencies. The result is a turnkey, ready-to-code experience that maximizes productivity, reduces friction, and lets teams focus entirely on building, testing, and shipping software. To showcase this value, I thought I would walk through an example scenario. In this scenario I want to simulate a typical modern developer workflow. Let's look at a day in the life of a developer on this hybrid platform building an IOT project using Python and React. Spin up a ready-to-go workstation (Dev Box) for Windows development and heavy builds. Launch a Linux-native Codespace for cross-platform services, ephemeral testing, and PR work. Run "local" testing like a Docker Swarm cluster, database, and message queue ready to go out-of-the-box. Switch seamlessly between environments without losing project-specific configurations, libraries, or extensions. 9:00 AM – Morning Kickoff on Dev Box I start my day on my Microsoft Dev Box, which gives me a fully-configured Windows environment with VS Code, design tools, and Azure integrations. I select my teams project, and the environment is pre-configured for me through the Dev Box catalogue. Fortunately for me, its already provisioned. I could always self service another one using the "New Dev Box" button if I wanted too. I'll connect through the browser but I could use the desktop app too if I wanted to. My Tasks are: Prototype a new dashboard widget for monitoring IoT device temperature. Use GUI-based tools to tweak the UI and preview changes live. Review my Visio Architecture. Join my morning stand up. Write documentation notes and plan API interactions for the backend. In a flash, I have access to my modern work tooling like Teams, I have this projects files already preloaded and all my peripherals are working without additional setup. Only down side was that I did seem to be the only person on my stand up this morning? Why Dev Box first: GUI-heavy tasks are fast and responsive. Dev Box’s environment allows me to use a full desktop. Great for early-stage design, planning, and visual work. Enterprise Apps are ready for me to use out of the box (P.S. It also supports my multi-monitor setup). I use my Dev Box to make a very complicated change to my IoT dashboard. Changing the title from "IoT Dashboard" to "Owain's IoT Dashboard". I preview this change in a browser live. (Time for a coffee after this hardwork). The rest of the dashboard isnt loading as my backend isnt running... yet. 10:30 AM – Switching to Linux Codespaces Once the UI is ready, I push the code to GitHub and spin up a Linux-native GitHub Codespace for backend development. Tasks: Implement FastAPI endpoints to support the new IoT feature. Run the service on my Codespace and debug any errors. Why Codespaces now: Linux-native tools ensure compatibility with the production server. Docker and containerized testing run natively, avoiding WSL translation overhead. The environment is fully reproducible across any device I log in from. 12:30 PM – Midday Testing & Sync I toggle between Dev Box and Codespaces to test and validate the integration. I do this in my Dev Box Edge browser viewing my codespace (I use my Codespace in a browser through this demo to highlight the difference in environments. In reality I would leverage the VSCode "Remote Explorer" extension and its GitHub Codespace integration to use my Codespace from within my own desktop VSCode but that is personal preference) and I use the same browser to view my frontend preview. I update the environment variable for my frontend that is running locally in my Dev Box and point it at the port running my API locally on my Codespace. In this case it was a web socket connection and HTTPS calls to port 8000. I can make this public by changing the port visibility in my Codespace. https://fluffy-invention-5x5wp656g4xcp6x9-8000.app.github.dev/api/devices wss://fluffy-invention-5x5wp656g4xcp6x9-8000.app.github.dev/ws This allows me to: Preview the frontend widget on Dev Box, connecting to the backend running in Codespaces. Make small frontend adjustments in Dev Box while monitoring backend logs in Codespaces. Commit changes to GitHub, keeping both environments in sync and leveraging my CI/CD for deployment to the next environment. We can see the Dev Box running local frontend and the Codespace running the API connected to each other, making requests and displaying the data in the frontend! Hybrid advantage: Dev Box handles GUI previews comfortably and allows me to live test frontend changes. Codespaces handles production-aligned backend testing and Linux-native tools. Dev Box allows me to view all of my files in one screen with potentially multiple Codespaces running in browser of VS Code Desktop. Due to all of those platform efficiencies I have completed my days goals within an hour or two and now I can spend the rest of my day learning about how to enable my developers to inner source using GitHub CoPilot and MCP (Shameless plug). The bottom line There are some additional considerations when architecting a developer platform for an enterprise such as private networking and security not covered in this post but these are implementation details to deliver the described developer experience. Architecting such a platform is a valuable investment to deliver the developer platform foundations we discussed at the top of the article. While in this demo I have quickly built I was working in a mono repository in real engineering teams it is likely (I hope) that an application is built of many different repositories. The great thing about Dev Box and Codespaces is that this wouldn’t slow down the rapid development I can achieve when using both. My Dev Box would be specific for the project or development team, pre loaded with all the tools I need and potentially some repos too! When I need too I can quickly switch over to Codespaces and work in a clean isolated environment and push my changes. In both cases any changes I want to deliver locally are pushed into GitHub (Or ADO), merged and my CI/CD ensures that my next step, potentially a staging environment or who knows perhaps *Whispering* straight into production is taken care of. Once I’m finished I delete my Codespace and potentially my Dev Box if I am done with the project, knowing I can self service either one of these anytime and be up and running again! Now is there overlap in terms of what can be developed in a Codespace vs what can be developed in Azure Dev Box? Of course, but as organisations prioritise developer experience to ensure release velocity while maintaining organisational standards and governance then providing developers a windows native and Linux native service both of which are primarily charged on the consumption of the compute* is a no brainer. There are also gaps that neither fill at the moment for example Microsoft Dev Box only provides windows compute while GitHub Codespaces only supports VS Code as your chosen IDE. It's not a question of which service do I choose for my developers, these two services are better together! *Changes have been announced to Dev Box pricing. A W365 license is already required today and dev boxes will continue to be managed through Azure. For more information please see: Microsoft Dev Box capabilities are coming to Windows 365 - Microsoft Dev Box | Microsoft Learn1.4KViews2likes0CommentsExtend SRE Agent with MCP: Build an Agentic Workflow to Triage Customer Issues
Your inbox is full. GitHub issues piling up. "App not working." "How do I configure alerts?" "Please add dark mode." You open each one, figure out what it is, ask for more info, add labels, route to the right team. An hour later, you're still sorting issues. Sound familiar? The Triage Tax Every L1 support engineer, PM, and on-call developer who's handled customer issues knows this pain. When tickets come in, you're not solving problems, you're sorting them. Read the issue. Is it a bug or a question? Check the docs. Does this feature exist? Ask for more info. Wait two days. Re-triage. Add labels. Route to engineering. It's tedious. It requires judgment, you need to understand the product, know what info is needed, check documentation. And honestly? It's work that nobody volunteers for but someone has to do. In large organizations, it gets even more complex. The issue doesn't just need to be triaged, it needs to be routed to the right engineering team. Is this an auth issue? Frontend? Backend? Infrastructure? A wrong routing decision means delays, re-assignments, and frustrated customers. What if an AI agent could do this for you? Enter Azure SRE Agent + MCP Here's what I built: I gave SRE Agent access to my GitHub and PagerDuty accounts via MCP, uploaded my triage rubric as a markdown file, and set it to run twice a day. No more reading every ticket manually. No more asking the same "please provide more info" questions. No more morning triage sessions. What My Setup Looks Like My app's customer issues come in through GitHub. My team uses PagerDuty to track bugs and incidents. So I connected both via MCP to the SRE Agent. I also uploaded my triage logic as a .md file on how to classify issues, what info is required for each category, which labels to use, which team handles what. And since I didn't want to run this workflow manually, I set up a scheduled task to trigger it twice a day. Now it just runs. I verify its work if I want to. What the Agent Does Fetches all open, unlabeled GitHub issues Reads each issue and classifies it (bug, doc question, feature request) Checks if required info is present Posts a comment asking for details if needed, or acknowledges the issue Adds appropriate labels Creates a PagerDuty incident for bugs ready for engineering Moves to the next issue How I Built It: Step by Step Let me walk you through exactly how I set this up inside SRE Agent. Step 1: Create an SRE Agent I created a new SRE Agent in the Azure portal. Since this workflow triages GitHub issues and not Azure resources, I didn't need to configure any Azure resource groups or subscriptions. Just an agent. Step 2: Connect MCP Servers I added two MCP servers to give the agent access to my tools: GitHub MCP– Fetch issues, post comments, add labels PagerDuty MCP – Create incidents for bugs that need dev team's attention MCP (Model Context Protocol) lets you bring any API into the agent. If your tool has an API, you can connect it. Step 3: Create Subagents I created two focused subagents, each with a specific job and only the tools it needs: GitHub Issue Triager "You are expert in triaging GitHub issues, classifying them into categories such as user needs to supply additional information, bug, documentation question, or feature request. Use the knowledge base to search for the right document that helps you with performing this triaging. Perform all actions autonomously without waiting for user input. Hand off to Incident Creator for the issues you classified as bugs." Tools: GitHub MCP (issues, labels, comments) Incident Creator Here "You are expert in managing incidents in PagerDuty, listing services, incidents, creating incidents with all details. Once done, hand off back to GitHub Issue Triager." Tools: PagerDuty MCP (services, incidents) The handoff between them creates a workflow. They collaborate without human involvement. Step 4: Add Your Knowledge I uploaded my triage logic as a .md file to the agent's knowledge base. This is my rubric - my mental model for how to triage issues: How do I classify bugs vs. doc questions vs. feature requests? What info is required for each category? What labels do I use? When should an incident be created? Which team handles which type of issue? I wrote it down the way I'd explain it to a new teammate. The agent searches and follows it. Step 5: Add a Scheduled Task I didn't want to trigger this workflow manually every time. SRE Agent supports scheduled tasks, workflows that run automatically on a cadence. I set up a trigger to run twice a day: morning and evening. Now the workflow is fully automated. Here is the end to end automated agentic workflow to triage customer tickets. Why MCP Matters Every team uses different tools. Maybe your customer issues live in Zendesk, incidents go to ServiceNow and you use Jira or Azure DevOps. SRE Agent doesn't lock you in. With MCP, you connect to whatever tools you already use. The agent orchestrates across them. That's the extensibility model: your tools, your workflow, orchestrated by the agent. The Result Before: 2 hours every morning sorting tickets. After: By the time anyone logs in, issues are labeled, missing-info requests are posted, urgent bugs have incidents, and feature requests are acknowledged. Your team can finally focus on the complex stuff not sorting tickets. Why This Matters Faster response times. Issues get acknowledged in minutes, not days. Consistent classification. No "this should have been a P1" moments. No tickets bouncing between teams. Happier customers. They get a response immediately even if it's just "we're looking into it." Focus on what matters. Your team should be solving problems, not sorting them. The Bottom Line Triage isn't the job, it's the tax on the job. It quietly eats the hours your team could spend building, debugging, and shipping. You don't need to build a custom triage bot. You don't need to wire up webhooks and write glue code. You give the SRE agent your tools, your logic, and a schedule and it handles the sorting. Use GitHub? Connect GitHub. Use Zendesk? Connect Zendesk. PagerDuty, ServiceNow, Jira - whatever your team runs on, the agent meets you there. Stop sorting tickets. Start shipping. A Few Tips Test MCP endpoints before configuring them in the SRE agent Give each subagent only the tools it needs, don't enable everything Start read-only until you trust the classification, then enable comments Do You Still Want to Triage Issues Manually? What tools does your team use to track customer-reported issues and incidents? Let us know in the comments, we'd love to hear how you'd use this workflow with your stack. Is triage your most toilsome workflow or is there something even worse eating your team's time? Let us know in the comments.811Views1like0CommentsUnlocking Application Modernisation with GitHub Copilot
AI-driven modernisation is unlocking new opportunities you may not have even considered yet. It's also allowing organisations to re-evaluate previously discarded modernisation attempts that were considered too hard, complex or simply didn't have the skills or time to do. During Microsoft Build 2025, we were introduced to the concept of Agentic AI modernisation and this post from Ikenna Okeke does a great job of summarising the topic - Reimagining App Modernisation for the Era of AI | Microsoft Community Hub. This blog post however, explores the modernisation opportunities that you may not even have thought of yet, the business benefits, how to start preparing your organisation, empowering your teams, and identifying where GitHub Copilot can help. I’ve spent the last 8 months working with customers exploring usage of GitHub Copilot, and want to share what my team members and I have discovered in terms of new opportunities to modernise, transform your applications, bringing some fun back into those migrations! Let’s delve into how GitHub Copilot is helping teams update old systems, move processes to the cloud, and achieve results faster than ever before. Background: The Modernisation Challenge (Then vs Now) Modernising legacy software has always been hard. In the past, teams faced steep challenges: brittle codebases full of technical debt, outdated languages (think decades-old COBOL or VB6), sparse documentation, and original developers long gone. Integrating old systems with modern cloud services often requiring specialised skills that were in short supply – for example, check out this fantastic post from Arvi LiVigni (@arilivigni ) which talks about migrating from COBOL “the number of developers who can read and write COBOL isn’t what it used to be,” making those systems much harder to update". Common pain points included compatibility issues, data migrations, high costs, security vulnerabilities, and the constant risk that any change could break critical business functions. It’s no wonder many modernisation projects stalled or were “put off” due to their complexity and risk. So, what’s different now (circa 2025) compared to two years ago? In a word: Intelligent AI assistance. Tools like GitHub Copilot have emerged as AI pair programmers that dramatically lower the barriers to modernisation. Arvi’s post talks about how only a couple of years ago, developers had to comb through documentation and Stack Overflow for clues when deciphering old code or upgrading frameworks. Today, GitHub Copilot can act like an expert co-developer inside your IDE, ready to explain mysterious code, suggest updates, and even rewrite legacy code in modern languages. This means less time fighting old code and more time implementing improvements. As Arvi says “nine times out of 10 it gives me the right answer… That speed – and not having to break out of my flow – is really what’s so impactful.” In short, AI coding assistants have evolved from novel experiments to indispensable tools, reimagining how we approach software updates and cloud adoption. I’d also add from my own experience – the models we were using 12 months ago have already been superseded by far superior models with ability to ingest larger context and tackle even further complexity. It's easier to experiment, and fail, bringing more robust outcomes – with such speed to create those proof of concepts, experimentation and failing faster, this has also unlocked the ability to test out multiple hypothesis’ and get you to the most confident outcome in a much shorter space of time. Modernisation is easier now because AI reduces the heavy lifting. Instead of reading the 10,000-line legacy program alone, a developer can ask Copilot to explain what the code does or even propose a refactored version. Rather than manually researching how to replace an outdated library, they can get instant recommendations for modern equivalents. These advancements mean that tasks which once took weeks or months can now be done in days or hours – with more confidence and less drudgery - more fun! The following sections will dive into specific opportunities unlocked by GitHub Copilot across the modernisation journey which you may not even have thought of. Modernisation Opportunities Unlocked by Copilot Modernising an application isn’t just about updating code – it involves bringing everyone and everything up to speed with cloud-era practices. Below are several scenarios and how GitHub Copilot adds value, with the specific benefits highlighted: 1. AI-Assisted Legacy Code Refactoring and Upgrades Instant Code Comprehension: GitHub Copilot can explain complex legacy code in plain English, helping developers quickly understand decades-old logic without scouring scarce documentation. For example, you can highlight a cryptic COBOL or C++ function and ask Copilot to describe what it does – an invaluable first step before making any changes. This saves hours and reduces errors when starting a modernisation effort. Automated Refactoring Suggestions: The AI suggests modern replacements for outdated patterns and APIs, and can even translate code between languages. For instance, Copilot can help convert a COBOL program into JavaScript or C# by recognising equivalent constructs. It also uses transformation tools (like OpenRewrite for Java/.NET) to systematically apply code updates – e.g. replacing all legacy HTTP calls with a modern library in one sweep. Developers remain in control, but GitHub Copilot handles the tedious bulk edits. Bulk Code Upgrades with AI: GitHub Copilot’s App Modernisation capabilities can analyse an entire codebase and generate a detailed upgrade plan, then execute many of the code changes automatically. It can upgrade framework versions (say from .NET Framework 4.x to .NET 6, or Java 8 to Java 17) by applying known fix patterns and even fixing compilation errors after the upgrade. Teams can finally tackle those hundreds of thousand-line enterprise applications – a task that could take multiple years with GitHub Copilot handling the repetitive changes. Technical Debt Reduction: By cleaning up old code and enforcing modern best practices, GitHub Copilot helps chip away at years of technical debt. The modernised codebase is more maintainable and stable, which lowers the long-term risk hanging over critical business systems. Notably, the tool can even scan for known security vulnerabilities during refactoring as it updates your code. In short, each legacy component refreshed with GitHub Copilot comes out safer and easier to work on, instead of remaining a brittle black box. 2. Accelerating Cloud Migration and Azure Modernisation Guided Azure Migration Planning: GitHub Copilot can assess a legacy application’s cloud readiness and recommend target Azure services for each component. For instance, it might suggest migrating an on-premises database to Azure SQL, moving file storage to Azure Blob Storage, and converting background jobs to Azure Functions. This provides a clear blueprint to confidently move an app from servers to Azure PaaS. One-Click Cloud Transformations: GitHub Copilot comes with predefined migration tasksthat automate the code changes required for cloud adoption. With one click, you can have the AI apply dozens of modifications across your codebase. For example: File storage: Replace local file read/writes with Azure Blob Storage SDK calls. Email/Comms: Swap out SMTP email code for Azure Communication Services or SendGrid. Identity: Migrate authentication from Windows AD to Azure AD (Entra ID) libraries. Configuration: Remove hard-coded configurations and use Azure App Configuration or Key Vault for secrets. GitHub Copilot performs these transformations consistently, following best practices (like using connection strings from Azure settings). After applying the changes, it even fixes any compile errors automatically, so you’re not left with broken builds. What used to require reading countless Azure migration guides is now handled in minutes. Automated Validation & Deployment: Modernisation doesn’t stop at code changes. GitHub Copilot can also generate unit tests to validate that the application still behaves correctly after the migration. It helps ensure that your modernised, cloud-ready app passes all its checks before going live. When you’re ready to deploy, GitHub Copilot can produce the necessary Infrastructure-as-Code templates (e.g. Azure Resource Manager Bicep files or Terraform configs) and even set up CI/CD pipeline scripts for you. In other words, the AI can configure the Azure environment and deployment process end-to-end. This dramatically reduces manual effort and error, getting your app to the cloud faster and with greater confidence. Integrations: GitHub Copilot also helps tackle larger migration scenarios that were previously considered too complex. For example, many enterprises want to retire expensive proprietary integration platforms like MuleSoft or Apigee and use Azure-native services instead, but rewriting hundreds of integration workflows was daunting. Now, GitHub Copilot can assist in translating those workflows: for instance, converting an Apigee API proxy into an Azure API Management policy, or a MuleSoft integration into an Azure Logic App. Multi-Cloud Migrations: if you plan to consolidate from other clouds into Azure, GitHub Copilot can suggest equivalent Azure services and SDK calls to replace AWS or GCP-specific code. These AI-assisted conversions significantly cut down the time needed to reimplement functionality on Azure. The business impact can be substantial. By lowering the effort of such migrations, GitHub Copilot makes it feasible to pursue opportunities that deliver big cost savings and simplification. 3. Boosting Developer Productivity and Quality Instant Unit Tests (TDD Made Easy): Writing tests for old code can be tedious, but GitHub Copilot can generate unit test cases on the fly. Developers can highlight an existing function and ask Copilot to create tests; it will produce meaningful test methods covering typical and edge scenarios. This makes it practical to apply test-driven development practices even to legacy systems – you can quickly build a safety net of tests before refactoring. By catching bugs early through these AI-generated tests, teams gain confidence to modernise code without breaking things. It essentially injects quality into the process from the start, which is crucial for successful modernisation. DevOps Automation: GitHub Copilot helps modernise your build and deployment process as well. It can draft CI/CD pipeline configurations, Dockerfiles, Kubernetes manifests, and other DevOps scripts by leveraging its knowledge of common patterns. For example, when setting up a GitHub Actions workflow to deploy your app, GitHub Copilot will autocomplete significant parts (like build steps, test runs, deployment jobs) based on the project structure. This not only saves time but also ensures best practices (proper caching, dependency installation, etc.) are followed by default. Microsoft even provides an extension where you can describe your Azure infrastructure needs in plain language and have GitHub Copilot generate the corresponding templates and pipeline YAML. By automating these pieces, teams can move to cloud-based, automated deployments much faster. Behaviour-Driven Development Support: Teams practicing BDD write human-readable scenarios (e.g. using Gherkin syntax) describing application behaviour. GitHub Copilot’s AI is adept at interpreting such descriptions and suggesting step definition code or test implementations to match. For instance, given a scenario “When a user with no items checks out, then an error message is shown,” GitHub Copilot can draft the code for that condition or the test steps required. This helps bridge the gap between non-technical specifications and actual code. It makes BDD more efficient and accessible, because even if team members aren’t strong coders, the AI can translate their intent into working code that developers can refine. Quality and Consistency: By using AI to handle boilerplate and repetitive tasks, developers can focus more on high-value improvements. GitHub Copilot’s suggestions are based on a vast corpus of code, which often means it surfaces well-structured, idiomatic patterns. Starting from these suggestions, developers are less likely to introduce errors or reinvent the wheel, which leads to more consistent code quality across the project. The AI also often reminds you of edge cases (for example, suggesting input validation or error handling code that might be missed), contributing to a more robust application. In practice, many teams find that adopting GitHub Copilot results in fewer bugs and quicker code reviews, as the code is cleaner on the first pass. It’s like having an extra set of eyes on every pull request, ensuring standards are met. Business Benefits of AI-Powered Modernisation Bringing together the technical advantages above, what’s the payoff for the business and stakeholders? Modernising with GitHub Copilot can yield multiple tangible and intangible benefits: Accelerated Time-to-Market: Modernisation projects that might have taken a year can potentially be completed in a few months, or an upgrade that took weeks can be done in days. This speed means you can deliver new features to customers sooner and respond faster to market changes. It also reduces downtime or disruption since migrations happen more swiftly. Cost Savings: By automating repetitive work and reducing the effort required from highly paid senior engineers, GitHub Copilot can trim development costs. Faster project completion also means lower overall project cost. Additionally, running modernised apps on cloud infrastructure (with updated code) often lowers operational costs due to more efficient resource usage and easier maintenance. There’s also an opportunity cost benefit: developers freed up by Copilot can work on other value-adding projects in parallel. Improved Quality & Reliability: GitHub Copilot’s contributions to testing, bug-fixing, and even security (like patching known vulnerabilities during upgrades) result in more robust applications. Modernised systems have fewer outages and security incidents than shaky legacy ones. Stakeholders will appreciate that with GitHub Copilot, modernisation doesn’t mean “trading one set of bugs for another” – instead, you can increase quality as you modernise (GitHub’s research noted higher code quality when using Copilot, as developers are less likely to introduce errors or skip tests). Business Agility: A modernised application (especially one refactored for cloud) is typically more scalable and adaptable. New integrations or features can be added much faster once the platform is up-to-date. GitHub Copilot helps clear the modernisation hurdle, after which the business can innovate on a solid, flexible foundation (for example, once a monolith is broken into microservices or moved to Azure PaaS, you can iterate on it much faster in the future). AI-assisted modernisation thus unlocks future opportunities (like easier expansion, integrations, AI features, etc.) that were impractical on the legacy stack. Employee Satisfaction and Innovation: Developer happiness is a subtle but important benefit. When tedious work is handled by AI, developers can spend more time on creative tasks – designing new features, improving user experience, exploring new technologies. This can foster a culture of innovation. Moreover, being seen as a company that leverages modern tools (like AI Copilot) helps attract and retain top tech talent. Teams that successfully modernise critical systems with Copilot will gain confidence to tackle other ambitious projects, creating a positive feedback loop of improvement. To sum up, GitHub Copilot acts as a force-multiplier for application modernisation. It enables organisations to do more with less: convert legacy “boat anchors” into modern, cloud-enabled assets rapidly, while improving quality and developer morale. This aligns IT goals with business goals – faster delivery, greater efficiency, and readiness for the future. Call to Action: Embrace the Future of Modernisation GitHub Copilot has proven to be a catalyst for transforming how we approach legacy systems and cloud adoption. If you’re excited about the possibilities, here are next steps and what to watch for: Start Experimenting: If you haven’t already, try GitHub Copilot on a sample of your code. Use Copilot or Copilot Chat to explain a piece of old code or generate a unit test. Seeing it in action on your own project can build confidence and spark ideas for where to apply it. Identify a Pilot Project: Look at your application portfolio for a candidate that’s ripe for modernisation – maybe a small legacy service that could be moved to Azure, or a module that needs a refactor. Use GitHub Copilot to assess and estimate the effort. Often, you’ll find tasks once deemed “too hard” might now be feasible. Early successes will help win support for larger initiatives. Stay Tuned for Our Upcoming Blog Series: This post is just the beginning. In forthcoming posts, we’ll dive deeper into: Setting Up Your Organisation for Copilot Adoption: Practical tips on preparing your enterprise environment – from licensing and security considerations to training programs. We’ll discuss best practices (like running internal awareness campaigns, defining success metrics, and creating Copilot champions in your teams) to ensure a smooth rollout. Empowering Your Colleagues: How to foster a culture that embraces AI assistance. This includes enabling continuous learning, sharing prompt techniques and knowledge bases, and addressing any scepticism. We’ll cover strategies to support developers in using Copilot effectively, so that everyone from new hires to veteran engineers can amplify their productivity. Identifying High-Impact Modernisation Areas: Guidance on spotting where GitHub Copilot can add the most value. We’ll look at different domains – code, cloud, tests, data – and how to evaluate opportunities (for example, using telemetry or feedback to find repetitive tasks suited for AI, or legacy components with high ROI if modernised). Engage and Share: As you start leveraging Copilot for modernisation, share your experiences and results. Success stories (even small wins like “GitHub Copilot helped reduce our code review times” or “we migrated a component to Azure in 1 sprint”) can build momentum within your organisation and the broader community. We invite you to discuss and ask questions in the comments or in our tech community forums. Take a look at the new App Modernisation Guidance—a comprehensive, step-by-step playbook designed to help organisations: Understand what to modernise and why Migrate and rebuild apps with AI-first design Continuously optimise with built-in governance and observability Modernisation is a journey, and AI is the new compass and Copilot to guide the way. By embracing tools like GitHub Copilot, you position your organisation to break through modernisation barriers that once seemed insurmountable. The result is not just updated software, but a more agile, cloud-ready business and a happier, more productive development team. Now is the time to take that step. Empower your team with Copilot, and unlock the full potential of your applications and your developers. Stay tuned for more insights in our next posts, and let’s modernise what’s possible together!2KViews4likes1CommentReimagining AI Ops with Azure SRE Agent: New Automation, Integration, and Extensibility features
Azure SRE Agent offers intelligent and context aware automation for IT operations. Enhanced by customer feedback from our preview, the SRE Agent has evolved into an extensible platform to automate and manage tasks across Azure and other environments. Built on an Agentic DevOps approach - drawing from proven practices in internal Azure operations - the Azure SRE Agent has already saved over 20,000 engineering hours across Microsoft product teams operations, delivering strong ROI for teams seeking sustainable AIOps. An Operations Agent that adapts to your playbooks Azure SRE Agent is an AI powered operations automation platform that empowers SREs, DevOps, IT operations, and support teams to automate tasks such as incident response, customer support, and developer operations from a single, extensible agent. Its value proposition and capabilities have evolved beyond diagnosis and mitigation of Azure issues, to automating operational workflows and seamless integration with the standards and processes used in your organization. SRE Agent is designed to automate operational work and reduce toil, enabling developers and operators to focus on high-value tasks. By streamlining repetitive and complex processes, SRE Agent accelerates innovation and improves reliability across cloud and hybrid environments. In this article, we will look at what’s new and what has changed since the last update. What’s New: Automation, Integration, and Extensibility Azure SRE Agent just got a major upgrade. From no-code automation to seamless integrations and expanded data connectivity, here’s what’s new in this release: No-code Sub-Agent Builder: Rapidly create custom automations without writing code. Flexible, event-driven triggers: Instantly respond to incidents and operational changes. Expanded data connectivity: Unify diagnostics and troubleshooting across more data sources. Custom actions: Integrate with your existing tools and orchestrate end-to-end workflows via MCP. Prebuilt operational scenarios: Accelerate deployment and improve reliability out of the box. Unlike generic agent platforms, Azure SRE Agent comes with deep integrations, prebuilt tools, and frameworks specifically for IT, DevOps, and SRE workflows. This means you can automate complex operational tasks faster and more reliably, tailored to your organization’s needs. Sub-Agent Builder: Custom Automation, No Code Required Empower teams to automate repetitive operational tasks without coding expertise, dramatically reducing manual workload and development cycles. This feature helps address the need for targeted automation, letting teams solve specific operational pain points without relying on one-size-fits-all solutions. Modular Sub-Agents: Easily create custom sub-agents tailored to your team’s needs. Each sub-agent can have its own instructions, triggers, and toolsets, letting you automate everything from outage response to customer email triage. Prebuilt System Tools: Eliminate the inefficiency of creating basic automation from scratch, and choose from a rich library of hundreds of built-in tools for Azure operations, code analysis, deployment management, diagnostics, and more. Custom Logic: Align automation to your unique business processes by defining your automation logic and prompts, teaching the agent to act exactly as your workflow requires. Flexible Triggers: Automate on Your Terms Invoke the agent to respond automatically to mission-critical events, not wait for manual commands. This feature helps speed up incident response and eliminate missed opportunities for efficiency. Multi-Source Triggers: Go beyond chat-based interactions, and trigger the agent to automatically respond to Incident Management and Ticketing systems like PagerDuty and ServiceNow, Observability Alerting systems like Azure Monitor Alerts, or even on a cron-based schedule for proactive monitoring and best-practices checks. Additional trigger sources such as GitHub issues, Azure DevOps pipelines, email, etc. will be added over time. This means automation can start exactly when and where you need it. Event-Driven Operations: Integrate with your CI/CD, monitoring, or support systems to launch automations in response to real-world events - like deployments, incidents, or customer requests. Vital for reducing downtime, it ensures that business-critical actions happen automatically and promptly. Expanded Data Connectivity: Unified Observability and Troubleshooting Integrate data, enabling comprehensive diagnostics and troubleshooting and faster, more informed decision-making by eliminating silos and speeding up issue resolution. Multiple Data Sources: The agent can now read data from Azure Monitor, Log Analytics, and Application Insights based on its Azure role-based access control (RBAC). Additional observability data sources such as Dynatrace, New Relic, Datadog, and more can be added via the Remote Model Context Protocol (MCP) servers for these tools. This gives you a unified view for diagnostics and automation. Knowledge Integration: Rather than manually detailing every instruction in your prompt, you can upload your Troubleshooting Guide (TSG) or Runbook directly, allowing the agent to automatically create an execution plan from the file. You may also connect the agent to resources like SharePoint, Jira, or documentation repositories through Remote MCP servers, enabling it to retrieve needed files on its own. This approach utilizes your organization’s existing knowledge base, streamlining onboarding and enhancing consistency in managing incidents. Azure SRE Agent is also building multi-agent collaboration by integrating with PagerDuty and Neubird, enabling advanced, cross-platform incident management and reliability across diverse environments. Custom Actions: Automate Anything, Anywhere Extend automation beyond Azure and integrate with any tool or workflow, solving the problem of limited automation scope and enabling end-to-end process orchestration. Out-of-the-Box Actions: Instantly automate common tasks like running azcli, kubectl, creating GitHub issues, or updating Azure resources, reducing setup time and operational overhead. Communication Notifications: The SRE Agent now features built-in connectors for Outlook, enabling automated email notifications, and for Microsoft Teams, allowing it to post messages directly to Teams channels for streamlined communication. Bring Your Own Actions: Drop in your own Remote MCP servers to extend the agent’s capabilities to any custom tool or workflow. Future-proof your agentic DevOps by automating proprietary or emerging processes with confidence. Prebuilt Operations Scenarios Address common operational challenges out of the box, saving teams time and effort while improving reliability and customer satisfaction. Incident Response: Minimize business impact and reduce operational risk by automating detection, diagnosis, and mitigation of your workload stack. The agent has built-in runbooks for common issues related to many Azure resource types including Azure Kubernetes Service (AKS), Azure Container Apps (ACA), Azure App Service, Azure Logic Apps, Azure Database for PostgreSQL, Azure CosmosDB, Azure VMs, etc. Support for additional resource types is being added continually, please see product documentation for the latest information. Root Cause Analysis & IaC Drift Detection: Instantly pinpoint incident causes with AI-driven root cause analysis including automated source code scanning via GitHub and Azure DevOps integration. Proactively detect and resolve infrastructure drift by comparing live cloud environments against source-controlled IaC, ensuring configuration consistency and compliance. Handle Complex Investigations: Enable the deep investigation mode that uses a hypothesis-driven method to analyze possible root causes. It collects logs and metrics, tests hypotheses with iterative checks, and documents findings. The process delivers a clear summary and actionable steps to help teams accurately resolve critical issues. Incident Analysis: The integrated dashboard offers a comprehensive overview of all incidents managed by the SRE Agent. It presents essential metrics, including the number of incidents reviewed, assisted, and mitigated by the agent, as well as those awaiting human intervention. Users can leverage aggregated visualizations and AI-generated root cause analyses to gain insights into incident processing, identify trends, enhance response strategies, and detect areas for improvement in incident management. Inbuilt Agent Memory: The new SRE Agent Memory System transforms incident response by institutionalizing the expertise of top SREs - capturing, indexing, and reusing critical knowledge from past incidents, investigations, and user guidance. Benefit from faster, more accurate troubleshooting, as the agent learns from both successes and mistakes, surfacing relevant insights, runbooks, and mitigation strategies exactly when needed. This system leverages advanced retrieval techniques and a domain-aware schema to ensure every on-call engagement is smarter than the last, reducing mean time to resolution (MTTR) and minimizing repeated toil. Automatically gain a continuously improving agent that remembers what works, avoids past pitfalls, and delivers actionable guidance tailored to the environment. GitHub Copilot and Azure DevOps Integration: Automatically triage, respond to, and resolve issues raised in GitHub or Azure DevOps. Integration with modern development platforms such as GitHub Copilot coding agent increases efficiency and ensures that issues are resolved faster, reducing bottlenecks in the development lifecycle. Ready to get started? Azure SRE Agent home page Product overview Pricing Page Pricing Calculator Pricing Blog Demo recordings Deployment samples What’s Next? Give us feedback: Your feedback is critical - You can Thumbs Up / Thumbs Down each interaction or thread, or go to the “Give Feedback” button in the agent to give us in-product feedback - or you can create issues or just share your thoughts in our GitHub repo at https://github.com/microsoft/sre-agent. We’re just getting started. In the coming months, expect even more prebuilt integrations, expanded data sources, and new automation scenarios. We anticipate continuous growth and improvement throughout our agentic AI platforms and services to effectively address customer needs and preferences. Let us know what Ops toil you want to automate next!4.6KViews1like0Comments