cloud native
135 TopicsFrom "Maybe Next Quarter" to "Running Before Lunch" on Container Apps - Modernizing Legacy .NET App
In early 2025, we wanted to modernize Jon Galloway's MVC Music Store - a classic ASP.NET MVC 5 app running on .NET Framework 4.8 with Entity Framework 6. The goal was straightforward: address vulnerabilities, enable managed identity, and deploy to Azure Container Apps and Azure SQL. No more plaintext connection strings. No more passwords in config files. We hit a wall immediately. Entity Framework on .NET Framework did not support Azure.Identity or DefaultAzureCredential. We just could not add a NuGet package and call it done - we’d need EF Core, which means modern .NET - and rewriting the data layer, the identity system, the startup pipeline, the views. The engineering team estimated one week of dedicated developer work. As a product manager without extensive .NET modernization experience, I wasn't able to complete it quickly on my own, so the project was placed in the backlog. This was before the GitHub Copilot "Agent" mode, the GitHub Copilot app modernization (a specialized agent with skills for modernization) existed but only offered assessment - it could tell you what needed to change, but couldn't make the end to end changes for you. Fast-forward one year. The full modernization agent is available. I sat down with the same app and the same goal. A few hours later, it was running on .NET 10 on Azure Container Apps with managed identity, Key Vault integration, and zero plaintext credentials. Thank you GitHub Copilot app modernization! And while we were on it – GitHub Copilot helped to modernize the experience as well, built more tests and generated more synthetic data for testing. Why Azure Container Apps? Azure Container Apps is an ideal deployment target for this modernized MVC Music Store application because it provides a serverless, fully managed container hosting environment. It abstracts away infrastructure management while natively supporting the key security and operational features this project required. It pairs naturally with infrastructure-as-code deployments, and its per-second billing on a consumption plan keeps costs minimal for a lightweight web app like this, eliminating the overhead of managing Kubernetes clusters while still giving you the container portability that modern .NET apps benefit from. That is why I asked Copilot to modernize to Azure Container Apps - here's how it went - Phase 1: Assessment GitHub Copilot App Modernization started by analyzing the codebase and producing a detailed assessment: Framework gap analysis - .NET Framework 4.0 → .NET 10, identifying every breaking change Dependency inventory - Entity Framework 6 (not EF Core), MVC 5 references, System.Web dependencies Security findings - plaintext SQL connection strings in Web.config, no managed identity support API surface changes - Global.asax → Program.cs minimal hosting, System.Web.Mvc → Microsoft.AspNetCore.Mvc The assessment is not a generic checklist. It reads your code - your controllers, your DbContext, your views - and maps a concrete modernization path. For this app, the key finding was clear: EF 6 on .NET Framework cannot support DefaultAzureCredential. The entire data layer needs to move to EF Core on modern .NET to unlock passwordless authentication. Phase 2: Code & Dependency Modernization This is where last year's experience ended and this year's began. The agent performed the actual modernization: Project structure: .csproj converted from legacy XML format to SDK-style targeting net10.0 Global.asax replaced with Program.cs using minimal hosting packages.config → NuGet PackageReference entries Data layer (the hard part): Entity Framework 6 → EF Core with Microsoft.EntityFrameworkCore.SqlServer DbContext rewritten with OnModelCreating fluent configuration System.Data.Entity → Microsoft.EntityFrameworkCore namespace throughout EF Core modernization generated from scratch Database seeding moved to a proper DbSeeder pattern with MigrateAsync() Identity: ASP.NET Membership → ASP.NET Core Identity with ApplicationUser, ApplicationDbContext Cookie authentication configured through ConfigureApplicationCookie Security (the whole trigger for this modernization): Azure.Identity + DefaultAzureCredential integrated in Program.cs Azure Key Vault configuration provider added via Azure.Extensions.AspNetCore.Configuration.Secrets Connection strings use Authentication=Active Directory Default — no passwords anywhere Application Insights wired through OpenTelemetry Views: Razor views updated from MVC 5 helpers to ASP.NET Core Tag Helpers and conventions _Layout.cshtml and all partials migrated The code changes touched every layer of the application. This is not a find-and-replace - it's a structural rewrite that maintains functional equivalence. Phase 3: Local Testing After modernization, the app builds, runs locally, and connects to a local SQL Server (or SQL in a container). EF Core modernizations apply cleanly, the seed data loads, and you can browse albums, add to cart, and check out. The identity system works. The Key Vault integration gracefully skips when KeyVaultName isn't configured - meaning local dev and Azure use the same Program.cs with zero code branches. Phase 4: AZD UP and Deployment to Azure The agent also generates the deployment infrastructure: azure.yaml - AZD service definition pointing to the Dockerfile, targeting Azure Container Apps Dockerfile - Multi-stage build using mcr.microsoft.com/dotnet/sdk:10.0 and aspnet:10.0 infra/main.bicep - Full IaaC including: Azure Container Apps with system + user-assigned managed identity Azure SQL Server with Azure AD-only authentication (no SQL auth) Azure Key Vault with RBAC, Secrets Officer role for the managed identity Container Registry with ACR Pull role assignment Application Insights + Log Analytics All connection strings injected as Container App secrets — using Active Directory Default, not passwords One command: AZD UP Provisions everything, builds the container, pushes to ACR, deploys to Container Apps. The app starts, runs MigrateAsync() on first boot, seeds the database, and serves traffic. Managed identity handles all auth to SQL and Key Vault. No credentials stored anywhere. What Changed in a Year Early 2025 Now Assessment Available Available Automated code modernization Semi-manual ✅ Full modernization agent Infrastructure generation Semi-manual ✅ Bicep + AZD generated Time to complete Weeks ✅ Hours The technology didn't just improve incrementally. The gap between "assessment" and "done" collapsed. A year ago, knowing what to do and being able to do it were very different things. Now they're the same step. Who This Is For If you have a .NET Framework app sitting on a backlog because "the modernization is too expensive" - revisit that assumption. The process changed. GitHub Copilot app modernization helps you rewrite your data layer, generates your infrastructure, and gets you to azd up. It can help you generate tests to increase your code coverage. If you have some feature requests or if you want to further optimize the code for scale - bring your requirements or logs or profile traces, you can take care of all of that during the modernization process. MVC Music Store went from .NET Framework 4.0 with Entity Framework 6 and plaintext SQL credentials to .NET 10 on Azure Container Apps with managed identity, Key Vault, and zero secrets in code. In an afternoon. That backlog item might be a lunch break now 😊. Really. Find your legacy apps and try it yourself. Next steps Modernize your .Net or Java apps with GitHub Copilot app modernization – https://aka.ms/ghcp-appmod Open your legacy application in Visual Studio or Visual Studio Code to start the process Deploy to Azure Container Apps https://aka.ms/aca/start473Views0likes1CommentHow ACR Runs Multi-Tenancy at Scale: Stamp Rebalancing and Why You Never See It Happen
By Johnson Shi, Richard Yuan, Yi Zha, Susan Shi, Jeanine Burke, Bin Du, Clark Porter, Bernie Harris, Eric Du Introduction Two of the most common questions we hear from teams running container workloads at scale on Azure Container Registry (ACR) are: "How does ACR keep my registry's performance predictable when I'm sharing infrastructure with thousands of other tenants?" — Cloud services are inherently multi-tenant. What does ACR actually do to keep my workload from competing with my neighbors? "What happens when one tenant's workload grows large enough to affect the shared infrastructure?" — Is there an active intervention, or does the system just absorb the noise? In this post, we clarify how ACR runs its multi-tenant fleet: the stamp architecture that underpins ACR's infrastructure in every Azure region, the practice of proactively rebalancing registries between stamps when one stamp gets hot, and the additional stamp isolation options available for exceptional workloads. Running multi-tenancy well at scale isn't passive — it's an active operational practice, and customers benefit from it every day without seeing it happen. Key Takeaways An ACR registry can be geo-replicated: a registry can have geo-replicas (which are both read and write-enabled) in multiple Azure regions. Each geo-replica is served by an ACR stamp — independent deployment units that underpin ACR regional infrastructure, each made up of VMSS-backed compute pools and a pool of storage accounts, that together serve many registries belonging to many tenants. Stamps are simultaneously a capacity pool, a fault domain, and an update domain. When a stamp gets hot, ACR proactively rebalances by moving registries to a less-utilized stamp in the same region. The registry endpoint does not change; the move is transparent to the customer. For exceptional workloads where rebalancing alone would just transfer the problem, ACR can provide additional stamp isolation — placing registries on stamps with fewer co-tenants, providing better traffic isolation, fault domain separation, and update domain independence. This also structurally improves the stamps the tenant used to share with everyone else. ACR engineering uses a mix of reactive signals (outages, sustained errors, throttling, low throughput) and proactive signals (operational telemetry) to decide when to rebalance stamps. Hot-node P95 CPU, discussed in this post, is one of the proactive signals we use — for each 1-minute bin, take the hottest node's average CPU, then percentile across bins. Pool-average hides per-node hot-spotting; single-sample Max is too noisy. All of this is currently manual. Rebalancing decisions, migrations, and isolation provisioning are operator-driven today. We are actively investing in standardizing and automating the practice — automated stamp rebalancing and lifecycle management are on the roadmap. Background What is a stamp? A stamp is ACR's unit of deployment within a region. At a high level, ACR has the following components within a region to serve registry data plane operations: VMSS-backed compute pools. Virtual Machine Scale Sets are Azure's primitive for running a managed group of identical VMs that autoscale together. Each stamp has a pool of VMs that handle authentication, manifest operations, tag resolution, and registry-side metadata — the coordination layer of a container pull — plus a separate pool of VMs running the dataproxy component, which sits between clients and storage. For private endpoint pulls, when a client pulls a layer, the dataproxy fetches from storage (or its local cache) and streams the bytes back; it is effectively a private endpoint and streaming cache layered together. A pool of storage accounts. Each ACR region has its own set of Azure Storage accounts that hold the actual blob (layer) data and manifest content for the geo-replicas on residing them. Storage accounts are multi-tenant within a stamp and region — multiple registries' blobs may land in the same group of accounts, with strict multi-tenant isolation controls and authorization enforcement. Each ACR region typically contains multiple stamps serving many tenants' registries. For geo-replicated registries, a geo-replica in a region is bound to exactly one underlying ACR stamp. A geo-replicated registry's global endpoint (<registry>.azurecr.io), geo-replica regional endpoints, and geo-replica dedicated data endpoints are resolved via DNS — backed by ACR's own Traffic Manager profile — to a specific stamp serving that region's geo-replica. The key conceptual point: a stamp is simultaneously a capacity pool (autoscale operates on it), a fault domain (incidents on the stamp affect all its tenants), and an update domain (rollouts progress through update domains within the stamp). When we move a registry between stamps in the same region, we are moving it between all three at once — and the customer's endpoint URLs do not change. From the customer's perspective, the migration is fully seamless: there are no endpoint changes, no DNS updates to make, and no action required on their part. The registry continues to work exactly as before, and the customer does not need to know or care that the underlying stamp has changed. Why multi-tenancy at scale is an active practice The naive picture is: provision enough capacity, autoscale handles the rest. This works in steady state. It does not work when one tenant's workload grows enough to systematically influence stamp behavior, when traffic shape is bursty enough that averages understate peaks, or when a single large tenant's blast radius becomes uncomfortably concentrated on a shared stamp. None of these is something a passive autoscaler will fix. They require an operator decision: this registry would be better served on that stamp. ACR engineering does this continuously — from routine rebalancing to providing additional isolation for exceptional workloads. How We Do It: Stamp Rebalancing Stamp rebalancing — a recurring practice Several signals can trigger a stamp rebalancing decision — reactive signals such as sustained errors, outages, throttling that customers observe or that we observe in our own telemetry, low throughput on a stamp, or proactive signals like hot-node P95 CPU (described in this post below) breaching a threshold. The most recent rebalancing work used hot-node P95 as the proactive trigger; other rebalancing decisions have been driven by the reactive signals just listed. When any of these fires, ACR engineering identifies the registries contributing most to the problem and picks one or more to move to a less-utilized stamp in the same region. The mechanism is straightforward: we initiate elevated operator actions, the control plane re-binds the registry's home_stamp field, DNS routing follows, in-flight requests on the source stamp drain in 30–60 seconds, and new traffic lands on the destination stamp. The cutover takes minutes. The customer's registry endpoint does not change. Most customers never know it happened; the ones whose registry moved typically see better latency afterward. Rebalancing to an existing cooler stamp is a recurring practice that resolves most multi-tenant pressure. For exceptional workloads where rebalancing to another shared stamp would just transfer the problem, ACR may provide additional stamp isolation — placing registries on stamps with fewer co-tenants, giving the tenant better traffic isolation, fault domain separation, and update domain independence while also structurally improving the stamps that tenant used to share with everyone else. Rebalancing at different scales ACR applies rebalancing across a spectrum of scenarios, from moving a handful of registries to a cooler stamp to providing additional stamp isolation for exceptional workloads. The decision criterion is workload size relative to the shared fleet — if moving a tenant to a different shared stamp would just transfer the hot-stamp problem to the destination, additional stamp isolation is the right answer. For everyone else, rebalancing to an existing stamp is sufficient. Both are manual today; both stamp provisioning and rebalancing mechanisms described are on ACR's roadmap to be automated with less operator involvement. Hot-node P95: one of the signals we use proactively Rebalancing decisions are driven by a mix of reactive and proactive signals. Reactive signals — outages, sustained error rates, frequent throttling, low throughput that customers report or that we see in our own telemetry — are the obvious triggers. But waiting for these means waiting for a customer-visible problem. Proactive signals let us intervene before that happens. Hot-node P95 CPU, showcased in this post, is one of the proactive signals we use, and it was the primary signal for the most recent rebalancing work described in the example below. The choice of CPU metric matters. Three candidates: Pool-average CPU. Averages every node in the pool. Hides per-node hot-spotting — a pool with 6% average CPU can still have one node at 99%. Single-sample Max CPU. The highest 1-minute sample. Captures spikes, but is dominated by single-bin noise that doesn't represent sustained load. Hot-node P95 CPU. For each 1-minute bin, take the hottest node's average CPU. Then percentile across bins over a representative 12-hour peak window. This is "how hot is the worst node, most of the time." Hot-node P95 captures sustained per-node load without being noisy, and it tracks customer-visible behavior more closely than either alternative. A concrete illustration from a recent regional resize: on one shared stamp's dataproxy pool, Max CPU touched 96% — alarming if read alone. But hot-node P95 was 43%, meaning most of the time even the hottest node was comfortably loaded; the 96% was a single 1-minute spike. Using Max as the operating signal would have triggered an unnecessary intervention. Using pool-average would have missed real hot-spotting elsewhere. Hot-node P95 is the right operating point for this particular signal — and it is one input among several that feed the broader rebalancing decision. A Recent Example: Rebalancing Large AI Workloads for Additional Isolation We recently completed the rebalancing of registries belonging to one of the largest AI workloads in the region, providing additional isolation to address the scale of their traffic. The customer's workload had grown to the point where its presence on the shared stamps was systematically influencing stamp behavior — variability that affected their own pull latency, and variability that affected every other tenant on the same shared stamps. The customer had 40 registries homed across two shared stamps in the region, with a severely long-tailed traffic distribution: the top four registries carried 96.7% of the customer's traffic. When that much load is concentrated in four registries, the migration cannot proceed as one batch. We moved them in phases, smallest to largest, with observation windows between phases: Idle and small-traffic tail first — about thirty low-traffic registries, used to validate the cutover tooling against the destination stamp. Medium-traffic registries next — in sub-batches with 24 hours of observation between them. The top four, one at a time — each individually with 48 hours of observation between cutovers. Order: smallest to largest, so each cutover was a sanity check at increasing load. The cumulative effect on the shared stamps the customer had previously occupied: Shared stamp + pool Hot-Node P95 CPU change Max CPU change Stamp A — registry pool -7% flat Stamp A — dataproxy pool -34% 96% → 64% Stamp B — registry pool -33% -3 percentage points Stamp B — dataproxy pool -44% -5 percentage points Stamp A dataproxy is the headline. The hottest node went from briefly touching 96% to maxing out at 64%, with sustained hot-node P95 dropping from 43% to 28.5%. Every other tenant homed on Stamp A — most with no idea this rebalancing happened — now runs on a structurally healthier pool, with more headroom, lower tail latency under load, and lower risk of CPU-driven incidents during traffic spikes. Stamp B saw similar relief. After the rebalancing, we right-sized the shared stamps downward — lowering the VMSS minimum instance count on each to match the new traffic level. Hot-node P95 was the primary signal driving this resize work, the same proactive signal that motivated the rebalancing in the first place: when hot traffic leaves a shared stamp, capacity right-sizing follows. Findings ACR runs this recurring stamp rebalancing practice for one reason: to give customers more guaranteed performance — higher and more predictable pull throughput, lower tail latency, better fault and update isolation — whether through routine rebalancing or additional isolation for exceptional workloads. Every tenant on the rebalanced stamps gets more headroom, more predictable behavior under load, and a smaller blast radius for any single incident or rollout. Three things happen continuously in any ACR region to make this real: registries get rebalanced between stamps as load patterns shift, exceptional workloads get additional stamp isolation when no shared stamp can absorb them sustainably, and stamps get continuously right-sized when load enters or leaves. All three are operator-driven today, all three are being invested in for automation, and all three are guided by a combination of reactive signals (outages, errors, throttling) and proactive signals (hot-node P95 CPU is one of them). The thesis is straightforward: cloud multi-tenancy at scale is not a passive property of the architecture. It is an active operational practice that exists to give customers guaranteed performance and predictable behavior. The customers who benefit most from it are usually the customers who never notice it's happening. Summary Question Answer How does ACR keep multi-tenant performance predictable at scale? By actively moving registries between stamps as load shifts — rebalancing in the common case, providing additional isolation for exceptional workloads. What is a stamp? An ACR deployment unit within a region's geo-replica: VMSS-backed registry and dataproxy compute pools plus a pool of storage accounts. Simultaneously a capacity pool, fault domain, and update domain. A region typically contains multiple stamps. Do customers see when their registry moves between stamps? No. Stamps are within a region; the global endpoint and any regional endpoint URLs do not change. The cutover takes minutes; in-flight requests drain in 30–60 seconds. Does providing additional isolation only help the isolated tenant? No — every other tenant who was sharing a stamp with that workload also benefits, because the largest source of variability has been removed from the shared fleet. What signals drive these decisions? A mix of reactive signals (outages, sustained errors, throttling, low throughput) and proactive signals from our own telemetry. Hot-node P95 CPU — the 95th percentile, across a 12-hour peak window, of the hottest node's CPU in each 1-minute bin — is one of the proactive signals, and it was the primary signal for the most recent rebalancing work. Is all of this automated? Not yet. Rebalancing, isolation provisioning, and migrations are operator-driven today. Standardizing and automating these practices is an active investment.181Views0likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.13KViews6likes0CommentsYou Can Scale MCP Servers Behind a Load Balancer on App Service — Here's How
Most MCP servers in the wild are single-instance processes. That's fine when they're driving a local Claude or VS Code session — but it's the wrong shape for a production agent fleet that has to absorb traffic spikes, ride through deploys, and survive instance failures. The good news: the MCP spec already grew up. The 2025-06-18 revision formalizes stateless HTTP transport (and the current 2025-11-25 revision keeps it), which means a single request carries everything the server needs to answer. No long-lived connection, no in-process session table, no sticky-session hacks to keep a client glued to one box. That tiny protocol change unlocks something big: you can stick an MCP server behind App Service's built-in load balancer and scale it like any other web API. This post walks through how, with a runnable sample. Sample: seligj95/app-service-mcp-stateless-scale-python. One azd up and you have a stateless FastAPI MCP server running on three App Service instances behind the platform load balancer, with a staging slot, Application Insights, and a k6 script that visualizes load distribution from the client side. Why "stateless" is the whole story Earlier MCP transports leaned on persistent connections — SSE channels and WebSocket-style sessions where the server held per-client state in memory (open tools, subscriptions, partial streams). That model is great for a local IDE talking to a local process. It's hostile to load balancing, because routing a follow-up request to a different instance breaks the session. The stateless HTTP transport flips that. Each request is a complete JSON-RPC envelope ( initialize , tools/list , tools/call ), every response is self-contained, and the server is allowed to forget the client between requests. Any instance can serve any call. That is the property a load balancer needs. In the sample, every tool is a pure function of its arguments — whoami reports the serving instance, lookup_fact reads a static dictionary, compute_primes runs a sieve. None of them touches per-client memory. That's not a constraint of the protocol; it's a discipline you adopt to keep statelessness intact. Why App Service, and not Functions or AKS Functions and AKS are a couple of the many great options for MCP server hosting depending on what the MCP server is used for. The use case we are discussing here is a scaled MCP server, i.e. an MCP server that must reach a large and broad audience. Here are a few defaults that make App Service a solid option for this scenario: Always On. Reasoning tools call into LLMs and external APIs; latencies routinely sit in the multi-second range. Functions caps a single execution at ten minutes by default (and aggressively scales workers to zero between bursts, which kills warm caches). App Service keeps the process resident. Horizontal scale is one parameter. Pick a Premium SKU, set the plan's capacity to N, and you have N instances behind a managed load balancer. No VMSS to declare, no ingress controller to wire up, no Service to reconcile. Deployment slots. Swap a warmed-up staging slot into production for zero-downtime deploys. Critical when your "API" is an LLM tool surface that an agent is actively driving. Easy Auth. OAuth 2.1 in front of the MCP endpoint without writing the flow yourself — turn on the App Service authentication blade and point it at Entra ID. The sample leaves this off so the deploy is one command, but the wiring is a checkbox away. The TL;DR: it's PaaS that already knows how to run a stateful long-lived process at horizontal scale, which is exactly the shape of a scaled MCP server. The FastAPI MCP server, end-to-end stateless The whole transport is one POST handler. The full source is in main.py , but here are the load-bearing pieces: @app.post("/mcp") async def mcp_endpoint(request: Request): body = await request.json() method = body.get("method", "") msg_id = body.get("id") if method == "initialize": return {"jsonrpc": "2.0", "id": msg_id, "result": _server_info()} if method == "tools/list": return {"jsonrpc": "2.0", "id": msg_id, "result": {"tools": [...]}} if method == "tools/call": params = body.get("params", {}) result = await MCP_TOOLS[params["name"]]["function"](**params.get("arguments", {})) return { "jsonrpc": "2.0", "id": msg_id, "result": {"content": [{"type": "text", "text": json.dumps(result)}]}, } There is no session table. There is no client_id cookie. There is no AsyncIterator held open between requests. initialize , tools/list , and tools/call all return in a single round trip, which is the shape App Service's load balancer expects. The most useful debugging tool in the sample is whoami : async def tool_whoami() -> Dict[str, Any]: return { "instance_id": os.environ.get("WEBSITE_INSTANCE_ID", "local"), "hostname": socket.gethostname(), ... } WEBSITE_INSTANCE_ID is unique per App Service worker. Call whoami a few times from your MCP client and the value rotates — that's the load balancer working. If it doesn't rotate, something is pinning your traffic (almost always the ARR Affinity cookie; we'll get there). The Bicep that actually makes it scale The infra is a P0v3 plan with capacity: 3 , a web app with affinity disabled, and a staging slot on the same plan: resource appServicePlan 'Microsoft.Web/serverfarms@2024-04-01' = { name: name sku: { name: 'P0v3' capacity: instanceCount // 3 by default } properties: { reserved: true } } resource web 'Microsoft.Web/sites@2024-04-01' = { name: name properties: { serverFarmId: appServicePlanId httpsOnly: true clientAffinityEnabled: false // ← the one line that matters siteConfig: { linuxFxVersion: 'PYTHON|3.11' alwaysOn: true healthCheckPath: '/health' appCommandLine: 'python -m uvicorn main:app --host 0.0.0.0 --port 8000' } } } resource staging 'Microsoft.Web/sites/slots@2024-04-01' = { parent: web name: 'staging' properties: { /* same shape — separate hostname, same plan */ } } The single most important line in that template is clientAffinityEnabled: false . App Service defaults to on, which sets the ARRAffinity cookie and pins every subsequent request from a given client to the instance that handled the first one. That default exists because legacy ASP.NET apps used in-process session state. Stateless MCP does not. Leaving affinity on silently undoes everything we just built. Premium v3 (P0v3) is the floor for two reasons: it gives Always On and unlocks deployment slots. Below that tier you don't get either. Application Insights without writing telemetry code The sample drops one line of bootstrap into main.py : from azure.monitor.opentelemetry import configure_azure_monitor if os.environ.get("APPLICATIONINSIGHTS_CONNECTION_STRING"): configure_azure_monitor(logger_name="mcp") The Azure Monitor OpenTelemetry distro auto-instruments FastAPI and outbound HTTP. Every request span App Service emits is tagged with cloud_RoleInstance , which Application Insights populates from WEBSITE_INSTANCE_ID . That makes the question "is traffic actually spreading across my instances?" a one-liner in Logs: requests | where timestamp > ago(15m) | where name contains "/mcp" | summarize count() by cloud_RoleInstance | order by count_ desc If you see three roughly-equal rows, you're done. If you see one row, your client is sending ARRAffinity cookies — turn affinity off and redeploy. Deploy azd auth login azd up That provisions the resource group, plan, web app, staging slot, Log Analytics workspace, and Application Insights resource, then deploys the Python app via Oryx. The output prints both WEB_URI and WEB_STAGING_URI . Open the production URI — the home page renders the instance ID that served it. Refresh. The ID changes. To swap the staging slot into production with no downtime: az webapp deployment slot swap \ --resource-group <rg> --name <app> \ --slot staging --target-slot production App Service warms the staging instances, redirects traffic, and the old production becomes the new staging — the classic blue-green pattern, but free. Prove it scales The sample ships a k6 script that hammers /mcp with tools/call requests and tags every response with the instance_id the server returned: BASE_URL=https://<your-app>.azurewebsites.net \ k6 run --summary-export=summary.json loadtest/k6-mcp.js jq '.metrics.mcp_instance_hits.values' summary.json The output groups hits per instance tag. On a three-instance plan with a 60-second steady load you should see something close to: { "count": 1842, "instance0d3e2f...": 614, "instance7a91bc...": 612, "instance19f0c4...": 616 } Roughly 33% on each box — the App Service load balancer round-robining new connections, with no help from the application. What I'd do next The sample is intentionally a starting point. Two extensions are the obvious next moves: Add Easy Auth. Turn on App Service authentication, pick Entra ID, require auth on /mcp . The token surfaces as headers; your tool handlers can use it to identify the calling agent without you owning any of the OAuth machinery. Autoscale on CPU. instanceCount: 3 is a starting point. Wire up Microsoft.Insights/autoscalesettings against the plan and let it scale 3 → 10 on the prime-counting tool. The architecture already supports it — that's the whole point of stateless. Try it Sample repo: github.com/seligj95/app-service-mcp-stateless-scale-python MCP spec: modelcontextprotocol.io/specification/2025-11-25 App Service docs: learn.microsoft.com/azure/app-service/overview If you ship something with it, I'd love to hear how it held up.155Views0likes0CommentsAnnouncing Public Preview of Argo CD extension in AKS Azure Portal Experience
We are excited to announce the public preview of Argo CD in the Azure Portal for Azure Kubernetes Service. As GitOps becomes the standard for deploying and operating applications at scale, customers need a way to adopt GitOps with simpler onboarding, secure defaults, and integrated workflows. With Argo CD now available directly in the Portal, teams can enable and manage GitOps without the complexity of manual setup. Bringing GitOps into the AKS experience Argo CD is widely used across Kubernetes environments, but setup often requires manual configuration across identity, networking, and registry integrations. With the Azure Portal experience, customers can: Enable Argo CD directly from the AKS cluster Configure identity, access, ingress, and registry integration in a guided flow Manage and monitor GitOps workflows through Argo CD UI This reduces onboarding friction and helps you reach your first successful GitOps deployment faster. Trusted identity and secure access The Argo CD experience integrates with Microsoft Entra ID to provide a secure, enterprise-ready foundation: Secure authentication using Workload Identity federation to Azure Container Registry (ACR) and Azure DevOps, removing long-lived credentials and hard-coded secrets Single Sign-On (SSO) using existing Azure identities Enterprise-grade hardening and security This preview includes built-in improvements to strengthen security posture: Images built on Azure Linux for reduced CVEs and improved baseline security Optional automatic patch updates to stay current while maintaining control over change management Parity with upstream Argo CD Argo CD in AKS remains aligned with the upstream open-source project, supporting: High availability (HA) configurations for production workloads Hub-and-spoke architectures for multi-cluster GitOps Application and ApplicationSet for scalable deployment across fleets Getting Started We invite you to explore the Argo CD experience in the Azure Portal and share feedback. To get started, go to your AKS cluster in the Azure Portal, navigate to the GitOps experience, and select Enable Argo CD. Follow the guided setup to configure identity, access, ingress, and registry integration with secure defaults. Once enabled, you can monitor your deployment and view application health and sync status from the Argo CD UI linked in the GitOps blade. For customers who prefer automation and scripting, the Argo CD extension is also available via Azure CLI public preview. NOTE: You can choose between Flux and Argo CD as your GitOps solution based on your needs. The Argo CD option is available during the initial GitOps setup experience, while existing Flux users will continue to see their current configuration.328Views0likes0CommentsManaging Multi‑Tenant Azure Resource with SRE Agent and Lighthouse
Azure SRE Agent is an AI‑powered reliability assistant that helps teams diagnose and resolve production issues faster while reducing operational toil. It analyzes logs, metrics, alerts, and deployment data to perform root cause analysis and recommend or execute mitigations with human approval. It’s capable of integrating with azure services across subscriptions and resource groups that you need to monitor and manage. Today’s enterprise customers live in a multi-tenant world, and there are multiple reasons to that due to acquisitions, complex corporate structures, managed service providers, or IT partners. Azure Lighthouse enables enterprise IT teams and managed service providers to manage resources across multiple azure tenants from a single control plane. In this demo I will walk you through how to set up Azure SRE agent to manage and monitor multi-tenant resources delegated through Azure Lighthouse. Navigate to the Azure SRE agent and select Create agent. Fill in the required details along with the deployment region and deploy the SRE agent. Once the deployment is complete, hit Set up your agent. Select the Azure resources you would like your agent to analyze like resource groups or subscriptions. This will land you to the popup window that allows you to select the subscriptions and resource groups that you would like SRE agent to monitor and manage. You can then select the subscriptions and resource groups under the same tenant that you want SRE agent to manage; Great, So far so good 👍 As a Managed Service Provider (MSP) you have multiple tenants that you are managing via Azure Lighthouse, and you need to have SRE agent access to those. So, to demo this will need to set up Azure Lighthouse with correct set of roles and configuration to delegate access to management subscription where the Centralized SRE agent is running. From Azure portal search Lighthouse. Navigate to the Lighthouse home page and select Manage your customers. On My customers Overview select Create ARM Template Provide a Name and Description. Select subscriptions on a Delegated scope. Select + Add authorization which will take you to Add authorization window. Select Principal type, I am selecting User for demo purposes. The pop-up window will allow Select users from the list. Select the checkbox next to the desired user who you want to delegate the subscription and hit Select Then select the Role that you would like to assign the user from the managing tenant to the delegated tenant and select add. You can add multiple roles by adding additional authorization to the selected user. This step is important to make sure the delegated tenant is assigned with the right role in order for SRE Agents to add it as Azure source. Azure SRE agent requires an Owner or User Administrator RBAC role to assign the subscription to the list of managed resources. If an appropriate role is not assigned, you will see an error when selecting the delegated subscriptions in SRE agent Managed resources. As per Lighthouse role support Owner role isn’t supported and User access Administrator role is supported, but only for limited purpose. Refer Azure Lighthouse documentation for additional information. If role is not defined correctly, you might see an error stating: 🛑Failed to add Role assignment “The 'delegatedRoleDefinitionIds' property is required when using certain roleDefinitionIds for authorization. To allow a principalId to assign roles to a managed identity in the customer tenant, set its roleDefinitionId to User Access Administrator. Download the ARM template and add specific Azure built-in roles that you want to grant in the delegatedRoleDefinitionIds property. You can include any supported Azure built-in role except for User Access Administrator or Owner. This example shows a principalId with User Access Administrator role that can assign two built in roles to managed identities in the customer tenant: Contributor and Log Analytics Contributor. { "principalId": "00000000-0000-0000-0000-000000000000", "principalIdDisplayName": "Policy Automation Account", "roleDefinitionId": "18d7d88d-d35e-4fb5-a5c3-7773c20a72d9", "delegatedRoleDefinitionIds": [ "b24988ac-6180-42a0-ab88-20f7382dd24c", "92aaf0da-9dab-42b6-94a3-d43ce8d16293" ] } In addition SRE agent would require certain roles at the managed identity level in order to access and operate on those services. Locate SRE agent User assigned managed identity and add roles to the service principal. For the demo purpose I am assigning Reader, Monitoring Reader, and Log Analytics Reader role. Here is the sample ARM template used for this demo. { "$schema": "https://schema.management.azure.com/schemas/2019-08-01/subscriptionDeploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { "mspOfferName": { "type": "string", "metadata": { "description": "Specify a unique name for your offer" }, "defaultValue": "lighthouse-sre-demo" }, "mspOfferDescription": { "type": "string", "metadata": { "description": "Name of the Managed Service Provider offering" }, "defaultValue": "lighthouse-sre-demo" } }, "variables": { "mspRegistrationName": "[guid(parameters('mspOfferName'))]", "mspAssignmentName": "[guid(parameters('mspOfferName'))]", "managedByTenantId": "6e03bca1-4300-400d-9e80-000000000000", "authorizations": [ { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "e40ec5ca-96e0-45a2-b4ff-59039f2c2b59", "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "18d7d88d-d35e-4fb5-a5c3-7773c20a72d9", "delegatedRoleDefinitionIds": [ "b24988ac-6180-42a0-ab88-20f7382dd24c", "92aaf0da-9dab-42b6-94a3-d43ce8d16293" ], "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "b24988ac-6180-42a0-ab88-20f7382dd24c", "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "acdd72a7-3385-48ef-bd42-f606fba81ae7", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "43d0d8ad-25c7-4714-9337-8ba259a9fe05", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "73c42c96-874c-492b-b04d-ab87d138a893", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" } ] }, "resources": [ { "type": "Microsoft.ManagedServices/registrationDefinitions", "apiVersion": "2022-10-01", "name": "[variables('mspRegistrationName')]", "properties": { "registrationDefinitionName": "[parameters('mspOfferName')]", "description": "[parameters('mspOfferDescription')]", "managedByTenantId": "[variables('managedByTenantId')]", "authorizations": "[variables('authorizations')]" } }, { "type": "Microsoft.ManagedServices/registrationAssignments", "apiVersion": "2022-10-01", "name": "[variables('mspAssignmentName')]", "dependsOn": [ "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]" ], "properties": { "registrationDefinitionId": "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]" } } ], "outputs": { "mspOfferName": { "type": "string", "value": "[concat('Managed by', ' ', parameters('mspOfferName'))]" }, "authorizations": { "type": "array", "value": "[variables('authorizations')]" } } } Login to the customers tenant and navigate to the service provides from the Azure Portal. From the Service Providers overview screen, select Service provider offers from the left navigation pane. From the top menu, select the Add offer drop down and select Add via template. In the Upload Offer Template window drag and drop or upload the template file that was created in the earlier step and hit Upload. Once the file is uploaded, select Review + Create. This will take a few minutes to deploy the template, and a successful deployment page should be displayed. Navigate to Delegations from Lighthouse overview and validate if you see the delegated subscription and the assigned role. Once the Lighthouse delegation is set up sign in to the managing tenant and navigate to the deployed SRE agent. Navigate to Azure resources from top menu or via Settings > Managed resources. Navigate to Add subscriptions to select customers subscriptions that you need SRE agent to manage. Adding subscription will automatically add required permission for the agent. Once the appropriate roles are added, the subscriptions are ready for the agent to manage and monitor resources within them. Summary - Benefits This blog post demonstrates how Azure SRE Agent can be used to centrally monitor and manage Azure resources across multiple tenants by integrating it with Azure Lighthouse, a common requirement for enterprises and managed service providers operating in complex, multi-tenant environments. It walks through: Centralized SRE operations across multiple Azure tenants Secure, role-based access using delegated resource management Reduced operational overhead for MSPs and enterprise IT teams Unified visibility into resource health and reliability across customer environments553Views2likes1CommentAnnouncing AWS with Azure SRE Agent: Cross-Cloud Investigation using the brand new AWS DevOps Agent
Overview Connect Azure SRE Agent to AWS services using the official AWS MCP server. Query AWS documentation, execute any of the 15,000+ AWS APIs, run operational workflows, and kick off incident investigations through AWS DevOps Agent, which is now generally available. The AWS MCP server connects Azure SRE Agent to AWS documentation, APIs, regional availability data, pre-built operational workflows (Agent SOPs), and AWS DevOps Agent for incident investigation. When connected, the proxy exposes 23 MCP tools organized into four categories: documentation and knowledge, API execution, guided workflows, and DevOps Agent operations. How it works The MCP Proxy for AWS runs as a local stdio process that SRE Agent spawns via uvx . The proxy handles AWS authentication using credentials you provide as environment variables. No separate infrastructure or container deployment is needed. In the portal, you use the generic MCP server (User provided connector) option with stdio transport. Key capabilities Area Capabilities Documentation Search all AWS docs, API references, and best practices; retrieve pages as markdown API execution Execute authenticated calls across 15,000+ AWS APIs with syntax validation and error handling Agent SOPs Pre-built multi-step workflows following AWS Well-Architected principles Regional info List all AWS regions, check service and feature availability by region Infrastructure Provision VPCs, databases, compute instances, storage, and networking resources Troubleshooting Analyze CloudWatch logs, CloudTrail events, permission issues, and application failures Cost management Set up billing alerts, analyze resource usage, and review cost data DevOps Agent Start AWS incident investigations, read root cause analyses, get remediation recommendations, and chat with AWS DevOps Agent Note: The AWS MCP Server is free to use. You pay only for the AWS resources consumed by API calls made through the server. All actions respect your existing IAM policies. Prerequisites Azure SRE Agent resource deployed in Azure AWS account with IAM credentials configured uv package manager installed on the SRE Agent host (used to run the MCP proxy via uvx ) IAM permissions: aws-mcp:InvokeMcp , aws-mcp:CallReadOnlyTool , and optionally aws-mcp:CallReadWriteTool Step 1: Create AWS access keys The AWS MCP server authenticates using AWS access keys (an Access Key ID and a Secret Access Key). These keys are tied to an IAM user in your AWS account. You create them in the AWS Management Console. Navigate to IAM in the AWS Console Sign in to the AWS Management Console In the top search bar, type IAM and select IAM from the results (Direct URL: https://console.aws.amazon.com/iam/ ) In the left sidebar, select Users (Direct URL: https://console.aws.amazon.com/iam/home#/users ) Create a dedicated IAM user Create a dedicated user for SRE Agent rather than reusing a personal account. This makes it easy to scope permissions and rotate keys independently. Select Create user Enter a descriptive user name (e.g., sre-agent-mcp ) Do not check "Provide user access to the AWS Management Console" (this user only needs programmatic access) Select Next Select Attach policies directly Select Create policy (opens in a new tab) and paste the following JSON in the JSON editor: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "aws-mcp:InvokeMcp", "aws-mcp:CallReadOnlyTool", "aws-mcp:CallReadWriteTool" ], "Resource": "*" } ] } Select Next, give the policy a name (e.g., SREAgentMCPAccess ), and select Create policy Back on the Create user tab, select the refresh button in the policy list, search for SREAgentMCPAccess , and check it Select Next > Create user Generate access keys After the user is created, generate the access keys that SRE Agent will use: From the Users list, select the user you just created (e.g., sre-agent-mcp ) Select the Security credentials tab Scroll down to the Access keys section Select Create access key For the use case, select Third-party service Check the confirmation checkbox and select Next Optionally add a description tag (e.g., Azure SRE Agent ) and select Create access key Copy both values immediately: Value Example format Where you'll use it Access Key ID <your-access-key-id> Connector environment variable AWS_ACCESS_KEY_ID Secret Access Key <your-secret-access-key> Connector environment variable AWS_SECRET_ACCESS_KEY Important: The Secret Access Key is shown only once on this screen. If you close the page without copying it, you must delete the key and create a new one. Select Download .csv file as a backup, then store the file securely and delete it after configuring the connector. Tip: For production use, also add service-specific IAM permissions for the AWS APIs you want SRE Agent to call. The MCP permissions above grant access to the MCP server itself, but individual API calls (e.g., ec2:DescribeInstances , logs:GetQueryResults ) require their own IAM actions. Start broad for testing, then scope down using the principle of least privilege. Required permissions summary Permission Description Required? aws-mcp:InvokeMcp Base access to the AWS MCP server Yes aws-mcp:CallReadOnlyTool Read operations (describe, list, get, search) Yes aws-mcp:CallReadWriteTool Write operations (create, update, delete resources) Optional Step 2: Add the MCP connector Connect the AWS MCP server to your SRE Agent using the portal. The proxy runs as a local stdio process that SRE Agent spawns via uvx . It handles SigV4 signing using the AWS credentials you provide as environment variables. Determine the AWS MCP endpoint for your region The AWS MCP server has regional endpoints. Choose the one matching your AWS resources: AWS Region MCP Endpoint URL us-east-1 (default) https://aws-mcp.us-east-1.api.aws/mcp us-west-2 https://aws-mcp.us-west-2.api.aws/mcp eu-west-1 https://aws-mcp.eu-west-1.api.aws/mcp Note: Without the --metadata AWS_REGION=<region> argument, operations default to us-east-1 . You can always override the region in your query. Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select MCP server (User provided connector) and select Next Configure the connector with these values: Field Value Name aws-mcp Connection type stdio Command python3 Arguments -c , __import__('subprocess').check_call(['pip','install','-q','mcp-proxy-for-aws']);__import__('os').execlp('mcp-proxy-for-aws','mcp-proxy-for-aws','https://aws-mcp.us-east-1.api.aws/mcp','--metadata','AWS_REGION=us-west-2') Environment variables AWS_ACCESS_KEY_ID=<your-access-key-id> , AWS_SECRET_ACCESS_KEY=<your-secret-access-key> Select Next to review Select Add connector This is equivalent to the following MCP client configuration used by tools like Claude Desktop or Amazon Kiro CLI: { "mcpServers": { "aws-mcp": { "command": "uvx", "args": [ "mcp-proxy-for-aws@latest", "https://aws-mcp.us-east-1.api.aws/mcp", "--metadata", "AWS_REGION=us-west-2" ] } } } Important: Store the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY securely. In the portal, environment variables for connectors are stored encrypted. For production deployments, consider using a dedicated IAM user with scoped-down permissions (see Step 1). Never commit credentials to source control. Tip: If your SRE Agent host already has AWS credentials configured (e.g., via aws configure or an instance profile), the proxy will pick them up automatically from the environment. In that case, you can omit the explicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. Note: After adding the connector, the agent service initializes the MCP connection. This may take up to 30 seconds as uvx downloads the proxy package on first run (~89 dependencies). If the connector does not show Connected status after a minute, see the Troubleshooting section below. Step 3: Add an AWS skill Skills give agents domain knowledge and best practices for specific tool sets. Create an AWS skill so your agent knows how to troubleshoot AWS services, provision infrastructure, and follow operational workflows. Tip: Why skills over subagents? Skills inject domain knowledge into the main agent's context, so it can use AWS expertise without handing off to a separate agent. Conversation context stays intact and there's no handoff latency. Use a subagent when you need full isolation with its own system prompt and tool restrictions. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: aws_infrastructure_operations display_name: AWS Infrastructure & Operations description: | AWS infrastructure and operations: EC2, EKS, Lambda, S3, RDS, CloudWatch, CloudTrail, IAM, VPC, and others. Also covers AWS DevOps Agent for incident investigation, root cause analysis, and remediation. Use for querying AWS resources, investigating issues, provisioning infrastructure, searching documentation, running AWS API calls via the AWS MCP server, and coordinating investigations between Azure SRE Agent and AWS DevOps Agent. instructions: | ## Overview The AWS MCP Server is a managed remote MCP server that gives AI assistants authenticated access to AWS services. It combines documentation access, authenticated API execution, and pre-built Agent SOPs in a single interface. **Authentication:** Handled automatically by the MCP Proxy for AWS, running as a local stdio process. All actions respect existing IAM policies configured in the connector environment variables. **Regional endpoints:** The MCP server has regional endpoints. The proxy is configured with a default region; you can override by specifying a region in your queries (e.g., "list my EC2 instances in eu-west-1"). ## Searching Documentation Use aws___search_documentation to find information across all AWS docs. ## Executing AWS API Calls Use aws___call_aws to execute authenticated AWS API calls. The tool handles SigV4 signing and provides syntax validation. ## Using Agent SOPs Use aws___retrieve_agent_sop to find and follow pre-built workflows. SOPs provide step-by-step guidance following AWS Well-Architected principles. ## Regional Operations Use aws___list_regions to see all available AWS regions and aws___get_regional_availability to check service support in specific regions. ## AWS DevOps Agent Integration The AWS MCP server includes tools for AWS DevOps Agent: - aws___list_agent_spaces / aws___create_agent_space: Manage AgentSpaces - aws___create_investigation: Start incident investigations (5-8 min async) - aws___get_task: Poll investigation status - aws___list_journal_records: Read root cause analysis - aws___list_recommendations / aws___get_recommendation: Get remediation steps - aws___start_evaluation: Run proactive infrastructure evaluations - aws___create_chat / aws___send_message: Chat with AWS DevOps Agent ## Troubleshooting | Issue | Solution | |-------|----------| | Access denied errors | Verify IAM policy includes aws-mcp:InvokeMcp and aws-mcp:CallReadOnlyTool | | API call fails | Check IAM policy includes the specific service action | | Wrong region results | Specify the region explicitly in your query | | Proxy connection error | Verify uvx is installed and the proxy can reach aws-mcp.region.api.aws | mcp_connectors: - aws-mcp Select Save Note: The mcp_connectors: - aws-mcp at the bottom links this skill to the connector you created in Step 2. The skill's instructions teach the agent how to use the 23 AWS MCP tools effectively. Step 4: Test the integration Open a new chat session with your SRE Agent and try these example prompts to verify the connection is working. Quick verification Start with this simple test to confirm the AWS MCP proxy is connected and authenticating correctly: What AWS regions are available? If the agent returns a list of regions, the connection is working. If you see authentication errors, go back and verify the IAM credentials and permissions from Step 1. Documentation and knowledge Search AWS documentation for EKS best practices for production clusters What AWS regions support Amazon Bedrock? Read the AWS documentation page about S3 bucket policies Infrastructure queries List all my running EC2 instances in us-east-1 Show me the details of my EKS cluster named "production-cluster" What Lambda functions are deployed in my account? CloudWatch and monitoring What CloudWatch alarms are currently in ALARM state? Show me the CPU utilization metrics for my RDS instance over the last 24 hours Search CloudWatch Logs for errors in the /aws/lambda/my-function log group Troubleshooting workflows My EC2 instance i-0abc123 is not reachable. Help me troubleshoot. My Lambda function is timing out. Walk me through the investigation. Find an Agent SOP for troubleshooting EKS pod scheduling failures Cross-cloud scenarios My Azure Function is failing when calling AWS S3. Check if there are any S3 service issues and review the bucket policy for "my-data-bucket". Compare the health of my AWS EKS cluster with my Azure AKS cluster. AWS DevOps Agent investigations List all available AWS DevOps Agent spaces in my account Create an AWS DevOps Agent investigation for the high error rate on my Lambda function "order-processor" in us-west-2 Start a chat with AWS DevOps Agent about my EKS cluster performance Cross-agent investigation (Azure SRE Agent + AWS DevOps Agent) My application is failing across both Azure and AWS. Start an AWS DevOps Agent investigation for the AWS side while you check Azure Monitor for errors on the Azure side. Then combine the findings into a unified root cause analysis. What's New: AWS DevOps Agent Integration The AWS MCP server now includes full integration with AWS DevOps Agent, which recently became generally available. This means Azure SRE Agent can start autonomous incident investigations on AWS infrastructure and get back root cause analyses and remediation recommendations — all within the same chat session. Available tools by category AgentSpace management Tool Description aws___list_agent_spaces Discover available AgentSpaces aws___get_agent_space Get AgentSpace details including ARN and configuration aws___create_agent_space Create a new AgentSpace for investigations Investigation lifecycle Tool Description aws___create_investigation Start an incident investigation (async, 5-8 min) aws___get_task Poll investigation task status aws___list_tasks List investigation tasks with filters aws___list_journal_records Read root cause analysis journal aws___list_executions List execution runs for a task aws___list_recommendations Get prioritized mitigation recommendations aws___get_recommendation Get full remediation specification Proactive evaluations Tool Description aws___start_evaluation Start an evaluation to find preventive recommendations aws___list_goals List evaluation goals and criteria Real-time chat Tool Description aws___create_chat Start a real-time chat session with AWS DevOps Agent aws___list_chats List recent chat sessions aws___send_message Send a message and get a streamed response Cross-Agent Investigation Workflow With the AWS MCP server connected, SRE Agent can run parallel investigations across both clouds. Here's how the cross-agent workflow works: Start an AWS investigation: Ask SRE Agent to create an AWS DevOps Agent investigation for the AWS-side symptoms Investigate Azure in parallel: While the AWS investigation runs (5-8 minutes), SRE Agent uses its native tools to check Azure Monitor, Log Analytics, and resource health Read AWS results: When the investigation completes, SRE Agent reads the journal records and recommendations Correlate findings: SRE Agent combines both sets of findings into a single root cause analysis with remediation steps for both clouds Common cross-cloud scenarios: Azure app calling AWS services: Investigate Azure Function errors that correlate with AWS API failures Hybrid deployments: Check AWS EKS clusters alongside Azure AKS clusters during multi-cloud outages Data pipeline issues: Trace data flow across Azure Event Hubs and AWS Kinesis or SQS Agent-to-agent investigation: Start an AWS DevOps Agent investigation for the AWS side while Azure SRE Agent checks Azure resources in parallel Architecture The integration uses a stdio proxy architecture. SRE Agent spawns the proxy as a child process, and the proxy forwards requests to the AWS MCP endpoint: Azure SRE Agent | | stdio (local process) v mcp-proxy-for-aws (spawned via uvx) | | Authenticated HTTPS requests v AWS MCP Server (aws-mcp.<region>.api.aws) | |--- Authenticated AWS API calls --> AWS Services | (EC2, S3, CloudWatch, EKS, Lambda, etc.) | '--- DevOps Agent API calls ------> AWS DevOps Agent |-- AgentSpaces (workspaces) |-- Investigations (async root cause analysis) |-- Recommendations (remediation specs) '-- Chat sessions (real-time interaction) Troubleshooting Authentication and connectivity issues Error Cause Solution 403 Forbidden IAM user lacks MCP permissions Add aws-mcp:InvokeMcp , aws-mcp:CallReadOnlyTool to the IAM policy 401 Unauthorized Invalid or expired AWS credentials Rotate access keys and update the connector environment variables Proxy fails to start uvx not installed or not on PATH Install uv on the SRE Agent host Connection timeout Proxy cannot reach the AWS MCP endpoint Verify outbound HTTPS (port 443) is allowed to aws-mcp.<region>.api.aws Connector added but tools not available MCP connections are initialized at agent startup Redeploy or restart the agent service from the Azure portal Slow first connection uvx downloads ~89 dependencies on first run Wait up to 30 seconds for the initial connection API and permission issues Error Cause Solution AccessDenied on API call IAM user lacks the service-specific permission Add the required IAM action (e.g., ec2:DescribeInstances ) to the user's policy CallReadWriteTool denied Write permission not granted Add aws-mcp:CallReadWriteTool to the IAM policy Wrong region data Proxy configured for a different region Update the AWS_REGION metadata in the connector arguments, or specify the region in your query API not found Newly released or unsupported API Use aws___suggest_aws_commands to find the correct API name Verify the connection Test that the proxy can authenticate by opening a new chat session and asking: What AWS regions are available? If the agent returns a list of regions, the connection is working. If you see authentication errors, verify the IAM credentials and permissions from Step 1. Re-authorize the integration If you encounter persistent authentication issues: Navigate to the IAM console Select the user created in Step 1 Navigate to Security credentials > Access keys Deactivate or delete the old access key Create a new access key Update the connector environment variables in the SRE Agent portal with the new credentials Related content AWS MCP Server documentation MCP Proxy for AWS on GitHub AWS MCP Server tools reference AWS DevOps Agent documentation AWS DevOps Agent GA announcement AWS IAM documentation8.6KViews0likes1CommentMicrosoft 365 multi-agent workflow with Microsoft Agent Framework
Learn how to design and run a multi‑agent workflow with Microsoft Agent Framework: from building a coordinated set of specialized agents and tools, to hosting and deploying them with Azure AI Foundry, and finally exposing the same workflow to users in Microsoft 365 (Teams or Copilot). This walkthrough demonstrates a practical end‑to‑end pattern for orchestrating agents, adding tools, and packaging the solution for real‑world applications.434Views0likes0CommentsAn AI led SDLC: Building an End-to-End Agentic Software Development Lifecycle with Azure and GitHub.
This is due to the inevitable move towards fully agentic, end-to-end SDLCs. We may not yet be at a point where software engineers are managing fleets of agents creating the billion-dollar AI abstraction layer, but (as I will evidence in this article) we are certainly on the precipice of such a world. Before we dive into the reality of agentic development today, let me examine two very different modules from university and their relevance in an AI-first development environment. Manual Requirements Translation. At university I dedicated two whole years to a unit called “Systems Design”. This was one of my favourite units, primarily focused on requirements translation. Often, I would receive a scenario between “The Proprietor” and “The Proprietor’s wife”, who seemed to be in a never-ending cycle of new product ideas. These tasks would be analysed, broken down, manually refined, and then mapped to some kind of early-stage application architecture (potentially some pseudo-code and a UML diagram or two). The big intellectual effort in this exercise was taking human intention and turning it into something tangible to build from (BA’s). Today, by the time I have opened Notepad and started to decipher requirements, an agent can already have created a comprehensive list, a service blueprint, and a code scaffold to start the process (*cough* spec-kit *cough*). Manual debugging. Need I say any more? Old-school debugging with print()’s and breakpoints is dead. I spent countless hours learning to debug in a classroom and then later with my own software, stepping through execution line by line, reading through logs, and understanding what to look for; where correlation did and didn’t mean causation. I think back to my year at IBM as a fresh-faced intern in a cloud engineering team, where around 50% of my time was debugging different issues until it was sufficiently “narrowed down”, and then reading countless Stack Overflow posts figuring out the actual change I would need to make to a PowerShell script or Jenkins pipeline. Already in Azure, with the emergence of SRE agents, that debug process looks entirely different. The debug process for software even more so… #terminallastcommand WHY IS THIS NOT RUNNING? #terminallastcommand Review these logs and surface errors relating to XYZ. As I said: breakpoints are dead, for now at least. Caveat – Is this a good thing? One more deviation from the main core of the article if you would be so kind (if you are not as kind skip to the implementation walkthrough below). Is this actually a good thing? Is a software engineering degree now worthless? What if I love printf()? I don’t know is my answer today, at the start of 2026. Two things worry me: one theoretical and one very real. To start with the theoretical: today AI takes a significant amount of the “donkey work” away from developers. How does this impact cognitive load at both ends of the spectrum? The list that “donkey work” encapsulates is certainly growing. As a result, on one end of the spectrum humans are left with the complicated parts yet to be within an agent’s remit. This could have quite an impact on our ability to perform tasks. If we are constantly dealing with the complex and advanced, when do we have time to re-root ourselves in the foundations? Will we see an increase in developer burnout? How do technical people perform without the mundane or routine tasks? I often hear people who have been in the industry for years discuss how simple infrastructure, computing, development, etc. were 20 years ago, almost with a longing to return to a world where today’s zero trust, globally replicated architectures are a twinkle in an architect’s eye. Is constantly working on only the most complex problems a good thing? At the other end of the spectrum, what if the performance of AI tooling and agents outperforms our wildest expectations? Suddenly, AI tools and agents are picking up more and more of today’s complicated and advanced tasks. Will developers, architects, and organisations lose some ability to innovate? Fundamentally, we are not talking about artificial general intelligence when we say AI; we are talking about incredibly complex predictive models that can augment the existing ideas they are built upon but are not, in themselves, innovators. Put simply, in the words of Scott Hanselman: “Spicy auto-complete”. Does increased reliance on these agents in more and more of our business processes remove the opportunity for innovative ideas? For example, if agents were football managers, would we ever have graduated from Neil Warnock and Mick McCarthy football to Pep? Would every agent just augment a ‘lump it long and hope’ approach? We hear about learning loops, but can these learning loops evolve into “innovation loops?” Past the theoretical and the game of 20 questions, the very real concern I have is off the back of some data shared recently on Stack Overflow traffic. We can see in the diagram below that Stack Overflow traffic has dipped significantly since the release of GitHub Copilot in October 2021, and as the product has matured that trend has only accelerated. Data from 12 months ago suggests that Stack Overflow has lost 77% of new questions compared to 2022… Stack Overflow democratises access to problem-solving (I have to be careful not to talk in past tense here), but I will admit I cannot remember the last time I was reviewing Stack Overflow or furiously searching through solutions that are vaguely similar to my own issue. This causes some concern over the data available in the future to train models. Today, models can be grounded in real, tested scenarios built by developers in anger. What happens with this question drop when API schemas change, when the technology built for today is old and deprecated, and the dataset is stale and never returning to its peak? How do we mitigate this impact? There is potential for some closed-loop type continuous improvement in the future, but do we think this is a scalable solution? I am unsure. So, back to the question: “Is this a good thing?”. It’s great today; the long-term impacts are yet to be seen. If we think that AGI may never be achieved, or is at least a very distant horizon, then understanding the foundations of your technical discipline is still incredibly important. Developers will not only be the managers of their fleet of agents, but also the janitors mopping up the mess when there is an accident (albeit likely mopping with AI-augmented tooling). An AI First SDLC Today – The Reality Enough reflection and nostalgia (I don’t think that’s why you clicked the article), let’s start building something. For the rest of this article I will be building an AI-led, agent-powered software development lifecycle. The example I will be building is an AI-generated weather dashboard. It’s a simple example, but if agents can generate, test, deploy, observe, and evolve this application, it proves that today, and into the future, the process can likely scale to more complex domains. Let’s start with the entry point. The problem statement that we will build from. “As a user I want to view real time weather data for my city so that I can plan my day.” We will use this as the single input for our AI led SDLC. This is what we will pass to promptkit and watch our app and subsequent features built in front of our eyes. The goal is that we will: - Spec-kit to get going and move from textual idea to requirements and scaffold. - Use a coding agent to implement our plan. - A Quality agent to assess the output and quality of the code. - GitHub Actions that not only host the agents (Abstracted) but also handle the build and deployment. - An SRE agent proactively monitoring and opening issues automatically. The end to end flow that we will review through this article is the following: Step 1: Spec-driven development - Spec First, Code Second A big piece of realising an AI-led SDLC today relies on spec-driven development (SDD). One of the best summaries for SDD that I have seen is: “Version control for your thinking”. Instead of huge specs that are stale and buried in a knowledge repository somewhere, SDD looks to make them a first-class citizen within the SDLC. Architectural decisions, business logic, and intent can be captured and versioned as a product evolves; an executable artefact that evolves with the project. In 2025, GitHub released the open-source Spec Kit: a tool that enables the goal of placing a specification at the centre of the engineering process. Specs drive the implementation, checklists, and task breakdowns, steering an agent towards the end goal. This article from GitHub does a great job explaining the basics, so if you’d like to learn more it’s a great place to start (https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/). In short, Spec Kit generates requirements, a plan, and tasks to guide a coding agent through an iterative, structured development process. Through the Spec Kit constitution, organisational standards and tech-stack preferences are adhered to throughout each change. I did notice one (likely intentional) gap in functionality that would cement Spec Kit’s role in an autonomous SDLC. That gap is that the implement stage is designed to run within an IDE or client coding agent. You can now, in the IDE, toggle between task implementation locally or with an agent in the cloud. That is great but again it still requires you to drive through the IDE. Thinking about this in the context of an AI-led SDLC (where we are pushing tasks from Spec Kit to a coding agent outside of my own desktop), it was clear that a bridge was needed. As a result, I used Spec Kit to create the Spec-to-issue tool. This allows us to take the tasks and plan generated by Spec Kit, parse the important parts, and automatically create a GitHub issue, with the option to auto-assign the coding agent. From the perspective of an autonomous AI-led SDLC, Speckit really is the entry point that triggers the flow. How Speckit is surfaced to users will vary depending on the organisation and the context of the users. For the rest of this demo I use Spec Kit to create a weather app calling out to the OpenWeather API, and then add additional features with new specs. With one simple prompt of “/promptkit.specify “Application feature/idea/change” I suddenly had a really clear breakdown of the tasks and plan required to get to my desired end state while respecting the context and preferences I had previously set in my Spec Kit constitution. I had mentioned a desire for test driven development, that I required certain coverage and that all solutions were to be Azure Native. The real benefit here compared to prompting directly into the coding agent is that the breakdown of one large task into individual measurable small components that are clear and methodical improves the coding agents ability to perform them by a considerable degree. We can see an example below of not just creating a whole application but another spec to iterate on an existing application and add a feature. We can see the result of the spec creation, the issue in our github repo and most importantly for the next step, our coding agent, GitHub CoPilot has been assigned automatically. Step 2: GitHub Coding Agent - Iterative, autonomous software creation Talking of coding agents, GitHub Copilot’s coding agent is an autonom ous agent in GitHub that can take a scoped development task and work on it in the background using the repository’s context. It can make code changes and produce concrete outputs like commits and pull requests for a developer to review. The developer stays in control by reviewing, requesting changes, or taking over at any point. This does the heavy lifting in our AI-led SDLC. We have already seen great success with customers who have adopted the coding agent when it comes to carrying out menial tasks to save developers time. These coding agents can work in parallel to human developers and with each other. In our example we see that the coding agent creates a new branch for its changes, and creates a PR which it starts working on as it ticks off the various tasks generated in our spec. One huge positive of the coding agent that sets it apart from other similar solutions is the transparency in decision-making and actions taken. The monitoring and observability built directly into the feature means that the agent’s “thinking” is easily visible: the iterations and steps being taken can be viewed in full sequence in the Agents tab. Furthermore, the action that the agent is running is also transparently available to view in the Actions tab, meaning problems can be assessed very quickly. Once the coding agent is finished, it has run the required tests and, even in the case of a UI change, goes as far as calling the Playwright MCP server and screenshotting the change to showcase in the PR. We are then asked to review the change. In this demo, I also created a GitHub Action that is triggered when a PR review is requested: it creates the required resources in Azure and surfaces the (in this case) Azure Container Apps revision URL, making it even smoother for the human in the loop to evaluate the changes. Just like any normal PR, if changes are required comments can be left; when they are, the coding agent can pick them up and action what is needed. It’s also worth noting that for any manual intervention here, use of GitHub Codespaces would work very well to make minor changes or perform testing on an agent’s branch. We can even see the unit tests that have been specified in our spec how been executed by our coding agent. The pattern used here (Spec Kit -> coding agent) overcomes one of the biggest challenges we see with the coding agent. Unlike an IDE-based coding agent, the GitHub.com coding agent is left to its own iterations and implementation without input until the PR review. This can lead to subpar performance, especially compared to IDE agents which have constant input and interruption. The concise and considered breakdown generated from Spec Kit provides the structure and foundation for the agent to execute on; very little is left to interpretation for the coding agent. Step 3: GitHub Code Quality Review (Human in the loop with agent assistance.) GitHub Code Quality is a feature (currently in preview) that proactively identifies code quality risks and opportunities for enhancement both in PRs and through repository scans. These are surfaced within a PR and also in repo-level scoreboards. This means that PRs can now extend existing static code analysis: Copilot can action CodeQL, PMD, and ESLint scanning on top of the new, in-context code quality findings and autofixes. Furthermore, we receive a summary of the actual changes made. This can be used to assist the human in the loop in understanding what changes have been made and whether enhancements or improvements are required. Thinking about this in the context of review coverage, one of the challenges sometimes in already-lean development teams is the time to give proper credence to PRs. Now, with AI-assisted quality scanning, we can be more confident in our overall evaluation and test coverage. I would expect that use of these tools alongside existing human review processes would increase repository code quality and reduce uncaught errors. The data points support this too. The Qodo 2025 AI Code Quality report showed that usage of AI code reviews increased quality improvements to 81% (from 55%). A similar study from Atlassian RovoDev 2026 study showed that 38.7% of comments left by AI agents in code reviews lead to additional code fixes. LLM’s in their current form are never going to achieve 100% accuracy however these are still considerable, significant gains in one of the most important (and often neglected) parts of the SDLC. With a significant number of software supply chain attacks recently it is also not a stretch to imagine that that many projects could benefit from "independently" (use this term loosely) reviewed and summarised PR's and commits. This in the future could potentially by a specialist/sub agent during a PR or merge to focus on identifying malicious code that may be hidden within otherwise normal contributions, case in point being the "near-miss" XZ Utils attack. Step 4: GitHub Actions for build and deploy - No agents here, just deterministic automation. This step will be our briefest, as the idea of CI/CD and automation needs no introduction. It is worth noting that while I am sure there are additional opportunities for using agents within a build and deploy pipeline, I have not investigated them. I often speak with customers about deterministic and non-deterministic business process automation, and the importance of distinguishing between the two. Some processes were created to be deterministic because that is all that was available at the time; the number of conditions required to deal with N possible flows just did not scale. However, now those processes can be non-deterministic. Good examples include IVR decision trees in customer service or hard-coded sales routines to retain a customer regardless of context; these would benefit from less determinism in their execution. However, some processes remain best as deterministic flows: financial transactions, policy engines, document ingestion. While all these flows may be part of an AI solution in the future (possibly as a tool an agent calls, or as part of a larger agent-based orchestration), the processes themselves are deterministic for a reason. Just because we could have dynamic decision-making doesn’t mean we should. Infrastructure deployment and CI/CD pipelines are one good example of this, in my opinion. We could have an agent decide what service best fits our codebase and which region we should deploy to, but do we really want to, and do the benefits outweigh the potential negatives? In this process flow we use a deterministic GitHub action to deploy our weather application into our “development” environment and then promote through the environments until we reach production and we want to now ensure that the application is running smoothly. We also use an action as mentioned above to deploy and surface our agents changes. In Azure Container Apps we can do this in a secure sandbox environment called a “Dynamic Session” to ensure strong isolation of what is essentially “untrusted code”. Often enterprises can view the building and development of AI applications as something that requires a completely new process to take to production, while certain additional processes are new, evaluation, model deployment etc many of our traditional SDLC principles are just as relevant as ever before, CI/CD pipelines being a great example of that. Checked in code that is predictably deployed alongside required services to run tests or promote through environments. Whether you are deploying a java calculator app or a multi agent customer service bot, CI/CD even in this new world is a non-negotiable. We can see that our geolocation feature is running on our Azure Container Apps revision and we can begin to evaluate if we agree with CoPilot that all the feature requirements have been met. In this case they have. If they hadn't we'd just jump into the PR and add a new comment with "@copilot" requesting our changes. Step 5: SRE Agent - Proactive agentic day two operations. The SRE agent service on Azure is an operations-focused agent that continuously watches a running service using telemetry such as logs, metrics, and traces. When it detects incidents or reliability risks, it can investigate signals, correlate likely causes, and propose or initiate response actions such as opening issues, creating runbook-guided fixes, or escalating to an on-call engineer. It effectively automates parts of day two operations while keeping humans in control of approval and remediation. It can be run in two different permission models: one with a reader role that can temporarily take user permissions for approved actions when identified. The other model is a privileged level that allows it to autonomously take approved actions on resources and resource types within the resource groups it is monitoring. In our example, our SRE agent could take actions to ensure our container app runs as intended: restarting pods, changing traffic allocations, and alerting for secret expiry. The SRE agent can also perform detailed debugging to save human SREs time, summarising the issue, fixes tried so far, and narrowing down potential root causes to reduce time to resolution, even across the most complex issues. My initial concern with these types of autonomous fixes (be it VPA on Kubernetes or an SRE agent across your infrastructure) is always that they can very quickly mask problems, or become an anti-pattern where you have drift between your IaC and what is actually running in Azure. One of my favourite features of SRE agents is sub-agents. Sub-agents can be created to handle very specific tasks that the primary SRE agent can leverage. Examples include alerting, report generation, and potentially other third-party integrations or tooling that require a more concise context. In my example, I created a GitHub sub-agent to be called by the primary agent after every issue that is resolved. When called, the GitHub sub-agent creates an issue summarising the origin, context, and resolution. This really brings us full circle. We can then potentially assign this to our coding agent to implement the fix before we proceed with the rest of the cycle; for example, a change where a port is incorrect in some Bicep, or min scale has been adjusted because of latency observed by the SRE agent. These are quick fixes that can be easily implemented by a coding agent, subsequently creating an autonomous feedback loop with human review. Conclusion: The journey through this AI-led SDLC demonstrates that it is possible, with today’s tooling, to improve any existing SDLC with AI assistance, evolving from simply using a chat interface in an IDE. By combining Speckit, spec-driven development, autonomous coding agents, AI-augmented quality checks, deterministic CI/CD pipelines, and proactive SRE agents, we see an emerging ecosystem where human creativity and oversight guide an increasingly capable fleet of collaborative agents. As with all AI solutions we design today, I remind myself that “this is as bad as it gets”. If the last two years are anything to go by, the rate of change in this space means this article may look very different in 12 months. I imagine Spec-to-issue will no longer be required as a bridge, as native solutions evolve to make this process even smoother. There are also some areas of an AI-led SDLC that are not included in this post, things like reviewing the inner-loop process or the use of existing enterprise patterns and blueprints. I also did not review use of third-party plugins or tools available through GitHub. These would make for an interesting expansion of the demo. We also did not look at the creation of custom coding agents, which could be hosted in Microsoft Foundry; this is especially pertinent with the recent announcement of Anthropic models now being available to deploy in Foundry. Does today’s tooling mean that developers, QAs, and engineers are no longer required? Absolutely not (and if I am honest, I can’t see that changing any time soon). However, it is evidently clear that in the next 12 months, enterprises who reshape their SDLC (and any other business process) to become one augmented by agents will innovate faster, learn faster, and deliver faster, leaving organisations who resist this shift struggling to keep up.37KViews9likes2CommentsThe Durable Task Scheduler Consumption SKU is now Generally Available
Today, we're excited to announce that the Durable Task Scheduler Consumption SKU has reached General Availability. Developers can now run durable workflows and agents on Azure with pay-per-use pricing, no storage to manage, no capacity to plan, and no idle costs. Just create a scheduler, connect your app, and start orchestrating. Whether you're coordinating AI agent workflows, processing event-driven pipelines, or running background jobs, the Consumption SKU is ready to go. GET STARTED WITH THE DURABLE TASK SCHEDULER CONSUMPTION SKU Since launching the Consumption SKU in public preview last November, we've seen incredible adoption and have incorporated feedback from developers around the world to ensure the GA release is truly production ready. “The Durable Task Scheduler has become a foundational piece of what we call ‘workflows’. It gives us the reliability guarantees we need for processing financial documents and sensitive workflows, while keeping the programming model straightforward. The combination of durable execution, external event correlation, deterministic idempotency, and the local emulator experience has made it a natural fit for our event-driven architecture. We have been delighted with the consumption SKUs cost model for our lower environments.”– Emily Lewis, CarMax What is the Durable Task Scheduler? If you're new to the Durable Task Scheduler, we recommend checking out our previous blog posts for a detailed background: Announcing Limited Early Access of the Durable Task Scheduler Announcing Workflow in Azure Container Apps with the Durable Task Scheduler Announcing Dedicated SKU GA & Consumption SKU Public Preview In brief, the Durable Task Scheduler is a fully managed orchestration backend for durable execution on Azure, meaning your workflows and agent sessions can reliably resume and run to completion, even through process failures, restarts, and scaling events. Whether you’re running workflows or orchestrating durable agents, it handles task scheduling, state persistence, fault tolerance, and built-in monitoring, freeing developers from the operational overhead of managing their own execution engines and storage backends. The Durable Task Scheduler works across Azure compute environments: Azure Functions: Using the Durable Functions extension across all Function App SKUs, including Flex Consumption. Azure Container Apps: Using the Durable Functions or Durable Task SDKs with built-in workflow support and auto-scaling. Any compute: Azure Kubernetes Service, Azure App Service, or any environment where you can run the Durable Task SDKs (.NET, Python, Java, JavaScript). Why choose the Consumption SKU? With the Consumption SKU you’re charged only for actions dispatched, with no minimum commitments or idle costs. There’s no capacity to size or throughput to reserve. Create a scheduler, connect your app, and you’re running. The Consumption SKU is a natural fit for workloads with unpredictable or bursty usage patterns: AI agent orchestration: Multi-step agent workflows that call LLMs, retrieve data, and take actions. Users trigger these on demand, so volume is spiky and pay-per-use avoids idle costs between bursts. Event-driven pipelines: Processing events from queues, webhooks, or streams with reliable orchestration and automatic checkpointing, where volumes spike and dip unpredictably. API-triggered workflows: User signups, form submissions, payment flows, and other request-driven processing where volume varies throughout the day. Distributed transactions: Retries and compensation logic across microservices with durable sagas that survive failures and restarts. What's included in the Consumption SKU at GA The Consumption SKU has been hardened based on feedback and real-world usage during the public preview. Here's what's included at GA: Performance Up to 500 actions per second: Sufficient throughput for a wide range of workloads, with the option to move to the Dedicated SKU for higher-scale scenarios. Up to 30 days of data retention: View and manage orchestration history, debug failures, and audit execution data for up to 30 days. Built-in monitoring dashboard Filter orchestrations by status, drill into execution history, view visual Gantt and sequence charts, and manage orchestrations (pause, resume, terminate, or raise events), all from the dashboard, secured with Role-Based Access Control (RBAC). Identity-based security The Consumption SKU uses Entra ID for authentication and RBAC for authorization. No SAS tokens or access keys to manage, just assign the appropriate role and connect. Get started with the Durable Task Scheduler today The Consumption SKU is available now Generally Available. Provision a scheduler in the Azure portal, connect your app, and start orchestrating. You only pay for what you use. Documentation Getting started Samples Pricing Consumption SKU docs We'd love to hear your feedback. Reach out to us by filing an issue on our GitHub repository622Views0likes0Comments