containers
388 TopicsHeroku Entered Maintenance Mode — Here's Your Next Move
Heroku has entered sustaining engineering — no new features, no new enterprise contracts. If you're running production workloads on the platform, you're probably thinking about what comes next. Azure Container Apps is worth a serious look. Scale-to-zero pricing, event-driven autoscaling, built-in microservice support, serverless GPUs, and an active roadmap — it's a container platform that handles everything from a simple web app to AI-native workloads, and you only pay for what you use. I migrated a real Heroku app to Container Apps to pressure-test the experience. Here's what I learned, what to watch out for, and how you can do it in an afternoon. Why Container Apps is the natural next chapter The philosophy carries over directly. Where Heroku had git push , Container Apps has: az containerapp up --name my-app --source . --environment my-env One command. If you have a Dockerfile, it builds and deploys your app directly. No local Docker install, no manual registry push — code in, URL out. That part didn't change. The concept mapping is tight: Heroku Azure Container Apps Dyno Container App replica Procfile process types Separate Container Apps Heroku add-ons Azure managed services Config vars Environment variables + secrets heroku run one-off dynos Container Apps Jobs Heroku Pipelines GitHub Actions Heroku Scheduler Scheduled Container Apps Jobs Container Apps also includes capabilities you'd need to piece together separately on Heroku — KEDA-powered autoscaling from any event source, Dapr for service-to-service communication, traffic splitting across revisions for safe rollouts, and scale to zero so you stop paying when nothing's running. Simplest path? If your app is a straightforward web server and you don't want containers at all, Azure App Service ( az webapp up ) is also available. But for most Heroku workloads — especially anything with workers, background jobs, or variable traffic — Container Apps is the better fit. What a real migration looks like I took a Node.js + Redis todo app from Heroku and moved it to Container Apps. The app is intentionally boring — Express server, Redis for storage, one web process. This is roughly what a lot of Heroku apps look like, and the migration took about 90 minutes end-to-end (most of that waiting for Redis to provision). Step 1: Export what you have heroku config --json --app my-heroku-app > heroku-config.json heroku apps:info --app my-heroku-app heroku addons --app my-heroku-app You want three things: your environment variables, your add-on list, and your app metadata. The config export is the most important one — it's every secret and connection string your app needs. Step 2: Create the Azure backing services For each Heroku add-on, create the Azure equivalent. Here are the common ones: Heroku add-on Azure service CLI command Heroku Postgres Azure Database for PostgreSQL az postgres flexible-server create Heroku Redis Azure Cache for Redis az redis create Heroku Scheduler Container Apps Jobs az containerapp job create SendGrid SendGrid via Marketplace (Portal) Papertrail / LogDNA Azure Monitor + Log Analytics (Enabled by default) For my todo app, I needed Redis: az redis create \ --name my-redis \ --resource-group my-rg \ --location swedencentral \ --sku Basic --vm-size c0 One thing to know: Azure Cache for Redis takes 10–20 minutes to provision. Heroku's Redis add-on takes about two minutes. Budget the time. Step 3: Containerize If you don't have a Dockerfile, write one. For a Node app this is about 10 lines: FROM node:20-slim WORKDIR /app COPY package*.json ./ RUN npm ci --omit=dev COPY . . EXPOSE 8080 CMD ["node", "server.js"] Don't have a Dockerfile? Point GitHub Copilot at the migration repo and it will generate one for your stack — Node, Python, Ruby, Java, or Go. The repo includes templates and a containerization skill that inspects your app and produces a production-ready Dockerfile. Step 4: Build, push, deploy I used Azure Container Registry (ACR) for the build. No local Docker install needed. az acr create --name myacr --resource-group my-rg --sku Basic az acr build --registry myacr --image my-app:v1 . Then create the Container App: az containerapp create \ --name my-app \ --resource-group my-rg \ --environment my-env \ --image myacr.azurecr.io/my-app:v1 \ --registry-server myacr.azurecr.io \ --target-port 8080 \ --ingress external \ --min-replicas 1 Step 5: Wire up the config This is where Heroku's config:get maps to Container Apps' secrets and environment variables. One gotcha I hit: you have to set secrets before you reference them in environment variables. If you try to do both at once, the deployment fails. # Set the secret first az containerapp secret set \ --name my-app \ --resource-group my-rg \ --secrets redis-url="rediss://:ACCESS_KEY@my-redis.redis.cache.windows.net:6380" # Then reference it az containerapp update \ --name my-app \ --resource-group my-rg \ --set-env-vars "REDIS_URL=secretref:redis-url" Step 6: Verify and cut over Hit the Azure URL, test your routes, check that data flows through the new Redis instance. When you're satisfied, update your DNS CNAME to point at the Container Apps FQDN. Total time: ~90 minutes, and most of that was waiting for Redis to provision. The actual migration work was about 30 minutes of CLI commands. Lessons from the field I want to be upfront about what to watch for — these are the things that would waste your time if you hit them unprepared. 📌 Register Azure providers first. If your subscription has never used Container Apps, you need to register the resource providers before creating anything: az provider register --namespace Microsoft.App --wait az provider register --namespace Microsoft.OperationalInsights --wait This takes a minute or two. Without it, resource creation fails with confusing error messages. 📌 Set secrets before referencing them in env vars. The CLI doesn't warn you — it just fails the deployment. Always az containerapp secret set first, then az containerapp update --set-env-vars . 📌 Budget time for Azure resource provisioning. Azure Cache for Redis takes 10–20 minutes vs Heroku's ~2 minutes. Enterprise-grade infrastructure takes a bit longer to spin up — plan accordingly and provision backing services in parallel. None of these are blockers. They're the kind of things a migration guide should tell you upfront — and ours does. Migrate today, build intelligent apps tomorrow Once your app is on Container Apps, you're on a platform built for AI-native workloads too. No second migration required: Serverless GPU — attach GPU compute to your Container Apps for inference workloads. Run models alongside your app, same environment, same deployment pipeline. No separate ML infrastructure to manage. Dynamic Sessions — spin up isolated, sandboxed code execution environments on demand. Build AI agents that execute tools, run LLM-generated code safely, or offer interactive coding experiences — all within your existing Container Apps environment. These aren't separate services you bolt on — they're configuration changes on the platform you're already running on. Building AI-native? Container Apps pairs naturally with Azure AI Foundry — one place to access state-of-the-art models from both OpenAI and Anthropic, manage prompts, evaluate outputs, and deploy endpoints. Your app runs on Container Apps; your intelligence runs on Foundry. Same subscription, same identity, no glue code between clouds. The applications being migrated today won't look the same in two years. A platform that grows with you — from web app to intelligent service — means you make this move once. You don't have to figure this out alone We've built the resources to make this migration fast and repeatable: 📖 Official Migration Guide — End-to-end walkthrough covering assessment, containerization, service mapping, and deployment. Start here. 🤖 Agent-Assisted Migration Repo — An open-source repository designed to work with GitHub Copilot and other AI coding assistants. It includes an AGENTS.md file and six migration skills that walk you through the entire process — from Heroku assessment to DNS cutover — with real CLI commands, Dockerfile templates for five languages, Bicep IaC, and GitHub Actions workflows. Point Copilot at the repo alongside your app's source code, and it becomes a migration pair-programmer: running the right commands, generating Dockerfiles, setting up CI/CD, and flagging things you might miss. This isn't a magic migrate my app button. It's more like having a colleague who has done this migration twenty times sitting next to you while you do it. The cost math works Let's talk numbers. Heroku plan Monthly cost Container Apps equivalent Monthly cost Standard-1X (idle most of the day) $25/mo Consumption plan (scale to zero) ~$0–5/mo Performance-L (steady traffic) $500/mo Dedicated plan with autoscaling Meaningfully less 10 low-traffic apps across dev/staging/prod $250+/mo Consumption plan with free grants Near zero Container Apps' monthly free grant covers 180,000 vCPU-seconds and 2 million requests. For apps that idle most of the day, that's often enough to pay nothing at all. The biggest savings come from workloads that don't run 24/7. Heroku charges for every hour a dyno is running, period. Container Apps charges for actual compute consumption and scales to zero when there's no traffic. Get started Inventory — Run heroku apps and heroku addons to see what you have. Pick a pilot — Choose a non-critical app for your first migration. Migrate — Follow the official migration guide, or point GitHub Copilot at the migration repo and let it pair with you. Azure Container Apps gives you a production-grade container platform with scale-to-zero economics, an active roadmap, and a path to AI-native workloads — all from a single az containerapp up command. If you're evaluating what comes after Heroku, start here. 👉 Start your migration · Clone the migration repo · Explore Azure Container Apps196Views0likes0CommentsMigrating to the next generation of Virtual Nodes on Azure Container Instances (ACI)
What is ACI/Virtual Nodes? Azure Container Instances (ACI) is a fully-managed serverless container platform which gives you the ability to run containers on-demand without provisioning infrastructure. Virtual Nodes on ACI allows you to run Kubernetes pods managed by an AKS cluster in a serverless way on ACI instead of traditional VM‑backed node pools. From a developer’s perspective, Virtual Nodes look just like regular Kubernetes nodes, but under the hood the pods are executed on ACI’s serverless infrastructure, enabling fast scale‑out without waiting for new VMs to be provisioned. This makes Virtual Nodes ideal for bursty, unpredictable, or short‑lived workloads where speed and cost efficiency matter more than long‑running capacity planning. Introducing the next generation of Virtual Nodes on ACI The newer Virtual Nodes v2 implementation modernises this capability by removing many of the limitations of the original AKS managed add‑on and delivering a more Kubernetes‑native, flexible, and scalable experience when bursting workloads from AKS to ACI. In this article I will demonstrate how you can migrate an existing AKS cluster using the Virtual Nodes managed add-on (legacy), to the new generation of Virtual Nodes on ACI, which is deployed and managed via Helm. More information about Virtual Nodes on Azure Container Instances can be found here, and the GitHub repo is available here. Advanced documentation for Virtual Nodes on ACI is also available here, and includes topics such as node customisation, release notes and a troubleshooting guide. Please note that all code samples within this guide are examples only, and are provided without warranty/support. Background Virtual Nodes on ACI is rebuilt from the ground-up, and includes several fixes and enhancements, for instance: Added support/features VNet peering, outbound traffic to the internet with network security groups Init containers Host aliases Arguments for exec in ACI Persistent Volumes and Persistent Volume Claims Container hooks Confidential containers (see supported regions list here) ACI standby pools Support for image pulling via Private Link and Managed Identity (MSI) Planned future enhancements Kubernetes network policies Support for IPv6 Windows containers Port Forwarding Note: The new generation of the add-on is managed via Helm rather than as an AKS managed add-on. Requirements & limitations Each Virtual Nodes on ACI deployment requires 3 vCPUs and 12 GiB memory on one of the AKS cluster’s VMs Each Virtual Nodes node supports up to 200 pods DaemonSets are not supported Virtual Nodes on ACI requires AKS clusters with Azure CNI networking (Kubenet is not supported, nor is overlay networking) Migrating to the next generation of Virtual Nodes on Azure Container Instances via Helm chart For this walkthrough, I'm using Bash via Windows Subsystem for Linux (WSL), along with the Azure CLI. Direct migration is not supported, and therefore the steps below show an example of removing Virtual Nodes managed add-on and its resources and then installing the Virtual Nodes on ACI Helm chart. In this walkthrough I will explain how to delete and re-create the Virtual Nodes subnet, however if you need to preserve the VNet and/or use a custom subnet name, refer to the Helm customisation steps here. Be sure to use a new subnet CIDR within the VNet address space, which doesn't overlap with other subnets nor the AKS CIDRS for nodes/pods and ClusterIP services. To minimise disruption, we'll first install the Virtual Nodes on ACI Helm chart, before then removing the legacy managed add-on and its resources. Prerequisites A recent version of the Azure CLI An Azure subscription with sufficient ACI quota for your selected region Helm Deployment steps Initialise environment variables location=northeurope rg=rg-virtualnode-demo vnetName=vnet-virtualnode-demo clusterName=aks-virtualnode-demo aksSubnetName=subnet-aks vnSubnetName=subnet-vn Create the new Virtual Nodes on ACI subnet with the specific name value of cg (a custom subnet can be used by following the steps here): vnSubnetId=$(az network vnet subnet create \ --resource-group $rg \ --vnet-name $vnetName \ --name cg \ --address-prefixes <your subnet CIDR> \ --delegations Microsoft.ContainerInstance/containerGroups --query id -o tsv) Assign the cluster's -kubelet identity Contributor access to the infrastructure resource group, and Network Contributor access to the ACI subnet: nodeRg=$(az aks show --resource-group $rg --name $clusterName --query nodeResourceGroup -o tsv) nodeRgId=$(az group show -n $nodeRg --query id -o tsv) agentPoolIdentityId=$(az aks show --resource-group $rg --name $clusterName --query "identityProfile.kubeletidentity.resourceId" -o tsv) agentPoolIdentityObjectId=$(az identity show --ids $agentPoolIdentityId --query principalId -o tsv) az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Contributor" \ --scope "$nodeRgId" az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Network Contributor" \ --scope "$vnSubnetId" Download the cluster's kubeconfig file: az aks get-credentials -n $clusterName -g $rg Clone the virtualnodesOnAzureContainerInstances GitHub repo: git clone https://github.com/microsoft/virtualnodesOnAzureContainerInstances.git Install the Virtual Nodes on ACI Helm chart: helm install <yourReleaseName> <GitRepoRoot>/Helm/virtualnode Confirm the Virtual Nodes node shows within the cluster and is in a Ready state (virtualnode-n): $ kubectl get node NAME STATUS ROLES AGE VERSION aks-nodepool1-35702456-vmss000000 Ready <none> 4h13m v1.33.6 aks-nodepool1-35702456-vmss000001 Ready <none> 4h13m v1.33.6 virtualnode-0 Ready <none> 162m v1.33.7 Scale-down any running Virtual Nodes workloads (example below): kubectl scale deploy <deploymentName> -n <namespace> --replicas=0 Drain and cordon the legacy Virtual Nodes node: kubectl drain virtual-node-aci-linux Disable the Virtual Nodes managed add-on (legacy): az aks disable-addons --resource-group $rg --name $clusterName --addons virtual-node Export a backup of the original subnet configuration: az network vnet subnet show --resource-group $rg --vnet-name $vnetName --name $vnSubnetName > subnetConfigOriginal.json Delete the original subnet (subnets cannot be renamed and therefore must be re-created): az network vnet subnet delete -g $rg -n $vnSubnetName --vnet-name $vnetName Delete the previous (legacy) Virtual Nodes node from the cluster: kubectl delete node virtual-node-aci-linux Test and confirm pod scheduling on Virtual Node: apiVersion: v1 kind: Pod metadata: annotations: name: demo-pod spec: containers: - command: - /bin/bash - -c - 'counter=1; while true; do echo "Hello, World! Counter: $counter"; counter=$((counter+1)); sleep 1; done' image: mcr.microsoft.com/azure-cli name: hello-world-counter resources: limits: cpu: 2250m memory: 2256Mi requests: cpu: 100m memory: 128Mi nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists If the pod successfully starts on the Virtual Node, you should see similar to the below: $ kubectl get pod -o wide demo-pod NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo-pod 1/1 Running 0 95s 10.241.0.4 vnode2-virtualnode-0 <none> <none> Modify the nodeSelector and tolerations properties of your Virtual Nodes workloads to match the requirements of Virtual Nodes on ACI (see details below) Modify your deployments to run on Virtual Nodes on ACI For Virtual Nodes managed add-on (legacy), the following nodeSelector and tolerations are used to run pods on Virtual Nodes: nodeSelector: kubernetes.io/role: agent kubernetes.io/os: linux type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Exists - key: azure.com/aci effect: NoSchedule For Virtual Nodes on ACI, the nodeSelector/tolerations are slightly different: nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists Troubleshooting Check the virtual-node-admission-controller and virtualnode-n pods are running within the vn2 namespace: $ kubectl get pod -n vn2 NAME READY STATUS RESTARTS AGE virtual-node-admission-controller-54cb7568f5-b7hnr 1/1 Running 1 (5h21m ago) 5h21m virtualnode-0 6/6 Running 6 (4h48m ago) 4h51m If these pods are in a Pending state, your node pool(s) may not have enough resources available to schedule them (use kubectl describe pod to validate). If the virtualnode-n pod is crashing, check the logs of the proxycri container to see whether there are any Managed Identity permissions issues (the cluster's -agentpool MSI needs to have Contributor access on the infrastructure resource group): kubectl logs -n vn2 virtualnode-0 -c proxycri Further troubleshooting guidance is available within the official documentation. Support If you have issues deploying or using Virtual Nodes on ACI, add a GitHub issue here517Views3likes0CommentsThe Swarm Diaries: What Happens When You Let AI Agents Loose on a Codebase
The Idea Single-agent coding assistants are impressive, but they have a fundamental bottleneck: they think serially. Ask one to build a full CLI app with a database layer, a command parser, pretty output, and tests, and it’ll grind through each piece one by one. Industry benchmarks bear this out: AIMultiple’s 2026 agentic coding benchmark measured Claude Code CLI completing full-stack tasks in ~12 minutes on average, with other CLI agents ranging from 3 to 14 minutes depending on the tool. A three-week real-world test by Render.com found single-agent coding workflows taking 10–30 minutes for multi-file feature work. But these subtasks don’t depend on each other. A storage agent doesn’t need to wait for the CLI agent. A test writer doesn’t need to watch the renderer work. What if they all ran at the same time? The hypothesis was straightforward: a swarm of specialized agents should beat a single generalist on at least two of three pillars — speed, quality, or cost. The architecture looked clean on a whiteboard: The reality was messier. But first, let me explain the machinery that makes this possible. How It’s Wired: Brains and Hands The system runs on a brains-and-hands split. The brain is an Azure Durable Task Scheduler (DTS) orchestration — a deterministic workflow that decomposes the goal into a task DAG, fans agents out in parallel, merges their branches, and runs quality gates. If the worker crashes mid-run, DTS replays from the last checkpoint. No work lost. Simple LLM calls — the planner that decomposes the goal, the judge that scores the output — run as lightweight DTS activities. One call, no tools, cheap. The hands are Microsoft Agent Framework (MAF) agents, each running in its own Docker container. One sandbox per agent, each with its own git clone, filesystem, and toolset. When an agent’s LLM decides to edit a file or run a build, the call routes through middleware to that agent’s isolated container. No two agents ever touch the same workspace. These complex agents — coders, researchers, the integrator — run as DTS durable entities with full agentic loops and turn-level checkpointing. The split matters because LLM reasoning and code execution have completely different reliability profiles. The brain checkpoints and replays deterministically. The hands are ephemeral — if a container dies, spin up a new one and replay the agent’s last turn. This separation is what lets you run five agents in parallel without them stepping on each other’s git branches, build artifacts, or file handles. It’s also what made every bug I was about to encounter debuggable. When something broke, I always knew which side broke — the orchestration logic, or the agent behavior. That distinction saved me more hours than any other design decision. The First Run Produced Nothing After hours of vibe-coding the foundation — Pydantic models, skill prompts, a prompt builder, a context store, sixteen architectural decisions documented in ADRs — I wired up the seven-phase orchestration and hit go. All five agents returned empty responses. Every single one. The logs showed agents “running” but producing zero output. I stared at the code for an embarrassingly long time before I found it. The planner returned task IDs as integers — 1, 2, 3 . The sandbox provisioner stored them as string keys — "1", "2", "3" . When the orchestrator did sandbox_map.get(1) , it got None . No sandbox meant no middleware. The agents were literally talking to thin air — making LLM calls with no tools attached, like a carpenter showing up to a job site with no hammer. The fix was one line. The lesson was bigger: LLMs don’t respect type contracts. They’ll return an integer when you expect a string, a list when you expect a dict, and a confident hallucination when they have nothing to say. Every boundary between AI-generated data and deterministic systems needs defensive normalization. This would not be the last time I learned that lesson. The Seven-Minute Merge Once agents actually ran and produced code, a new problem emerged. I watched the logs on a run that took twenty-one minutes total. Four agents finished their work in about twelve minutes. The remaining seven minutes were the LLM integrator merging four branches — eight to thirty tool calls per merge, using the premium model, to do what git merge --no-edit does in five seconds. I was paying for a premium LLM to run git diff , read both sides of every file, and write a merged version. For branches that merged cleanly. With zero conflicts. The fix was obvious in retrospect: try git merge first. If it succeeds — great, five seconds, done. Only call the LLM integrator when there are actual conflicts to resolve. Merge time dropped from seven minutes to under thirty seconds. I felt a little silly for not doing this from the start. When Agents Build Different Apps The merge speedup felt like a win until I looked at what was actually being merged. The storage agent had built a JSON-file backend. The CLI agent had written its commands against SQLite. Both modules were well-written. They compiled individually. Together, nothing worked — the CLI tried to import a Storage class that didn’t exist in the JSON backend. This was the moment I realized the agents weren’t really a team. They were strangers who happened to be assigned to the same project, each interpreting the goal in their own way. The fix was the single most impactful change in the entire project: contract-first planning. Instead of just decomposing the goal into tasks, the planner now generates API contracts — function signatures, class shapes, data model definitions — and injects them into every agent’s prompt. “Here’s what the Storage class looks like. Here’s what Task looks like. Build against these interfaces.” Before contracts, three of six branches conflicted and the quality score was 28. After contracts, zero of four branches conflicted and the score hit 68. It turns out the plan isn’t just a plan. In a multi-agent system, the plan is the product. A brilliant plan with mediocre agents produces working code. A vague plan with brilliant agents produces beautiful components that don’t fit together. The Agent Who Lied PR #4 came back with what looked like a solid result. The test writer reported three test files with detailed coverage summaries. The JSON output was meticulous — file names, function names, which modules each test covered. Then I checked tool_call_count: 0 . The test writer hadn’t written a single file. It hadn’t even opened a file. It received zero tools — because the skill loader normalized test_writer to underscores while the tool registry used test-writer with hyphens. The lookup failed silently. The agent got no tools, couldn’t do any work, and did what LLMs do when they can’t fulfill a request but feel pressure to answer: it made something up. Confidently. This happened in three of our first four evaluation runs. I called them “phantom agents” — they showed up to work, clocked in, filed a report, and went home without lifting a finger. The fix had two parts. First, obviously, fix the hyphen/underscore normalization. Second, and more importantly: add a zero-tool-call guard. If an agent that should be writing files reports success with zero tool calls, don’t believe it. Nudge it and retry. The deeper lesson stuck with me: agents will never tell you they failed. They’ll report success with elaborate detail. You have to verify what they actually did, not what they said they did. The Integrator Who Took Shortcuts Even with contracts preventing mismatched architectures, merge conflicts still happened when multiple agents touched the same files. The LLM integrator’s job was to resolve these conflicts intelligently, preserving logic from both sides. Instead, facing a gnarly conflict in models.py , it ran: git restore --source=HEAD -- models.py One command. Silently destroyed one agent’s entire implementation — the Task class, the constants, the schema version — gone. The integrator committed the lobotomized file and reported “merge resolved successfully.” The downstream damage was immediate. storage.py imported symbols that no longer existed. The judge scored 43 out of 100. The fixer agent had to spend five minutes reconstructing the data model from scratch. But that wasn’t even the worst shortcut. On other runs, the integrator replaced conflicting code with: def add_task(desc, priority=0): pass # TODO: implement storage layer When an LLM is asked to resolve a hard conflict, it’ll sometimes pick the easiest valid output — delete everything and write a placeholder. Technically valid Python. Functionally a disaster. Fixing this required explicit prompt guardrails: Never run git restore --source=HEAD Never replace implementations with pass # TODO placeholders When two implementations conflict, keep the more complete one After resolving each file, read it back and verify the expected symbols still exist The lesson: LLMs optimize for the path of least resistance. Under pressure, “valid” and “useful” diverge sharply. Demolishing the House for a Leaky Faucet When the judge scored a run below 70, the original retry strategy was: start over. Re-plan. Re-provision five sandboxes. Re-run all agents. Re-merge. Re-judge. Seven minutes and a non-trivial cloud bill, all because one agent missed an import statement. This was absurd. Most failures weren’t catastrophic — they were close. A missing model field. A broken import. An unhandled error case. The code was 90% right. Starting from scratch was like tearing down a house because the bathroom faucet leaks. So I built the fixer agent: a premium-tier model that receives the judge’s specific complaints and makes surgical edits directly on the integrator’s branch. No new sandboxes, no new branches, no merge step. The first time it ran, the score jumped from 43 to 89.5. Three minutes instead of seven. And it solved the problem that actually existed, rather than hoping a second roll of the dice would land better. Of course, the fixer’s first implementation had its own bug — it ran in a new sandbox, created a new branch, and occasionally conflicted with the code it was trying to fix. The fix to the fixer: just edit in place on the integrator’s existing sandbox. No branch, no merge, no drama. How Others Parallelize (and Why We Went Distributed) Most multi-agent coding frameworks today parallelize by spawning agents as local processes on a single developer machine. Depending on the framework, there’s typically a lead agent or orchestrator that breaks the task down into subtasks, spins up new agents to handle each piece, and combines their work when they finish — often through parallel TMux sessions or subprocess pools sharing a local filesystem. It’s simple, it’s fast to set up, and for many tasks it works. But local parallelization hits a ceiling. All agents share one machine’s CPU, memory, and disk I/O. Five agents each running npm install or cargo build compete for the same 32 GB of RAM. There’s no true filesystem isolation — two agents can clobber the same file if the orchestrator doesn’t carefully sequence writes. Recovery from a crash means restarting the entire local process tree. And scaling from 3 agents to 10 means buying a bigger machine. Our swarm takes a different approach: fully distributed execution. Each agent runs in its own Docker container with its own filesystem, git clone, and compute allocation — provisioned on AKS, ACA, or any container host. Four agents get four independent resource pools. If one container dies, DTS replays that agent from its last checkpoint in a fresh container without affecting the others. Git branch-per-agent isolation means zero filesystem conflicts by design. The trade-off is overhead: container provisioning, network latency, and the merge step add wall-clock time that a local TMux setup avoids. On a small two-agent task, local parallelization on a fast laptop probably wins. But for tasks with 4+ agents doing real work — cloning repos, installing dependencies, running builds and tests — independent resource pools and crash isolation matter. Our benchmarks on a 4-agent helpdesk system showed the swarm completing in ~8 minutes with zero resource contention, producing 1,029 lines across 14 files with 4 clean branch merges. The Scorecard After all of this, did the swarm actually beat a single agent? I ran head-to-head benchmarks: same prompt, same model (GPT-5-nano), solo agent vs. swarm, scored by a Sonnet 4.6 judge on a four-criterion rubric. Two tasks — a simple URL shortener (Render.com’s benchmark prompt) and a complex helpdesk ticket system. All runs are public — you can review every line of generated code: Task Solo Agent PR Swarm PR URL Shortener PR #1 PR #2 Helpdesk System PR #3 PR #4 URL Shortener (Simple) Helpdesk System (Complex) Quality (rubric, /5) Solo 1.9 → Swarm 2.5 (+32%) Solo 2.3 → Swarm 2.95 (+28%) Speed Solo 2.5 min → Swarm 5.5 min (2.2×) Solo 1.75 min → Swarm ~8 min (~4.5×) Tokens 7.7K → 30K (3.9×) 11K → 39K (3.4×) The pattern held across both tasks: +28–32% quality improvement, at the cost of 2–4× more time and ~3.5× more tokens. On the complex task, the quality gains broadened — the swarm produced better code organization (3/5 vs 2/5), actually wrote tests (code:test ratio 0 → 0.15), and generated 5× more files with cleaner decomposition. On the simple task, the gap came entirely from security practices: environment variables, parameterized queries, and proper .gitignore rules that the solo agent skipped entirely. Industry benchmarks from AIMultiple and Render.com show single CLI agents averaging 10–15 minutes on comparable full-stack tasks. Our swarm runs in 5–12 minutes depending on parallelizability — but the real win is quality, not speed. Specialized agents with a narrow, well-defined scope tend to be more thorough: the solo agent skipped tests and security practices entirely, while the swarm's dedicated agents actually addressed them. Two out of three pillars — with a caveat the size of your task. On small, tightly-coupled problems, just use one good agent. On larger, parallelizable work with three or more independent modules? The swarm earns its keep. What I Actually Learned The Rules That Stuck Contract-first planning. Define interfaces before writing implementations. The plan isn’t just a guide — it’s the product. Deterministic before LLM. Try git merge before calling the LLM integrator. Run ruff check before asking an agent to debug. Use code when you can; use AI when you must. Validate actions, not claims. An agent that reports “merge resolved successfully” may have deleted everything. Check tool call counts. Read the actual diff. Trust nothing. Cheap recovery over expensive retries. A fixer agent that patches one file beats re-running five agents from scratch. The cost of failure should be proportional to the failure. Not every problem needs a swarm. If the task fits in one agent’s context window, adding four more just adds overhead. The sweet spot is 3+ genuinely independent modules. The Bigger Picture The biggest surprise? Building a multi-agent AI system is more about software engineering than AI engineering. The hard problems weren’t prompt design or model selection — they were contracts between components, isolation of concerns, idempotent operations, observability, and recovery strategies. Principles that have been around since the 1970s. The agents themselves are almost interchangeable. Swap GPT for Claude, change the temperature, fine-tune the system prompt — it barely matters if your orchestration is broken. What matters is how you decompose work, how you share context, how you merge results, and how you recover from failure. Get the engineering right, and the AI just works. Get it wrong, and no model on earth will save you. By the Numbers The codebase is ~7,400 lines of Python across 230 tests and 141 commits. Over 10+ evaluation runs, the swarm processed a combined ~200K+ tokens, merged 20+ branches, and resolved conflicts ranging from trivial (package.json version bumps) to gnarly (overlapping data models). It’s built on Azure Durable Task Scheduler, Microsoft Agent Framework, and containerized sandboxes that run anywhere Docker does — AKS, ACA, or a plain docker run on your laptop. And somewhere in those 141 commits is a one-line fix for an integer-vs-string bug that took me an embarrassingly long time to find. References Azure Durable Task Scheduler — Deterministic workflow orchestration with replay, checkpointing, and fan-out/fan-in patterns. Microsoft Agent Framework (MAF) — Python agent framework for tool-calling, middleware, and structured output. Azure Kubernetes Service (AKS) — Managed Kubernetes for running containerized agent workloads at scale. Azure Container Apps (ACA) — Serverless container platform for simpler deployments. Azure OpenAI Service — Hosts the GPT models used by planner, coder, and judge agents. Built with Azure DTS, Microsoft Agent Framework, and containerized sandboxes (Docker, AKS, ACA — your choice). And a lot of grep through log files.367Views6likes0CommentsAfter Ingress NGINX: Migrating to Application Gateway for Containers
If you're running Ingress NGINX on AKS, you've probably seen the announcements by now. The community Ingress Nginx project is being retired, upstream maintenance ends in March 2026, and Microsoft's extended support for the Application Routing add-on runs out in November 2026. A requirement to migrate to another solution is inevitable. There are a few places you can go. This post focuses on Application Gateway for Containers: what it is, why it's worth the move, and how to actually do it. Microsoft has also released a migration utility that handles most of the translation work from your existing Ingress resources, so we'll cover that too. Ingress NGINX Retirement Ingress NGINX has been the default choice for Kubernetes HTTP routing for years. It's reliable, well-understood, and it appears in roughly half the "getting started with AKS" tutorials on the internet. So the retirement announcement caught a lot of teams off guard. In November 2025, the Kubernetes SIG Network and Security Response Committee announced that the community ingress-nginx project would enter best-effort maintenance until March 2026, after which there will be no further releases, bug fixes, or security patches. It had been running on a small group of volunteers for years, accumulated serious technical debt from its flexible annotation model, and the maintainers couldn't sustain it. For AKS, the timeline depends on how you're running it. If you self-installed via Helm, you're directly exposed to the March 2026 upstream deadline, after that, you're on your own for CVEs. If you're using the Application Routing add-on, Microsoft has committed to critical security patches until November 2026, but nothing beyond that. No new features, no general bug fixes. Application Gateway for Containers Application Gateway for Containers (AGC) is Azure's managed Layer 7 load balancer for AKS, and it's the successor to both the classic Application Gateway Ingress Controller and the Ingress API approach more broadly. It went GA in late 2024 and added WAF support in November 2025. The architecture splits across two planes. On the Azure side, you have the AGC resource itself, a managed load balancer that sits outside your cluster and handles the actual traffic. It has child resources for frontends (the public entry points, each with an auto-generated FQDN) and an association that links it to a dedicated delegated subnet in your VNet. Unlike the older App Gateway Ingress Controller, AGC is a standalone Azure resource, you don't deploy an App Gateway instance On the Kubernetes side, the ALB Controller runs as a small deployment in your cluster. It watches for Gateway API resources: Gateways, HTTPRoutes, and the various AGC policy types and translates them into configuration on the AGC resource. When you create or update an HTTPRoute, the controller picks it up and pushes the changes to the data plane. AGC supports both Gateway API and the Ingress API. This means you don't have to convert everything to Gateway API resources in one shot. Gateway API is where the richer functionality lives though, and so you may want to consider undertaking this migration. For deployment, you have two options: Bring Your Own (BYO) — you create the AGC resource, frontend, and subnet association in Azure yourself using the CLI, portal, Bicep, Terraform, or whatever tool you prefer. The ALB Controller then references the resource by ID. This gives you full control over the Azure-side lifecycle and fits well into existing IaC pipelines. Managed by ALB Controller — you define an ApplicationLoadBalancer custom resource in Kubernetes and the ALB Controller creates and manages the Azure resources for you. Simpler to get started, but the Azure resource lifecycle is tied to the Kubernetes resource — which some teams find uncomfortable for production workloads. One prerequisite worth flagging upfront: AGC requires Azure CNI or Azure CNI Overlay. Kubenet has been deprecated and will be fully retired in 2028, so If you're on Kubenet, you'll need to plan a CNI migration alongside this work. There is an in-place cluster migration process to allow you to do this without re-building your cluster. Why Choose AGC Over Other Alternatives? AGC's architecture is different from running an in-cluster ingress controller, and worth understanding before you start. The data plane runs outside your cluster entirely. With NGINX you're running pods that consume node resources, need upgrading, and can themselves become a reliability concern. With AGC, that's Azure's problem. You're not patching an ingress controller or sizing nodes around it. The ALB Controller does run a small number of pods in your cluster, but they're lightweight, watching Kubernetes resources and syncing configuration to the Azure data plane. They're not in the traffic path, and their resource footprint is minimal. Ingress and HTTPRoute resources still reference Kubernetes Services as usual. Application Gateway for Containers runs an Azure‑managed data plane outside the cluster and routes traffic directly to backend pod IPs using Kubernetes Endpoint/EndpointSlice data, rather than relying on in‑cluster ingress pods. This enables faster convergence as pods scale and allows health probing and traffic management to be handled at the gateway layer. WAF is built in, using the same Azure WAF policies you might already have. If you're currently running a separate Application Gateway in front of your cluster purely for WAF, AGC removes that extra hop and one fewer resource to keep current. Configuration changes push to the data plane near-instantly, without a reload cycle. NGINX reloads its config when routes change, which is mostly fine, but noticeable if you're in a high-churn environment with frequent deployments. Building on Gateway API from the start also means you're not doing this twice. It's where Kubernetes ingress is heading, and AGC fully supports it. By taking advantage of the Gateway API you are defining your configuration once in a proxy agnostic manner, and can easily switch the underlying proxy if you need to at a later date, avoiding vendor lock-in. Planning Your Migration Before you run any tooling or touch any manifests, spend some time understanding what you actually have. Start by inventorying your Ingress NGINX resources across all clusters and namespaces. You want to know how many Ingress objects you have, which annotations they're using, and whether there's anything non-standard such as custom snippets, lua configurations, or anything that leans heavily on NGINX-specific behaviour. The migration utility will flag most of this, but knowing upfront means fewer surprises. Next, confirm your cluster prerequisites. AGC requires Azure CNI or Azure CNI Overlay and workload identity. If you're on Kubenet, that migration needs to happen first. Finally, check that workload identity is enabled on your cluster. Decide on your deployment model before generating any output. BYO gives you full control over the AGC resource lifecycle and slots into existing IaC pipelines cleanly, but requires you to pre-create Azure resources. Managed is simpler to get started with but ties the Azure resource lifecycle to Kubernetes objects, which can feel uncomfortable for production workloads. Finally, decide whether you want to migrate from Ingress API to Gateway API as part of this work, or keep your existing Ingress resources and just swap the controller. AGC supports both. Doing both at once is more work upfront but gets you to the right place in a single migration. Keeping Ingress resources is lower risk in the short term, but you'll need to do the API migration later regardless. Introducing the AGC Migration Utility Microsoft released the AGC Migration Utility in January 2026 as an official CLI tool to handle the conversion of existing Ingress resources to Gateway API resources compatible with AGC. It doesn't modify anything on your cluster. It reads your existing configuration and generates YAML you can review and apply when you're ready. One thing to be aware of is that the migration utility only generates Gateway API resources, so if you use it, you're moving off the Ingress API at the same time as moving off NGINX. There's no flag to produce Ingress resources for AGC instead. If you want to land on AGC but keep Ingress resources for now, you'll need to set that up manually. There are two input modes. In files mode, you point it at a directory of YAML manifests and it converts them locally without needing cluster access. In cluster mode, it connects to your current kubeconfig context and reads Ingress resources directly from a live cluster. Both produce the same output. Alongside the converted YAML, the utility produces a migration report covering every annotation it encountered. Each annotation gets a status: completed, warning, not-supported, or error. The warning and not-supported statuses are where you'll need to do some manual work/ These represent annotations that either migrated with caveats, or have no AGC equivalent at all. The coverage of NGINX annotations is broad. URL rewrites, SSL redirects, session affinity, backend protocol, mTLS, WAF, canary routing by weight or header, permanent and temporary redirects, custom hostnames, most of the common patterns are covered. Before you run a full conversion, it's worth doing a --dry-run pass first to get a clear picture of what needs manual attention. Migrating Step by Step With prerequisites confirmed and your deployment model chosen, here's how the migration looks end to end. 1. Get the utility Pre-built binaries for Linux, macOS, and Windows are available on the GitHub releases page. Download the binary for your platform and make it executable. If you'd prefer to build from source, clone the repo and run ./build.sh from the root, which produces binaries in the bin folder. 2. Run a dry-run against your manifests Before generating any output, run in dry-run mode to see what the migration report looks like. This tells you which annotations are fully supported, which need manual attention, and which have no AGC equivalent. ./agc-migration files --provider nginx --ingress-class nginx --dry-run ./manifests/*.yaml If you'd rather read directly from your cluster, use cluster mode: ./agc-migration cluster --provider nginx --ingress-class nginx --dry-run 3. Review the migration report Work through the report before proceeding. Anything marked not-supported needs a plan. The next section covers the most common gaps, but the report itself includes specific recommendations for each issue it finds. 4. Set up AGC and install the ALB Controller Before applying any generated resources you need AGC running in Azure and the ALB Controller installed in your cluster. The setup process is well documented, so rather than reproduce it here, follow the official quickstart at aka.ms/agc. Make sure you note the resource ID of your AGC instance if you're using BYO deployment, as you'll need it in the next step. 5. Generate the converted resources Run the utility again with your chosen deployment flag to generate output: # BYO ./agc-migration files --provider nginx --ingress-class nginx \ --byo-resource-id $AGC_ID \ --output-dir ./output \ ./manifests/*.yaml # Managed ./agc-migration files --provider nginx --ingress-class nginx \ --managed-subnet-id $SUBNET_ID \ - -output-dir ./output \ ./manifests/*.yaml 6. Review and apply the generated resources Check the generated Gateway, HTTPRoute, and policy resources against your expected routing behaviour before applying anything. Apply to a non-production cluster first if you can. kubectl apply -f ./output/ 7. Validate and cut over Test your routes before updating DNS. Running both NGINX and AGC in parallel while you validate is a sensible approach; route test traffic to AGC while NGINX continues serving production, then update your DNS records to point to the AGC frontend FQDN once you're satisfied. 8. Decommission NGINX Once traffic has been running through AGC cleanly, uninstall the NGINX controller and remove the old Ingress resources. Two ingress controllers watching the same resources will cause confusion sooner or later. What the Migration Utility Doesn't Handle The utility covers a lot of ground, but there are some gaps you should be clear on. Annotations marked not-supported in the migration report have no direct AGC equivalent and won't appear in the generated output. The most common for NGINX users are custom snippets and lua-based configurations, which allow arbitrary NGINX config to be injected directly into the server block. There's no equivalent in AGC or Gateway API. If you're relying on these, you'll need to work out whether AGC's native routing capabilities can cover the same requirements through HTTPRoute filters, URL rewrites, or header manipulation. The utility doesn't migrate TLS certificates or update certificate references in the generated resources. Your existing Kubernetes Secrets containing certificates should carry over without changes, but verify that the Secret references in your generated Gateway and HTTPRoute resources are correct before cutting over. DNS cutover is outside the scope of the utility entirely. Once your AGC frontend is provisioned it gets an auto-generated FQDN, and you'll need to update your DNS records or CNAME entries accordingly. Any GitOps or CI/CD pipelines that reference your Ingress resources by name or apply them from a specific path will also need updating to reflect the new Gateway API resource types and output structure. Conclusion For many, the retirement of Ingress NGINX is unwanted complexity and extra work. If you have to migrate though, you can use it as an opportunity to land on a significantly better architecture: Gateway API as your routing layer, WAF and per-pod load balancing built in, and an ingress controller that's fully managed by Azure rather than running in your cluster. The migration utility can take care of a lot of the mechanical conversion work. Rather than manually rewriting Ingress resources into Gateway API equivalents and mapping NGINX annotations to their AGC counterparts, the utility does that translation for you and produces a migration report that tells you exactly what it couldn't handle. Running a dry-run against your manifests is a good first step to get a clear picture of your annotation coverage and what needs manual attention before you commit to a timeline. Full documentation for AGC is at aka.ms/agc and the migration utility repo is at github.com/Azure/Application-Gateway-for-Containers-Migration-Utility. Ingress NGINX retirement is coming up very soon, with the standalone implementation retiring very soon, at the end of March 2026. Using the App Routing add-on for AKS gives you a little bit of breathing room until November 2026, but it's still not long. Make sure you have a solution in place before this date to avoid running unsupported and potentially vulnerable software on your critical infrastructure.351Views1like1CommentHelp wanted: Refresh articles in Azure Architecture Center (AAC)
I’m the Project Manager for architecture review boards (ARBs) in the Azure Architecture Center (AAC). We’re looking for subject matter experts to help us improve the freshness of the AAC, Cloud Adoption Framework (CAF), and Well-Architected Framework (WAF) repos. This opportunity is currently limited to Microsoft employees only. As an ARB member, your main focus is to review, update, and maintain content to meet quarterly freshness targets. Your involvement directly impacts the quality, relevance, and direction of Azure Patterns & Practices content across AAC, CAF, and WAF. The content in these repos reaches almost 900,000 unique readers per month, so your time investment has a big, global impact. The expected commitment is 4-6 hours per month, including attendance at weekly or bi-weekly sync meetings. Become an ARB member to gain: Increased visibility and credibility as a subject‑matter expert by contributing to Microsoft‑authored guidance used by customers and partners worldwide. Broader internal reach and networking without changing roles or teams. Attribution on Microsoft Learn articles that you own. Opportunity to take on expanded roles over time (for example, owning a set of articles, mentoring contributors, or helping shape ARB direction). We’re recruiting new members across several ARBs. Our highest needs are in the Web ARB, Containers ARB, and Data & Analytics ARB: The Web ARB focuses on modern web application architecture on Azure—App Service and PaaS web apps, APIs and API Management, ingress and networking (Application Gateway, Front Door, DNS), security and identity, and designing for reliability, scalability, and disaster recovery. The Containers ARB focuses on containerized and Kubernetes‑based architectures—AKS design and operations, networking and ingress, security and identity, scalability, and reliability for production container platforms. The Data & Analytics ARB focuses on data platform and analytics architectures—data ingestion and integration, analytics and reporting, streaming and real‑time scenarios, data security and governance, and designing scalable, reliable data solutions on Azure. We’re also looking for people to take ownership of other articles across AAC, CAF, and WAF. These articles span many areas, including application and solution architectures, containers and compute, networking and security, governance and observability, data and integration, and reliability and operational best practices. You don’t need to know everything—deep expertise in one or two areas and an interest in keeping Azure architecture guidance accurate and current is what matters most. Please reply to this post if you’re interested in becoming an ARB member, and I’ll follow up with next steps. If you prefer, you can email me at v-jodimartis@microsoft.com. Thanks! 🙂38Views0likes0CommentsHealth-Aware Failover for Azure Container Registry Geo-Replication
Azure Container Registry (ACR) supports geo-replication: one registry resource with active-active (primary-primary), write-enabled geo-replicas across multiple Azure regions. You can push or pull through any replica, and ACR asynchronously replicates content and metadata to all other replicas using an eventual consistency model. For geo-replicated registries, ACR exposes a global endpoint like contoso.azurecr.io; that URL is backed by Azure Traffic Manager, which routes requests to the replica with the best network performance profile (usually the closest region). That's the promise. But TM routing at the global endpoint was latency-aware, not fully workload-health-aware: it could see whether the regional front door responded, but not whether that region could successfully serve real pull and push traffic end to end. This post walks through how we connected ACR Health Monitor's deep dependency checks to Traffic Manager so the global endpoint avoids routing to degraded replicas, improving failover outcomes and reducing customer-facing errors during regional incidents. The Problem: Healthy on the Outside, Broken on the Inside Traffic Manager routes traffic using performance-based routing, directing each DNS query to the endpoint with the lowest latency for the caller. To decide whether an endpoint is viable, TM periodically probes a health endpoint — and for ACR, that health check tested exactly one thing: is the reverse proxy responding? The problem is that a container registry is much more than a web server. A successful docker pull touches storage (where layers and manifests live), caching infrastructure, authentication and authorization services, and the metadata service. Any one of those backend dependencies can fail independently while the reverse proxy keeps happily returning 200 OK to Traffic Manager's health probes. This meant that during real outages — a storage degradation in a region, a caching failure, an authentication service disruption — Traffic Manager had no idea anything was wrong. It kept sending customers straight into a broken region, and those customers got 500 errors on their pull and push operations. We saw this pattern play out across multiple incidents: storage degradations, caching failures, VM outages, and full datacenter events — each lasting hours, all cases where geo-replicated registries had healthy replicas in other regions that could have served traffic, but Traffic Manager kept routing to the degraded region because the shallow health check passed. The Manual Workaround (and Its Failure Mode) Customers could work around this by manually disabling the affected endpoint: az acr replication update --name contoso --region eastus --region-endpoint-enabled false But this required customers to detect the outage, identify the affected region, and manually disable the endpoint — all during an active incident. Worse, in the most severe scenarios, the manual workaround could not be reliably executed. The endpoint-disable operation itself routes through the regional resource provider — the very infrastructure that's degraded. You can't tell the control plane to reroute traffic away from a region when the control plane in that region is the thing that's down. Customers were stuck. How Health Monitor Solves This ACR runs an internal service called Health Monitor within its data plane infrastructure. Its original job was narrowly scoped: it tracked the health of individual nodes so that the load balancer could route traffic to healthy instances within a region. What it didn't do was share that health signal with Traffic Manager for cross-region routing. We extended Health Monitor with a new deep health endpoint that aggregates the health status of multiple critical data plane dependencies. Rather than just asking "is the reverse proxy up?", this endpoint answers the real question: "can this region actually serve container registry requests right now?" Before we walk through the implementation details, here is a simplified before-and-after view: Before After What Gets Checked The deep health endpoint evaluates the availability of: Storage — The storage layer that holds image layers and manifests. This is the most fundamental dependency; if storage is unreachable, no image operations can succeed. Caching infrastructure — Used for caching and distributed coordination. Failures here degrade push operations and can affect pull latency. Container availability — The health of the internal services that process registry API requests. Authentication services — The authorization pipeline that validates whether a caller has permission to pull or push. Metadata service — For registries using metadata search capabilities, the metadata service is also monitored. If the health evaluation determines that the region cannot reliably serve requests, the endpoint returns unhealthy. Traffic Manager sees the failure, degrades the endpoint, and routes subsequent DNS queries to the next-lowest-latency replica — all automatically, with no customer intervention required. Per-Registry Intelligence Getting regional health right was the first step — but we needed to go further. A blunt "is the region healthy?" check would be too coarse. In each region, ACR distributes customer data across a large pool of storage accounts. A storage degradation might affect only a subset of those accounts — meaning most registries in the region are fine, and only those whose data lives on the affected accounts need to fail over. Health Monitor evaluates health on a per-registry basis. When a Traffic Manager probe arrives, Health Monitor determines which backing resources that specific registry depends on and evaluates health against those specific resources — not the region's overall health. This means that if contoso.azurecr.io depends on resources that are experiencing errors but fabrikam.azurecr.io depends on healthy ones in the same region, only Contoso's traffic gets rerouted. Fabrikam keeps getting served locally with no unnecessary latency penalty. The same per-registry logic applies to other dependencies. If a registry has metadata search enabled and the metadata service is down, that registry's endpoint goes unhealthy. If another registry in the same region doesn't use metadata search, it stays healthy. Tuning for Stability Failing over too eagerly is almost as bad as not failing over at all. A transient blip shouldn't send traffic across the continent. We tuned the thresholds so that the endpoint is only marked unhealthy after a sustained pattern of failures — not a single transient error. The end-to-end failover timing — from the onset of a real dependency failure through Health Monitor detection, Traffic Manager probe cycles, and DNS TTL propagation — is on the order of minutes, not seconds. This is deliberately conservative: fast enough to catch real regional degradation, but slow enough to ride out the kind of transient errors that resolve on their own. For context, Traffic Manager itself probes endpoints every 30 seconds and requires multiple consecutive failures before degrading an endpoint, and DNS TTL adds additional propagation delay before all clients switch to the new region. It's worth noting that DNS-based failover has an inherent limitation: even after Traffic Manager updates its DNS response, existing clients may continue reaching the degraded endpoint until their local DNS cache expires. Docker daemons, container runtimes, and CI/CD systems all cache DNS resolutions. The failover is not instantaneous — but it is automatic, which is a dramatic improvement over the previous state where failover either required manual intervention or simply didn't happen. Health Monitor's Own Resilience A natural question: what happens if Health Monitor itself fails? Health Monitor is designed to fail-open. If the monitor process is unable to evaluate dependencies — because it has crashed, is restarting, or cannot reach a dependency to check its status — the health endpoint returns healthy, preserving the pre-existing routing behavior. This ensures that a Health Monitor failure cannot itself cause a false failover. The system degrades gracefully back to the original latency-based routing rather than introducing a new failure mode. How Routing Changed The change is transparent to customers. They still access their registry through the same myregistry.azurecr.io hostname. The difference is that the system behind that hostname is now actively steering them away from degraded regions instead of blindly routing on latency alone. What Customers Should Know For registries with geo-replication enabled, this improvement is automatic — no configuration changes or action required: Pull operations benefit the most. When traffic is rerouted to a healthy replica, image layers are served from that replica's storage. For images that have completed replication to the target region, pulls succeed seamlessly. For recently pushed images that haven't yet replicated, a pull from the failover region may not find the image until replication catches up. If your workflow pushes an image and immediately pulls from a different region, consider building in retry logic or checking replication status before pulling. Push operations are more nuanced. If failover or DNS re-resolution happens during an in-flight push, that push can fail and may need to be retried. This failure mode is not new to health-aware failover; it can already occur when DNS resolves a client to a different region during a push. During failover, customers should expect both higher push latency and a higher chance of retries for long-running uploads. For production pipelines, use retry logic and design publish steps to be idempotent. Single-region registries are unaffected by this change. Traffic Manager is only involved when replicas exist; registries without geo-replication continue to route directly to their single region. In the edge case where the only region is degraded, Traffic Manager has nowhere else to route, so it continues routing to the original endpoint — the same behavior as before. Observability When a failover occurs, customers can observe the routing change through several signals: Increased pull latency from a different region — if your monitoring shows image pull times increasing, it may indicate traffic has been rerouted to a more distant replica. Azure Resource Health — check the Resource Health blade for your registry to see if there's a known issue in your primary region. Replication status — the replication health API shows the status of each replica, which can help confirm whether a specific region is experiencing issues. We're actively working on improving the observability story here — including richer signals for when routing changes occur and which region is currently serving your traffic. Rollout and Safety We rolled this out incrementally, following Azure's safe deployment practices across ring-based deployment stages. The migration involved updating each registry's Traffic Manager configuration to use the new deep health evaluation. This is controlled at the Traffic Manager level, making it straightforward to roll back a specific registry or region if needed. We also built in safeguards to quickly revert to previous routing behavior if needed. If Health Monitor's deep health evaluation were to malfunction and falsely report regions as unhealthy, we can disable it and revert to the original pass-through behavior — the same shallow health check as before — as a safety net. The Outcome Since rolling out Health Monitor-based routing, geo-replicated registries now automatically fail over during the types of regional degradation events that previously required manual intervention or resulted in extended customer impact. The classes of incidents we tracked — storage outages, caching failures, VM disruptions, and authentication service degradation — now trigger automatic rerouting to healthy replicas. This is one piece of a broader effort to improve ACR's resilience for geo-replicated registries. Other recent and ongoing work includes improving replication consistency for rapid tag overwrites, enabling cross-region pull-through for images that haven't finished replicating, and optimizing the replication service's resource utilization for large registries. Geo-replication has always been ACR's answer to multi-region availability. Health Monitor makes sure that promise holds when it matters most — when something goes wrong. To learn more about ACR geo-replication, see Geo-replication in Azure Container Registry. To configure geo-replication for your registry, see Enable geo-replication.190Views2likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.10KViews1like1CommentProactive Health Monitoring and Auto-Communication Now Available for Azure Container Registry
Today, we're introducing Azure Container Registry's (ACR) latest service health enhancement: automated auto-communication through Azure Service Health alerts. When ACR detects degradation in critical operations—authentication, image push, and pull—your teams are now proactively notified through Azure Service Health, delivering better transparency and faster communication without waiting for manual incident reporting. For platform teams, SRE organizations, and enterprises with strict SLA requirements, this means container registry health events are now communicated automatically and integrated into your existing incident management and observability workflows. Background: Why Registry Availability Matters Container registries sit at the heart of modern software delivery. Every CI/CD pipeline build, every Kubernetes pod startup, and every production deployment depends on the ability to authenticate, push artifacts, and pull images reliably. When a registry experiences degradation—even briefly—the downstream impact can cascade quickly: failed pipelines, delayed deployments, and application startup failures across multiple clusters and environments. Until now, ACR customers discovered service issues primarily through two paths: monitoring their own workloads for symptoms (failed pulls, auth errors), or checking the Azure Status page reactively. Neither approach gives your team the head start needed to coordinate an effective response before impact is felt. Auto-Communication Through Azure Service Health Alerts ACR now provides faster communication when: Degradation is detected in your region Automated remediation is in progress Engineering teams have been engaged and are actively mitigating These notifications arrive through Azure Service Health, the same platform your teams already use to track planned maintenance and health advisories across all your Azure resources. You receive timely visibility into registry health events—with rich context including tracking IDs, affected regions, impacted resources, and mitigation timelines—without needing to open a support request or continuously monitor dashboards. Who Benefits This capability delivers value across every team that depends on container registry availability: Enterprise platform teams managing centralized registries for large organizations will receive early warning before CI/CD pipelines begin failing across hundreds of development teams. SRE organizations can integrate ACR health signals into their existing incident management workflows—via webhook integration with PagerDuty, Opsgenie, ServiceNow, and similar tools—rather than relying on synthetic monitoring or customer reports. Teams with strict SLA requirements can now correlate production incidents with documented ACR service events, supporting post-incident reviews and customer communication. All ACR customers gain a level of registry observability that previously required custom monitoring infrastructure to approximate. A Part of ACR's Broader Observability Strategy Automated Service Health auto-communication is one component of ACR's ongoing investment in service health and observability. Combined with Azure Monitor metrics, diagnostic logs and events, Service Health alerts give your teams a layered observability posture: Signal What It Tells You Service Health alerts ACR-wide service events in your regions, with official mitigation status Azure Monitor metrics Registry-level request rates, success rates, and storage utilization. This will be available soon Diagnostic logs Repository and operation-level audit trail What's next: We are working on exposing additional ACR metrics through Azure Monitor, giving you deeper visibility into registry operations—such as authentication, pull and push API requests, and error breakdowns—directly in the Azure portal. This will enable self-service diagnostics, allowing your teams to investigate and troubleshoot registry issues independently without opening a support request. Getting Started To configure Service Health alerts for ACR, navigate to Service Health in the Azure portal, create an alert rule filtering on Container Registry, and attach an action group with your preferred notification channels (email, SMS, webhook). Alerts can also be created programmatically via ARM templates or Bicep for infrastructure-as-code workflows. For the full step-by-step setup guide—including recommended alert configurations for production-critical, maintenance awareness, and comprehensive monitoring scenarios—see Configure Service Health alerts for Azure Container Registry.317Views0likes0CommentsMicrosoft Azure at KubeCon Europe 2026 | Amsterdam, NL - March 23-26
Microsoft Azure is coming back to Amsterdam for KubeCon + CloudNativeCon Europe 2026 in two short weeks, from March 23-26! As a Diamond Sponsor, we have a full week of sessions, hands-on activities, and ways to connect with the engineers behind AKS and our open-source projects. Here's what's on the schedule: Azure Day with Kubernetes: 23 March 2026 Before the main conference begins, join us at Hotel Casa Amsterdam for a free, full-day technical event built around AKS (registration required for entry - capacity is limited!). Whether you're early in your Kubernetes journey, running clusters at scale, or building AI apps, the day is designed to give you practical guidance from Microsoft product and engineering teams. Morning sessions cover what's new in AKS, including how teams are building and running AI apps on Kubernetes. In the afternoon, pick your track: Hands-on AKS Labs: Instructor-led labs to put the morning's concepts into practice. Expert Roundtables: Small-group conversations with AKS engineers on topics like security, autoscaling, AI workloads, and performance. Bring your hard questions. Evening: Drinks on us. Capacity is limited, so secure your spot before it closes: aka.ms/AKSDayEU KubeCon + CloudNativeCon: 24-26 March 2026 There will be lots going on at the main conference! Here's what to add to your calendar: Keynote (24 March): Jorge Palma takes the stage to tackle a question the industry is actively wrestling with: can AI agents reliably operate and troubleshoot Kubernetes at scale, and should they? Customer Keynote (24 March): Wayve's Mukund Muralikrishnan shares how they handle GPU scheduling across multi-tenant inference workloads using Kueue, providing a practical look at what production AI infrastructure actually requires. Demo Theatre (25 March): Anson Qian and Jorge Palma walk through a Kubernetes-native approach to cross-cloud AI inference, covering elastic autoscaling with Karpenter and GPU capacity scheduling across clouds. Sessions: Microsoft engineers are presenting across all three days on topics ranging from multi-cluster networking, supply chain security, observability, Istio in production, and more. Full list below. Find our team in the Project Pavilion at kiosks for Inspektor Gadget, Headlamp, Drasi, Radius, Notary Project, Flatcar, ORAS, Ratify, and Istio. Brendan Burns, Kubernetes co-founder and Microsoft CVP & Technical Fellow, will also share his thoughts on the latest developments and key Microsoft announcements related to open-source, cloud native, and AI application development in his KubeCon Europe blog on March 24. Come find us at Microsoft Azure booth #200 all three days. We'll be running short demos and sessions on AKS, running Kubernetes at scale, AI workloads, and cloud-native topics throughout the show, plus fun activations and opportunities to unlock special swag. Read on below for full details on our KubeCon sessions and booth theater presentations: Sponsored Keynote Date: Tues 24 March 2026 Start Time: 10:18 AM CET Room: Hall 12 Title: Scaling Platform Ops with AI Agents: Troubleshooting to Remediation Speakers: Jorge Palma, Natan Yellin (Robusta) As AI agents increasingly write our code, can they also operate and troubleshoot our infrastructure? More importantly, should they? This keynote explores the practical reality of deploying AI agents to maintain Kubernetes clusters at scale. We'll demonstrate HolmesGPT, an open-source CNCF sandbox project that connects LLMs to operational and observability data to diagnose production issues. You'll see how agents reduce MTTR by correlating logs, metrics, and cluster state far faster than manual investigation. Then we'll tackle the harder problem: moving from diagnosis to remediation. We'll show how agents with remediation policies can detect and fix issues autonomously, within strict RBAC boundaries, approval workflows, and audit trails. We'll be honest about challenges: LLM non-determinism, building trust, and why guardrails are non-negotiable. This isn't about replacing SREs; it's about multiplying their effectiveness so they can focus on creative problem-solving and system design. Customer Keynote Date: Tues 24 March 2026 Start Time: 9:37 AM CET Room: Hall 12 Title: Rules of the road for shared GPUs: AI inference scheduling at Wayve Speaker: Mukund Muralikrishnan, Wayve Technologies As AI inference workloads grow in both scale and diversity, predictable access to GPUs becomes as important as raw throughput, especially in large, multi-tenant Kubernetes clusters. At Wayve, Kubernetes underpins a wide range of inference workloads, from latency-sensitive evaluation and validation to large-scale synthetic data generation supporting the development of an end-to-end self-driving system. These workloads run side by side, have very different priorities, and all compete for the same GPU capacity. In this keynote, we will share how we manage scheduling and resources for multi-tenant AI inference on Kubernetes. We will explain why default Kubernetes scheduling falls short, and how we use Kueue, a Kubernetes-native queueing and admission control solution, to operate shared GPU clusters reliably at scale. This approach gives teams predictable GPU allocations, improves cluster utilisation, and reduces operational noise. We will close by briefly showing how frameworks like Ray fit into this model as Wayve scales its AI Driver platform. KubeCon Theatre Demo Date: Wed 25 March 2026 Start Time: 13:15 CET Room: Hall 1-5 | Solutions Showcase | Demo Theater Title: Building cross-cloud AI inference on Kubernetes with OSS Speaker: Anson Qian, Jorge Palma Operating AI inference under bursty, latency-sensitive workloads is hard enough on a single cluster. It gets harder when GPU capacity is fragmented across regions and cloud providers. This demo walks through a Kubernetes-native pattern for cross-cloud AI inference, using an incident triage and root cause analysis workflow as the example. The stack is built on open-source capabilities for lifecycle management, inference, autoscaling, and cross-cloud capacity scheduling. We will specifically highlight Karpenter for elastic autoscaling and a GPU flex nodes project for scheduling capacity across multiple cloud providers into a single cluster. Models, inference endpoints, and GPU resources are treated as first-class Kubernetes objects, enabling elastic scaling, stable routing under traffic spikes, and cross-provider failover without a separate AI control plane. KubeCon Europe 2025 Sessions with Microsoft Speakers Speaker Title Jorge Palma Microsoft keynote: Scaling Platform Ops with AI Agents: Troubleshooting to Remediation Anson Qian, Jorge Palma Microsoft demo: Building cross-cloud AI inference on Kubernetes with OSS Will Tsai Leveling up with Radius: Custom Resources and Headlamp Integration for Real-World Workloads Simone Rodigari Demystifying the Kubernetes Network Stack (From Pod to Pod) Joaquin Rodriguez Privacy as Infrastructure: Declarative Data Protection for AI on Kubernetes Cijo Thomas ⚡Lightning Talk: “Metrics That Lie”: Understanding OpenTelemetry’s Cardinality Capping and Its Implications Gaurika Poplai ⚡Lightning Talk: Compliance as Code Meets Developer Portals: Kyverno + Backstage in Action Mereta Degutyte & Anubhab Majumdar Network Flow Aggregation: Pay for the Logs You Care About! Niranjan Shankar Expl(AI)n Like I’m 5: An Introduction To AI-Native Networking Danilo Chiarlone Running Wasmtime in Hardware-Isolated Microenvironments Jack Francis Cluster Autoscaler Evolution Jackie Maertens Cloud Native Theater | Istio Day: Running State of the Art Inference with Istio and LLM-D Jackie Maertens & Mitch Connors Bob and Alice Revisited: Understanding Encryption in Kubernetes Mitch Connors Istio in Production: Expected Value, Results, and Effort at GitHub Scale Mitch Connors Evolution or Revolution: Istio as the Network Platform for Cloud Native René Dudfield Ping SRE? I Am the SRE! Awesome Fun I Had Drawing a Zine for Troubleshooting Kubernetes Deployments René Dudfield & Santhosh Nagaraj Does Your Project Want a UI in Kubernetes-SIGs/headlamp? Bridget Kromhout How Will Customized Kubernetes Distributions Work for You? a Discussion on Options and Use Cases Kenneth Kilty AI-Powered Cloud Native Modernization: From Real Challenges to Concrete Solutions Mike Morris Building the Next Generation of Multi-Cluster with Gateway API Toddy Mladenov, Flora Taagen & Dallas Delaney Beyond Image Pull-Time: Ensuring Runtime Integrity With Image Layer Signing Microsoft Booth Theatre Sessions Tues 24 March (11:00 - 18:00) Zero-Migration AI with Drasi: Bridge Your Existing Infrastructure to Modern Workflows Bringing real-time Kubernetes observability to AI agents via Model Context Protocol Secure Kubernetes Across the Stack: Supply Chain to Runtime Cut the Noise, Cut the Bill: Cost‑Smart Network Observability for Kubernetes AKS everywhere: one Kubernetes experience from Cloud to Edge Teaching AI to Build Better AKS Clusters with Terraform AKS-Flex: autoscale GPU nodes from Azure and neocloud like Nebius using karpenter Block Game with Block Storage: Running Minecraft on Kubernetes with local NVMe When One Cluster Fails: Keeping Kubernetes Services Online with Cilium ClusterMesh You Spent How Much? Controlling Your AI Spend with Istio + agentgateway Azure Front Door Edge Actions: Hardware-protected CDN functions in Azure Secure Your Sensitive Workloads with Confidential Containers on Azure Red Hat OpenShift AKS Automatic Anyscale on Azure Wed 25 March Kubernetes Answers without AI (And That's Okay) Accelerating Cloud‑Native and AI Workloads with Azure Linux on AKS Codeless OpenTelemetry: Auto‑Instrumenting Kubernetes Apps in Minutes Life After ingress-nginx: Modern Kubernetes Ingress on AKS Modern Apps, Faster: Modernization with AKS + GitHub Copilot App Mod Get started developing on AKS Encrypt Everything, Complicate Nothing: Rethinking Kubernetes Workload Network Security From Repo to Running on AKS with GitHub Copilot Simplify Multi‑Cluster App Traffic with Azure Kubernetes Application Network Open Source with Chainguard and Microsoft: Better Together on AKS Accelerating Cloud-Native Delivery for Developers: API-Driven Platforms with Radius Operate Kubernetes at Scale with Azure Kubernetes Fleet Manager Thurs 26 March Oooh Wee! An AKS GUI! – Deploy, Secure & Collaborate in Minutes (No CLI Required) Sovereign Kubernetes: Run AKS Where the Cloud Can’t Go Thousand Pods, One SAN: Burst-Scaling Stateful Apps with Azure Container Storage + Elastic SAN There will also be a wide variety of demos running at our booth throughout the show – be sure to swing by to chat with the team. We look forward to seeing you at KubeCon Europe 2026 in Amsterdam Psst! Local or coming in to Amsterdam early? You can also catch the Microsoft team at: Cloud Native Rejekts on 21 March Maintainer Summit on 22 March852Views0likes0CommentsEven simpler to Safely Execute AI-generated Code with Azure Container Apps Dynamic Sessions
AI agents are writing code. The question is: where does that code run? If it runs in your process, a single hallucinated import os; os.remove('/') can ruin your day. Azure Container Apps dynamic sessions solve this with on-demand sandboxed environments — Hyper-V isolated, fully managed, and ready in milliseconds. Thanks to your feedback, Dynamic Sessions are now easier to use with AI via MCP. Agents can quickly start a session interpreter and safely run code – all using a built-in MCP endpoint. Additionally - new starter samples show how to invoke dynamic sessions from Microsoft Agent Framework with code interpreter and with a custom container for even more versatility. What Are Dynamic Sessions? A session pool maintains a reservoir of pre-warmed, isolated sandboxes. When your app needs one, it’s allocated instantly via REST API. When idle, it’s destroyed automatically after provided session cool down period. What you get: Strong isolation - Each session runs in its own Hyper-V sandbox — enterprise-grade security Millisecond startup -Pre-warmed pool eliminates cold starts Fully managed - No infra to maintain — automatic lifecycle, cleanup, scaling Simple access - Single HTTP endpoint, session identified by a unique ID Scalable - Hundreds to thousands of concurrent sessions Two Session Types 1. Code Interpreter — Run Untrusted Code Safely Code interpreter sessions accept inline code, run it in a Hyper-V sandbox, and return the output. Sessions support network egress and persistent file systems within the session lifetime. Three runtimes are available: Python — Ships with popular libraries pre-installed (NumPy, pandas, matplotlib, etc.). Ideal for AI-generated data analysis, math computation, and chart generation. Node.js — Comes with common npm packages. Great for server-side JavaScript execution, data transformation, and scripting. Shell — A full Linux shell environment where agents can run arbitrary commands, install packages, start processes, manage files, and chain multi-step workflows. Unlike Python/Node.js interpreters, shell sessions expose a complete OS — ideal for agent-driven DevOps, build/test environments, CLI tool execution, and multi-process pipelines. 2. Custom Containers — Bring Your Own Runtime Custom container sessions let you run your own container image in the same isolated, on-demand model. Define your image, and Container Apps handles the pooling, scaling, and lifecycle. Typical use cases are hosting proprietary runtimes, custom code interpreters, and specialized tool chains. This sample (Azure Samples) dives deeper into Customer Containers with Microsoft agent Framework orchestration. MCP Support for Dynamic Sessions Dynamic sessions also support Model Context Protocol (MCP) on both shell and Python session types. This turns a session pool into a remote MCP server that AI agents can connect to — enabling tool execution, file system access, and shell commands in a secure, ephemeral environment. With an MCP-enabled shell session, an Azure Foundry agent can spin up a Flask app, run system commands, or install packages — all in an isolated container that vanishes when done. The MCP server is enabled with a single property on the session pool (isMCPServerEnabled: true), and the resulting endpoint + API key can be plugged directly into Azure Foundry as a connected tool. For a step-by-step walkthrough, see How to add an MCP tool to your Azure Foundry agent using dynamic sessions. Deep Dive: Building an AI Travel Agent with Code Interpreter Sessions Let’s walk through a sample implementation — a travel planning agent that uses dynamic sessions for both static code execution (weather research) and LLM-generated code execution (charting). Full source: github.com/jkalis-MS/AIAgent-ACA-DynamicSession Architecture Travel Agent Architecture Component Purpose Microsoft Agent Framework Agent runtime with middleware, telemetry, and DevUI Azure OpenAI (GPT-4o) LLM for conversation and code generation ACA Session Pools Sandboxed Python code interpreter Azure Container Apps Hosts the agent in a container Application Insights Observability for agent spans The agent implements with two variants switchable in the Agent Framework DevUI — tools in ACA Dynamic Session (sandbox) and tools running locally (no isolation) — making the security value immediately visible. Scenario A: Static Code in a Sandbox — Weather Research The agent sends pre-written Python code to the session pool to fetch live weather data. The code runs with network egress enabled, calls the Open-Meteo API, and returns formatted results — all without touching the host process. import requests from azure.identity import DefaultAzureCredential credential = DefaultAzureCredential() token = credential.get_token("https://dynamicsessions.io/.default") response = requests.post( f"{pool_endpoint}/code/execute?api-version=2024-02-02-preview&identifier=weather-session-1", headers={"Authorization": f"Bearer {token.token}"}, json={"properties": { "codeInputType": "inline", "executionType": "synchronous", "code": weather_code, # Python that calls Open-Meteo API }}, ) result = response.json()["properties"]["stdout"] Scenario B: LLM-Generated Code in a Sandbox — Dynamic Charting This is where it gets interesting. The user asks “plot a chart comparing Miami and Tokyo weather.” The agent: Fetches weather data Asks Azure OpenAI to generate matplotlib code using a tightly-scoped system prompt Safety-checks the generated code for forbidden imports (subprocess, os.system, etc.) Wraps the code with data injection and sends it to the sandbox Downloads the resulting PNG from the sandbox’s /mnt/data/ directory from openai import AzureOpenAI # 1. LLM generates chart code client = AzureOpenAI(azure_endpoint=endpoint, api_key=key, api_version="2024-12-01-preview") generated_code = client.chat.completions.create( model="gpt-4o", messages=[{"role": "system", "content": CODE_GEN_PROMPT}, {"role": "user", "content": f"Weather data: {weather_json}"}], temperature=0.2, ).choices[0].message.content # 2. Execute in sandbox requests.post( f"{pool_endpoint}/code/execute?api-version=2024-02-02-preview&identifier=chart-session-1", headers={"Authorization": f"Bearer {token.token}"}, json={"properties": { "codeInputType": "inline", "executionType": "synchronous", "code": f"import json, matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nweather_data = json.loads('{weather_json}')\n{generated_code}", }}, ) # 3. Download the chart img = requests.get( f"{pool_endpoint}/files/content/chart.png?api-version=2024-02-02-preview&identifier=chart-session-1", headers={"Authorization": f"Bearer {token.token}"}, ).content The result — a dark-themed dual-subplot chart comparing maximal and minimal temperature forecast chart example rendered by the Chart Weather tool in Dynamic Session: Authentication The agent uses DefaultAzureCredential locally and ManagedIdentityCredential when deployed. Tokens are cached and refreshed automatically: from azure.identity import DefaultAzureCredential token = DefaultAzureCredential().get_token("https://dynamicsessions.io/.default") auth_header = f"Bearer {token.token}" # Uses ManagedIdentityCredential automatically when deployed to Container Apps Observability The agent uses Application Insights for end-to-end tracing. The Microsoft Agent Framework exposes OpenTelemetry spans for invoke_agent, chat, and execute tool — wired to Azure Monitor with custom exporters: from azure.monitor.opentelemetry import configure_azure_monitor from agent_framework.observability import create_resource, enable_instrumentation # Configure Azure Monitor first configure_azure_monitor( connection_string="InstrumentationKey=...", resource=create_resource(), # Uses OTEL_SERVICE_NAME, etc. enable_live_metrics=True, ) # Then activate Agent Framework's telemetry code paths, optional if ENABLE_INSTRUMENTATION and/or ENABLE_SENSITIVE_DATA are set in env vars enable_instrumentation(enable_sensitive_data=False) This gives you traces for every agent invocation, tool execution (including sandbox timing), and LLM call — visible in the Application Insights transaction search and end-to-end transaction view in the new Agents blade in Application Insights. You can also open a detailed dashboard by clicking Explore in Grafana. Session pools emit their own metrics and logs for monitoring sandbox utilization and performance. Combined with the agent-level Application Insights traces, you can get full visibility from the user prompt → agent → LLM → sandbox execution → response — across both your application and the infrastructure running untrusted code. Deploy with One Command The project includes full Bicep infrastructure-as-code. A single azd up provisions Azure OpenAI, Container Apps, Session Pool (with egress enabled), Container Registry, Application Insights, and all role assignments. azd auth login azd up Next Steps Dynamic sessions documentation – Microsoft Learn MCP + Shell sessions tutorial - How to add an MCP tool to your Foundry agent Custom container sessions sample - github.com/Azure-Samples/dynamic-sessions-custom-container AI Agent + Dynamic Sessions - github.com/jkalis-MS/AIAgent-ACA-DynamicSession620Views0likes0Comments