containers
197 TopicsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.4.5KViews0likes0CommentsMigrating to the next generation of Virtual Nodes on Azure Container Instances (ACI)
What is ACI/Virtual Nodes? Azure Container Instances (ACI) is a fully-managed serverless container platform which gives you the ability to run containers on-demand without provisioning infrastructure. Virtual Nodes on ACI allows you to run Kubernetes pods managed by an AKS cluster in a serverless way on ACI instead of traditional VM‑backed node pools. From a developer’s perspective, Virtual Nodes look just like regular Kubernetes nodes, but under the hood the pods are executed on ACI’s serverless infrastructure, enabling fast scale‑out without waiting for new VMs to be provisioned. This makes Virtual Nodes ideal for bursty, unpredictable, or short‑lived workloads where speed and cost efficiency matter more than long‑running capacity planning. Introducing the next generation of Virtual Nodes on ACI The newer Virtual Nodes v2 implementation modernises this capability by removing many of the limitations of the original AKS managed add‑on and delivering a more Kubernetes‑native, flexible, and scalable experience when bursting workloads from AKS to ACI. In this article I will demonstrate how you can migrate an existing AKS cluster using the Virtual Nodes managed add-on (legacy), to the new generation of Virtual Nodes on ACI, which is deployed and managed via Helm. More information about Virtual Nodes on Azure Container Instances can be found here, and the GitHub repo is available here. Advanced documentation for Virtual Nodes on ACI is also available here, and includes topics such as node customisation, release notes and a troubleshooting guide. Please note that all code samples within this guide are examples only, and are provided without warranty/support. Background Virtual Nodes on ACI is rebuilt from the ground-up, and includes several fixes and enhancements, for instance: Added support/features VNet peering, outbound traffic to the internet with network security groups Init containers Host aliases Arguments for exec in ACI Persistent Volumes and Persistent Volume Claims Container hooks Confidential containers (see supported regions list here) ACI standby pools Support for image pulling via Private Link and Managed Identity (MSI) Planned future enhancements Kubernetes network policies Support for IPv6 Windows containers Port Forwarding Note: The new generation of the add-on is managed via Helm rather than as an AKS managed add-on. Requirements & limitations Each Virtual Nodes on ACI deployment requires 3 vCPUs and 12 GiB memory on one of the AKS cluster’s VMs Each Virtual Nodes node supports up to 200 pods DaemonSets are not supported Virtual Nodes on ACI requires AKS clusters with Azure CNI networking (Kubenet is not supported, nor is overlay networking) Virtual Nodes on ACI is incompatible with API server authorized IP ranges for AKS (because of the subnet delegation to ACI) Migrating to the next generation of Virtual Nodes on Azure Container Instances via Helm chart For this walkthrough, I'm using Bash via Windows Subsystem for Linux (WSL), along with the Azure CLI. Direct migration is not supported, and therefore the steps below show an example of removing Virtual Nodes managed add-on and its resources and then installing the Virtual Nodes on ACI Helm chart. In this walkthrough I will explain how to delete and re-create the Virtual Nodes subnet, however if you need to preserve the VNet and/or use a custom subnet name, refer to the Helm customisation steps here. Prerequisites A recent version of the Azure CLI An Azure subscription with sufficient ACI quota for your selected region Helm Deployment steps Initialise environment variables location=northeurope rg=rg-virtualnode-demo vnetName=vnet-virtualnode-demo clusterName=aks-virtualnode-demo aksSubnetName=subnet-aks vnSubnetName=subnet-vn Scale-down any running Virtual Nodes workloads (example below): kubectl delete deploy <deploymentName> -n <namespace> Drain and cordon the legacy Virtual Nodes node: kubectl drain virtual-node-aci-linux Disable the Virtual Nodes managed add-on (legacy): az aks disable-addons --resource-group $rg --name $clusterName --addons virtual-node Export a backup of the original subnet configuration: az network vnet subnet show --resource-group $rg --vnet-name $vnetName --name $vnSubnetName > subnetConfigOriginal.json Delete the original subnet (subnets cannot be renamed and therefore must be re-created): az network vnet subnet delete -g $rg -n $vnSubnetName --vnet-name $vnetName Create the new Virtual Nodes on ACI subnet (replicate the configuration of the original subnet but with the specific name value of cg): vnSubnetId=$(az network vnet subnet create \ --resource-group $rg \ --vnet-name $vnetName \ --name cg \ --address-prefixes 10.241.0.0/16 \ --delegations Microsoft.ContainerInstance/containerGroups --query id -o tsv) Assign the cluster's -kubelet identity Contributor access to the infrastructure resource group, and Network Contributor access to the ACI subnet: nodeRg=$(az aks show --resource-group $rg --name $clusterName --query nodeResourceGroup -o tsv) nodeRgId=$(az group show -n $nodeRg --query id -o tsv) agentPoolIdentityId=$(az aks show --resource-group $rg --name $clusterName --query "identityProfile.kubeletidentity.resourceId" -o tsv) agentPoolIdentityObjectId=$(az identity show --ids $agentPoolIdentityId --query principalId -o tsv) az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Contributor" \ --scope "$nodeRgId" az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Network Contributor" \ --scope "$vnSubnetId" Download the cluster's kubeconfig file: az aks get-credentials -n $clusterName -g $rg Clone the virtualnodesOnAzureContainerInstances GitHub repo: git clone https://github.com/microsoft/virtualnodesOnAzureContainerInstances.git Install the Virtual Nodes on ACI Helm chart: helm install <yourReleaseName> <GitRepoRoot>/Helm/virtualnode Confirm the Virtual Nodes node shows within the cluster and is in a Ready state (virtualnode-n): $ kubectl get node NAME STATUS ROLES AGE VERSION aks-nodepool1-35702456-vmss000000 Ready <none> 4h13m v1.33.6 aks-nodepool1-35702456-vmss000001 Ready <none> 4h13m v1.33.6 virtualnode-0 Ready <none> 162m v1.33.7 Delete the previous Virtual Nodes node from the cluster: kubectl delete node virtual-node-aci-linux Test and confirm pod scheduling on Virtual Node: apiVersion: v1 kind: Pod metadata: annotations: name: demo-pod spec: containers: - command: - /bin/bash - -c - 'counter=1; while true; do echo "Hello, World! Counter: $counter"; counter=$((counter+1)); sleep 1; done' image: mcr.microsoft.com/azure-cli name: hello-world-counter resources: limits: cpu: 2250m memory: 2256Mi requests: cpu: 100m memory: 128Mi nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists If the pod successfully starts on the Virtual Node, you should see similar to the below: $ kubectl get pod -o wide demo-pod NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo-pod 1/1 Running 0 95s 10.241.0.4 vnode2-virtualnode-0 <none> <none> Modify your deployments to run on Virtual Nodes on ACI For Virtual Nodes managed add-on (legacy), the following nodeSelector and tolerations are used to run pods on Virtual Nodes: nodeSelector: kubernetes.io/role: agent kubernetes.io/os: linux type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Exists - key: azure.com/aci effect: NoSchedule For Virtual Nodes on ACI, the nodeSelector/tolerations are slightly different: nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists Troubleshooting Check the virtual-node-admission-controller and virtualnode-n pods are running within the vn2 namespace: $ kubectl get pod -n vn2 NAME READY STATUS RESTARTS AGE virtual-node-admission-controller-54cb7568f5-b7hnr 1/1 Running 1 (5h21m ago) 5h21m virtualnode-0 6/6 Running 6 (4h48m ago) 4h51m If these pods are in a Pending state, your node pool(s) may not have enough resources available to schedule them (use kubectl describe pod to validate). If the virtualnode-n pod is crashing, check the logs of the proxycri container to see whether there are any Managed Identity permissions issues (the cluster's -agentpool MSI needs to have Contributor access on the infrastructure resource group): kubectl logs -n vn2 virtualnode-0 -c proxycri Further troubleshooting guidance is available within the official documentation. Support If you have issues deploying or using Virtual Nodes on ACI, add a GitHub issue here311Views1like0CommentsMicrosoft Azure at KubeCon Europe 2026 | Amsterdam, NL - March 23-26
Microsoft Azure is coming back to Amsterdam for KubeCon + CloudNativeCon Europe 2026 in two short weeks, from March 23-26! As a Diamond Sponsor, we have a full week of sessions, hands-on activities, and ways to connect with the engineers behind AKS and our open-source projects. Here's what's on the schedule: Azure Day with Kubernetes: 23 March 2026 Before the main conference begins, join us at Hotel Casa Amsterdam for a free, full-day technical event built around AKS (registration required for entry - capacity is limited!). Whether you're early in your Kubernetes journey, running clusters at scale, or building AI apps, the day is designed to give you practical guidance from Microsoft product and engineering teams. Morning sessions cover what's new in AKS, including how teams are building and running AI apps on Kubernetes. In the afternoon, pick your track: Hands-on AKS Labs: Instructor-led labs to put the morning's concepts into practice. Expert Roundtables: Small-group conversations with AKS engineers on topics like security, autoscaling, AI workloads, and performance. Bring your hard questions. Evening: Drinks on us. Capacity is limited, so secure your spot before it closes: aka.ms/AKSDayEU KubeCon + CloudNativeCon: 24-26 March 2026 There will be lots going on at the main conference! Here's what to add to your calendar: Keynote (24 March): Jorge Palma takes the stage to tackle a question the industry is actively wrestling with: can AI agents reliably operate and troubleshoot Kubernetes at scale, and should they? Customer Keynote (24 March): Wayve's Mukund Muralikrishnan shares how they handle GPU scheduling across multi-tenant inference workloads using Kueue, providing a practical look at what production AI infrastructure actually requires. Demo Theatre (25 March): Anson Qian and Jorge Palma walk through a Kubernetes-native approach to cross-cloud AI inference, covering elastic autoscaling with Karpenter and GPU capacity scheduling across clouds. Sessions: Microsoft engineers are presenting across all three days on topics ranging from multi-cluster networking, supply chain security, observability, Istio in production, and more. Full list below. Find our team in the Project Pavilion at kiosks for Inspektor Gadget, Headlamp, Drasi, Radius, Notary Project, Flatcar, ORAS, Ratify, and Istio. Brendan Burns, Kubernetes co-founder and Microsoft CVP & Technical Fellow, will also share his thoughts on the latest developments and key Microsoft announcements related to open-source, cloud native, and AI application development in his KubeCon Europe blog on March 24. Come find us at Microsoft Azure booth #200 all three days. We'll be running short demos and sessions on AKS, running Kubernetes at scale, AI workloads, and cloud-native topics throughout the show, plus fun activations and opportunities to unlock special swag. Read on below for full details on our KubeCon sessions and booth theater presentations: Sponsored Keynote Date: Tues 24 March 2026 Start Time: 10:18 AM CET Room: Hall 12 Title: Scaling Platform Ops with AI Agents: Troubleshooting to Remediation Speakers: Jorge Palma, Natan Yellin (Robusta) As AI agents increasingly write our code, can they also operate and troubleshoot our infrastructure? More importantly, should they? This keynote explores the practical reality of deploying AI agents to maintain Kubernetes clusters at scale. We'll demonstrate HolmesGPT, an open-source CNCF sandbox project that connects LLMs to operational and observability data to diagnose production issues. You'll see how agents reduce MTTR by correlating logs, metrics, and cluster state far faster than manual investigation. Then we'll tackle the harder problem: moving from diagnosis to remediation. We'll show how agents with remediation policies can detect and fix issues autonomously, within strict RBAC boundaries, approval workflows, and audit trails. We'll be honest about challenges: LLM non-determinism, building trust, and why guardrails are non-negotiable. This isn't about replacing SREs; it's about multiplying their effectiveness so they can focus on creative problem-solving and system design. Customer Keynote Date: Tues 24 March 2026 Start Time: 9:37 AM CET Room: Hall 12 Title: Rules of the road for shared GPUs: AI inference scheduling at Wayve Speaker: Mukund Muralikrishnan, Wayve Technologies As AI inference workloads grow in both scale and diversity, predictable access to GPUs becomes as important as raw throughput, especially in large, multi-tenant Kubernetes clusters. At Wayve, Kubernetes underpins a wide range of inference workloads, from latency-sensitive evaluation and validation to large-scale synthetic data generation supporting the development of an end-to-end self-driving system. These workloads run side by side, have very different priorities, and all compete for the same GPU capacity. In this keynote, we will share how we manage scheduling and resources for multi-tenant AI inference on Kubernetes. We will explain why default Kubernetes scheduling falls short, and how we use Kueue, a Kubernetes-native queueing and admission control solution, to operate shared GPU clusters reliably at scale. This approach gives teams predictable GPU allocations, improves cluster utilisation, and reduces operational noise. We will close by briefly showing how frameworks like Ray fit into this model as Wayve scales its AI Driver platform. KubeCon Theatre Demo Date: Wed 25 March 2026 Start Time: 13:15 CET Room: Hall 1-5 | Solutions Showcase | Demo Theater Title: Building cross-cloud AI inference on Kubernetes with OSS Speaker: Anson Qian, Jorge Palma Operating AI inference under bursty, latency-sensitive workloads is hard enough on a single cluster. It gets harder when GPU capacity is fragmented across regions and cloud providers. This demo walks through a Kubernetes-native pattern for cross-cloud AI inference, using an incident triage and root cause analysis workflow as the example. The stack is built on open-source capabilities for lifecycle management, inference, autoscaling, and cross-cloud capacity scheduling. We will specifically highlight Karpenter for elastic autoscaling and a GPU flex nodes project for scheduling capacity across multiple cloud providers into a single cluster. Models, inference endpoints, and GPU resources are treated as first-class Kubernetes objects, enabling elastic scaling, stable routing under traffic spikes, and cross-provider failover without a separate AI control plane. KubeCon Europe 2025 Sessions with Microsoft Speakers Speaker Title Jorge Palma Microsoft keynote: Scaling Platform Ops with AI Agents: Troubleshooting to Remediation Anson Qian, Jorge Palma Microsoft demo: Building cross-cloud AI inference on Kubernetes with OSS Will Tsai Leveling up with Radius: Custom Resources and Headlamp Integration for Real-World Workloads Simone Rodigari Demystifying the Kubernetes Network Stack (From Pod to Pod) Joaquin Rodriguez Privacy as Infrastructure: Declarative Data Protection for AI on Kubernetes Cijo Thomas ⚡Lightning Talk: “Metrics That Lie”: Understanding OpenTelemetry’s Cardinality Capping and Its Implications Gaurika Poplai ⚡Lightning Talk: Compliance as Code Meets Developer Portals: Kyverno + Backstage in Action Mereta Degutyte & Anubhab Majumdar Network Flow Aggregation: Pay for the Logs You Care About! Niranjan Shankar Expl(AI)n Like I’m 5: An Introduction To AI-Native Networking Danilo Chiarlone Running Wasmtime in Hardware-Isolated Microenvironments Jack Francis Cluster Autoscaler Evolution Jackie Maertens Cloud Native Theater | Istio Day: Running State of the Art Inference with Istio and LLM-D Jackie Maertens & Mitch Connors Bob and Alice Revisited: Understanding Encryption in Kubernetes Mitch Connors Istio in Production: Expected Value, Results, and Effort at GitHub Scale Mitch Connors Evolution or Revolution: Istio as the Network Platform for Cloud Native René Dudfield Ping SRE? I Am the SRE! Awesome Fun I Had Drawing a Zine for Troubleshooting Kubernetes Deployments René Dudfield & Santhosh Nagaraj Does Your Project Want a UI in Kubernetes-SIGs/headlamp? Bridget Kromhout How Will Customized Kubernetes Distributions Work for You? a Discussion on Options and Use Cases Kenneth Kilty AI-Powered Cloud Native Modernization: From Real Challenges to Concrete Solutions Mike Morris Building the Next Generation of Multi-Cluster with Gateway API Toddy Mladenov, Flora Taagen & Dallas Delaney Beyond Image Pull-Time: Ensuring Runtime Integrity With Image Layer Signing Microsoft Booth Theatre Sessions Tues 24 March (11:00 - 18:00) Zero-Migration AI with Drasi: Bridge Your Existing Infrastructure to Modern Workflows Bringing real-time Kubernetes observability to AI agents via Model Context Protocol Secure Kubernetes Across the Stack: Supply Chain to Runtime Cut the Noise, Cut the Bill: Cost‑Smart Network Observability for Kubernetes AKS everywhere: one Kubernetes experience from Cloud to Edge Teaching AI to Build Better AKS Clusters with Terraform AKS-Flex: autoscale GPU nodes from Azure and neocloud like Nebius using karpenter Block Game with Block Storage: Running Minecraft on Kubernetes with local NVMe When One Cluster Fails: Keeping Kubernetes Services Online with Cilium ClusterMesh You Spent How Much? Controlling Your AI Spend with Istio + agentgateway Azure Front Door Edge Actions: Hardware-protected CDN functions in Azure Secure Your Sensitive Workloads with Confidential Containers on Azure Red Hat OpenShift AKS Automatic Anyscale on Azure Wed 25 March Kubernetes Answers without AI (And That's Okay) Accelerating Cloud‑Native and AI Workloads with Azure Linux on AKS Codeless OpenTelemetry: Auto‑Instrumenting Kubernetes Apps in Minutes Life After ingress-nginx: Modern Kubernetes Ingress on AKS Modern Apps, Faster: Modernization with AKS + GitHub Copilot App Mod Get started developing on AKS Encrypt Everything, Complicate Nothing: Rethinking Kubernetes Workload Network Security From Repo to Running on AKS with GitHub Copilot Simplify Multi‑Cluster App Traffic with Azure Kubernetes Application Network Open Source with Chainguard and Microsoft: Better Together on AKS Accelerating Cloud-Native Delivery for Developers: API-Driven Platforms with Radius Operate Kubernetes at Scale with Azure Kubernetes Fleet Manager Thurs 26 March Oooh Wee! An AKS GUI! – Deploy, Secure & Collaborate in Minutes (No CLI Required) Sovereign Kubernetes: Run AKS Where the Cloud Can’t Go Thousand Pods, One SAN: Burst-Scaling Stateful Apps with Azure Container Storage + Elastic SAN There will also be a wide variety of demos running at our booth throughout the show – be sure to swing by to chat with the team. We look forward to seeing you at KubeCon Europe 2026 in Amsterdam Psst! Local or coming in to Amsterdam early? You can also catch the Microsoft team at: Cloud Native Rejekts on 21 March Maintainer Summit on 22 March181Views0likes0CommentsEven simpler to Safely Execute AI-generated Code with Azure Container Apps Dynamic Sessions
AI agents are writing code. The question is: where does that code run? If it runs in your process, a single hallucinated import os; os.remove('/') can ruin your day. Azure Container Apps dynamic sessions solve this with on-demand sandboxed environments — Hyper-V isolated, fully managed, and ready in milliseconds. Thanks to your feedback, Dynamic Sessions are now easier to use with AI via MCP. Agents can quickly start a session interpreter and safely run code – all using a built-in MCP endpoint. Additionally - new starter samples show how to invoke dynamic sessions from Microsoft Agent Framework with code interpreter and with a custom container for even more versatility. What Are Dynamic Sessions? A session pool maintains a reservoir of pre-warmed, isolated sandboxes. When your app needs one, it’s allocated instantly via REST API. When idle, it’s destroyed automatically after provided session cool down period. What you get: Strong isolation - Each session runs in its own Hyper-V sandbox — enterprise-grade security Millisecond startup -Pre-warmed pool eliminates cold starts Fully managed - No infra to maintain — automatic lifecycle, cleanup, scaling Simple access - Single HTTP endpoint, session identified by a unique ID Scalable - Hundreds to thousands of concurrent sessions Two Session Types 1. Code Interpreter — Run Untrusted Code Safely Code interpreter sessions accept inline code, run it in a Hyper-V sandbox, and return the output. Sessions support network egress and persistent file systems within the session lifetime. Three runtimes are available: Python — Ships with popular libraries pre-installed (NumPy, pandas, matplotlib, etc.). Ideal for AI-generated data analysis, math computation, and chart generation. Node.js — Comes with common npm packages. Great for server-side JavaScript execution, data transformation, and scripting. Shell — A full Linux shell environment where agents can run arbitrary commands, install packages, start processes, manage files, and chain multi-step workflows. Unlike Python/Node.js interpreters, shell sessions expose a complete OS — ideal for agent-driven DevOps, build/test environments, CLI tool execution, and multi-process pipelines. 2. Custom Containers — Bring Your Own Runtime Custom container sessions let you run your own container image in the same isolated, on-demand model. Define your image, and Container Apps handles the pooling, scaling, and lifecycle. Typical use cases are hosting proprietary runtimes, custom code interpreters, and specialized tool chains. This sample (Azure Samples) dives deeper into Customer Containers with Microsoft agent Framework orchestration. MCP Support for Dynamic Sessions Dynamic sessions also support Model Context Protocol (MCP) on both shell and Python session types. This turns a session pool into a remote MCP server that AI agents can connect to — enabling tool execution, file system access, and shell commands in a secure, ephemeral environment. With an MCP-enabled shell session, an Azure Foundry agent can spin up a Flask app, run system commands, or install packages — all in an isolated container that vanishes when done. The MCP server is enabled with a single property on the session pool (isMCPServerEnabled: true), and the resulting endpoint + API key can be plugged directly into Azure Foundry as a connected tool. For a step-by-step walkthrough, see How to add an MCP tool to your Azure Foundry agent using dynamic sessions. Deep Dive: Building an AI Travel Agent with Code Interpreter Sessions Let’s walk through a sample implementation — a travel planning agent that uses dynamic sessions for both static code execution (weather research) and LLM-generated code execution (charting). Full source: github.com/jkalis-MS/AIAgent-ACA-DynamicSession Architecture Travel Agent Architecture Component Purpose Microsoft Agent Framework Agent runtime with middleware, telemetry, and DevUI Azure OpenAI (GPT-4o) LLM for conversation and code generation ACA Session Pools Sandboxed Python code interpreter Azure Container Apps Hosts the agent in a container Application Insights Observability for agent spans The agent implements with two variants switchable in the Agent Framework DevUI — tools in ACA Dynamic Session (sandbox) and tools running locally (no isolation) — making the security value immediately visible. Scenario A: Static Code in a Sandbox — Weather Research The agent sends pre-written Python code to the session pool to fetch live weather data. The code runs with network egress enabled, calls the Open-Meteo API, and returns formatted results — all without touching the host process. import requests from azure.identity import DefaultAzureCredential credential = DefaultAzureCredential() token = credential.get_token("https://dynamicsessions.io/.default") response = requests.post( f"{pool_endpoint}/code/execute?api-version=2024-02-02-preview&identifier=weather-session-1", headers={"Authorization": f"Bearer {token.token}"}, json={"properties": { "codeInputType": "inline", "executionType": "synchronous", "code": weather_code, # Python that calls Open-Meteo API }}, ) result = response.json()["properties"]["stdout"] Scenario B: LLM-Generated Code in a Sandbox — Dynamic Charting This is where it gets interesting. The user asks “plot a chart comparing Miami and Tokyo weather.” The agent: Fetches weather data Asks Azure OpenAI to generate matplotlib code using a tightly-scoped system prompt Safety-checks the generated code for forbidden imports (subprocess, os.system, etc.) Wraps the code with data injection and sends it to the sandbox Downloads the resulting PNG from the sandbox’s /mnt/data/ directory from openai import AzureOpenAI # 1. LLM generates chart code client = AzureOpenAI(azure_endpoint=endpoint, api_key=key, api_version="2024-12-01-preview") generated_code = client.chat.completions.create( model="gpt-4o", messages=[{"role": "system", "content": CODE_GEN_PROMPT}, {"role": "user", "content": f"Weather data: {weather_json}"}], temperature=0.2, ).choices[0].message.content # 2. Execute in sandbox requests.post( f"{pool_endpoint}/code/execute?api-version=2024-02-02-preview&identifier=chart-session-1", headers={"Authorization": f"Bearer {token.token}"}, json={"properties": { "codeInputType": "inline", "executionType": "synchronous", "code": f"import json, matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nweather_data = json.loads('{weather_json}')\n{generated_code}", }}, ) # 3. Download the chart img = requests.get( f"{pool_endpoint}/files/content/chart.png?api-version=2024-02-02-preview&identifier=chart-session-1", headers={"Authorization": f"Bearer {token.token}"}, ).content The result — a dark-themed dual-subplot chart comparing maximal and minimal temperature forecast chart example rendered by the Chart Weather tool in Dynamic Session: Authentication The agent uses DefaultAzureCredential locally and ManagedIdentityCredential when deployed. Tokens are cached and refreshed automatically: from azure.identity import DefaultAzureCredential token = DefaultAzureCredential().get_token("https://dynamicsessions.io/.default") auth_header = f"Bearer {token.token}" # Uses ManagedIdentityCredential automatically when deployed to Container Apps Observability The agent uses Application Insights for end-to-end tracing. The Microsoft Agent Framework exposes OpenTelemetry spans for invoke_agent, chat, and execute tool — wired to Azure Monitor with custom exporters: from azure.monitor.opentelemetry import configure_azure_monitor from agent_framework.observability import create_resource, enable_instrumentation # Configure Azure Monitor first configure_azure_monitor( connection_string="InstrumentationKey=...", resource=create_resource(), # Uses OTEL_SERVICE_NAME, etc. enable_live_metrics=True, ) # Then activate Agent Framework's telemetry code paths, optional if ENABLE_INSTRUMENTATION and/or ENABLE_SENSITIVE_DATA are set in env vars enable_instrumentation(enable_sensitive_data=False) This gives you traces for every agent invocation, tool execution (including sandbox timing), and LLM call — visible in the Application Insights transaction search and end-to-end transaction view in the new Agents blade in Application Insights. You can also open a detailed dashboard by clicking Explore in Grafana. Session pools emit their own metrics and logs for monitoring sandbox utilization and performance. Combined with the agent-level Application Insights traces, you can get full visibility from the user prompt → agent → LLM → sandbox execution → response — across both your application and the infrastructure running untrusted code. Deploy with One Command The project includes full Bicep infrastructure-as-code. A single azd up provisions Azure OpenAI, Container Apps, Session Pool (with egress enabled), Container Registry, Application Insights, and all role assignments. azd auth login azd up Next Steps Dynamic sessions documentation – Microsoft Learn MCP + Shell sessions tutorial - How to add an MCP tool to your Foundry agent Custom container sessions sample - github.com/Azure-Samples/dynamic-sessions-custom-container AI Agent + Dynamic Sessions - github.com/jkalis-MS/AIAgent-ACA-DynamicSession313Views0likes0CommentsRethinking Background Workloads with Azure Functions on Azure Container Apps
Objective Azure Container Apps provides a flexible platform for running background workloads, supporting multiple execution models to address different workload needs. Two commonly used models are: Azure Functions on Azure Container Apps - overview of Azure functions Azure Container Apps Jobs – overview of Container App Jobs Both are first‑class capabilities on the same platform and are designed for different types of background processing. This blog explores Use Cases where Azure Functions on Azure Container Apps are best suited Use Cases where Container App Jobs provide advantages Use Cases where Azure Functions on Azure Container Apps Are suited Azure Functions on Azure Container Apps are particularly well suited for event‑driven and workflow‑oriented background workloads, where work is initiated by external signals and coordination is a core concern. The following use cases illustrate scenarios where the Functions programming model aligns naturally with the workload, allowing teams to focus on business logic while the platform handles triggering, scaling, and coordination. Event‑Driven Data Ingestion Pipelines For ingestion pipelines where data arrives asynchronously and unpredictably. Example: A retail company processes inventory updates from hundreds of suppliers. Files land in Blob Storage overnight, varying widely in size and arrival time. In this scenario: Each file is processed independently as it arrives Execution is driven by actual data arrival, not schedules Parallelism and retries are handled by the platform .blob_trigger(arg_name="blob", path="inventory-uploads/{name}", connection="StorageConnection") async def process_inventory(blob: func.InputStream): data = blob.read() # Transform and load to database await transform_and_load(data, blob.name) Multi‑Step, Event‑Driven Processing Workflows Functions works well for workloads that involve multiple dependent steps, where each step can fail independently and must be retried or resumed safely. Example: An order processing workflow that includes validation, inventory checks, payment capture, and fulfilment notifications. Using Durable Functions: Workflow state persisted automatically Each step can be retried independently Execution resumes from the point of failure rather than restarting Durable Functions on Container Apps solves this declaratively: .orchestration_trigger(context_name="context") def order_workflow(context: df.DurableOrchestrationContext): order = context.get_input() # Each step is independently retryable with built-in checkpointing validated = yield context.call_activity("validate_order", order) inventory = yield context.call_activity("check_inventory", validated) payment = yield context.call_activity("capture_payment", inventory) yield context.call_activity("notify_fulfillment", payment) return {"status": "completed", "order_id": order["id"]} Scheduled, Recurring Background Tasks For time‑based background work that runs on a predictable cadence and is closely tied to application logic. Example: Daily financial summaries, weekly aggregations, or month‑end reconciliation reports. Timer‑triggered Functions allow: Schedules to be defined in code Logic to be versioned alongside application code Execution to run in the same Container Apps environment as other services .timer_trigger(schedule="0 0 6 * * *", arg_name="timer") async def daily_financial_summary(timer: func.TimerRequest): if timer.past_due: logging.warning("Timer is running late!") await generate_summary(date.today() - timedelta(days=1)) await send_to_stakeholders() Long‑Running, Parallelizable Workloads Scenarios which require long‑running workloads to be decomposed into smaller units of work and coordinated as a workflow. Example: A large data migration processing millions of records. With Durable Functions: Work is split into independent batches Batches execute in parallel across multiple instances Progress is checkpointed automatically Failures are isolated to individual batches .orchestration_trigger(context_name="context") def migration_orchestrator(context: df.DurableOrchestrationContext): batches = yield context.call_activity("get_migration_batches") # Process all batches in parallel across multiple instances tasks = [context.call_activity("migrate_batch", b) for b in batches] results = yield context.task_all(tasks) yield context.call_activity("generate_report", results) Use Cases where Container App Jobs are a Best Fit Azure Container Apps Jobs are well suited for workloads that require explicit execution control or full ownership of the runtime and lifecycle. Common examples include: Batch Processing Using Existing Container Images Teams often have existing containerized batch workloads such as data processors, ETL tools, or analytics jobs that are already packaged and validated. When refactoring these workloads into a Functions programming model is not desirable, Container Apps Jobs allow them to run unchanged while integrating into the Container Apps environment. Large-Scale Data Migrations and One-Time Operations Jobs are a natural fit for one‑time or infrequently run migrations, such as schema upgrades, backfills, or bulk data transformations. These workloads are typically: Explicitly triggered Closely monitored Designed to run to completion under controlled conditions The ability to manage execution, retries, and shutdown behavior directly is often important in these scenarios. Custom Runtime or Specialized Dependency Workloads Some workloads rely on: Specialized runtimes Native system libraries Third‑party tools or binaries When these requirements fall outside the supported Functions runtimes, Container Apps Jobs provide the flexibility to define the runtime environment exactly as needed. Externally Orchestrated or Manually Triggered Workloads In some architectures, execution is coordinated by an external system such as: A CI/CD pipeline An operations workflow A custom scheduler or control plane Container Apps Jobs integrate well into these models, where execution is initiated explicitly rather than driven by platform‑managed triggers. Long-Running, Single-Instance Processing For workloads that are intentionally designed to run as a single execution unit without fan‑out, trigger‑based scaling, or workflow orchestration Jobs provide a straightforward execution model. This includes tasks where parallelism, retries, and state handling are implemented directly within the application. Making the Choice Consideration Azure Functions on Azure Container Apps Azure Container Apps Jobs Trigger model Event‑driven (files, messages, timers, HTTP, events) Explicit execution (manual, scheduled, or externally triggered) Scaling behavior Automatic scaling based on trigger volume / queue depth Fixed or explicitly defined parallelism Programming model Functions programming model with triggers, bindings, Durable Functions General container execution model State management Built‑in state, retries, and checkpointing via Durable Functions Custom state management required Workflow orchestration Native support using Durable Functions Must be implemented manually Boilerplate required Minimal (no polling, retry, or coordination code) Higher (polling, retries, lifecycle handling) Runtime flexibility Limited to supported Functions runtimes Full control over runtime and dependencies Getting Started on Functions on Azure Container Apps If you’re already running on Container Apps, adding Functions is straightforward: Your Functions run alongside your existing apps, sharing the same networking, observability, and scaling infrastructure. Check out the documentation for details - Getting Started on Functions on Azure Container Apps # Create a Functions app in your existing Container Apps environment az functionapp create \ --name my-batch-processor \ --storage-account mystorageaccount \ --environment my-container-apps-env \ --workload-profile-name "Consumption" \ --runtime python \ --functions-version 4 Getting Started on Container App Jobs on Azure Container Apps If you already have an Azure Container Apps environment, you can create a job using the Azure CLI. Checkout the documentation for details - Jobs in Azure Container Apps az containerapp job create \ --name my-job \ --resource-group my-resource-group \ --environment my-container-apps-env \ --trigger-type Manual \ --image mcr.microsoft.com/k8se/quickstart-jobs:latest \ --cpu 0.25 \ --memory 0.5Gi Quick Links Azure Functions on Azure Container Apps overview Create your Azure Functions app through custom containers on Azure Container Apps Run event-driven and batch workloads with Azure Functions on Azure Container Apps872Views0likes0CommentsRethinking Ingress on Azure: Application Gateway for Containers Explained
Introduction Azure Application Gateway for Containers is a managed Azure service designed to handle incoming traffic for container-based applications. It brings Layer-7 load balancing, routing, TLS termination, and web application protection outside of the Kubernetes cluster and into an Azure-managed data plane. By separating traffic management from the cluster itself, the service reduces operational complexity while providing a more consistent, secure, and scalable way to expose container workloads on Azure. Service Overview What Application Gateway for Containers does Azure Application Gateway for Containers is a managed Layer-7 load balancing and ingress service built specifically for containerized workloads. Its main job is to receive incoming application traffic (HTTP/HTTPS), apply routing and security rules, and forward that traffic to the right backend containers running in your Kubernetes cluster. Instead of deploying and operating an ingress controller inside the cluster, Application Gateway for Containers runs outside the cluster, as an Azure-managed data plane. It integrates natively with Kubernetes through the Gateway API (and Ingress API), translating Kubernetes configuration into fully managed Azure networking behavior. In practical terms, it handles: HTTP/HTTPS routing based on hostnames, paths, headers, and methods TLS termination and certificate management Web Application Firewall (WAF) protection Scaling and high availability of the ingress layer All of this is provided as a managed Azure service, without running ingress pods in your cluster. What problems it solves Application Gateway for Containers addresses several common challenges teams face with traditional Kubernetes ingress setups: Operational overhead Running ingress controllers inside the cluster means managing upgrades, scaling, certificates, and availability yourself. Moving ingress to a managed Azure service significantly reduces this burden. Security boundaries By keeping traffic management and WAF outside the cluster, you reduce the attack surface of the Kubernetes environment and keep security controls aligned with Azure-native services. Consistency across environments Platform teams can offer a standard, Azure-managed ingress layer that behaves the same way across clusters and environments, instead of relying on different in-cluster ingress configurations. Separation of responsibilities Infrastructure teams manage the gateway and security policies, while application teams focus on Kubernetes resources like routes and services. How it differs from classic Application Gateway While both services share the “Application Gateway” name, they target different use cases and operating models. In the traditional model of using Azure Application Gateway is a general-purpose Layer-7 load balancer primarily designed for VM-based or service-based backends. It relies on centralized configuration through Azure resources and is not Kubernetes-native by design. Application Gateway for Containers, on the other hand: Is designed specifically for container platforms Uses Kubernetes APIs (Gateway API / Ingress) instead of manual listener and rule configuration Separates control plane and data plane more cleanly Enables faster, near real-time updates driven by Kubernetes changes Avoids running ingress components inside the cluster In short, classic Application Gateway is infrastructure-first, while Application Gateway for Containers is platform- and Kubernetes-first. Architecture at a Glance At a high level, Azure Application Gateway for Containers is built around a clear separation between control plane and data plane. This separation is one of the key architectural ideas behind the service and explains many of its benefits. Control plane and data plane The control plane is responsible for configuration and orchestration. It listens to Kubernetes resources—such as Gateway API or Ingress objects—and translates them into a running gateway configuration. When you create or update routing rules, TLS settings, or security policies in Kubernetes, the control plane picks up those changes and applies them automatically. The data plane is where traffic actually flows. It handles incoming HTTP and HTTPS requests, applies routing rules, performs TLS termination, and forwards traffic to the correct backend services inside your cluster. This data plane is fully managed by Azure and runs outside of the Kubernetes cluster, providing isolation and high availability by design. Because the data plane is not deployed as pods inside the cluster, it does not consume cluster resources and does not need to be scaled or upgraded by the customer. Managed components vs customer responsibilities One of the goals of Application Gateway for Containers is to reduce what customers need to operate, while still giving them control where it matters. Managed by Azure Application Gateway for Containers data plane Scaling, availability, and patching of the gateway Integration with Azure networking Web Application Firewall engine and updates Translation of Kubernetes configuration into gateway rules Customer-managed Kubernetes resources (Gateway API or Ingress) Backend services and workloads TLS certificates and references Routing and security intent (hosts, paths, policies) Network design and connectivity to the cluster This split allows platform teams to keep ownership of the underlying Azure infrastructure, while application teams interact with the gateway using familiar Kubernetes APIs. The result is a cleaner operating model with fewer moving parts inside the cluster. In short, Application Gateway for Containers acts as an Azure-managed ingress layer, driven by Kubernetes configuration but operated outside the cluster. This architecture keeps traffic management simple, scalable, and aligned with Azure-native networking and security services. Traffic Handling and Routing This section explains what happens to a request from the moment it reaches Azure until it is forwarded to a container running in your cluster. Traffic Flow: From Internet to Pod Azure Application Gateway for Containers (AGC) acts as the specialized "front door" for your Kubernetes workloads. By sitting outside the cluster, it manages high-volume traffic ingestion so your environment remains focused on application logic rather than networking overhead. The Request Journey Once a request is initiated by a client—such as a browser or an API—it follows a streamlined path to your container: 1. Entry via Public Frontend: The request reaches AGC’s public frontend endpoint. Note: While private frontends are currently the most requested feature and are under high-priority development, the service currently supports public-facing endpoints. 2. Rule Evaluation: AGC evaluates the incoming request against the routing rules you’ve defined using standard Kubernetes resources (Gateway API or Ingress). 3. Direct Pod Proxying: Once a rule is matched, AGC forwards the traffic directly to the backend pods within your cluster. 4. Azure Native Delivery: Because AGC operates as a managed data plane outside the cluster, traffic reaches your workloads via Azure networking. This removes the need for managing scaling or resource contention for in-cluster ingress pods. Flexibility in Security and Routing The architecture is designed to be as "hands-off" or as "hands-on" as your security policy requires: Optional TLS Offloading: You have full control over the encryption lifecycle. Depending on your specific use case, you can choose to perform TLS termination at the gateway to offload the compute-intensive decryption, or maintain encryption all the way to the container for end-to-end security. Simplified Infrastructure: By using AGC, you eliminate the "hop" typically required by in-cluster controllers, allowing the gateway to communicate with pods with minimal latency and high predictability. Kubernetes Integration Application Gateway for Containers is designed to integrate natively with Kubernetes, allowing teams to manage ingress behavior using familiar Kubernetes resources instead of Azure-specific configuration. This makes the service feel like a natural extension of the Kubernetes platform rather than an external load balancer. Gateway API as the primary integration model The Gateway API is the preferred and recommended way to integrate Application Gateway for Containers with Kubernetes. With the Gateway API: Platform teams define the Gateway and control how traffic enters the cluster. Application teams define routes (such as HTTPRoute) to expose their services. Responsibilities are clearly separated, supporting multi-team and multi-namespace environments. Application Gateway for Containers supports core Gateway API resources such as: GatewayClass Gateway HTTPRoute When these resources are created or updated, Application Gateway for Containers automatically translates them into gateway configuration and applies the changes in near real time. Ingress API support For teams that already use the traditional Kubernetes Ingress API, Application Gateway for Containers also provides Ingress support. This allows: Reuse of existing Ingress manifests A smoother migration path from older ingress controllers Gradual adoption of Gateway API over time Ingress resources are associated with Application Gateway for Containers using a specific ingress class. While fully functional, the Ingress API offers fewer capabilities and less flexibility compared to the Gateway API. How teams interact with the service A key benefit of this integration model is the clean separation of responsibilities: Platform teams Provision and manage Application Gateway for Containers Define gateways, listeners, and security boundaries Own network and security policies Application teams Define routes using Kubernetes APIs Control how their applications are exposed Do not need direct access to Azure networking resources This approach enables self-service for application teams while keeping governance and security centralized. Why this matters By integrating deeply with Kubernetes APIs, Application Gateway for Containers avoids custom controllers, sidecars, or ingress pods inside the cluster. Configuration stays declarative, changes are automated, and the operational model stays consistent with Kubernetes best practices. Security Capabilities Security is a core part of Azure Application Gateway for Containers and one of the main reasons teams choose it over in-cluster ingress controllers. The service brings Azure-native security controls directly in front of your container workloads, without adding complexity inside the cluster. Web Application Firewall (WAF) Application Gateway for Containers integrates with Azure Web Application Firewall (WAF) to protect applications against common web attacks such as SQL injection, cross-site scripting, and other OWASP Top 10 threats. A key differentiator of this service is that it leverages Microsoft's global threat intelligence. This provides an enterprise-grade layer of security that constantly evolves to block emerging threats, a significant advantage over many open-source or standard competitor WAF solutions. Because the WAF operates within the managed data plane, it offers several operational benefits: Zero Cluster Footprint: No WAF-specific pods or components are required to run inside your Kubernetes cluster, saving resources for your actual applications. Edge Protection: Security rules and policies are applied at the Azure network edge, ensuring malicious traffic is blocked before it ever reaches your workloads. Automated Maintenance: All rule updates, patching, and engine maintenance are handled entirely by Azure. Centralized Governance: WAF policies can be managed centrally, ensuring consistent security enforcement across multiple teams and namespaces—a critical requirement for regulated environments. TLS and certificate handling TLS termination happens directly at the gateway. HTTPS traffic is decrypted at the edge, inspected, and then forwarded to backend services. Key points: Certificates are referenced from Kubernetes configuration TLS policies are enforced by the Azure-managed gateway Applications receive plain HTTP traffic, keeping workloads simpler This approach allows teams to standardize TLS behavior across clusters and environments, while avoiding certificate logic inside application pods. Network isolation and exposure control Because Application Gateway for Containers runs outside the cluster, it provides a clear security boundary between external traffic and Kubernetes workloads. Common patterns include: Internet-facing gateways with WAF protection Private gateways for internal or zero-trust access Controlled exposure of only selected services By keeping traffic management and security at the gateway layer, clusters remain more isolated and easier to protect. Security by design Overall, the security model follows a simple principle: inspect, protect, and control traffic before it enters the cluster. This reduces the attack surface of Kubernetes, centralizes security controls, and aligns container ingress with Azure’s broader security ecosystem. Scale, Performance, and Limits Azure Application Gateway for Containers is built to handle production-scale traffic without requiring customers to manage capacity, scaling rules, or availability of the ingress layer. Scalability and performance are handled as part of the managed service. Interoperability: The Best of Both Worlds A common hesitation when adopting cloud-native networking is the fear of vendor lock-in. Many organizations worry that using a provider-specific ingress service will tie their application logic too closely to a single cloud’s proprietary configuration. Azure Application Gateway for Containers (AGC) addresses this directly by utilizing the Kubernetes Gateway API as its primary integration model. This creates a powerful decoupling between how you define your traffic and how that traffic is actually delivered. Standardized API, Managed Execution By adopting this model, you gain two critical advantages simultaneously: Zero Vendor Lock-In (Standardized API): Your routing logic is defined using the open-source Kubernetes Gateway API standard. Because HTTPRoute and Gateway resources are community-driven standards, your configuration remains portable and familiar to any Kubernetes professional, regardless of the underlying infrastructure. Zero Operational Overhead (Managed Implementation): While the interface is a standard Kubernetes API, the implementation is a high-performance Azure-managed service. You gain the benefits of an enterprise-grade load balancer—automatic scaling, high availability, and integrated security—without the burden of managing, patching, or troubleshooting proxy pods inside your cluster. The "Pragmatic" Advantage As highlighted in recent architectural discussions, moving from traditional Ingress to the Gateway API is about more than just new features; it’s about interoperability. It allows platform teams to offer a consistent, self-service experience to developers while retaining the ability to leverage the best-in-class performance and security that only a native cloud provider can offer. The result is a future-proof architecture: your teams use the industry-standard language of Kubernetes to describe what they need, and Azure provides the managed muscle to make it happen. Scaling model Application Gateway for Containers uses an automatic scaling model. The gateway data plane scales up or down based on incoming traffic patterns, without manual intervention. From an operator’s perspective: There are no ingress pods to scale No node capacity planning for ingress No separate autoscaler to configure Scaling is handled entirely by Azure, allowing teams to focus on application behavior rather than ingress infrastructure. Performance characteristics Because the data plane runs outside the Kubernetes cluster, ingress traffic does not compete with application workloads for CPU or memory. This often results in: More predictable latency Better isolation between traffic management and application execution Consistent performance under load The service supports common production requirements such as: High concurrent connections Low-latency HTTP and HTTPS traffic Near real-time configuration updates driven by Kubernetes changes Service limits and considerations Like any managed service, Application Gateway for Containers has defined limits that architects should be aware of when designing solutions. These include limits around: Number of listeners and routes Backend service associations Certificates and TLS configurations Throughput and connection scaling thresholds These limits are documented and enforced by the platform to ensure stability and predictable behavior. For most application platforms, these limits are well above typical usage. However, they should be reviewed early when designing large multi-tenant or high-traffic environments. Designing with scale in mind The key takeaway is that Application Gateway for Containers removes ingress scaling from the cluster and turns it into an Azure-managed concern. This simplifies operations and provides a stable, high-performance entry point for container workloads. When to Use (and When Not to Use) Scenario Use it? Why Kubernetes workloads on Azure ✅ Yes The service is designed specifically for container platforms and integrates natively with Kubernetes APIs. Need for managed Layer-7 ingress ✅ Yes Routing, TLS, and scaling are handled by Azure without in-cluster components. Enterprise security requirements (WAF, TLS policies) ✅ Yes Built-in Azure WAF and centralized TLS enforcement simplify security. Platform team managing ingress for multiple apps ✅ Yes Clear separation between platform and application responsibilities. Multi-tenant Kubernetes clusters ✅ Yes Gateway API model supports clean ownership boundaries and isolation. Desire to avoid running ingress controllers in the cluster ✅ Yes No ingress pods, no cluster resource consumption. VM-based or non-container backends ❌ No Classic Application Gateway is a better fit for non-container workloads. Simple, low-traffic test or dev environments ❌ Maybe not A lightweight in-cluster ingress may be simpler and more cost-effective. Need for custom or unsupported L7 features ❌ Maybe not Some advanced or niche ingress features may not yet be available. Non-Kubernetes platforms ❌ No The service is tightly integrated with Kubernetes APIs. When to Choose a Different Path: Azure Container Apps While Application Gateway for Containers provides the ultimate control for Kubernetes environments, not every project requires that level of infrastructure management. For teams that don't need the full flexibility of Kubernetes and are looking for the fastest path to running containers on Azure without managing clusters or ingress infrastructure at all, Azure Container Apps offers a specialized alternative. It provides a fully managed, serverless container platform that handles scaling, ingress, and networking automatically "out of the box". Key Differences at a Glance Feature AGC + Kubernetes Azure Container Apps Control Granular control over cluster and ingress. Fully managed, serverless experience. Management You manage the cluster; Azure manages the gateway. Azure manages both the platform and ingress. Best For Complex, multi-team, or highly regulated environments. Rapid development and simplified operations. Appendix - Routing configuration examples The following examples show how Application Gateway for Containers can be configured using both Gateway API and Ingress API for common routing and TLS scenarios. More examples can be found here, in the detailed documentation. HTTP listener apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: app-route spec: parentRefs: - name: agc-gateway rules: - backendRefs: - name: app-service port: 80 Path routing logic apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: path-routing spec: parentRefs: - name: agc-gateway rules: - matches: - path: type: PathPrefix value: /api backendRefs: - name: api-service port: 80 - backendRefs: - name: web-service port: 80 Weighted canary / rollout apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: canary-route spec: parentRefs: - name: agc-gateway rules: - backendRefs: - name: app-v1 port: 80 weight: 80 - name: app-v2 port: 80 weight: 20 TLS Termination apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: app-ingress spec: ingressClassName: azure-alb-external tls: - hosts: - app.contoso.com secretName: tls-cert rules: - host: app.contoso.com http: paths: - path: / pathType: Prefix backend: service: name: app-service port: number: 80756Views1like0CommentsSeamless Migrations From Self Hosted Nginx Ingress To The AKS App Routing Add-On
The Kubernetes Steering Committee has announced that the Nginx Ingress controller will be retired in March 2026. That' not far away, and once this happens Nginx Ingress will not receive any further updates, including security patches. Continuing to run the standalone Nginx Ingress controller past the end of March could open you up to security risks. Azure Kubernetes Service (AKS) offers a managed routing add-on which also implements Nginx as the Ingress Controller. Microsoft has recently committed to supporting this version of Nginx Ingress until November 2026. There is also an updated version of the App Routing add-on in the works, that will be based on Istio to allow for transition off Nginx Ingress. This new App Routing add-on will support Gateway API based ingress only, so there will be some migration required if you are using the Ingress API. There is tooling availible to support migration from Ingress to Gateway API, such as the Ingress2Gateway tool. If you are already using the App Routing add-on then you are supported until November and have extra time to either move to the new Istio based solution when it is released or migrate to another solution such as App Gateway for Containers. However, if you are running the standalone version of Nginx Ingress, you may want to consider migrating to the App Routing add-on to give you some extra time. To be very clear, migrating to the App Routing add-on does not solve the problem; it buys you some more time until November and sets you up for a transition to the future Istio based App Routing add-on. Once you complete this migration you will need to plan to either move to the new version based on Istio, or migrate to another Ingress solution, before November. This rest of this article walks through migrating from BYO Nginx to the App Routing add-on without disrupting your existing traffic. How Parallel Running Works The key to a zero-downtime migration is that both controllers can run simultaneously. Each controller uses a different IngressClass, so Kubernetes routes traffic based on which class your Ingress resources reference. Your BYO Nginx uses the nginx IngressClass and runs in the ingress-nginx namespace. The App Routing add-on uses the webapprouting.kubernetes.azure.com IngressClass and runs in the app-routing-system namespace. They operate completely independently, each with its own load balancer IP. This means you can: Enable the add-on alongside your existing controller Create new Ingress resources targeting the add-on (or duplicate existing ones) Validate everything works via the add-on's IP Cut over DNS or backend pool configuration Remove the old Ingress resources once you're satisfied At no point does your production traffic go offline. Enabling the App Routing add-on Start by enabling the add-on on your existing cluster. This doesn't touch your BYO Nginx installation. bash az aks approuting enable \ --resource-group <resource-group> \ --name <cluster-name> </cluster-name></resource-group> Wait for the add-on to deploy. You can verify it's running by checking the app-routing-system namespace: kubectl get pods -n app-routing-system kubectl get svc -n app-routing-system You should see the Nginx controller pod running and a service called nginx with a load balancer IP. This IP is separate from your BYO controller's IP. bash # Get both IPs for comparison BYO_IP=$(kubectl get svc ingress-nginx-controller -n ingress-nginx \ -o jsonpath='{.status.loadBalancer.ingress[0].ip}') add-on_IP=$(kubectl get svc nginx -n app-routing-system \ -o jsonpath='{.status.loadBalancer.ingress[0].ip}') echo "BYO Nginx IP: $BYO_IP" echo "add-on IP: $add-on_IP" Both controllers are now running. Your existing applications continue to use the BYO controller because their Ingress resources still reference ingressClassName: nginx. Migrating Applications: The Parallel Ingress Approach For production workloads, create a second Ingress resource that targets the add-on. This lets you validate everything before cutting over traffic. Here's an example. Your existing Ingress might look like this: apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: myapp-ingress-byo namespace: myapp annotations: nginx.ingress.kubernetes.io/rewrite-target: / spec: ingressClassName: nginx # BYO controller rules: - host: myapp.example.com http: paths: - path: / pathType: Prefix backend: service: name: myapp port: number: 80 Create a new Ingress for the add-on: apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: myapp-ingress-add-on namespace: myapp annotations: nginx.ingress.kubernetes.io/rewrite-target: / spec: ingressClassName: webapprouting.kubernetes.azure.com # add-on controller rules: - host: myapp.example.com http: paths: - path: / pathType: Prefix backend: service: name: myapp port: number: 80 Apply this new Ingress resource. The add-on controller picks it up and configures routing, but your production traffic still flows through the BYO controller because DNS (or your backend pool) still points to the old IP. Validating Before Cutover Test the new route via the add-on IP before touching anything else: # For public ingress with DNS curl -H "Host: myapp.example.com" http://$add-on_IP # For private ingress, test from a VM in the VNet curl -H "Host: myapp.example.com" http://$add-on_IP Run your full validation suite against this IP. Check TLS certificates, path routing, authentication, rate limiting, custom headers, and anything else your application depends on. If you have monitoring or synthetic tests, point them at the add-on IP temporarily to gather confidence. If something doesn't work, you can troubleshoot without affecting production. The BYO controller is still handling all real traffic. Cutover To Routing Add-on If your ingress has a public IP and you're using DNS to route traffic, the cutover is straightforward. Lower your DNS TTL well in advance. Set it to 60 seconds at least an hour before you plan to cut over. This ensures changes propagate quickly and you can roll back fast if needed. When you're ready, update your DNS A record to point to the add-on IP If your ingress has a private IP and sits behind App Gateway, API Management, or Front Door, the cutover involves updating the backend pool instead of DNS. In-Place Patching: The Faster But Riskier Option If you're migrating a non-critical application or an internal service where some downtime is acceptable, you can patch the ingressClassName in place: kubectl patch ingress myapp-ingress-byo -n myapp \ --type='json' \ -p='[{"op":"replace","path":"/spec/ingressClassName","value":"webapprouting.kubernetes.azure.com"}]' This is atomic from Kubernetes' perspective. The BYO controller immediately drops the route, and the add-on controller immediately picks it up. In practice, there's usually a few seconds gap while the add-on configures Nginx and reloads. Once this change is made, the Ingress will not work until you update your DNS or backend pool details to point to the new IP. Decommissioning the BYO Nginx Controller Once all your applications are migrated and you're confident everything works, you can remove the BYO controller. First, verify nothing is still using it: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' \ | grep -v "webapprouting" If that returns only the header row (or is empty), you're clear to proceed. If it shows any Ingress resources, you've still got work to do. Remove the BYO Nginx Helm release: helm uninstall ingress-nginx -n ingress-nginx kubectl delete namespace ingress-nginx The Azure Load Balancer provisioned for the BYO controller will be deprovisioned automatically. Verify only the add-on IngressClass remains: kubectl get ingressclass You should see only webapprouting.kubernetes.azure.com. Key Differences Between BYO Nginx and the App Routing add-on The add-on runs the same Nginx binary, so most of your configuration carries over. However, there are a few differences worth noting. TLS Certificates: The BYO setup typically uses cert-manager or manual Secrets for certificates. The add-on supports this, but it also integrates natively with Azure Key Vault. If you want to use Key Vault, you need to configure the add-on with the appropriate annotations. Otherwise, your existing cert-manager setup continues to work. DNS Management: If you're using external-dns with your BYO controller, it works with the add-on too. The add-on also has native integration with Azure DNS zones if you want to simplify your setup. This is optional. Custom Nginx Configuration: With BYO Nginx, you have full access to the ConfigMap and can customise the global Nginx configuration extensively. The add-on restricts this because it's managed by Azure. If you've done significant customisation (Lua scripts, custom modules, etc.), audit carefully to ensure the add-on supports what you need. Most standard configurations work fine. Annotations: The nginx.ingress.kubernetes.io/* annotations work the same way. The add-on adds some Azure-specific annotations for WAF integration and other features, but your existing annotations should carry over without changes. What Comes Next This migration gets you onto a supported platform, but it's temporary. November 2026 is not far away, and you'll need to plan your next move. Microsoft is building a new App Routing add-on based on Istio. This is expected later in 2026 and will likely become the long-term supported option. Keep an eye on Azure updates for announcements about preview availability and migration paths. If you need something production-ready sooner, App Gateway for Containers is worth evaluating. It's built on Envoy and supports the Kubernetes Gateway API, which is the future direction for ingress in Kubernetes. The Gateway API is more expressive than the Ingress API and is designed to be vendor-neutral. For now, getting off the unsupported BYO Nginx controller is the priority. The App Routing add-on gives you the breathing room to make an informed decision about your long-term strategy rather than rushing into something because you're running out of time.487Views1like0CommentsBeyond the Desktop: The Future of Development with Microsoft Dev Box and GitHub Codespaces
The modern developer platform has already moved past the desktop. We’re no longer defined by what’s installed on our laptops, instead we look at what tooling we can use to move from idea to production. An organisations developer platform strategy is no longer a nice to have, it sets the ceiling for what’s possible, an organisation can’t iterate it's way to developer nirvana if the foundation itself is brittle. A great developer platform shrinks TTFC (time to first commit), accelerates release velocity, and maybe most importantly, helps alleviate everyday frictions that lead to developer burnout. Very few platforms deliver everything an organization needs from a developer platform in one product. Modern development spans multiple dimensions, local tooling, cloud infrastructure, compliance, security, cross-platform builds, collaboration, and rapid onboarding. The options organizations face are then to either compromise on one or more of these areas or force developers into rigid environments that slow productivity and innovation. This is where Microsoft Dev Box and GitHub Codespaces come into play. On their own, each addresses critical parts of the modern developer platform: Microsoft Dev Box provides a full, managed cloud workstation. Dev Box gives developers a consistent, high-performance environment while letting central IT apply strict governance and control. Internally at Microsoft, we estimate that usage of Dev Box by our development teams delivers savings of 156 hours per year per developer purely on local environment setup and upkeep. We have also seen significant gains in other key SPACE metrics reducing context-switching friction and improving build/test cycles. Although the benefits of Dev Box are clear in the results demonstrated by our customers it is not without its challenges. The biggest challenge often faced by Dev Box customers is its lack of native Linux support. At the time of writing and for the foreseeable future Dev Box does not support native Linux developer workstations. While WSL2 provides partial parity, I know from my own engineering projects it still does not deliver the full experience. This is where GitHub Codespaces comes into this story. GitHub Codespaces delivers instant, Linux-native environments spun up directly from your repository. It’s lightweight, reproducible, and ephemeral ideal for rapid iteration, PR testing, and cross-platform development where you need Linux parity or containerized workflows. Unlike Dev Box, Codespaces can run fully in Linux, giving developers access to native tools, scripts, and runtimes without workarounds. It also removes much of the friction around onboarding: a new developer can open a repository and be coding in minutes, with the exact environment defined by the project’s devcontainer.json. That said, Codespaces isn’t a complete replacement for a full workstation. While it’s perfect for isolated project work or ephemeral testing, it doesn’t provide the persistent, policy-controlled environment that enterprise teams often require for heavier workloads or complex toolchains. Used together, they fill the gaps that neither can cover alone: Dev Box gives the enterprise-grade foundation, while Codespaces provides the agile, cross-platform sandbox. For organizations, this pairing sets a higher ceiling for developer productivity, delivering a truly hybrid, agile and well governed developer platform. Better Together: Dev Box and GitHub Codespaces in action Together, Microsoft Dev Box and GitHub Codespaces deliver a hybrid developer platform that combines consistency, speed, and flexibility. Teams can spin up full, policy-compliant Dev Box workstations preloaded with enterprise tooling, IDEs, and local testing infrastructure, while Codespaces provides ephemeral, Linux-native environments tailored to each project. One of my favourite use cases is having local testing setups like a Docker Swarm cluster, ready to go in either Dev Box or Codespaces. New developers can jump in and start running services or testing microservices immediately, without spending hours on environment setup. Anecdotally, my time to first commit and time to delivering “impact” has been significantly faster on projects where one or both technologies provide local development services out of the box. Switching between Dev Boxes and Codespaces is seamless every environment keeps its own libraries, extensions, and settings intact, so developers can jump between projects without reconfiguring or breaking dependencies. The result is a turnkey, ready-to-code experience that maximizes productivity, reduces friction, and lets teams focus entirely on building, testing, and shipping software. To showcase this value, I thought I would walk through an example scenario. In this scenario I want to simulate a typical modern developer workflow. Let's look at a day in the life of a developer on this hybrid platform building an IOT project using Python and React. Spin up a ready-to-go workstation (Dev Box) for Windows development and heavy builds. Launch a Linux-native Codespace for cross-platform services, ephemeral testing, and PR work. Run "local" testing like a Docker Swarm cluster, database, and message queue ready to go out-of-the-box. Switch seamlessly between environments without losing project-specific configurations, libraries, or extensions. 9:00 AM – Morning Kickoff on Dev Box I start my day on my Microsoft Dev Box, which gives me a fully-configured Windows environment with VS Code, design tools, and Azure integrations. I select my teams project, and the environment is pre-configured for me through the Dev Box catalogue. Fortunately for me, its already provisioned. I could always self service another one using the "New Dev Box" button if I wanted too. I'll connect through the browser but I could use the desktop app too if I wanted to. My Tasks are: Prototype a new dashboard widget for monitoring IoT device temperature. Use GUI-based tools to tweak the UI and preview changes live. Review my Visio Architecture. Join my morning stand up. Write documentation notes and plan API interactions for the backend. In a flash, I have access to my modern work tooling like Teams, I have this projects files already preloaded and all my peripherals are working without additional setup. Only down side was that I did seem to be the only person on my stand up this morning? Why Dev Box first: GUI-heavy tasks are fast and responsive. Dev Box’s environment allows me to use a full desktop. Great for early-stage design, planning, and visual work. Enterprise Apps are ready for me to use out of the box (P.S. It also supports my multi-monitor setup). I use my Dev Box to make a very complicated change to my IoT dashboard. Changing the title from "IoT Dashboard" to "Owain's IoT Dashboard". I preview this change in a browser live. (Time for a coffee after this hardwork). The rest of the dashboard isnt loading as my backend isnt running... yet. 10:30 AM – Switching to Linux Codespaces Once the UI is ready, I push the code to GitHub and spin up a Linux-native GitHub Codespace for backend development. Tasks: Implement FastAPI endpoints to support the new IoT feature. Run the service on my Codespace and debug any errors. Why Codespaces now: Linux-native tools ensure compatibility with the production server. Docker and containerized testing run natively, avoiding WSL translation overhead. The environment is fully reproducible across any device I log in from. 12:30 PM – Midday Testing & Sync I toggle between Dev Box and Codespaces to test and validate the integration. I do this in my Dev Box Edge browser viewing my codespace (I use my Codespace in a browser through this demo to highlight the difference in environments. In reality I would leverage the VSCode "Remote Explorer" extension and its GitHub Codespace integration to use my Codespace from within my own desktop VSCode but that is personal preference) and I use the same browser to view my frontend preview. I update the environment variable for my frontend that is running locally in my Dev Box and point it at the port running my API locally on my Codespace. In this case it was a web socket connection and HTTPS calls to port 8000. I can make this public by changing the port visibility in my Codespace. https://fluffy-invention-5x5wp656g4xcp6x9-8000.app.github.dev/api/devices wss://fluffy-invention-5x5wp656g4xcp6x9-8000.app.github.dev/ws This allows me to: Preview the frontend widget on Dev Box, connecting to the backend running in Codespaces. Make small frontend adjustments in Dev Box while monitoring backend logs in Codespaces. Commit changes to GitHub, keeping both environments in sync and leveraging my CI/CD for deployment to the next environment. We can see the Dev Box running local frontend and the Codespace running the API connected to each other, making requests and displaying the data in the frontend! Hybrid advantage: Dev Box handles GUI previews comfortably and allows me to live test frontend changes. Codespaces handles production-aligned backend testing and Linux-native tools. Dev Box allows me to view all of my files in one screen with potentially multiple Codespaces running in browser of VS Code Desktop. Due to all of those platform efficiencies I have completed my days goals within an hour or two and now I can spend the rest of my day learning about how to enable my developers to inner source using GitHub CoPilot and MCP (Shameless plug). The bottom line There are some additional considerations when architecting a developer platform for an enterprise such as private networking and security not covered in this post but these are implementation details to deliver the described developer experience. Architecting such a platform is a valuable investment to deliver the developer platform foundations we discussed at the top of the article. While in this demo I have quickly built I was working in a mono repository in real engineering teams it is likely (I hope) that an application is built of many different repositories. The great thing about Dev Box and Codespaces is that this wouldn’t slow down the rapid development I can achieve when using both. My Dev Box would be specific for the project or development team, pre loaded with all the tools I need and potentially some repos too! When I need too I can quickly switch over to Codespaces and work in a clean isolated environment and push my changes. In both cases any changes I want to deliver locally are pushed into GitHub (Or ADO), merged and my CI/CD ensures that my next step, potentially a staging environment or who knows perhaps *Whispering* straight into production is taken care of. Once I’m finished I delete my Codespace and potentially my Dev Box if I am done with the project, knowing I can self service either one of these anytime and be up and running again! Now is there overlap in terms of what can be developed in a Codespace vs what can be developed in Azure Dev Box? Of course, but as organisations prioritise developer experience to ensure release velocity while maintaining organisational standards and governance then providing developers a windows native and Linux native service both of which are primarily charged on the consumption of the compute* is a no brainer. There are also gaps that neither fill at the moment for example Microsoft Dev Box only provides windows compute while GitHub Codespaces only supports VS Code as your chosen IDE. It's not a question of which service do I choose for my developers, these two services are better together! *Changes have been announced to Dev Box pricing. A W365 license is already required today and dev boxes will continue to be managed through Azure. For more information please see: Microsoft Dev Box capabilities are coming to Windows 365 - Microsoft Dev Box | Microsoft Learn1.2KViews2likes0CommentsExploring Traffic Manager Integration for External DNS
When you deploy externally accessible applications into Kubernetes, there is usually a requirement for creating some DNS records pointing to these applications, to allow your users to resolve them. Rather than manually creating these DNS records, there are tools that will do this work for you, one of which is External DNS. External DNS can watch your Kubernetes resource configuration for specific annotations and then use these to create DNS records in your DNS zone. It has integrations with many DNS providers, including Azure DNS. This solution works well and is in use by many customers using AKS and Azure DNS. Where we hit a limitation with External DNS in Azure is in scenarios where we are need to distribute traffic across multiple clusters for load balancing and global distribution. There are a few ways to achieve this global distribution in Azure, one way is to use Azure Traffic Manger. Unlike something like Azure Front Door, Azure Traffic Manager is a DNS based global load balancer. Traffic is directed to your different AKS clusters based on DNS resolution using Traffic Manager. When a user queries your Traffic Manager CNAME, they hit the Traffic Manager DNS servers, which then return a DNS record for a specific cluster based on your load balancing configuration. Traffic Manager can then introduce advanced load balancing scenarios such as: Geographic routing - direct users to the nearest endpoint based on their location Weighted distribution - split traffic percentages (e.g., 80% to one region, 20% to another) Priority-based failover - automatic disaster recovery with primary/backup regions Performance-based routing - direct users to the endpoint with lowest latency So, given Azure Traffic Manager is a essentially a DNS service, it would be good if we could manage it using External DNS. We already use External DNS to create DNS records for each of our individual clusters, so why not use it to also create the Traffic Manager DNS configuration to load balance across them. This would also provides the added benefit of allowing us to change our load balancing strategy or configuration by making changes to our External DNS annotations. Unfortunately, an External DNS integration for Traffic Manager doesn't currently exist. In this post, I'll walk through a proof-of-concept for a provider I built to explore whether this integration is viable, and share what I learned along the way. External DNS Webhook Provider External DNS has two types of integrations. The first, and most common for "official" providers are "In-tree" providers, which are the integrations that have been created by External DNS contributors and sit within the central External DNS repository. This includes the Azure DNS provider. The second type of provider is the WebHook provider, which allows for external contributors to easily create their own providers without the need to submit them to the core External DNS repo and go through that release process. We are going to use the WebHook provider External DNS has begun the process of phasing out "In-tree" providers and replacing with WebHook ones. No new "In-tree" providers will be accepted. By using the WebHook provider mechanism, I was able to create a proof of concept Azure Traffic Manager provider that does the following: Watches Kubernetes Services for Traffic Manager annotations Automatically creates and manages Traffic Manager profiles and endpoints Syncs state between your Kubernetes clusters and Azure Enables annotation-driven configuration - no manual Traffic Manager management needed Handles duplication so that when your second cluster attempts to create a Traffic Manager record it adds an endpoint to the existing instance Works alongside the standard External DNS Azure provider for complete DNS automation Here's what the architecture looks like: Example Configuration: Multi-Region Deployment Let me walk you through a practical example using this provider to see how it works. We will deploy an application across two Azure regions with weighted traffic distribution and automatic failover. This example assumes you have built and deployed the PoC provider as discussed below. Step 1: Deploy Your Application You deploy your application to both East US and West US AKS clusters. Each deployment is a standard Kubernetes Service with LoadBalancer type. We then apply annotations that tell our External DNS provider how to create and configure the Traffic Manger resource. These annotations are defined as part of our plugin. apiVersion: v1 kind: Service metadata: name: my-app-east namespace: production annotations: # Standard External DNS annotation external-dns.alpha.kubernetes.io/hostname: my-app-east.example.com # Enable Traffic Manager integration external-dns.alpha.kubernetes.io/webhook-traffic-manager-enabled: "true" external-dns.alpha.kubernetes.io/webhook-traffic-manager-resource-group: "my-tm-rg" external-dns.alpha.kubernetes.io/webhook-traffic-manager-profile-name: "my-app-global" # Weighted routing configuration external-dns.alpha.kubernetes.io/webhook-traffic-manager-weight: "70" external-dns.alpha.kubernetes.io/webhook-traffic-manager-endpoint-name: "east-us" external-dns.alpha.kubernetes.io/webhook-traffic-manager-endpoint-location: "eastus" # Health check configuration external-dns.alpha.kubernetes.io/webhook-traffic-manager-monitor-path: "/health" external-dns.alpha.kubernetes.io/webhook-traffic-manager-monitor-protocol: "HTTPS" spec: type: LoadBalancer ports: - port: 443 targetPort: 8080 selector: app: my-app The West US deployment has identical annotations, except: Weight: "30" (sending 30% of traffic here initially) Endpoint name: "west-us" Location: "westus" Step 2: Automatic Resource Creation When you deploy these Services, here's what happens automatically: Azure Load Balancer provisions and assigns public IPs to each Service External DNS (Azure provider) creates A records: my-app-east.example.com → East US LB IP my-app-west.example.com → West US LB IP External DNS (Webhook provider) creates a Traffic Manager profile named my-app-global Webhook adds endpoints to the profile: East endpoint (weight: 70, target: my-app-east.example.com) West endpoint (weight: 30, target: my-app-west.example.com) External DNS (Azure provider) creates a CNAME: my-app.example.com → Traffic Manager FQDN Now when users access my-app.example.com, Traffic Manager routes 70% of traffic to East US and 30% to West US, with automatic health checking on both endpoints. Step 3: Gradual Traffic Migration Want to shift more traffic to West US? Just update the annotations: kubectl annotate service my-app-east \ external-dns.alpha.kubernetes.io/webhook-traffic-manager-weight="50" \ --overwrite kubectl annotate service my-app-west \ external-dns.alpha.kubernetes.io/webhook-traffic-manager-weight="50" \ --overwrite Within minutes, traffic distribution automatically adjusts to 50/50. This enables: Blue-green deployments - test new versions with small traffic percentages Canary releases - gradually increase traffic to new deployments Geographic optimisation - adjust weights based on user distribution Step 4: Automatic Failover If the East US cluster becomes unhealthy, Traffic Manager's health checks detect this and automatically fail over 100% of traffic to West US, no manual intervention required. How It Works The webhook implements External DNS's webhook provider protocol: 1. Negotiate Capabilities External DNS queries the webhook to determine supported features and versions. 2. Adjust Endpoints External DNS sends all discovered endpoints to the webhook. The webhook: Filters for Services with Traffic Manager annotations Validates configuration Enriches endpoints with metadata Returns only Traffic Manager-enabled endpoints 3. Record State External DNS queries the webhook for current Traffic Manager state. The webhook: Syncs profiles from Azure Converts to External DNS endpoint format Returns CNAME records pointing to Traffic Manager FQDNs 4. Apply Changes External DNS sends CREATE/UPDATE/DELETE operations. The webhook: Creates Traffic Manager profiles as needed Adds/updates/removes endpoints Configures health monitoring Updates in-memory state cache The webhook uses Azure SDK for Go to interact with the Traffic Manager API and maintains an in-memory cache of profile state to optimise performance and reduce API calls. Proof of Concept 🛑 Important: This is a Proof of Concept This project is provided as example code to demonstrate the integration pattern between External DNS and Traffic Manager. It is not a supported product and comes with no SLAs, warranties, or commitments. The code is published to help you understand how to build this type of integration. If you decide to implement something similar for your production environment, you should treat this as inspiration and build your own solution that you can properly test, secure, and maintain. Think of this as a blueprint, not a finished product. With that caveat out of the way, if you want to experiment with this approach, the PoC is available on GitHub: github.com/sam-cogan/external-dns-traffic-manager. The readme file containers detailed instructions on how to deploy the PoC into a single and multi-cluster environment, along with demo applications to try it out. Use Cases This integration unlocks several powerful scenarios: Multi-Region High Availability - Deploy your application across multiple Azure regions with automatic DNS-based load balancing and health-based failover. No additional load balancers or gateways required. Blue-Green Deployments - Deploy a new version alongside your current version, send 5% of traffic to test, gradually increase, and roll back instantly if issues arise by changing annotations. Geographic Distribution - Route European users to your Europe region and US users to your US region automatically using Traffic Manager's geographic routing with the same annotation-based approach. Disaster Recovery - Configure priority-based routing with your primary region at priority 1 and DR region at priority 2. Traffic automatically fails over when health checks fail. Cost Optimisation - Use weighted routing to balance traffic across regions based on capacity and costs. Send more traffic to regions where you have reserved capacity or lower egress costs. Considerations and Future Work This is a proof of concept and should be thoroughly tested before production use. Some areas for improvement: Current Limitations In-memory state only - no persistent storage (restarts require resync) Basic error handling - needs more robust retry logic Limited observability - could use more metrics and events Manual CRD cleanup - DNSEndpoint CRDs need manual cleanup when switching providers Potential Enhancements Support for more endpoint types - currently focuses on ExternalEndpoints Advanced health check configuration - custom intervals, timeouts, and thresholds Metric-based routing decisions - integrate with Azure Monitor for intelligent routing GitOps integration - Flux/ArgoCD examples and best practices Helm chart - simplified deployment If you try this out or have ideas for improvements, please open an issue or PR on GitHub. Wrapping Up This proof of concept shows that External DNS and Traffic Manager can work together nicely. Since Traffic Manager is really just an advanced DNS service, bringing it into External DNS's annotation-driven workflow makes a lot of sense. You get the same declarative approach for both basic DNS records and sophisticated traffic routing. While this isn't production-ready code (and you shouldn't use it as-is), it demonstrates a viable pattern. If you're dealing with multi-region Kubernetes deployments and need intelligent DNS-based routing, this might give you some ideas for building your own solution. The code is out there for you to learn from, break, and hopefully improve upon. If you build something based on this or have feedback on the approach, I'd be interested to hear about it. Resources GitHub Repository: github.com/sam-cogan/external-dns-traffic-manager External DNS Documentation: kubernetes-sigs.github.io/external-dns Azure Traffic Manager: learn.microsoft.com/azure/traffic-manager Webhook Provider Guide: External DNS Webhook Tutorial450Views0likes0CommentsSimplifying Image Signing with Notary Project and Artifact Signing (GA)
Securing container images is a foundational part of protecting modern cloud‑native applications. Teams need a reliable way to ensure that the images moving through their pipelines are authentic, untampered, and produced by trusted publishers. We’re excited to share an updated approach that combines the Notary Project, the CNCF standard for signing and verifying OCI artifacts, with Artifact Signing—formerly Trusted Signing—which is now generally available as a managed signing service. The Notary Project provides an open, interoperable framework for signing and verification across container images and other OCI artifacts, while Notary Project tools like Notation and Ratify enable enforcement in CI/CD pipelines and Kubernetes environments. Artifact Signing complements this by removing the operational complexity of certificate management through short‑lived certificates, verified Azure identities, and role‑based access control, without changing the underlying standards. If you previously explored container image signing using Trusted Signing, the core workflows remain unchanged. As Artifact Signing reaches GA, customers will see updated terminology across documentation and tooling, while existing Notary Project–based integrations continue to work without disruption. Together, Notary Project and Artifact Signing make it easier for teams to adopt image signing as a scalable platform capability—helping ensure that only trusted artifacts move from build to deployment with confidence. Get started Sign container images using Notation CLI Sign container images in CI/CD pipelines Verify container images in CI/CD pipelines Verify container images in AKS Extend signing and verification to all OCI artifacts in registries Related content Simplifying Code Signing for Windows Apps: Artifact Signing (GA) Simplify Image Signing and Verification with Notary Project (preview article)432Views3likes0Comments