containers
403 TopicsInside ACR Artifact Cache: Pull-Through Caching at Scale
By: Akash Singhal, Luis Dieguez, Kiran Challa, Nathan Anderson, Tony Vargas, Caroline Barker, Ren Shao, Mabel Egba, Toddy Mladenov, Johnson Shi Introduction For many customers, Azure Container Registry (ACR) is the only registry their workloads can trust, even when images and artifacts originate from a different registry such as Docker Hub, Microsoft Artifact Registry, GitHub Container Registry, Quay, another ACR, or a private registry. ACR Artifact Cache makes this many-to-one model practical by letting a platform team map a downstream ACR repository path to an upstream source repository. Here, upstream means the source registry and repository ACR contacts on behalf of the customer, and downstream means the ACR-facing path customers pull from. From the outside, the experience looks like a normal pull from ACR. Inside the service, that pull moves through the same multi-tenant registry platform that serves ACR traffic across regions, clouds, and data plane stamps. This series is about the gap between that simple external experience and the internal system. The goal is to show what happens inside ACR, why the system is designed this way, and how those design choices shape the behavior customers ultimately observe. Some implementation details are simplified, and the system continues to evolve. The request paths and design constraints are representative, but this article intentionally avoids service-by-service internals that are not necessary to understand the feature. For this overview, the useful mental model is: serve now, hydrate for later. Later sections will show where that model helps, and where it creates engineering pressure. Why serve upstream content from ACR? Pulling directly from an upstream is often sufficient for development, but production systems need stronger guarantees from the pull path. The failure modes are familiar to anyone who has operated containerized workloads at scale: an upstream registry is slow or temporarily unavailable an upstream applies rate limits or burst protection credentials for various upstream sources need to be handled safely ACR-to-ACR scenarios should avoid customer-managed credentials entirely by using managed identity network policy expects pulls to stay inside an approved network boundary a platform team wants one shared, sanitized catalog of public content for first-party consumption while individual teams pull only what they need Let’s take Docker Hub as a concrete example. Docker Hub pull rate limits mean that unauthenticated users and Docker Personal users can exhaust their allowed pulls in a time window, causing shared build agents or Kubernetes nodes to receive rate-limit errors instead of images. That is a useful example because it makes the upstream dependency visible, but it is not the whole story. The broader engineering problem is that upstream-sourced artifacts should behave like local registry dependencies once a customer chooses to route them through ACR. Artifact Cache addresses that problem by letting customers map a downstream ACR namespace to an upstream namespace, pull through ACR, and allow ACR to materialize content locally as it is requested. A pull-through cache inside ACR Azure Container Registry operates across 60+ Azure regions and 6 public and sovereign clouds, serves hundreds of thousands of registries, and handles billions of requests per day. Artifact Cache is only one part of that larger service, but it is large enough to be a distributed systems problem in its own right: more than 100 million image pulls per day, petabyte-scale egress, upstreams with different behavior, and customers who expect registry pulls to remain predictable. This scale matters because Artifact Cache is not deployed beside ACR as a separate service. It is part of the same registry system that serves normal pushes, pulls, tag listing, catalog operations, authentication flows, private networking scenarios, and other registry API traffic. That means Artifact Cache has to fit into ACR's existing resource model and request-serving model. Customers configure cache rules and authentication boundaries through the control plane, then their pulls are served through the data plane. The next sections follow those two parts in order: first the resources customers create, then the runtime path those resources affect. The customer workflow The setup begins in the control plane, where customers define the relationship between an ACR namespace and an upstream source. A customer starts with an ACR and chooses an upstream repository. In the examples below, myregistry.azurecr.io is the customer's ACR login server. The dockerhub/library/node path is the downstream ACR namespace the customer wants to use for cached content. The authentication model depends on the upstream: For a public upstream, the cache rule may not need credentials. For a private upstream, the customer stores upstream credential material in their Azure Key Vault, creates a credential set that references those secrets, and then associates that credential set with a cache rule. At access time, ACR uses the system-assigned managed identity associated with the cache rule to read the referenced Key Vault secrets, so the customer controls access by granting that identity the required secret permissions. ACR materializes those credentials only when it needs to contact the upstream, so the customer-owned Key Vault remains the secret store. For an ACR-to-ACR upstream, the customer can use a user-assigned managed identity. In that scenario, credential sets are not part of the flow; managed identity replaces the credential-set and Key Vault path. At a high level, the customer defines a namespace mapping: docker pull myregistry.azurecr.io/dockerhub/library/node:latest maps to: docker pull docker.io/library/node:latest In ACR, that mapping is stored as a cache rule: a control-plane resource that maps a downstream ACR path to an upstream source path. If the upstream requires authentication, the cache rule links to the appropriate credential boundary: a credential set backed by customer-owned Key Vault secrets, or a user-assigned managed identity for ACR-to-ACR. This is where the control-plane/data-plane split shows up. The control plane manages registry configuration through surfaces such as CLI, portal, Bicep, ARM templates, and other Azure Resource Manager clients. ARM sends those resource operations to the ACR control plane, which creates or updates the cache rule and, when needed, the credential set as child resources under the registry. Those resources do not own customer secrets or identities directly; they link to existing Azure resources such as the customer's Key Vault or an optional user-assigned managed identity. Later, the data plane uses that persisted configuration to decide whether a runtime registry request, such as a pull or tag listing, should be handled by Artifact Cache. After setup, the runtime path begins with the simplest possible pull: docker pull myregistry.azurecr.io/dockerhub/library/node:latest To understand what happens after that command, we need a map of the ACR components that participate in the request path. The ACR components involved The architecture needed for this overview is much smaller than ACR's full internal service graph. ACR is a regionalized service. The control plane operates at the regional level, while data plane stamps serve hot-path registry traffic for the registries assigned to them. A registry is pinned to a stamp, and high-traffic regions may have more than one stamp. Stamp architecture is an ACR concept covered in more detail in the stamp rebalancing post; this article only needs the simplified model below. For this article, ACR has three important boundaries: The regional control plane manages registry resources and provisioning operations. The data plane stamp serves hot-path registry traffic for registries pinned to that stamp. The storage layer holds downstream registry metadata, blobs, and storage-backed event queues. At this level of detail, a data plane stamp is composed of a few major runtime substrates. The registry data plane virtual machine scale set (VMSS) is the core ACR data plane. It runs containerized services including the frontend, the registry API entry point that receives and routes OCI and ACR-specific requests. The data proxy VMSS also runs containerized services and serves selected blob-content paths. It serves eligible blob-content traffic behind ACR's dedicated data endpoint; see the ACR data endpoint documentation. The stamp also includes a runtime cluster for additional data plane services, including services that are not on the hot path. This article will not explain why ACR uses both VMSS-based services and a runtime cluster inside the data plane stamp. That tradeoff is useful context, but it belongs in a separate deep dive. For Artifact Cache, the important point is narrower: the stamp contains the runtime substrates that participate in data plane serving, including runtime-cluster services that process async import and hydration work. The component list is: Component Role Region control plane Manages registry resources and provisioning operations Data plane stamp Serves pinned registries in a region Registry data plane VMSS Core ACR data plane for OCI and ACR-specific APIs Frontend Handles OCI registry API traffic inside the registry data plane Data proxy VMSS Serves selected blob-content paths, including Artifact Cache Runtime Kubernetes Cluster Hosts additional data plane services, including async import and hydration workers Cache rule Maps downstream ACR path to upstream path Credential set or managed identity Provides the upstream authentication boundary when needed Cache Backend service Handles cache-rule-backed pulls Storage queue Regional storage resource used for hydration events Metadata/blob storage Stores downstream manifests, tags, digests, and layer blobs Import workers Run in the data plane runtime cluster and hydrate downstream content asynchronously Upstream registry Public, private, or another ACR registry used as the source The diagram below is a component map rather than a step-by-step pull trace. It shows one visible data plane stamp in West US for myregistry.azurecr.io, with a muted marker to indicate that larger regions can contain multiple stamps. The stamp contains a registry data plane VMSS, a data proxy VMSS, and a runtime Kubernetes cluster. Regional metadata/blob storage and the storage queue sit outside the stamp boundary. The storage queue is also outside the regional control plane cluster; it is a storage resource consumed by data plane runtime-cluster workers. First artifact pull Now return to the pull request: docker pull myregistry.azurecr.io/dockerhub/library/node:latest The request reaches the data plane stamp where myregistry is pinned. The frontend in the registry data plane VMSS handles the registry API request and forwards it to the Cache Backend Service, which checks whether the requested repository path matches a cache rule. If there is no matching cache rule, the request follows the normal ACR path. If a cache rule matches, Artifact Cache logic applies. The next check is local state. ACR looks at downstream metadata and blob storage to determine whether the requested manifest and blobs are already available locally. If the content is present, ACR can serve it from the downstream registry path. If the content is not available locally, ACR resolves the upstream repository path from the cache rule. If the upstream requires authentication, ACR uses the configured auth boundary for that upstream: a credential set for private upstreams, or a user-assigned managed identity for ACR-to-ACR upstreams. The request can then be served through the upstream-backed data path, with the data proxy handling the blob content path. The first pull does not need to wait for durable hydration to complete before the client receives content. Serving the pull and hydrating the downstream registry are related operations, but they are deliberately separated. The trace above follows the same node:latest image used in the setup example. On a cache miss, the data plane queues an async import event for the requested image while still serving the client request. Manifest content returns through the frontend path. For layer blobs, the frontend returns a redirect to the data proxy, and the client follows that redirect while the data proxy streams blob content from the upstream CDN. The data plane serves the customer request, but it also detects that durable downstream state needs to be populated. That durable work is where hydration comes in. Hydration Hydration is the process that materializes upstream content into the downstream ACR registry. ACR performs hydration asynchronously because the data plane workload can be bursty and variable. A deployment or scale-out event can cause many clients to request the same not-yet-hydrated image at nearly the same time. Image size, layer count, multi-platform manifest trees, upstream behavior, queue depth, and retry behavior all matter in a multi-tenant service. The north star is to coordinate those requests: collapse duplicate work, hydrate the content from upstream, and serve all waiting clients without turning one customer action into unnecessary upstream load. That coordination problem is challenging at ACR scale, and we are continuing to improve it. The existing async import path gives Artifact Cache a durable and scalable foundation while that serving path continues to evolve. At a high level, the data plane queues an import event. A notification service consumes the event and dispatches work to import workers in the data plane runtime cluster. Those workers fetch the required content from the upstream registry and write manifests, tags, digests, and layer blobs into ACR metadata and blob storage. When import workers complete, they notify the notification service, which can publish completion signals through ACR eventing surfaces such as Event Grid and webhooks. This allows customers to use webhooks to detect when cached content is fully available locally. You can read more about how it works here. The mental model is that the first pull can serve immediately, while hydration makes future local serving durable. A follow-up post will go deeper on the work ACR does to reduce upstream load during this hydration window. Later pulls After hydration completes, later pulls for the same content can be served from ACR. For digest references, the model is relatively direct because a digest is content-addressed. If ACR has the requested digest and its blobs downstream, the data plane can serve that content locally. Tags are more subtle because tags can change. A tag such as latest is a name that can point to different content over time. Artifact Cache therefore must care about freshness semantics for tag-based pulls. This is one of the reasons a pull-through cache becomes more complex than "fetch once and forget." The benefit is not only lower latency. ACR also reduces repeated dependency on the upstream for content that has already been materialized downstream. Guarding the pull path Once content is hydrated, ACR must serve that content from the customer's registry boundary even when the upstream is slow, unavailable, or returning errors. That distinction matters for tag-based pulls: ACR may need upstream checks to reason about freshness, but an upstream failure should not automatically prevent ACR from serving content that is already available downstream. Artifact Cache also must be careful about how it behaves when upstreams are unhealthy. If an upstream starts returning 5xx errors or throttling requests, ACR should avoid amplifying the problem by repeatedly sending customer-triggered requests upstream. Circuit breaking and upstream work minimization are part of being a good steward of both customer traffic and upstream registry limits. More details to follow in subsequent posts. There is a separate availability question inside ACR: what happens if Artifact Cache-specific components, such as the cache backend path, are operationally unavailable? ACR handles that case gracefully by falling back to normal registry pull behavior: it checks the customer's registry state and serves the image if the requested content already exists in ACR. In other words, cache-backend unavailability should not block pulls for content that is already present in the registry. What we will explore next This overview is the map for the rest of the series. The following posts will go deeper into the parts of the system where the design pressure is highest. Minimizing upstream work We will start with how Artifact Cache avoids making more upstream requests than necessary. This becomes difficult when many clients request the same not-yet-hydrated image at the same time. A Kubernetes scale-out event is the classic example: many nodes may ask for the same image concurrently, and the system must avoid turning one customer's action into unnecessary duplicate upstream work. Making Artifact Cache observable to customers We will also look at how customers understand whether their cache rule is healthy, whether credentials are usable, and why a pull failed. This is hard because a failed pull can involve customer configuration, Key Vault access, managed identity configuration, upstream credentials, upstream availability, data plane request handling, or asynchronous hydration. The engineering challenge is to expose the right customer-facing health and debug signals without turning internal topology into the user interface. Repository semantics in Artifact Cache Finally, we will look at repository semantics. Once upstream content becomes local, the repository is no longer just a mirror. Tags can move upstream, digest references are content-addressed, and customers may push their own content into downstream repositories. The visible repository state can involve both upstream-derived content and customer-owned downstream writes. Closing Artifact Cache is designed to make upstream-sourced artifacts behave like ACR-served content once customers choose to route those artifacts through their registry. The design goal is that customers can pull from ACR and reason about the result using ACR boundaries: registry configuration, local serving, customer-visible health, and predictable repository semantics.189Views2likes0CommentsIntroducing On-demand Sandboxes for Azure Durable Task Scheduler (Private Preview)
Maybe it needs a native toolchain. Maybe it runs untrusted customer or LLM-generated code. Maybe it needs Python from a .NET orchestrator, or bursty compute that should scale to zero when the work is done. Today, we're thrilled to announce On-demand Sandboxes for Azure Durable Task Scheduler, now available in private preview. On-demand Sandboxes lets you move those individual workflow steps to managed, isolated compute while your orchestrator stays exactly where it is. Tell DTS which steps should run in isolation, provide a container image with the step code, and DTS handles provisioning, scaling, and teardown. No infrastructure to manage, no idle costs, no orchestrator changes. Sign up for On-demand Sandboxes Private Preview Today → Availability: On-demand Sandboxes targets the standalone Durable Task SDKs used outside the Azure Functions host — for apps running on Azure Container Apps, Azure Kubernetes Service, App Service, or anywhere else you self-host. The private preview supports the .NET and Python Durable Task SDKs, with additional language SDKs and Azure Functions support coming soon. What is Azure Durable Task Scheduler? The Durable Task Scheduler is a fully managed backend for durable execution on Azure. It can serve as the backend for a Durable Function App using the Durable Functions extension, or as the backend for an app leveraging the Durable Task SDKs in other compute environments, such as Azure Container Apps, Azure Kubernetes Service, or Azure App Service. For a deeper introduction, see the Durable Task Scheduler overview or the full Durable Task documentation. Why On-demand Sandboxes? Most activities belong in-process. They're fast, simple, and co-located with your orchestrator. But sometimes you hit a step that doesn't fit: it needs a native binary, a different language runtime, per-invocation isolation, or bursty compute you don't want to keep warm. On-demand Sandboxes gives you a way to handle those exceptions without spinning up dedicated infrastructure or managing scaling policies in Azure Kubernetes Service or Azure Container Apps. Activity-level granularity. Move individual steps to managed compute, not your whole app. Per-activity or per-invocation isolation. Each execution runs in a clean, microVM-backed sandbox. Ideal for untrusted code, customer plugins, or LLM-generated logic. Cross-runtime flexibility. Run a Python inference step from a .NET orchestrator. No compromise on either side. Scale-to-zero. Pay for CPU and memory per second of execution, not infrastructure that waits. No orchestrator changes. Your orchestration code and hosting model don't change at all. Here are a few scenarios where On-demand Sandboxes shines: Native toolchains. Package ffmpeg, LibreOffice, or Pandoc in a container without dragging them into your main app. CPU-heavy preprocessing. OCR, layout extraction, or image processing can scale independently of the rest of your workflow. Cross-runtime workflows. A .NET orchestrator dispatches a Python inference step. No compromises. Sandboxed code execution. Run customer plugins or LLM-generated code with a clean boundary on every invocation. Multi-tenant isolation. Tenant-specific steps get dedicated boundaries while everything else stays in-process. Bursty event-driven workloads. Steps that spike hard but rarely may not justify always-on infrastructure. Sub-second cold starts mean you get capacity when you need it without paying to keep it warm. How it works On-demand Sandboxes uses a two-part model: a worker profile in your orchestrator app that tells DTS which activities to offload, and a worker image that contains those activity implementations. Your orchestrator still calls activities the same way it always has; the decision to run one activity in a sandbox lives in the profile configuration. 1. Declare a sandbox worker profile In the app that hosts your orchestrator, define a sandbox worker profile. The profile gives DTS the container image, resource shape, concurrency setting, and activity names that should run in a sandbox: using Microsoft.DurableTask.Worker.AzureManaged.Sandbox; [SandboxWorkerProfile("code-executor")] internal sealed class CodeSandboxWorkerProfile : ISandboxWorkerProfile { public void Configure(SandboxOptions options) { options.ContainerImage = Environment.GetEnvironmentVariable("DTS_SANDBOX_IMAGE") ?? throw new InvalidOperationException("DTS_SANDBOX_IMAGE is required."); options.Cpu = "1000m"; options.Memory = "2048Mi"; options.MaxConcurrentActivities = 1; options.AddActivity(TaskNames.ExecuteCode); } } Then enable on-demand sandbox discovery when you configure the Durable Task worker in the main app: workerBuilder.AddTasks(tasks => tasks.AddAllGeneratedTasks()); workerBuilder.UseDurableTaskScheduler(options => { options.EndpointAddress = Environment.GetEnvironmentVariable("DTS_ENDPOINT"); options.TaskHubName = Environment.GetEnvironmentVariable("DTS_TASK_HUB"); options.Credential = credential; }); workerBuilder.EnableSandboxes(); Here's what the profile configuration does: SandboxWorkerProfile: a friendly profile id for this sandbox setup. It groups the activity, image, and resource settings for monitoring and reuse across deployments. ContainerImage: the container image (from your registry) that contains the activity implementations. Cpu / Memory: the resource shape for each worker instance. Sized per your activity's needs. MaxConcurrentActivities: how many activities a single worker instance can process concurrently. AddActivity: the specific activity to offload. Only activities added to a sandbox worker profile execute in DTS-managed isolated compute; everything else stays in-process. The orchestrator call site doesn't change: ExecuteCodeOutput execution = await context.CallActivityAsync<ExecuteCodeOutput>( TaskNames.ExecuteCode, new ExecuteCodeInput(pythonCode, input.CsvData)); ExecuteCode is not registered in the main app's in-process activity list. When the orchestrator calls it, DTS uses the codegen profile to route the work to the sandbox image. 2. Build the worker image The worker image is a container you own. In most apps, this worker lives in a separate project from the orchestrator host so it can have its own entry point, dependencies, and container image. It registers the activity implementations it can run and opts in to managed execution with UseSandboxWorker(): builder.Services.AddDurableTaskWorker(workerBuilder => { workerBuilder.AddTasks(tasks => { tasks.AddActivity<ExecuteCodeActivity>(); }); workerBuilder.UseSandboxWorker(); }); UseSandboxWorker() is the key line. It signals that this worker runs in DTS-managed compute. The sandbox worker does not need to configure the DTS endpoint, task hub, profile id, or credentials; DTS injects the runtime settings when it starts the container. The activity implementations themselves are standard Durable Task activities. There's nothing special about the activity code: it can call a runtime with different dependencies, such as Python and pandas, while running in an isolated container instead of in your main app's process. Package the image like any containerized service, including whatever runtimes and native tools the activity needs. Push it to your container registry (e.g., Azure Container Registry) and reference the image in the worker profile's ContainerImage option. View logs in the DTS dashboard Once your sandbox activities are running, you can view their execution logs directly in the Durable Task Scheduler dashboard. The dashboard shows real-time output from your managed workers, including stdout, stderr, and activity lifecycle events. This gives you full visibility into what's happening inside the sandbox without needing to configure external log sinks or set up your own observability pipeline. Demo Get started On-demand Sandboxes is in private preview. To get access, sign up here. We'll enable the feature on your scheduler and help you get your first sandbox activity running. Once you're in, the workflow is straightforward: declare a sandbox worker profile in your orchestrator app, build and push a worker image, and DTS takes care of the rest. Sign up for On-demand Sandboxes Private Preview Today → Documentation: Durable Task Scheduler overview Samples: Azure-Samples/Durable-Task-Scheduler Pricing: Azure Durable Task Scheduler pricing Questions, feedback, or ideas? Open an issue in the Durable-Task-Scheduler GitHub repo. We'd love to hear from you.252Views0likes0CommentsIntroducing Azure Container Apps Sandboxes: Secure Infrastructure for Agentic Workloads
Today we are announcing the public preview of Azure Container Apps Sandboxes - a new first-class resource type that gives you fast, secure, ephemeral compute environments with built-in suspend and resume. This is the underlying infrastructure on which products like Cloud sandboxes in GitHub Copilot, Foundry Hosted Agents, and Azure Container Apps Express are built, you now have the opportunity to build your solutions leveraging this infrastructure. Azure Container Apps Sandboxes unlocks two massive opportunities. For platform developers and ISVs, sandboxes give you the same isolated compute fabric that powers many Microsoft products. You get the building blocks to create your own multi-tenant platform on proven, enterprise-scale infrastructure. For AI agents, sandboxes become a self-configurable tool that lets agents extend their own capabilities on the fly. An agent can spin up a fresh sandbox in milliseconds and use it to execute untrusted code, compile source, test HTTP requests against a live app, launch a browser session, or tackle whatever needs a quick and scalable infrastructure. On one side it empowers humans to build platforms, on the other it empowers agents to build their own capabilities. Both get enterprise-grade isolation, instant startup, and snapshot-based persistence out of the box. We'll walk through the resource model, sandbox lifecycle, the features that set Sandboxes apart - like snapshots, lifecycle policies, network egress controls, volumes, and managed identities - and show you how to get started with the portal and CLI. What Are Container Apps Sandboxes? Container Apps Sandboxes are secure, isolated compute environments that start in sub-second time, scale to thousands, and cost nothing when idle. Each sandbox runs in its own hardware-isolated microVM boundary - fully separated from the host, the platform, and every other sandbox. You bring your own Open Container Initiative (OCI) image, and Sandboxes handle the rest: provisioning from prewarmed pools, strong multi-tenant isolation, and snapshot-based suspend/resume that preserves full memory and disk state across sessions. There are many ways Sandboxes can help you build your next project - here are a few: Your own build & test systems - wire a Sandbox into your CI/CD flow to run builds while your laptop stays cool. Agents that can run anything safely - an agent spawns a sandbox, executes work inside it, and returns the output with no agent host privileges required. Agent swarms - decompose a research question, spawn N sandbox workers in parallel (each pinned to its own image and egress policy), and synthesize the result. Early access customers are already unlocking significant benefits by leveraging Azure Container Apps Sandboxes. "With Azure Container Apps sandboxes, SitecoreAI can safely enable agents to take real action. The combination of multi-tenant isolation, rapid scale-out, and full automation allows Sitecore to run long-lived, autonomous agents that securely execute code, manage workflows, and interact with enterprise systems within secure, governed environments. With this foundation, we can build agents that do real work: assembling content, personalizing experiences, and optimizing campaigns in production. Agents that operate continuously, learn from results, and improve over time, so our customers get better outcomes without giving up control." - Mo Cherif, VP of AI and Innovation, Sitecore "We got early access to Azure Container Apps Sandboxes, and got the first prototype integrated with Atlas AI in hours, and it's already shaping a new Atlas AI capability that we plan to launch in preview in Q3. It gives every Atlas AI agent a safe, sandboxed workspace (file system, terminal, code execution) on a customer's live data in Cognite Data Fusion. The value: Industrial process, reliability, and production engineers spend days and weeks on questions like "which wells are underperforming and why?" These questions are tractable but expensive, so they are asked rarely and decisions are made on gut feel. With this, an agent pulls the data, runs the analysis, cross-references maintenance and inspection records, and returns a cited draft in minutes. Sandboxes make it practical: Aligned feature set, per-customer isolation, pause/resume across multi-day investigations, scale-to-zero economics." - Kelvin Sundli, Product manager, Atlas AI, Cognite Resource Model: Sandbox Groups and Sandboxes The top-level ARM resource is Microsoft.App/SandboxGroups. A Sandbox Group is the management boundary for a collection of sandboxes that share configuration - think of it like a Container Apps Environment, but purpose-built for sandboxes. When you create a Sandbox Group, you specify: Subscription, Resource Group, and Region Sandbox defaults (optional): default CPU, memory, disk, max sandbox count, and default idle timeout Networking: optionally deploy into a custom VNet with a dedicated subnet for private networking Identity: System or user assigned Entra identity. Individual sandboxes are created within a Sandbox Group. Each sandbox has its own source (disk image or snapshot), resource tier, lifecycle policy, network egress policy, environment variables, ports, volumes, and connections. Sandbox Lifecycle Sandboxes have a well-defined lifecycle with the following states: State Description Creating Provisioning the sandbox from a disk image or snapshot Running Actively executing - backed by a live microVM Idle System-suspended after inactivity; can auto-resume on the next request Suspended Full state (memory + disk) preserved as a snapshot; no compute costs Resuming Restoring from a suspended or idle state - sub-second for most workloads Stopped User-initiated stop; can be resumed Stopping Graceful shutdown in progress Deleting Teardown in progress The key insight here is the distinction between Idle and Suspended. When a sandbox goes idle (e.g., no traffic for a configured timeout), the system can automatically suspend it and capture a snapshot. When a new request arrives, the sandbox resumes transparently. This gives you scale-to-zero economics with stateful compute - something that wasn't possible before without significant custom engineering. Disk Images: Bring Your Own Container Sandboxes boot from Disk Images - Open Container Initiative (OCI) images converted into an optimized root filesystem format. You point to any OCI image (public or private registry), and the platform builds a bootable disk image from it. You can start with public, pre-built images maintained by the platform (for example, Ubuntu base images), or bring your own private images. For private registries, you can authenticate with username/token or use a user-assigned managed identity for Azure Container Registry (ACR) – integrated with Azure as you expect. Snapshots: Full-State Persistence Snapshots capture the complete state of a running sandbox - memory, disk, and all running processes. When you resume a sandbox from a snapshot, every process, open file handle, and in-memory data structure is restored exactly as it was. A snapshot captures the full state of a running sandbox: memory pages, disk, processes. Two ways to make one - automatically on suspend, or manually on demand. Three things they're great for: Checkpointing mid-task so a long-running agent can resume exactly where it left off Cloning an environment that's already warm - dependencies installed, caches populated, services running Shipping a "ready-to-go" state that resumes in sub-second instead of cold-booting Snapshots are free during the preview, after which they will be stored as Azure Blob Storage at standard rates. Each snapshot records the source sandbox, resource allocation (CPU, memory, disk), and container metadata - so what you get back is exactly what you snapshotted. Resource Tiers Every sandbox is assigned to a resource tier that determines its CPU, memory, and disk allocation: Tier CPU Memory Disk XS 0.25 vCPU 0.5 GB 5 GB S 0.5 vCPU 1 GB 10 GB M (default) 1vCPU 2 GB 20 GB L 2 vCPU 4 GB 40 GB XL 4 vCPU 8 GB 80 GB When creating a sandbox from a snapshot, the resource tier is inherited from the snapshot and cannot be changed - this ensures the restored environment has the exact resources it was running with when the snapshot was taken. Lifecycle Policies: Auto-Suspend and Auto-Delete Every sandbox can be configured with lifecycle policies that automate state transitions and cleanup: Auto-Suspend Idle timeout: How long a sandbox can sit idle before being suspended (configurable: 1m, 2m, 5m, 10m, 30m, 60m) Suspend mode: Disk + Memory (default): Full snapshot including memory state - resume picks up exactly where you left off, with all processes and in-memory data intact. Disk: Only the disk is preserved; the VM restarts fresh on resume. Useful when you only need file persistence, not process continuity. Auto-Delete Automatically delete sandboxes after a configurable number of days of inactivity Prevents accumulation of abandoned sandboxes that consume snapshot storage These lifecycle policies are what make Sandboxes economically viable at scale. A platform serving thousands of tenants can configure aggressive idle timeouts (say, 60 seconds) with Memory suspend mode, and each tenant's sandbox disappears from the billing meter almost immediately - but resumes in sub-second time the moment they return. Network Egress Policy For scenarios involving untrusted code - AI agents executing LLM-generated scripts, multi-tenant SaaS with user-submitted workloads - controlling outbound network access is critical. Sandboxes provide a per-sandbox Network Egress Policy: Default action: Allow or Deny all outbound traffic Host rules: Domain-pattern rules (e.g., *.github.com → Allow) to permit specific destinations Custom CIDR rules: Network-level rules for IP ranges (e.g., 10.0.0.0/8 → Deny) Skip egress proxy: Option to bypass the egress proxy entirely when custom VNet routing handles policy enforcement This means you can run a sandbox in a deny-by-default posture and allowlist only the specific endpoints it needs (your API server, a package registry, etc.) - without setting up NSGs or firewall appliances. Managed Volumes: Persistent and Shared Storage Sandboxes support two types of mountable volumes, both managed by Microsoft: Volume Type Backed By Best For Managed Azure Blob Azure Blob Storage Shared data across sandboxes, file uploads/downloads, persistent artifacts Managed Data Disk Azure Disk Storage High-performance storage for databases, build caches, large working sets - only available to one sandbox at a time Blob volumes come with a built-in file explorer in the portal - you can browse, upload, download, create folders, and drag-and-drop files directly. Data Disk volumes provide dedicated block storage with configurable sizes. Secrets and Identity Secrets Sandbox Groups support key-value secrets scoped to the group. Secrets can be created, edited, and referenced by sandboxes within the group. These secrets can be used in egress policies to modify requests with transform or header-injection rules, without exposing the secrets to code running inside the sandbox. Managed Identity Sandbox Groups support both system-assigned and user-assigned managed identities, with full RBAC role assignment management. This means your sandboxes can authenticate to Azure services (Key Vault, Storage, Cosmos DB, etc.) without managing credentials - the same identity model you use everywhere else in Azure. MCP Connectors and Triggers ACA Sandboxes now supports managed connectors through the Model Context Protocol (MCP), giving sandboxes access to external APIs - including Microsoft 365, Salesforce, ServiceNow, GitHub, and 1,400+ other systems - without managing credentials directly. Attach a Connector Gateway to your sandbox group, and every sandbox in the group can call external APIs through a standardized MCP interface at runtime. Pair connectors with triggers to build event-driven automation: route an Outlook email to a sandbox that triages it with an AI agent, or react to a SharePoint file upload by extracting and processing the document all without writing glue code. Triggers can fire a shell command inside a sandbox or invoke an HTTP endpoint the sandbox exposes, so your automation shapes fit naturally around your workload. The integration is built on the new Connector Namespace service (az connector-namespace), the same runtime behind Logic Apps and Power Platform connectors, now available as a programmable layer for sandboxes. See the end-to-end samples for runnable azd up-deployable examples covering email triage and document automation scenarios. The Portal Experience Azure Container Apps Sandboxes are only available in the new Azure Container Apps portal that provides a rich, IDE-like experience for working with sandboxes. Creating a Sandbox The portal offers multiple creation paths: Standard Sandbox - full configuration control over source, resources, lifecycle, networking, and volumes GitHub Copilot Sandbox - preset, Copilot CLI ready to go, GitHub credentials can be wired through the Access Token before the sandbox is created Claude Sandbox - Claude CLI pre-installed, ready for agentic coding inside the sandbox Using Coding Agents (Copilot CLI / Claude Code) If you live inside Copilot CLI or Claude Code, you don't need to learn a new CLI. Install the azure-sandbox skill once and your agent picks up the right skills: # GitHub Copilot CLI # Add as a plugin marketplace /plugin marketplace add microsoft/azure-container-apps # Install all skills /plugin install sandboxes@Azure-Container-Apps # Claude Code claude plugin add microsoft/azure-container-apps The skill runs prerequisite checks silently (az --version, az account show, node --version, aca --version), prompts only if something's missing, and maps natural-language asks to the right aca commands. Bundled runbooks cover Copilot CLI BYOK (bring your own Azure OpenAI key), the deploy-a-web-app walkthrough, and shell setup. Sandbox Detail Page Once your sandbox is running, the detail page gives you immediate access to the sandbox terminal and additional details, such as - Network Audit - real-time egress traffic log showing allowed and denied requests Monitor - live CPU, memory, disk, and network utilization charts Connectors - attached connections with an "Add" action Volumes - mounted volumes with an "Add" action Log Stream - streaming container logs Processes - running process list inside the sandbox Files - file explorer to browse the sandbox filesystem The toolbar actions let you manage the state of the sandbox - Resume or Stop. In the Ellipsis menu (⁝) you can find additional settings to manage network Egress Policy and ingress (Add port), take a Snapshot of the sandbox, Commit (save disk state as a new disk image), set Lifecycle Policy or permanently Delete the sandbox. Finally, you can see additional Details in a side panel. Getting Started with the CLI and Python SDK All sandbox and sandbox-group operations go through the aca CLI. There are no az containerapp sandbox commands, - az is only used for az login, az account show, and resource-group management. Install (CLI) # Mac, Linux curl -fsSL https://aka.ms/aca-cli-install | sh # Windows irm https://aka.ms/aca-cli-install-ps | iex Run aca --help to get started. Install (Python SDK) pip install azure-containerapps-sandbox For more details, quick start and examples on ACA CLI and Python SDK, please go to https://sandboxes.azure.com Evolution from Dynamic Sessions If you've used Azure Container Apps Dynamic Sessions, Sandboxes are the next evolution of that capability. Everything Sessions can do, Sandboxes can do - and significantly more: Capability Dynamic Sessions Sandboxes Sub-second startup ✓ ✓ Strong isolation ✓ ✓ Custom container images ✓ ✓ Custom VNet integration ✓ (Partial) ✓ Suspend/resume with Memory and Disk snapshots - ✓ Lifecycle policies (auto-suspend, auto-delete) - ✓ Network egress policy (per-sandbox) - ✓ Persistent managed volumes (Blob, Data Disk) - ✓ Managed identity (system + user-assigned) - ✓ Secrets management - ✓ Configurable resource tiers - ✓ Direct access to sandbox in Portal experience - ✓ We will continue to support Dynamic Sessions, but all new investment goes into Sandboxes. If you're building new workloads on isolated ephemeral compute, start with Sandboxes. How It All Fits Together ACA Sandboxes is a platform primitive. It's the foundation on which multiple Microsoft products are already built - including ACA Express, Cloud sandboxes in GitHub Copilot, and Foundry Hosted Agents. When you build on Sandboxes, you're building on the same infrastructure that powers Microsoft's own portfolio. This is the evolution of what we shared with Project Legion in 2024. Legion described the internal infrastructure; Sandboxes exposes it as a customer-facing primitive that you can use directly. What's Next • Deeper Azure integrations - first-class connectivity with Azure networking, identity, storage, and AI services • Enhanced SDK and CLI - richer programmatic experiences for managing sandboxes at scale • More Microsoft services built on Sandboxes - this is just the beginning Get Started Today • Portal: https://sandboxes.azure.com/ • Documentation: Azure Container Apps Sandboxes • Pricing: Azure Container Apps Pricing (per-second vCPU/memory billing, scale-to-zero, snapshots at Blob Storage rates) We'd love to hear your feedback. You can ask questions, or file issues on the Azure Container Apps GitHub (prefix with [Sandbox] for Sandboxes-specific issues).1.6KViews1like0CommentsWhat's new in Azure Container Apps at Build'26
Azure Container Apps (ACA) is a fully managed serverless container platform that enables developers to build and deploy microservices and modern applications without requiring container expertise or needing infrastructure management. ACA provides built-in autoscaling (including scale to zero), per-second billing, advanced networking, built-in observability, and simplified developer experiences across multiple programming languages and frameworks. The world of application development is shifting rapidly. Agentic AI is fundamentally changing the requirements of cloud platforms - more code is being written by AI, more apps are being deployed by agents, and more deployment stacks are being assembled autonomously. Platforms are aligning to two concurrent demands: hosting intelligent agents as first-class workloads, and giving those same agents access to empty, secure compute pools as tools they can invoke on demand. At the same time, the proliferation of AI-generated code means that platforms must offer strong isolation for untrusted workloads, instant provisioning for rapid iteration, and production-grade defaults that make the right thing the easy thing - for both humans and agents. Azure Container Apps is purpose-built for this new reality. Whether you're a developer shipping a web app in minutes or an agent spinning up ephemeral sandboxes for code execution, ACA provides the serverless foundation that meets both audiences where they are. Customers across industries are betting on ACA as the compute foundation for their AI and cloud-native workloads: Replit runs its agent-driven software creation platform on Azure, enabling enterprises like Hexaware to securely build and deploy AI-generated applications at scale with seamless procurement through Azure Marketplace. LayerX built its Ai Workforce document processing platform on Azure Container Apps, Azure OpenAI, Azure AI Search, and Cosmos DB - helping clients like Mitsui & Co. save 570 hours annually by automating manual document tasks. SJR built GX Manager with Microsoft Foundry to automate website personalization at scale - delivering production-grade, data-grounded content in seconds instead of hours of manual curation. August AI powers an AI health companion serving over 3.5 million customers on Azure infrastructure, scoring 100% on the U.S. Medical Licensing Examination and delivering potentially life-saving medical support. Photon Education created Classwise on Azure OpenAI and Foundry with Defender for Cloud security, enabling teachers to prepare lessons faster and engage students more effectively in inclusive learning environments. Microsoft Foundry Agent Service is built directly on Azure Container Apps, serving over 20,000 customers with a dedicated agent runtime that handles fast startup, tool execution, long-running operations, and enterprise-grade isolation at scale. Following the features announced at Ignite'25 and our continued momentum through early 2026, we're excited to share what's new at Build'26. This release deepens our commitment to the agentic era with new primitives for secure ephemeral compute, the fastest path from container to production, a reimagined portal experience, and continued investment in security, observability, and developer productivity. Azure Container Apps Sandboxes (Public Preview) Teams building agentic applications, multi-tenant platforms, development environments, and CI/CD systems have often had to stitch together custom infrastructure to run untrusted code safely, preserve state across sessions, and handle bursty demand without paying for idle capacity. Azure Container Apps Sandboxes addresses that challenge with a new first-class resource type that provides fast, secure, ephemeral compute environments with built-in suspend and resume capabilities. Each sandbox runs in its own hardware-isolated microVM boundary, supports standard OCI container images, and starts in sub-second time. Sandboxes can preserve memory, disk state, and preloaded libraries in a snapshot, so workloads resume quickly from the same point without incurring a cold-start reload penalty. Why Sandboxes are perfect for agents Agents can safely run AI-generated code in isolated environments with instant startup. Agents also accumulate context, intermediate results, and working state during long-running tasks. With sandbox snapshots, agents get persistent, isolated workspaces that survive across task boundaries - they can suspend and resume as needed, preserving full execution context including memory and disk. Key capabilities Sub-second startup - provision and execute immediately Hardware-isolated microVMs - strong security boundary for untrusted code Snapshot and resume - full state preservation (memory + disk) across sessions OCI container image support - bring any container Scale to zero, scale to thousands - consumption pricing with per-second billing This is the underlying infrastructure on which products like Cloud Sandbox in GitHub Copilot, Foundry Hosted Agents, and Azure Container Apps Express are built. ACA Sandboxes joins the Container Apps family alongside Apps, Jobs, Functions, and Dynamic Sessions as a foundational building block for the next generation of cloud and AI application workloads. Learn more about Azure Container Apps Sandboxes at https://aka.ms/aca/sandboxes Azure Container Apps Express (Public Preview) We recently launched Azure Container Apps Express in public preview - the simplest and fastest way to launch and scale powerful applications on Azure, from zero to hyperscale, without infrastructure decisions. It represents the first Azure compute platform purpose-built for agent and developer use alike. Express is based on years of experience running Azure Container Apps at scale. We've learned that most developers working on web apps, APIs, and agents want to deploy quickly, have automatic scaling, and avoid dealing with complex infrastructure. Express provides these capabilities - it sets up your environment in seconds, handles any amount of traffic, and removes complicated settings. This helps teams move from writing code to having a production-ready app in minutes, not hours. What makes Express different Instant provisioning - your app is running in seconds, not minutes Sub-second cold starts - fast enough for interactive UIs and on-demand agent endpoints Scale to and from zero - automatic, no configuration required Per-second billing - pay only for what you use, no environment provisioning fee Production-ready defaults - autoscaling, managed identity, secrets management, custom domains, container registry integration, revision management, and built-in observability Purpose-built for custom agents Agents need to spin up application endpoints on demand - fast, reliably, and without pre-provisioning infrastructure. Express is purpose-built for this pattern: it provisions in seconds, scales from zero instantly when an agent triggers a workload, and scales back down when the task is complete. Whether an agent is deploying a tool-use endpoint, standing up a temporary API for a multi-step workflow, or launching a web UI for human-in-the-loop review, Express gives it a production-grade, internet-reachable application with zero operational overhead. It's the fastest path from "an agent decided to deploy something" to "it's live and serving traffic." Learn more about Azure Container Apps Express at https://aka.ms/aca/express/launch-blog New Azure Container Apps Portal You open the Azure portal and want to deploy a Container App. Ten minutes later you're three blades deep, toggling settings you don't understand, wondering which workload profile is best before you even have an app. We built a different portal. One where deploying a container app takes less time than reading this paragraph. One where creating an Azure Container App is a single click. And one where experimental features ship weekly, not quarterly. Smart defaults, advanced when you need Developers care about outcomes - where their app is running and how to reach it - not starting with a configuration form. The new portal offers three creation modes to keep setup simple: Simple "one-click create" - auto-generates a unique name and provisions your app. Provide the container image and egress settings. That's it - no environment type selection, networking decisions, or container registry configuration. Advanced create - unlocks everything: custom VNets with subnet selection, managed identity for registry auth, lifecycle policies, egress controls, environment variables, custom scale rules, and more. It's a toggle at the top of the same form, not a separate workflow. Express App (Preview) - the new kind of ACA application that provisions and starts almost instantly. Observe quickly, act faster The app overview page surfaces critical information at a glance - including a unified Log Stream that brings app and system logs together in one place. Getting to the root cause now takes fewer clicks, and next steps are always one click away. Faster releases, direct feedback loop Azure Container Apps Express (Preview) and Azure Container Apps Sandboxes (Preview) are currently available only in this new portal. We ship weekly - often more. Upcoming Portal Features in settings give you an easy way to opt in to early access features and share feedback directly. Security: Defender for Cloud Serverless Containers Posture and Confidential Compute Security remains a top priority as enterprises run more sensitive and regulated workloads on Azure Container Apps. At Build'26, we're announcing two key security milestones. Public Preview: Defender for Cloud Serverless Containers Posture on Azure Container Apps Customers can now bring Azure Container Apps environments into Microsoft Defender for Cloud's Serverless Containers Posture experience, helping security teams extend posture management across more of their container estate from a single workflow. This makes it easier to gain visibility into Container Apps resources and assess risks across areas such as identity, networking, and container or image configuration. With this capability, teams can more consistently evaluate risk across container environments and use attack path analysis to identify potential exposure faster. The result is a more unified security posture, less manual effort, and stronger confidence when securing Container Apps deployments. Serverless Containers Posture is available as part of the Defender CSPM plan. Learn more at the Defender for Cloud documentation. General Availability: Confidential Compute for Azure Container Apps Confidential Compute in Azure Container Apps is now generally available, providing hardware-backed Trusted Execution Environments (TEEs) through workload profiles. This extends protection to data in use - in addition to data at rest and in transit - enabling teams to run higher-trust workloads with stronger isolation for sensitive data. With confidential computing now GA, Azure Container Apps becomes more viable for regulated, financial, healthcare, and other high-trust scenarios where organizations need hardware-enforced isolation that protects in-memory data, including from the underlying infrastructure. There is no extra charge for confidential compute workload profiles. Learn more at the Azure Confidential Computing documentation. Observability: HTTP Traffic Logs and OpenTelemetry Destinations Knowing what's happening inside your application is essential to running production workloads with confidence. At Build'26, we're announcing two enhancements that give teams deeper visibility and more flexibility in where they send telemetry. Monitor HTTP traffic in Azure Container Apps Azure Container Apps now adds a dedicated Azure Monitor diagnostic setting category - ContainerAppHTTPLogs - that exposes detailed HTTP access logs for incoming traffic. This capability is designed for high-volume request data, enabling teams to troubleshoot ingress and request-flow issues with much greater precision. With HTTP traffic logs, you can now investigate: Failed requests and error codes Latency patterns and outliers Retries and WebSocket disconnects Routing behavior and backend connectivity The result is faster issue resolution, less operational friction, and stronger confidence in running high-traffic, business-critical applications. Standard Azure Monitor log volume charges apply. Learn more at Azure Monitor pricing. Additional OpenTelemetry Destinations: New Relic, Dynatrace, Elastic Azure Container Apps enhances its managed OpenTelemetry (OTel) capabilities by expanding support for third-party observability platforms. This update introduces additional endpoint options for commonly used monitoring tools - New Relic, Dynatrace, and Elastic - extending the existing managed OpenTelemetry experience. Teams can now use a more consistent OpenTelemetry-based pipeline across Azure Monitor, Datadog, New Relic, Dynatrace, Elastic, and any OTLP-compatible endpoint, with less configuration overhead and more flexibility to route logs, metrics, and traces where they need them - without deploying or managing their own collectors. No extra charge applies. Learn more at the OpenTelemetry agents documentation. Additional Enhancements and Ecosystem Updates Beyond the headline announcements, Azure Container Apps continues to evolve with a steady cadence of improvements across the platform. Override Scale Rules in Azure Functions on Azure Container Apps Azure Functions on Container Apps has traditionally used platform-managed scaling, where triggers are automatically translated into KEDA scale rules. With the new allowScalingRuleOverride property, customers can now choose to override platform-managed scaling and define their own custom KEDA scaling rules. This enhancement is especially useful for scenarios where automatically generated KEDA rules lead to unintended scaling behavior, where workloads require custom thresholds or concurrency tuning, or where teams need standardized scaling policies across services. It works with any of the 60+ KEDA scalers - Service Bus, Kafka, PostgreSQL, HTTP concurrency, Cron, and more. Heroku Migration to Azure Container Apps With Heroku entering maintenance mode, Azure Container Apps is a natural landing zone for Heroku workloads. New guidance and tooling makes the migration path straightforward - from understanding why ACA is the right next step to a practical migration guide for hands-on implementation. Dapr v1.16 Platform Upgrade Azure Container Apps completed a staged platform upgrade to Dapr v1.16.4, bringing modernized actor scheduling, improved scalability for reminders, and updated TLS/security internals. The upgrade is fully platform-managed, with minimal customer action required for most workloads. Running AI Models on ACA Serverless GPUs The community continues to push the boundaries of what's possible with serverless GPUs on ACA. Recent highlights include running Gemma 4 with Ollama for fully private, self-hosted inference, and deploying ComfyUI for text-to-image and text-to-video workloads - all with scale-to-zero and per-second billing. Hosting Remote MCP Servers on ACA Azure Container Apps is emerging as the preferred platform for hosting Model Context Protocol (MCP) servers. With serverless scaling, idle billing, HTTP/1.1 and HTTP/2 support, and managed identity integration, ACA provides a production-ready environment for exposing tools and APIs to AI agents. Multiple tutorials and guides are now available for deploying MCP servers on ACA, including integration with Azure API Management. App Modernization with GitHub Copilot GitHub Copilot App Modernization can dramatically reduce the time required to modernize legacy applications and deploy them to ACA. A recent walkthrough demonstrated upgrading a classic ASP.NET MVC app on .NET Framework to .NET 10 and deploying it to Azure Container Apps in hours - with managed identity and Key Vault integration enabled by default. Azure Skills Repository for Container Apps The new Azure Skills repository includes comprehensive skills specifically for Azure Container Apps - covering troubleshooting, best practices, architecture patterns, security, deployment, and integration. These skills are designed to be used by AI agents and developer tools like GitHub Copilot CLI, providing rich context for building, deploying, and operating ACA workloads. It's another example of how the ACA ecosystem is evolving to be agent-native. Docker Compose for Agents Docker Compose for Agents on Container Apps (public preview) brings the familiar Compose workflow to agentic applications. Declare models, agents, and MCP tools in a single compose.yaml file and deploy unchanged from laptop to cloud - supporting LangGraph, Vercel AI SDK, Spring AI, CrewAI, and other frameworks. Learn more at the Compose for Agents documentation. What's Next Azure Container Apps is redefining how developers and agents build, deploy, and operate intelligent applications. With Sandboxes for secure ephemeral compute, Express for instant provisioning, a reimagined portal for streamlined management, and continued investment in security and observability - ACA provides the ideal foundation for the agentic era. The features announced at Build'26 deepen our commitment to making Azure Container Apps the platform where both humans and AI agents can ship production workloads with confidence, speed, and minimal operational overhead. Also, if you're at Build, come see us at the following sessions: Breakout 221: Idea to production-ready agent in seconds on AI-native runtime Demo 312: Multi-agents in action with 3 AI agents, 3 frameworks, tools & models Lab 580: Build and deploy reasoning agents with NVIDIA Nemotron and Foundry Lightning Talk 453: Building an End‑to‑End Enterprise AI Platform on Azure Or come visit us at the Azure Application Services booth #44. Visit our GitHub page for feedback, feature requests, or questions. Check out our roadmap to see what we're working on next. We look forward to hearing from you!337Views1like0CommentsHow ACR Runs Multi-Tenancy at Scale: Compute Stamp Rebalancing and Why You Never See It Happen
By Johnson Shi, Richard Yuan, Yi Zha, Susan Shi, Jeanine Burke, Bin Du, Clark Porter, Bernie Harris, Eric Du Introduction Two of the most common questions we hear from teams running container workloads at scale on Azure Container Registry (ACR) are: "How does ACR keep my registry's performance predictable when I'm sharing infrastructure with thousands of other tenants?" — Cloud services are inherently multi-tenant. What does ACR actually do to keep my workload from competing with my neighbors during high concurrency data plane API operations? "What happens when one tenant's workload grows large enough to affect the shared infrastructure?" — Is there an active intervention, or does the system just absorb the noise from concurrent registry operations? In this post, we clarify how ACR runs its multi-tenant fleet: the stamp architecture that underpins ACR's compute infrastructure in every Azure region, the practice of proactively rebalancing registries between compute stamps when one stamp gets hot from sustained registry data plane operations, and the additional stamp isolation options available for exceptional workloads. Running multi-tenancy well at scale isn't passive — it's an active operational practice, and customers benefit from it every day without seeing it happen. Key Takeaways An ACR registry can be geo-replicated: a registry can have geo-replicas (which are both read and write-enabled) in multiple Azure regions. Each geo-replica is served by an ACR compute stamp in a particular region — independent compute deployment units that underpin ACR regional infrastructure, each made up of VMSS-backed compute pools, that together serve many registry data plane operations belonging to many tenants. Compute stamps are simultaneously a compute capacity pool, a fault domain, and an update domain. Take note that compute stamps span only the compute component for ACR; ACR in each region maintains a separate pool of storage accounts shared across all compute stamps, which is not the focus of this post. When a compute stamp gets hot, ACR proactively rebalances by moving registries to a less-utilized stamp in the same region. The registry endpoint does not change; the move is transparent to the customer. For exceptional workloads where rebalancing alone would just transfer the problem, ACR can provide additional stamp isolation — placing registries on stamps with fewer co-tenants, providing better traffic isolation, fault domain separation, and update domain independence. This also structurally improves the stamps the tenant used to share with everyone else. ACR engineering uses a mix of reactive signals (outages, sustained errors, throttling, low throughput) and proactive signals (operational telemetry) to decide when to rebalance stamps. Hot-node P95 CPU, discussed in this post, is one of the proactive signals we use — for each 1-minute bin, take the hottest node's average CPU, then percentile across bins. Pool-average hides per-node hot-spotting; single-sample Max is too noisy. All of this is currently manual. Rebalancing decisions, migrations, and isolation provisioning are operator-driven today. We are actively investing in standardizing and automating the practice — automated stamp rebalancing and lifecycle management are on the roadmap. Background What is a stamp? A compute stamp is ACR's unit of compute deployment within a region. At a high level, ACR has the following compute components within a region to serve registry data plane operations: VMSS-backed compute pools. Virtual Machine Scale Sets are Azure's primitive for running a managed group of identical VMs that autoscale together. Each region has several compute stamps, each of which has a pool of VMs that handle registry data plane operations such as authentication, manifest operations, tag resolution, and registry-side metadata — the coordination layer of a container pull — plus a separate pool of VMs running the dataproxy component, which sits between clients and storage. For private endpoint pulls, when a client pulls a layer, the data proxy nodes of a compute stamp fetches from the regional storage pool (or from the data proxy's local compute cache) and streams the bytes back; it is effectively a private endpoint proxy and streaming compute cache layered together. Separately, each region has the following storage components shared across all stamps: A pool of storage accounts. Each ACR region has its own pool of Azure Storage accounts (currently shared across all compute stamps in the region) that hold the actual blob (layer) data and manifest content for the geo-replicas on residing them. Storage accounts are multi-tenant within a stamp and region — multiple registries' blobs may land in the same group of accounts, with strict multi-tenant isolation controls and authorization enforcement. Because the regional storage pool is not part of a compute stamp, a future blog post can cover how ACR is separately investing engineering resources to dynamically scale blobs hosted in a region's pool of storage accounts. Each ACR region typically contains multiple compute stamps serving many tenants' registries, all sharing a pool of storage accounts. For geo-replicated registries, a geo-replica in a region is bound to exactly one underlying ACR compute stamp and several underlying storage accounts. A geo-replicated registry's global endpoint (<registry>.azurecr.io), geo-replica regional endpoints, and geo-replica dedicated data endpoints are resolved via DNS — backed by ACR's own Traffic Manager profile — to a specific stamp serving that region's geo-replica. The stamp is ACR's unit of compute that handles a geo-replica's registry data plane operations and proxies requests to the underlying regional storage pool. The key conceptual point: an ACR compute stamp is simultaneously a capacity pool (autoscale operates on it), a fault domain (incidents on the stamp affect all its tenants), and an update domain (rollouts progress through update domains within the stamp). When we move a registry between compute stamps in the same region, we are moving it between all three at once — and the customer's endpoint URLs do not change. From the customer's perspective, the migration is fully seamless: there are no endpoint changes, no DNS updates to make, and no action required on their part. The registry continues to work exactly as before, and the customer does not need to know or care that the underlying stamp has changed. Why multi-tenancy at scale is an active practice The naive picture is: provision enough capacity, autoscale handles the rest. This works in steady state. It does not work when one tenant's workload grows enough to systematically influence stamp behavior, when traffic shape is bursty enough that averages understate peaks, or when a single large tenant's blast radius becomes uncomfortably concentrated on a shared stamp. None of these is something a passive autoscaler will fix. They require an operator decision: this registry would be better served on that stamp. ACR engineering does this continuously — from routine rebalancing to providing additional isolation for exceptional workloads. How We Do It: Stamp Rebalancing Stamp rebalancing — a recurring practice Several signals can trigger a stamp rebalancing decision — reactive signals such as sustained errors, outages, throttling that customers observe or that we observe in our own telemetry, low throughput on a stamp, or proactive signals like hot-node P95 CPU (described in this post below) breaching a threshold. The most recent rebalancing work used hot-node P95 as the proactive trigger; other rebalancing decisions have been driven by the reactive signals just listed. When any of these fires, ACR engineering identifies the registries contributing most to the problem and picks one or more to move to a less-utilized stamp in the same region. The mechanism is straightforward: we initiate elevated operator actions, the control plane re-binds the registry's home_stamp field, DNS routing follows, in-flight requests on the source stamp drain in 30–60 seconds, and new traffic lands on the destination stamp. The cutover takes minutes. The customer's registry endpoint does not change. Most customers never know it happened; the ones whose registry moved typically see better latency afterward. Rebalancing to an existing cooler stamp is a recurring practice that resolves most multi-tenant pressure. For exceptional workloads where rebalancing to another shared stamp would just transfer the problem, ACR may provide additional stamp isolation — placing registries on stamps with fewer co-tenants, giving the tenant better traffic isolation, fault domain separation, and update domain independence while also structurally improving the stamps that tenant used to share with everyone else. Rebalancing at different scales ACR applies rebalancing across a spectrum of scenarios, from moving a handful of registries to a cooler stamp to providing additional stamp isolation for exceptional workloads. The decision criterion is workload size relative to the shared fleet — if moving a tenant to a different shared stamp would just transfer the hot-stamp problem to the destination, additional stamp isolation is the right answer. For everyone else, rebalancing to an existing stamp is sufficient. Both are manual today; both stamp provisioning and rebalancing mechanisms described are on ACR's roadmap to be automated with less operator involvement. Hot-node P95: one of the signals we use proactively Rebalancing decisions are driven by a mix of reactive and proactive signals. Reactive signals — outages, sustained error rates, frequent throttling, low throughput that customers report or that we see in our own telemetry — are the obvious triggers. But waiting for these means waiting for a customer-visible problem. Proactive signals let us intervene before that happens. Hot-node P95 CPU, showcased in this post, is one of the proactive signals we use, and it was the primary signal for the most recent rebalancing work described in the example below. The choice of CPU metric matters. Three candidates: Pool-average CPU. Averages every node in the pool. Hides per-node hot-spotting — a pool with 6% average CPU can still have one node at 99%. Single-sample Max CPU. The highest 1-minute sample. Captures spikes, but is dominated by single-bin noise that doesn't represent sustained load. Hot-node P95 CPU. For each 1-minute bin, take the hottest node's average CPU. Then percentile across bins over a representative 12-hour peak window. This is "how hot is the worst node, most of the time." Hot-node P95 captures sustained per-node load without being noisy, and it tracks customer-visible behavior more closely than either alternative. A concrete illustration from a recent regional resize: on one shared stamp's dataproxy pool, Max CPU touched 96% — alarming if read alone. But hot-node P95 was 43%, meaning most of the time even the hottest node was comfortably loaded; the 96% was a single 1-minute spike. Using Max as the operating signal would have triggered an unnecessary intervention. Using pool-average would have missed real hot-spotting elsewhere. Hot-node P95 is the right operating point for this particular signal — and it is one input among several that feed the broader rebalancing decision. A Recent Example: Rebalancing Large AI Workloads for Additional Isolation We recently completed the rebalancing of registries belonging to one of the largest AI workloads in the region, providing additional isolation to address the scale of their traffic. The customer's workload had grown to the point where its presence on the shared stamps was systematically influencing stamp behavior — variability that affected their own pull latency, and variability that affected every other tenant on the same shared stamps. The customer had 40 registries homed across two shared stamps in the region, with a severely long-tailed traffic distribution: the top four registries carried 96.7% of the customer's traffic. When that much load is concentrated in four registries, the migration cannot proceed as one batch. We moved them in phases, smallest to largest, with observation windows between phases: Idle and small-traffic tail first — about thirty low-traffic registries, used to validate the cutover tooling against the destination stamp. Medium-traffic registries next — in sub-batches with 24 hours of observation between them. The top four, one at a time — each individually with 48 hours of observation between cutovers. Order: smallest to largest, so each cutover was a sanity check at increasing load. The cumulative effect on the shared stamps the customer had previously occupied: Shared stamp + pool Hot-Node P95 CPU change Max CPU change Stamp A — registry pool -7% flat Stamp A — dataproxy pool -34% 96% → 64% Stamp B — registry pool -33% -3 percentage points Stamp B — dataproxy pool -44% -5 percentage points Stamp A dataproxy is the headline. The hottest node went from briefly touching 96% to maxing out at 64%, with sustained hot-node P95 dropping from 43% to 28.5%. Every other tenant homed on Stamp A — most with no idea this rebalancing happened — now runs on a structurally healthier pool, with more headroom, lower tail latency under load, and lower risk of CPU-driven incidents during traffic spikes. Stamp B saw similar relief. After the rebalancing, we right-sized the shared stamps downward — lowering the VMSS minimum instance count on each to match the new traffic level. Hot-node P95 was the primary signal driving this resize work, the same proactive signal that motivated the rebalancing in the first place: when hot traffic leaves a shared stamp, capacity right-sizing follows. Findings ACR runs this recurring stamp rebalancing practice for one reason: to give customers more guaranteed performance — higher and more predictable pull throughput, lower tail latency, better fault and update isolation — whether through routine rebalancing or additional isolation for exceptional workloads. Every tenant on the rebalanced stamps gets more headroom, more predictable behavior under load, and a smaller blast radius for any single incident or rollout. Three things happen continuously in any ACR region to make this real: registries get rebalanced between stamps as load patterns shift, exceptional workloads get additional stamp isolation when no shared stamp can absorb them sustainably, and stamps get continuously right-sized when load enters or leaves. All three are operator-driven today, all three are being invested in for automation, and all three are guided by a combination of reactive signals (outages, errors, throttling) and proactive signals (hot-node P95 CPU is one of them). The thesis is straightforward: cloud multi-tenancy at scale is not a passive property of the architecture. It is an active operational practice that exists to give customers guaranteed performance and predictable behavior. The customers who benefit most from it are usually the customers who never notice it's happening. Summary Question Answer How does ACR keep multi-tenant performance predictable at scale? By actively moving registries between compute stamps as load shifts — rebalancing in the common case, providing additional isolation for exceptional workloads. What is a compute stamp? An ACR compute deployment unit within a region's geo-replica: VMSS-backed registry and data proxy compute pools. Simultaneously a compute capacity pool, fault domain, and update domain. A region typically contains multiple stamps. Take note that ACR maintains a separate pool of regional storage accounts shared across all compute stamps. Do customers see when their registry moves between stamps? No. Stamps are within a region; the global endpoint and any regional endpoint URLs do not change. The cutover takes minutes; in-flight requests drain in 30–60 seconds. Does providing additional isolation only help the isolated tenant? No — every other tenant who was sharing a stamp with that workload also benefits, because the largest source of variability has been removed from the shared fleet. What signals drive these decisions? A mix of reactive signals (outages, sustained errors, throttling, low throughput) and proactive signals from our own telemetry. Hot-node P95 CPU — the 95th percentile, across a 12-hour peak window, of the hottest node's CPU in each 1-minute bin — is one of the proactive signals, and it was the primary signal for the most recent rebalancing work. Is all of this automated? Not yet. Rebalancing, isolation provisioning, and migrations are operator-driven today. Standardizing and automating these practices is an active investment.290Views0likes0CommentsDeterminism over magic: the engineering design behind Azure Container Registry Regional Endpoints
By Zoey Li, Huangli Wu, Johnson Shi, Wei Meng Introduction Azure Container Registry (ACR) supports geo-replication: one registry resource with active-active replicas across multiple Azure regions. You push or pull through any replica, and ACR asynchronously replicates content to all others. For geo-replicated registries, ACR exposes a global endpoint — myregistry.azurecr.io — backed by Azure Traffic Manager (TM), which routes requests based on network performance. This works well for most workloads. But "automatic" is also "opaque": customers can't see or influence which region TM picks, and for teams with data-residency requirements or their own failover logic, not knowing which replica served a request wasn't enough. This post is an engineering deep dive into the design of Regional Endpoints: per-replica DNS names that let customers explicitly target a single registry replica while preserving the global endpoint as the automatic-failover entry point. We'll walk through the DNS topology, how authentication stays portable across endpoints, the certificate strategy, private endpoint integration, and the trade-offs we made to keep routing deterministic. The Problem: Opaque Routing, No Customer Control Traffic Manager's performance-based routing is a black box from the customer's perspective. The system picks the "best" region for each DNS resolution, but customers cannot influence or predict that choice. This created several pain points: Unpredictable routing: A client in a specific region cannot guarantee it hits the local replica. Network topology changes, TM probe timing, and DNS caching all introduce non-determinism. Troubleshooting latency spikes requires first figuring out which region served the request. Network isolation gaps: Enterprises with strict data residency requirements need deterministic in-region request handling. The global endpoint can route cross-region, which may not satisfy these requirements. A large financial services firm reported building duplicate single-region registries per geography as a workaround — abandoning geo-replication entirely. No client-side failover: Customers who wanted failover control (e.g., "try local, fall back to DR region") had no addressable per-region targets to configure in containerd mirrors or client retry logic. Reduced confidence in geo-replication: Some customers disabled geo-replication because they couldn't verify it was working as expected. Without per-region addressability, they couldn't confirm data locality or measure per-replica performance. What Regional Endpoints Are Regional Endpoints provide a dedicated DNS name for each replica region: myregistry.<region>.geo.azurecr.io For example, a registry contoso with replicas in East US and West Europe gets: contoso.eastus.geo.azurecr.io → resolves to the East US replica contoso.westeurope.geo.azurecr.io → resolves to the West Europe replica contoso.azurecr.io → unchanged global endpoint with TM routing and auto-failover Key properties: Coexistence: Regional endpoints exist alongside the global endpoint. Enabling them changes nothing about existing global endpoint behavior. No auto-failover on regional (by design): A request to contoso.eastus.geo.azurecr.io goes to East US. If East US is down, the request fails. This is the explicit trade-off — determinism means no silent rerouting. Global remains the failover entry point: Customers who want automatic failover continue using the global endpoint, now enhanced with health-aware routing. Regional endpoints complement, not replace, the global endpoint. DNL registries: Registries using Deterministic Name Labels include the hash in the hostname (e.g., contoso-ffb4cphwfsc2gbgg.eastus.geo.azurecr.io). Architecture Deep Dive DNS Infrastructure Each replica gets a stable hostname — contoso.eastus.geo.azurecr.io. For public access, this resolves via CNAME to the regional ACR registry server. For private endpoint access, it resolves through the privatelink zone (<region>.geo.privatelink.azurecr.io) to a private IP in the customer's VNet — the same two-hop CNAME pattern that the global login server and data endpoints already use when accessed via private link. How Authentication Works When a client calls a regional endpoint, the auth challenge uses the regional hostname for the token endpoint. However, the token itself is scoped to the global registry name (service=contoso.azurecr.io), making tokens interoperable across all endpoints for the same registry and requested scope. The key design choice: the realm uses the regional host (so the token endpoint matches the hostname the client is talking to), but the service stays global (so tokens are portable). A token obtained from contoso.eastus.geo.azurecr.io works equally against contoso.azurecr.io or contoso.westeurope.geo.azurecr.io. This means customers don't need separate credential stores per region — but Docker does require a docker login per hostname because it keys credential storage on the registry URL. Private Endpoint Integration Regional endpoints reuse the existing registry private endpoint group. When regional endpoints are enabled, new group members of the form registry_<region> are added alongside existing members. Regional endpoints are independent of data endpoints — enabling regional endpoints does not require or modify data endpoint configuration. IP consumption: Enabling regional endpoints adds N private IPs (one per replica region) to the existing baseline of 1 + N (1 global + N data). Total with regional endpoints: 1 + 2×N private IPs. Plan subnet capacity accordingly. When regional endpoints are enabled on a registry that already has private endpoints, the PE connections are updated asynchronously to include the new regional group members. Certificate Strategy TLS certificates are extended with wildcard SAN entries per region (e.g., *.<region>.geo.azurecr.io), supporting TLS validation for regional hostnames without issuing a certificate per registry. Lifecycle and Reversibility Regional endpoints are a registry-level feature. Once enabled, every replica automatically gets a regional hostname. The feature follows standard registry and replica operations: Enable (az acr update --regional-endpoints enabled): Public DNS records are created for all existing replicas. If private endpoints exist, they are updated asynchronously to include new regional members. Add replica: The new region automatically gets its regional DNS record. If private endpoints exist, they are updated to include the new region. Remove replica: That region's regional DNS record is removed. If private endpoints exist, the corresponding member is removed. Disable (az acr update --regional-endpoints disabled): All regional DNS records are removed. If private endpoints exist, the regional members are removed. The global endpoint continues working throughout. The feature can be re-enabled later. This is separate from the per-replication --global-endpoint-routing property (previously --region-endpoint-enabled, renamed in Azure CLI 2.87.0), which controls whether a replica participates in Traffic Manager routing on the global endpoint. That property has no effect on regional endpoint access. Design Trade-offs Regional Endpoints make the region explicit, so they intentionally do not auto-fail over. If East US is down, contoso.eastus.geo.azurecr.io fails — it does not silently mean West US. This is the point: determinism means no silent rerouting. In practice, ACR replicas are zone-redundant by default, spreading across multiple availability zones within a region — so a full regional outage is significantly less likely than a single-zone failure. Customers who want automatic failover across regions continue using the global endpoint. Other trade-offs we made: Docker login is per hostname. Tokens are interoperable across endpoints, but credential stores are hostname-scoped. az acr login --endpoint <region> provides a convenience shortcut. All-or-nothing enablement. Regional endpoints are enabled for all replicas simultaneously — you cannot enable for one region and not another. This simplifies the control plane and avoids partial-state confusion. More private IPs. Each regional endpoint adds one private IP per PE. Plan for 1 + 2×N IPs for N replicas. Replication lag is not masked. A pull to contoso.eastus.geo.azurecr.io will fail if the image hasn't replicated to East US yet. Regional endpoints guarantee region affinity, not data availability. For push-then-immediate-pull workflows, use retries or check replication status before pulling from a different region. What Customers Should Know Enabling Regional Endpoints Requires Azure CLI 2.86.0 or later. # Enable on existing registry az acr update -n myregistry --regional-endpoints enabled # Verify endpoints az acr show-endpoints -n myregistry Authentication # Login to a regional endpoint az acr login -n myregistry --endpoint eastus # Or manually with Docker (tokens are interoperable) TOKEN=$(az acr login -n myregistry --expose-token --query accessToken -o tsv) echo $TOKEN | docker login myregistry.eastus.geo.azurecr.io -u 00000000-0000-0000-0000-000000000000 --password-stdin Tokens are interoperable — a token obtained from any endpoint (regional or global) works across all endpoints. Docker requires a separate docker login per hostname since it keys credentials on the registry URL. TM Routing Disable The existing az acr replication update --global-endpoint-routing disabled removes a region from Traffic Manager routing on the global endpoint. This does not affect direct regional endpoint access — contoso.eastus.geo.azurecr.io remains reachable regardless of TM endpoint status. Rollout and Safety The feature is designed for safe, incremental adoption: Additive: Enabling creates new DNS hostnames. Existing global endpoint records and behavior are untouched. Reversible: Disabling removes the regional DNS records. The global endpoint continues working throughout. Isolated failure domain: A regional endpoint issue affects only traffic explicitly sent to that hostname. Global endpoint traffic is completely unaffected. No customer-side infrastructure required: No agents, sidecars, or additional services to deploy. The feature has been validated through integration testing. Outcome and What's Next Regional Endpoints complete the addressability story for geo-replicated registries. Customers now have two complementary access patterns: Global endpoint (contoso.azurecr.io): Automatic routing with health-aware failover. Best for workloads that want resilience without operational overhead. Regional endpoints (contoso.<region>.geo.azurecr.io): Deterministic, pinned-to-region access. Best for compliance, troubleshooting, and client-controlled failover strategies. Together, these give customers both explicit control and automatic failover — the combination that was previously impossible without abandoning geo-replication. Looking ahead, we are exploring ways to improve pull behavior when content has not yet replicated locally — addressing the replication lag limitation without sacrificing region affinity. To learn more about ACR geo-replication, see Geo-replication in Azure Container Registry. To enable regional endpoints, see the CLI reference or the Azure portal.178Views1like0CommentsEven simpler to Safely Execute AI-generated Code with Azure Container Apps Dynamic Sessions
AI agents are writing code. The question is: where does that code run? If it runs in your process, a single hallucinated import os; os.remove('/') can ruin your day. Azure Container Apps dynamic sessions solve this with on-demand sandboxed environments - Hyper-V isolated, fully managed, and ready in milliseconds. Thanks to your feedback, Dynamic Sessions are now easier to use with AI via MCP. Agents can quickly start a session interpreter and safely run code - all using a built-in MCP endpoint. Additionally - new starter samples show how to invoke dynamic sessions from Microsoft Agent Framework with code interpreter and with a custom container for even more versatility. What Are Dynamic Sessions? A session pool maintains a reservoir of pre-warmed, isolated sandboxes. When your app needs one, it’s allocated instantly via REST API. When idle, it’s destroyed automatically after provided session cool down period. What you get: Strong isolation - Each session runs in its own Hyper-V sandbox - enterprise-grade security Millisecond startup -Pre-warmed pool eliminates cold starts Fully managed - No infra to maintain - automatic lifecycle, cleanup, scaling Simple access - Single HTTP endpoint, session identified by a unique ID Scalable - Hundreds to thousands of concurrent sessions Two Session Types 1. Code Interpreter — Run Untrusted Code Safely Code interpreter sessions accept inline code, run it in a Hyper-V sandbox, and return the output. Sessions support network egress and persistent file systems within the session lifetime. Three runtimes are available: Python - Ships with popular libraries pre-installed (NumPy, pandas, matplotlib, etc.). Ideal for AI-generated data analysis, math computation, and chart generation. Node.js - Comes with common npm packages. Great for server-side JavaScript execution, data transformation, and scripting. Shell - A full Linux shell environment where agents can run arbitrary commands, install packages, start processes, manage files, and chain multi-step workflows. Unlike Python/Node.js interpreters, shell sessions expose a complete OS - ideal for agent-driven DevOps, build/test environments, CLI tool execution, and multi-process pipelines. 2. Custom Containers — Bring Your Own Runtime Custom container sessions let you run your own container image in the same isolated, on-demand model. Define your image, and Container Apps handles the pooling, scaling, and lifecycle. Typical use cases are hosting proprietary runtimes, custom code interpreters, and specialized tool chains. This sample (Azure Samples) dives deeper into Customer Containers with Microsoft agent Framework orchestration. MCP Support for Dynamic Sessions Dynamic sessions also support Model Context Protocol (MCP) on both shell and Python session types. This turns a session pool into a remote MCP server that AI agents can connect to - enabling tool execution, file system access, and shell commands in a secure, ephemeral environment. With an MCP-enabled shell session, an Azure Foundry agent can spin up a Flask app, run system commands, or install packages - all in an isolated container that vanishes when done. The MCP server is enabled with a single property on the session pool (isMCPServerEnabled: true), and the resulting endpoint + API key can be plugged directly into Azure Foundry as a connected tool. For a step-by-step walkthrough, see How to add an MCP tool to your Azure Foundry agent using dynamic sessions. Deep Dive: Building an AI Travel Agent with Code Interpreter Sessions Let’s walk through a sample implementation - a travel planning agent that uses dynamic sessions for both static code execution (weather research) and LLM-generated code execution (charting). Full source: github.com/jkalis-MS/AIAgent-ACA-DynamicSession Architecture Travel Agent Architecture Component Purpose Microsoft Agent Framework Agent runtime with middleware, telemetry, and DevUI Azure OpenAI (GPT-4o) LLM for conversation and code generation ACA Session Pools Sandboxed Python code interpreter Azure Container Apps Hosts the agent in a container Application Insights Observability for agent spans The agent implements with two variants switchable in the Agent Framework DevUI - tools in ACA Dynamic Session (sandbox) and tools running locally (no isolation) - making the security value immediately visible. Scenario A: Static Code in a Sandbox - Weather Research The agent sends pre-written Python code to the session pool to fetch live weather data. The code runs with network egress enabled, calls the Open-Meteo API, and returns formatted results - all without touching the host process. import requests from azure.identity import DefaultAzureCredential credential = DefaultAzureCredential() token = credential.get_token("https://dynamicsessions.io/.default") response = requests.post( f"{pool_endpoint}/code/execute?api-version=2024-02-02-preview&identifier=weather-session-1", headers={"Authorization": f"Bearer {token.token}"}, json={"properties": { "codeInputType": "inline", "executionType": "synchronous", "code": weather_code, # Python that calls Open-Meteo API }}, ) result = response.json()["properties"]["stdout"] Scenario B: LLM-Generated Code in a Sandbox - Dynamic Charting This is where it gets interesting. The user asks “plot a chart comparing Miami and Tokyo weather.” The agent: Fetches weather data Asks Azure OpenAI to generate matplotlib code using a tightly-scoped system prompt Safety-checks the generated code for forbidden imports (subprocess, os.system, etc.) Wraps the code with data injection and sends it to the sandbox Downloads the resulting PNG from the sandbox’s /mnt/data/ directory from openai import AzureOpenAI # 1. LLM generates chart code client = AzureOpenAI(azure_endpoint=endpoint, api_key=key, api_version="2024-12-01-preview") generated_code = client.chat.completions.create( model="gpt-4o", messages=[{"role": "system", "content": CODE_GEN_PROMPT}, {"role": "user", "content": f"Weather data: {weather_json}"}], temperature=0.2, ).choices[0].message.content # 2. Execute in sandbox requests.post( f"{pool_endpoint}/code/execute?api-version=2024-02-02-preview&identifier=chart-session-1", headers={"Authorization": f"Bearer {token.token}"}, json={"properties": { "codeInputType": "inline", "executionType": "synchronous", "code": f"import json, matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nweather_data = json.loads('{weather_json}')\n{generated_code}", }}, ) # 3. Download the chart img = requests.get( f"{pool_endpoint}/files/content/chart.png?api-version=2024-02-02-preview&identifier=chart-session-1", headers={"Authorization": f"Bearer {token.token}"}, ).content The result is a dark-themed dual-subplot chart comparing maximal and minimal temperature forecast chart example rendered by the Chart Weather tool in Dynamic Session: Authentication The agent uses DefaultAzureCredential locally and ManagedIdentityCredential when deployed. Tokens are cached and refreshed automatically: from azure.identity import DefaultAzureCredential token = DefaultAzureCredential().get_token("https://dynamicsessions.io/.default") auth_header = f"Bearer {token.token}" # Uses ManagedIdentityCredential automatically when deployed to Container Apps Observability The agent uses Application Insights for end-to-end tracing. The Microsoft Agent Framework exposes OpenTelemetry spans for invoke_agent, chat, and execute tool - wired to Azure Monitor with custom exporters: from azure.monitor.opentelemetry import configure_azure_monitor from agent_framework.observability import create_resource, enable_instrumentation # Configure Azure Monitor first configure_azure_monitor( connection_string="InstrumentationKey=...", resource=create_resource(), # Uses OTEL_SERVICE_NAME, etc. enable_live_metrics=True, ) # Then activate Agent Framework's telemetry code paths, optional if ENABLE_INSTRUMENTATION and/or ENABLE_SENSITIVE_DATA are set in env vars enable_instrumentation(enable_sensitive_data=False) This gives you traces for every agent invocation, tool execution (including sandbox timing), and LLM call - visible in the Application Insights transaction search and end-to-end transaction view in the new Agents blade in Application Insights. You can also open a detailed dashboard by clicking Explore in Grafana. Session pools emit their own metrics and logs for monitoring sandbox utilization and performance. Combined with the agent-level Application Insights traces, you can get full visibility from the user prompt → agent → LLM → sandbox execution → response across both your application and the infrastructure running untrusted code. Deploy with One Command The project includes full Bicep infrastructure-as-code. A single azd up provisions Azure OpenAI, Container Apps, Session Pool (with egress enabled), Container Registry, Application Insights, and all role assignments. azd auth login azd up Next Steps Dynamic sessions documentation – Microsoft Learn MCP + Shell sessions tutorial - How to add an MCP tool to your Foundry agent Custom container sessions sample - github.com/Azure-Samples/dynamic-sessions-custom-container AI Agent + Dynamic Sessions - github.com/jkalis-MS/AIAgent-ACA-DynamicSession1.2KViews0likes0CommentsAnnouncing Public Preview of Argo CD extension in AKS Azure Portal Experience
We are excited to announce the public preview of Argo CD in the Azure Portal for Azure Kubernetes Service. As GitOps becomes the standard for deploying and operating applications at scale, customers need a way to adopt GitOps with simpler onboarding, secure defaults, and integrated workflows. With Argo CD now available directly in the Portal, teams can enable and manage GitOps without the complexity of manual setup. Bringing GitOps into the AKS experience Argo CD is widely used across Kubernetes environments, but setup often requires manual configuration across identity, networking, and registry integrations. With the Azure Portal experience, customers can: Enable Argo CD directly from the AKS cluster Configure identity, access, ingress, and registry integration in a guided flow Manage and monitor GitOps workflows through Argo CD UI This reduces onboarding friction and helps you reach your first successful GitOps deployment faster. Trusted identity and secure access The Argo CD experience integrates with Microsoft Entra ID to provide a secure, enterprise-ready foundation: Secure authentication using Workload Identity federation to Azure Container Registry (ACR) and Azure DevOps, removing long-lived credentials and hard-coded secrets Single Sign-On (SSO) using existing Azure identities Enterprise-grade hardening and security This preview includes built-in improvements to strengthen security posture: Images built on Azure Linux for reduced CVEs and improved baseline security Optional automatic patch updates to stay current while maintaining control over change management Parity with upstream Argo CD Argo CD in AKS remains aligned with the upstream open-source project, supporting: High availability (HA) configurations for production workloads Hub-and-spoke architectures for multi-cluster GitOps Application and ApplicationSet for scalable deployment across fleets Getting Started We invite you to explore the Argo CD experience in the Azure Portal and share feedback. To get started, go to your AKS cluster in the Azure Portal, navigate to the GitOps experience, and select Enable Argo CD. Follow the guided setup to configure identity, access, ingress, and registry integration with secure defaults. Once enabled, you can monitor your deployment and view application health and sync status from the Argo CD UI linked in the GitOps blade. For customers who prefer automation and scripting, the Argo CD extension is also available via Azure CLI public preview. NOTE: You can choose between Flux and Argo CD as your GitOps solution based on your needs. The Argo CD option is available during the initial GitOps setup experience, while existing Flux users will continue to see their current configuration.376Views0likes0CommentsSet Up Endpoint DLP Evidence Collection on your Azure Blob Storage
Endpoint Data Loss Prevention (Endpoint DLP) is part of the Microsoft Purview Data Loss Prevention (DLP) suite of features you can use to discover and protect sensitive items across Microsoft 365 services. Microsoft Endpoint DLP allows you to detect and protect sensitive content across onboarded Windows 10, Windows 11 and macOS devices. Learn more about all of Microsoft's DLP offerings. Before you start setting up the storage, you should review Get started with collecting files that match data loss prevention policies from devices | Microsoft Learn to understand the licensing, permissions, device onboarding and your requirements. Prerequisites Before you begin, ensure the following prerequisites are met: You have an active Azure subscription. You have the necessary permissions to create and configure resources in Azure. You have setup endpoint Data Loss Prevention policy on your devices Configure the Azure Blob Storage You can follow these steps to create an Azure Blob Storage using the Azure portal. For other methods refer to Create a storage account - Azure Storage | Microsoft Learn Sign in to the Azure Storage Accounts with your account credentials. Click on + Create On the Basics tab, provide the essential information for your storage account. After you complete the Basics tab, you can choose to further customize your new storage account, or you accept the default options and proceed. Learn more about azure storage account properties Once you have provided all the information click on the Networking tab. In network access, select Enable public access from all networks while creating the storage account. Click on Review + create to validate the settings. Once the validation passes, click on Create to create the storage Wait for deployment of the resource to be completed and then click on Go to resource. Once the newly created Blob Storage is opened, on the left panel click on Data Storage -> Containers Click on + Containers. Provide the name and other details and then click on Create Once your container is successfully created, click on it. Assign relevant permissions to the Azure Blob Storage Once the container is created, using Microsoft Entra authorization, you must configure two sets of permissions (role groups) on it: One for the administrators and investigators so they can view and manage evidence One for users who need to upload items to Azure from their devices Best practice is to enforce least privilege for all users, regardless of role. By enforcing least privilege, you ensure that user permissions are limited to only those permissions necessary for their role. We will use portal to create these custom roles. Learn more about custom roles in Azure RBAC Open the container and in the left panel click on Access Control (IAM) Click on the Roles tab. It will open a list of all available roles. Open context menu of Owner role using ellipsis button (…) and click on Clone. Now you can create a custom role. Click on Start from scratch. We have to create two new custom roles. Based on the role you are creating enter basic details like name and description and then click on JSON tab. JSON tab gives you the details of the custom role including the permissions added to that role. For owner role JSON looks like this: Now edit these permissions and replace them with permissions required based on the role: Investigator Role: Copy the permissions available at Permissions on Azure blob for administrators and investigators and paste it in the JSON section. User Role: Copy the permissions available at Permissions on Azure blob for usersand paste it in the JSON section. Once you have created these two new roles, we will assign these roles to relevant users. Click on Role Assignments tab, then on Add + and on Add role assignment. Search for the role and click on it. Then click on Members tab Click on + Select Members. Add the users or user groups you want to add for that role and click on Select Investigator role – Assign this role to users who are administrators and investigators so they can view and manage evidence User role – Assign this role to users who will be under the scope of the DLP policy and from whose devices items will be uploaded to the storage Once you have added the users click on Review+Assign to save the changes. Now we can add this storage to DLP policy. For more information on configuring the Azure Blob Storage access, refer to these articles: How to authorize access to blob data in the Azure portal Assign share-level permissions. Configure storage in your DLP policy Once you have configured the required permissions on the Azure Blob Storage, we will add the storage to DLP endpoint settings. Learn more about configuring DLP policy Open the storage you want to use. In left panel click on Data Storage -> Containers. Then select the container you want to add to DLP settings. Click on the Context Menu (… button) and then Container Properties. Copy the URL Open the Data Loss Prevention Settings. Click on Endpoint Settings and then on Setup evidence collection for file activities on devices. Select Customer Managed Storage option and then click on Add Storage Give the storage name and copy the container URL we copied. Then click on Save. Storage will be added to the list. Storage will be added to the list for use in the policy configuration. You can add up to 10 URLs Now open the DLP endpoint policy configuration for which you want to collect the evidence. Configure your policy using these settings: Make sure that Devices is selected in the location. In Incident reports, toggle Send an alert to admins when a rule match occurs to On. In Incident reports, select Collect original file as evidence for all selected file activities on Endpoint. Select the storage account you want to collect the evidence in for that rule using the dropdown menu. The dropdown menu shows the list of storages configured in the endpoint DLP settings. Select the activities for which you want to copy matched items to Azure storage Save the changes Please reach out to the support team if you face any issues. We hope this guide is helpful and we look forward to your feedback. Thank you, Microsoft Purview Data Loss Prevention Team4.1KViews6likes2CommentsHardening OpenClaw on AKS: Mitigating Container Escapes with Kata microVM Isolation
What is OpenClaw, and what security challenges does it pose with container escapes? OpenClaw is an open-source autonomous AI agent designed for power users and developers to automate tasks, such as managing emails, files, and scheduling via chat apps like WhatsApp or Telegram. While OpenClaw functions as a powerful autonomous assistant, its runtime model creates a massive security paradox: to be truly useful, the agent requires broad permissions to your filesystem and APIs, yet this "God Mode" access often lacks the rigorous containerized isolation typical of enterprise workloads. Because many users run the framework natively rather than within a hardened sandbox, the primary security challenge is that a single malicious "Skill" or an indirect prompt injection can escalate into full system compromise. This structural vulnerability, exemplified by high-profile exploits like CVE-2026-25253, transforms the agent from a helpful tool into a high-risk entry point for lateral movement and data exfiltration within a private network. Why container escapes matter in OpenClaw-style deployments: because containers share the host kernel, a successful container escape turns a single compromised container into a host compromise (or at least a compromise of other co-located workloads). This is especially important when OpenClaw runs code from many tenants, many teams, or varying trust levels on the same worker nodes. That soft isolation is often permeable due to the following structural and configuration-based weaknesses: Shared-kernel attack surface: the container boundary is not a hypervisor boundary. Kernel vulnerabilities (e.g., privilege escalation bugs) can allow a process in a container to gain host-level privileges. Excessive privileges / misconfiguration: running with --privileged, broad Linux capabilities, hostPath mounts, access to the Docker socket, or device passthrough (e.g., /dev/kvm, /dev/fuse) can provide direct paths to host control. Filesystem and namespace boundary breaks: mount namespace confusion, writable host mounts, or mistakes in chroot/pivot_root handling can expose host files and credentials. Supply-chain and image risk: a malicious image or dependency can execute within the container and then attempt escalation/escape. Blast radius: once the host is compromised, attackers can access node-level secrets (service account tokens, registry creds), tamper with the runtime, sniff traffic, or pivot to other containers and the broader cluster. In short, OpenClaw’s security challenge is not that containers are inherently insecure, but that the isolation boundary is thinner than a VM boundary. When the threat model includes adversarial code execution, a “container-only” isolation strategy often requires additional hardening or a stronger sandbox. What are MicroVMs and Kata Containers, and how do they help mitigate OpenClaw container-escape risks? MicroVMs are lightweight virtual machines optimized for running short-lived or container-like workloads with much lower overhead than traditional VMs. They use hardware virtualization (via a hypervisor such as KVM) but keep the device model and boot path minimal, reducing startup time and the overall attack surface compared to a full general-purpose VM. Kata Containers is an “OCI-compatible containers in a VM” approach: it runs each container (or pod sandbox) inside a dedicated microVM by default (implementation varies by runtime/config). To the orchestration layer (e.g., Kubernetes), it still looks like a container runtime, but isolation is provided by a hypervisor boundary rather than only namespaces/cgroups. Stronger isolation boundary: a container escape that relies on Linux kernel exploitation is far less likely to directly compromise the host, because the workload’s “host” kernel is typically the guest kernel inside the microVM. Reduced blast radius: compromise is contained to the microVM/pod sandbox; lateral movement to other workloads on the same node becomes significantly harder. Smaller and more controllable attack surface: minimal device models, tighter default privileges, and fewer host mounts/devices exposed to the workload. Defense-in-depth with container controls: you still can (and should) apply seccomp, capabilities dropping, read-only root filesystems, and LSMs inside the guest, but the hypervisor boundary becomes an additional layer. Better fit for hostile multi-tenant workloads: when OpenClaw executes third-party jobs/plugins, Kata-style sandboxing aligns better with an adversarial threat model. Solution overview Figure 1 illustrates a Kubernetes-based sandboxing architecture for running OpenClaw workloads with stronger isolation. The design keeps the developer experience and packaging model of containers (OCI images, Kubernetes scheduling) while ensuring that untrusted agent code executes inside a microVM boundary using Kata Containers. This reduces the likelihood that a container escape can compromise the underlying node or other co-located workloads. Key components: (1) Application gateway for HTTPS traffic to the backend, (2) Kubernetes as the orchestration, scheduling and policy enforcement plane, (3) a container runtime (e.g., containerd) configured with a Kata Containers runtime class, (4) KVM-backed microVMs that provide the isolation boundary for each untrusted workload and (5) Azure files for persistent storage which allows scaling of OpenClaw. Figure 1: Solution architecture diagram End-to-end flow: Traffic Entry via Application Gateway: Incoming user requests (e.g., from WhatsApp or Discord) first hit the Azure Application Gateway. Orchestration in AKS: The traffic is routed into an Azure Kubernetes Service (AKS) cluster, which manages the lifecycle of the OpenClaw agent and its associated "Skills." Hardened Execution via Kata Containers: Instead of running in standard shared-kernel containers, the OpenClaw agent runs inside Kata Containers. This provides a dedicated lightweight VM for the agent, creating a hardware-level isolation boundary that prevents "container escapes" from compromising the host. Stateful Storage in Azure Files: The agent interacts with Azure Files to read and write persistent data, such as conversation history, configuration files, and downloaded assets, ensuring data remains available even if the container is restarted. Security posture: by shifting isolation from “shared-kernel containers” to “containers inside microVMs,” the architecture limits the blast radius of kernel-level exploits and common escape paths. Even if an attacker achieves code execution within an OpenClaw container, they must additionally break the microVM/hypervisor boundary to affect the node or neighboring workloads, providing a strong defense-in-depth improvement over standard container alone. Implement the solution This section describes how to deploy the solution architecture. In this post, you’ll perform the following tasks: Create a Kata VM-isolated AKS node pool Mount a NFS persistent storage Create the application ConfigMap Deploy the OpenClaw gateway Expose the gateway internally Set up TLS termination Route external traffic through the Azure application gateway for containers. Ensure that you have the following prerequisites deployed before moving to the next section: An AKS cluster provisioned in Azure An Azure NFS File Share with private link enabled. An Application gateway for containers managed by ALB controller Kubectl configured and pointing to the cluster Az CLI authenticated with the correct subscription Initialise environment variables In your Linux terminal, export these variables with your own values. They will be used in later commands. export cluster_name=<CLUSTER_NAME> export resource_group=<RESOURCE_GROUP> Create the AKS Node Pool with Kata VM Isolation The OpenClaw gateway pods require Kata VM isolation (runtimeClassName: kata-vm-isolation). You must create a dedicated AKS node pool that supports this runtime before deploying any workloads. Use the Azure CLI to add a node pool with the Kata VM isolation workload runtime to your existing AKS cluster: az aks nodepool add \ --resource-group $resource_group \ --cluster-name $cluster_name \ --name katanp \ --node-count 2 \ --node-vm-size Standard_D4s_v3 \ --os-sku AzureLinux \ --workload-runtime KataMshvVmIsolation \ --labels agentpool=katanp **Important:** The `--workload-runtime KataMshvVmIsolation` flag enables the `kata-vm-isolation` runtime class on the node pool. The VM size must support nested virtualization (D-series v3/v5, E-series v3/v5, etc.). Create NFS Persistent Volume The deployment uses an Azure Files NFS share for persistent workspace storage. The PersistentVolume must exist before the PVC can bind to it. Replace volumeHandle and volumeAttributes with your own Azure Files values. cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolume metadata: name: openclaw-nfs-pv spec: capacity: storage: 100Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain mountOptions: - sec=sys - noresvport - actimeo=30 csi: driver: file.csi.azure.com volumeHandle: <resource-group>#<storage-account>#<share-name> volumeAttributes: resourceGroup: <resource-group> shareName: <share-name> protocol: nfs server: <storage-account>.privatelink.file.core.windows.net EOF Verify that the persistent volume is created. kubectl get pv openclaw-nfs-pv Figure 2: Persistent volume Create the NFS PersistentVolumeClaim The PVC binds to the PV created. The deployment references this PVC by name (`pvc-openclaw-nfs`). cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: # The name of the PVC name: pvc-openclaw-nfs spec: accessModes: - ReadWriteMany resources: requests: # The real storage capacity in the claim storage: 50Gi # This field must be the same as the storage class name in StorageClass storageClassName: "" volumeName: openclaw-nfs-pv EOF Verify that the persistent volume claim is created successfully. The status should show bound. Figure 3: Persistent Volume Claim Create the ConfigMap The ConfigMap provides the openclaw.json configuration file to the gateway pods. It configures allowed CORS origins for the control UI and the gateway token. Replace the allowed origins with your own ALB frontend URL. The ConfigMap also stores the gateway auth token, so DO NOT hardcode your token here. Always keep it as a variable rather than storing it in plain text so that, if attackers gain access to this file, they cannot see the OpenClaw gateway auth token. cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: openclaw-config data: openclaw.json: | { "gateway": { "auth": { "token": "${AUTH_TOKEN}" }, "controlUi": { "allowedOrigins": [ "https://<YOUR ALB FRONTEND URL>.alb.azure.com" ] } } } EOF Create the Auth Token Secret The OpenClaw gateway requires an authentication token to secure access. The deployment references a Kubernetes Secret named openclaw-auth-token and injects it into the container as the AUTH_TOKEN environment variable via secretKeyRef. Generate a random token (or use an existing one) and create the kubernetes secret. # Generate a random 32-byte hex token AUTH_TOKEN=$(openssl rand -hex 32) echo "$AUTH_TOKEN" # save this — you'll need it to authenticate with the gateway kubectl create secret generic openclaw-auth-token \ --from-literal=token="$AUTH_TOKEN" If the secret does not exist when the deployment is applied, pods will fail with `CreateContainerConfigError`. Deploy the OpenClaw Gateway This is the main application deployment. It depends on all previous steps: - Kata node pool (pods require runtimeClassName: kata-vm-isolation and nodeSelector: agentpool=katanp) - PVC (pvc-openclaw-nfs for persistent workspace data) - ConfigMap (openclaw-config for openclaw.json) Key details: - Runs 2 replicas with a rolling update strategy - Uses an init container to copy the config file to a writable volume - Exposes port 18789 - Includes liveness and readiness probes on /health - Resource requests: 500m CPU, 512Mi memory - Resource limits: 2 CPU, 2Gi memory cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: openclaw-gateway spec: replicas: 2 selector: matchLabels: app: openclaw-gateway strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 1 template: metadata: labels: app: openclaw-gateway spec: runtimeClassName: kata-vm-isolation nodeSelector: agentpool: katanp securityContext: fsGroup: 1000 initContainers: - name: copy-openclaw-config image: alpine/openclaw:latest env: - name: HOME value: /writable command: - sh - -c - | cp /config/openclaw.json /writable/openclaw.json \ && chown 1000:1000 /writable/openclaw.json \ && echo "--- Config file contents ---" \ && cat /writable/openclaw.json volumeMounts: - name: openclaw-config-volume mountPath: /config - name: openclaw-writable mountPath: /writable containers: - name: gateway image: alpine/openclaw:latest ports: - containerPort: 18789 env: - name: NODE_OPTIONS value: "--max-old-space-size=4096" - name: AUTH_TOKEN valueFrom: secretKeyRef: name: openclaw-auth-token key: token # Start gateway the way the tutorial indicates command: ["openclaw", "gateway"] args: ["run", "--allow-unconfigured", "--bind", "lan"] volumeMounts: - name: openclaw-writable mountPath: /home/node/.openclaw - name: openclaw-data mountPath: /home/node/workspace subPath: workspace resources: requests: cpu: "500m" memory: "2Gi" limits: cpu: "1000m" memory: "4Gi" livenessProbe: httpGet: path: /health port: 18789 initialDelaySeconds: 60 periodSeconds: 15 failureThreshold: 3 readinessProbe: httpGet: path: /health port: 18789 initialDelaySeconds: 10 periodSeconds: 5 volumes: - name: openclaw-data persistentVolumeClaim: claimName: pvc-openclaw-nfs - name: openclaw-config-volume configMap: name: openclaw-config items: - key: openclaw.json path: openclaw.json - name: openclaw-writable emptyDir: {} --- apiVersion: v1 kind: Service metadata: name: openclaw-gateway-service spec: type: ClusterIP selector: app: openclaw-gateway ports: - protocol: TCP port: 18789 targetPort: 18789 EOF Verify that the deployment succeeds. Wait until all pods show `Running` and `READY 2/2`. kubectl get deployment openclaw-gateway kubectl get pods -l app=openclaw-gateway Figure 4: OpenClaw deployment Create the TLS secret (for HTTPS) The Application Gateway for Containers references a TLS secret (gateway-tls-secret) for HTTPS termination. This blog post uses a self-signed certificate; in a production environment, use a certificate signed by a certificate authority. Replace `<path-to-tls-cert>` and `<path-to-tls-key>` with paths to your TLS certificate and private key files. kubectl create secret tls gateway-tls-secret \ --cert=<path-to-tls-cert> \ --key=<path-to-tls-key> Create the Gateway The Gateway resource defines the HTTPS listener on the Azure Application Load Balancer (ALB). Update the `alb.network.azure.com/application-gateway-id` annotation to match your ALB traffic controller resource ID. You will also need to reference the gateway-tls-secret to enable HTTPS. cat <<EOF | kubectl apply -f - apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: https annotations: alb.network.azure.com/application-gateway-id: /subscriptions/<subscription id>/resourceGroups/mc_openclaw_openclaw-cluster_centralus/providers/Microsoft.ServiceNetworking/trafficControllers/<alb id> alb.networking.azure.io/alb-namespace: default alb.networking.azure.io/alb-name: alb-openclaw spec: gatewayClassName: azure-alb-external listeners: - name: https protocol: HTTPS port: 443 allowedRoutes: namespaces: from: All tls: mode: Terminate certificateRefs: - kind: Secret group: "" name: gateway-tls-secret EOF kubectl get gateway https Wait until the Gateway shows a `Programmed=True` condition. Create the HTTPRoute The HTTPRoute connects the Gateway to the backend Service. It routes all traffic (`/` prefix) from the HTTPS Gateway to `openclaw-gateway-service` on port 18789. cat <<EOF | kubectl apply -f - kind: HTTPRoute apiVersion: gateway.networking.k8s.io/v1 metadata: name: http-route spec: parentRefs: - name: https rules: - matches: - path: type: PathPrefix value: / backendRefs: - name: openclaw-gateway-service kind: Service namespace: default port: 18789 EOF Test OpenClaw application Get the external endpoint. kubectl get gateway https -o jsonpath='{.status.addresses[0].value}' Paste the endpoint into your browser to reach the OpenClaw application. If you are using a self-signed certificate, you will see a “Not secure” warning; click Advanced to proceed. In a production environment with a certificate signed by a certificate authority, you should not see that warning. Figure 5: OpenClaw Authentication Paste in your Gateway Token (the auth token created earlier). You will notice that even though the token is valid, it throws back a “pairing required” error. Pairing is required in OpenClaw whenever a new device, browser profile, or CLI client attempts to connect to the gateway for the first time, ensuring only authorized clients can control the AI agent. POD=$(kubectl get pod -l app=openclaw-gateway -o jsonpath='{.items[0].metadata.name}') POD2=$(kubectl get pod -l app=openclaw-gateway -o jsonpath='{.items[1].metadata.name}') TOKEN=$(kubectl get secret openclaw-auth-token -o jsonpath='{.data.token}' | base64 -d) kubectl exec "$POD" -c gateway -- openclaw devices approve --latest --token "$TOKEN" kubectl exec "$POD2" -c gateway -- openclaw devices approve --latest --token "$TOKEN" You should see a message like the one in the image below. You can now open the OpenClaw application and start using it. Figure 6: OpenClaw pairing success message Figure 7: OpenClaw Application You have successfully deployed OpenClaw within a microVM hosted on Azure Kubernetes Service. Test microVM kernel isolation From within the OpenClaw pod, try to read the host’s root filesystem via /proc/1/root. You should see an error like: ls: cannot access '/proc/1/root/etc/kubernetes': No such file or directory. kubectl exec -it "$POD" -c gateway -- ls /proc/1/root/etc/kubernetes 2>&1 In a standard container deployment, PID 1 inside the container is still running on the host kernel, so traversing /proc/1/root/ exposes the host's root filesystem — including sensitive paths like /etc/kubernetes (which holds kubelet credentials). With Kata VM isolation, the picture is completely different. When we run ls /proc/1/root/etc/kubernetes from inside the OpenClaw pod, it returns "No such file or directory". This is because PID 1 is no longer a process on the host — it's running inside a dedicated guest VM with its own kernel. The /proc/1/root/ path leads to the microVM's root filesystem, not the host's, and that microVM has no knowledge of the node's Kubernetes configuration or machine identity. The host is simply invisible. This is the core security guarantee of Kata Containers: even if an attacker achieves a full container escape, there is nothing to escape to — they land inside a lightweight VM boundary, not on the shared host, making lateral movement to other pods or the node itself impossible. Conclusion This post discussed why running OpenClaw workloads in standard containers can be risky when the workload includes untrusted or semi-trusted code: containers share the host Linux kernel, so a single container escape or privileged misconfiguration can expand into node-level compromise and a much larger blast radius. To address this, we introduced microVM-based sandboxing with Kata Containers on Azure Kubernetes Service (AKS) and walked through an implementation approach (a node pool with Kata VM isolation, storage, gateway deployment, and ingress). Finally, we validated the isolation properties by demonstrating that common host-visibility techniques (for example, probing /proc/1/root) no longer reveal host paths when the workload runs inside a microVM. Separate kernel boundary: Kata runs the container inside a microVM, so the workload executes against a guest kernel rather than the shared host kernel—kernel exploits and escape attempts don’t directly translate into host control. Host filesystem is no longer “in scope”: paths that often leak host context in standard containers (for example, traversals via /proc) resolve inside the microVM’s filesystem, not the node’s root filesystem. Reduced blast radius per workload: each sandbox has its own VM boundary, making it much harder to pivot from one compromised workload to other pods/containers on the same node. Stronger default device and privilege separation: the hypervisor boundary and minimal virtual device model limit exposure to host devices and privileged interfaces that commonly enable breakouts. Defense-in-depth still applies: you can keep container hardening (seccomp, capability dropping, read-only filesystems, restricted mounts) while gaining an additional isolation layer that is independent of Linux namespaces/cgroups. Overall, this post helps you deploy OpenClaw on AKS with Kata microVM isolation so you can run agent workloads with a significantly reduced risk of host-kernel compromise from container escape techniques.