azure container apps
235 TopicsShaping what Azure SRE Agent does: Tool Permissions and Hooks
When an AI agent runs against production, the first question every security team asks is "What can it do, who decided it could, and what stops it from doing something it should not." Azure SRE Agent reached general availability in March. Since then, teams inside Microsoft and customers running it against real production workloads have asked for the same thing: finer-grained controls over what the agent can do on its own and a clear answer to who governs each call that reaches a tool. Today at Build 2026, we are releasing global tool access policies as one of a set of new governance controls. This post covers how they work. Tool access policies give security and platform teams a single place to define which tools the agent can invoke, under what conditions, and what requires human approval before it runs. Underneath those policies sits the identity the agent runs as the bedrock that every other control layer depends on. It is defense in depth applied to agent behavior: layers of control, each one holding on its own, so that governing the agent is something you can read, audit, and reason about as you scale it across production. Identity is the bedrock: managed identity today, agent identity next Start here, because nothing else matters if you skip it. The identity the SRE Agent runs as, and the Azure RBAC role assignments on that identity, are the most powerful boundary the agent works inside of. If your role assignments do not grant the agent access to a resource, none of the controls below come into play, because the agent cannot reach the resource to begin with. Network rules, tool permissions, hooks, and connector contracts all sit on top of an RBAC story that you write. The features in this post add layers above that floor. They do not replace it. Today the SRE Agent operates as a managed identity, and your RBAC role assignments on that identity govern what it can do. This is the bedrock, and it is the same model your other Azure workloads already use. You assign roles, you scope them, and the agent inherits exactly what you granted and nothing more. Everything that follows assumes the bedrock is in place. With identity settled, the next question is the obvious one: where is the agent allowed to send its traffic? Permissions: govern what the agent does with a tool Identity decides what the agent can reach. Permissions decide what the agent does with the access it has, down to the individual tool. Two levels cover the range: a point-and-click grid for the common cases, and hooks when a decision needs your own code. The grid is the easy mode. Every tool the agent can use, built-in tools along with MCP servers, services, and custom tools, shows up in one searchable list with two switches. On/Off sets whether the tool is available at all; turn it off and the agent cannot use it. Allow/Ask sets what happens when it is on: Allow lets the agent run the tool automatically, Ask requires a human to approve every time, except in Autonomous mode. Select tools in bulk to flip a whole category at once, filter by category or permission, and use the Advanced permissions tab when you want rules that apply at global, per-agent, or per-thread scope instead of tool by tool. Defaults stay put until you touch them, and the engine is fail-closed: if a rule cannot be evaluated, the call is blocked rather than allowed. That covers most of what teams need. Underneath those switches are three rules, allow, ask, and deny, and the Advanced tab is where you set them by scope. Global rules apply to every agent and thread, Agent rules to one custom agent, Thread rules to a single conversation. Deny is the hard one: it blocks the tool outright no matter the run mode, and a deny at a higher scope always wins, so an Allow at thread scope cannot reopen something denied globally. That split is deliberate. A platform team sets the Global guardrails that should never be crossed and the Asks that always need a human, and service teams add their own Allow rules at Agent scope for routine work, without being able to override the guardrails above them. Platform team, Global scope: deny: bash(az * delete *) - never delete, on any agent or thread deny: bash(kubectl delete *) ask: bash(az webapp restart *) - always confirm, even in Autonomous allow: bash(az monitor *) - auto-approve monitoring queries Service team, Agent scope: allow: bash(kubectl get *) - routine read-only work allow: bash(kubectl describe *) Two details make this safe to lean on. Rules match the canonicalized tool invocation rather than the raw text, so enforcement holds no matter how the command was assembled. And fail-closed has a softer edge than a hard stop: a cached last-known-good policy covers transient failures, so a blip in the policy store blocks the call rather than silently widening access. You can find these under Capabilities > Tools missions. The layer worth spending time on is hooks. Allow and Ask answer "should this tool run." Hooks answer "should this specific call run, given exactly what it is about to do." A hook fires before the agent runs a tool and receives the actual call, parameters and all. Your code then decides the outcome and can reshape it: rewrite parameters before they are sent, inject extra context into the pipeline as a user message so the agent reconsiders before its next step, block the call outright, or redirect the agent toward a safer path. Because your code sees the real parameters, the decision can depend on anything you can express in code: which resource the call targets, whether a value falls outside an allowed range, the time of day, the result of an external policy lookup. This is where you write the rule the grid cannot. Two kinds of hook, mixable on the same agent. Command hooks are a script you write; reach for these when code is enough. Prompt hooks put a separate LLM in the loop as a judge that evaluates the call in context; reach for these when the decision needs reasoning rather than a fixed rule. A real example from our own internal test agent: when the agent tries to list files through the shell with ls or dir, a hook blocks the call. The agent absorbs the signal, reconsiders, and reaches for the ListDir tool instead. The hook did not argue with a human. It shaped what happened next. As with the grid, configure nothing and the agent behaves exactly as it does today. Both are additive. Authoring one is a short form. You name the hook, pick the event (Pre Tool Use, so it runs before the call), and set a tool matcher, either picked from the tool menu or written as a regex like (FetchWebpage|SearchMemory) with anchors and lookaheads when you need them, so the hook fires only on the calls you care about. You set a timeout and a fail mode (Block, so a hook that errors or hangs stops the call rather than waving it through), and you write the body in Bash or Python. A command hook reads the call as JSON on stdin, the event name, the tool name, its parameters, and the call id, and answers on stdout. Print nothing and exit zero to allow. Return a block decision with a reason to stop the call, and that reason is what the agent reads back. You can also substitute: run a cheaper or safer version yourself, block the real call, and hand your own output back as the result, so the agent never runs the expensive or risky original. #!/bin/bash input=$(cat) tool=$(echo "$input" | jq -r '.tool_name') # Block one tool, with a reason the agent will read if [ "$tool" = "ExampleToolName" ]; then echo '{"decision":"block","reason":"Blocked ExampleToolName by hook policy."}' exit 0 fi # Otherwise allow: print nothing and exit 0 exit 0 You can find these under Builder > Hooks Each layer holds on its own The layers stack. Identity is the floor: your RBAC assignments decide what the agent can reach at all. Permissions, the grid and hooks together, decide what it does with a tool. You author each layer, each one holds whether or not the layer above it behaves as expected, and all of it configures through the same ARM and Bicep surface your platform team already uses, reproducible the way the rest of your Azure estate is. The upgrade path is additive and non-breaking. Existing agents keep working. Turn on each control when you are ready, in the order your governance requires. There is more coming. We run Azure SRE Agent inside Microsoft on our own production workloads, so we feel the same gaps you do, and the next round is shaped by what we hear from teams running it in production today. Which control is doing the most for you, and which one are you still waiting on? Let us know and thank you! Getting started Create new SRE Agent — https://aka.ms/sreagent SRE Agent Documentation — https://aka.ms/sreagent/newdocs SRE Agent recipes — https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent273Views0likes0CommentsDesigning for High Availability: The Operational Reference for Running a Geo-Replicated ACR
By Johnson Shi, Zoey (Zhuyu) Li, Huangli Wu Introduction Three of the most common questions we hear from enterprise teams running geo-replicated Azure Container Registries (ACR) are: "How do I control which region serves my traffic?" — When my AKS clusters are spread across regions, can I pin each one to its co-located replica, or am I stuck with however the global endpoint routes? "What happens during a regional incident — is failover automatic or do I have to act?" — If the registry in one region degrades, does the global endpoint reroute on its own, or do I need to manually disable the affected replica? "What happens after the region recovers — does traffic return on its own?" — Is there a cooldown, a quarantine, or any manual step before failback? We answer those head-on, then go deeper on the operational details that come up when you actually run a geo-replicated registry: authentication across endpoint switches, throttling under load concentration, eventual-consistency failure modes, home region outage scope, webhooks, and private endpoint interaction. We draw on the official geo-replication docs, the global endpoint health-aware failover blog, the regional endpoints engineering design implementation, the regional endpoints public preview and private preview announcements, and the ACR reference for various registry endpoints, . This post also draws notes from the ACR product team on roadmap items that aren't yet documented elsewhere. Key Takeaways Health-aware failover is automatic. When the registry in a region degrades, the global endpoint reroutes away from it on the order of minutes, evaluated per-registry. No customer action required. Failback is automatic too. Once health-aware failover marks a region healthy again, the global endpoint resumes routing to it. There is no cooldown period. Health-aware failover applies only to global endpoint operations. It does not apply to regional endpoints (you're talking to one replica, period) or to dedicated data endpoints (the redirect is per-region). Health-aware failover is not triggered by throttling. It responds to regional ACR service health and Azure infrastructure health, not HTTP 429 responses. Use regional endpoints to manage per-replica throttling. Regional endpoints (Step 2a) give you explicit per-region URLs for workloads that need affinity, capacity planning, push/pull consistency, troubleshooting, or client-side failover. Use myregistry.<region>.geo.azurecr.io . Regional endpoints are available on Premium SKU registries. For workloads that don't need pinning, do nothing (Step 2b). The global endpoint plus health-aware failover handles routing automatically. Re-authenticate when switching endpoints. Each global or regional endpoint is its own authenticated surface; re-auth via az acr login , SDK auth, or the Kubernetes ACR credential provider on endpoint change. Don't run a long-lived DNS cache for the global endpoint. ACR purges DNS server-side on disable and during failover; a long-lived client cache works against that. For production workloads, enable dedicated data endpoints for security and DNS predictability on layer downloads. ACR is working on bounded staleness consistency for cross-replica eventual-consistency failure modes; see the FAQ. Background What is ACR geo-replication? Geo-replication is a Premium SKU feature that turns a single ACR registry into a multi-region, multi-write service. Every geo-replica in every region is writable — you can push, pull, and delete from any of them — and content syncs asynchronously between replicas under an eventual consistency model. Per-push replication time scales with the size and number of images being pushed. Similarly, when creating a new geo-replica, the time to populate the new geo-replica scales with the total size of the registry. A geo-replicated registry exposes a global endpoint at myregistry.azurecr.io . Behind that endpoint, ACR uses an internal traffic manager to direct each request to the replica with the best network performance profile for the caller — usually the closest replica, but not always. When clients are equidistant from multiple replicas, or when the closest replica is experiencing Azure infrastructure degradation, requests may be routed elsewhere. A geo-replicated registry also exposes a regional endpoint at myregistry.<region>.geo.azurecr.io , which allows clients to pin API requests to a specific geo-replica in lieu of global endpoints, which has Azure-managed routing among geo-replicas. Zone redundancy is always enabled for geo-replicas in regions where Azure has multiple availability zones — in those regions, ACR automatically spreads replica data across multiple availability zones within each region to protect against zonal outages. Endpoints and data endpoints: what goes where A common point of confusion: when you push or pull, not every request goes to the same place. The registry endpoints (global endpoint and regional endpoints), as well as the data endpoint, do different jobs. Your choice of data endpoint configuration has real consequences for security and resilience. Two kinds of traffic flow during a typical pull: Registry API traffic — authentication, manifest reads/writes, tag resolution, referrers, repository operations, blob location lookups, listing, metadata. This is everything except the actual layer (blob) bytes. All these API requests go to the global endpoint ( myregistry.azurecr.io ) or, if you've pinned your clients to call these APIs to a specific geo-replica, a geo-replica's regional endpoint ( myregistry.<region>.geo.azurecr.io ). Behind the scenes, the global endpoint internally proxies these requests to a specific geo-replica. Layer (blob) downloads — when the client asks for a blob, the registry doesn't serve the bytes itself. It returns an HTTP 307 redirect to a regional data endpoint (separate endpoint from the global endpoint or regional endpoints), and the client follows the redirect to download the layer from that region. Where that 307 sends you depends on whether you've enabled the registry's dedicated data endpoints feature: Configuration Layer downloads redirect to Default (no dedicated data endpoints) *.blob.core.windows.net (the underlying Azure storage account) Dedicated data endpoints enabled myregistry.<region>.data.azurecr.io for the region you were routed to Private endpoints enabled myregistry.<region>.data.azurecr.io for the region you were routed to Regional by design. Dedicated data endpoints always land you on a specific geo-replica's data endpoint — there is no "global data endpoint." With the global endpoint as your registry endpoint, the 307 redirect picks the data endpoint for whichever region the global endpoint chose to serve you. With a regional endpoint pinned to a specific region, the 307 always redirects you to that same region's data endpoint — never cross-region. Why dedicated data endpoints matter. Dedicated data endpoints are a Premium SKU feature that exists primarily to address security and firewall scoping. By default, layer downloads redirect to *.blob.core.windows.net — a wildcard storage FQDN. Firewall rules to allow that wildcard either let all Azure storage accounts through or none of them, which raises data exfiltration concerns and isn't tightly scoped to your registry. Dedicated data endpoints replace the wildcard with a fully qualified domain in your registry's own domain — myregistry.<region>.data.azurecr.io — so firewall rules can be scoped tightly to your specific registry, in your specific regions. That same design choice can also make layer downloads more predictable during routing changes. With dedicated data endpoints, the data endpoint FQDN is known ahead of time and lives in the registry's domain — one predictable hostname per region, configured once. Without them, the layer download has to resolve a wildcard storage FQDN that points to whichever storage account the registry happens to have provisioned, which is a separate DNS resolution path with its own routing behavior and its own caching profile. Dedicated data endpoints simplify the DNS picture by aligning the data path with the registry path and keeping the entire pull experience inside one set of predictable, scoped FQDNs. For any geo-replicated registry where security and high availability matter, enable dedicated data endpoints. Note: Health-aware failover applies only to operations against the global endpoint, not to regional endpoints or dedicated data endpoints. Take note that health-aware failover only kicks in and directs traffic away from a geo-replica when an Azure region is experiencing significant infrastructure degradation. At this stage, it does not kick in to redirect traffic to another geo-replica if a client's data plane API requests are throttled. See the relevant section below for the full scope when health-aware auto failover kicks in or not. The three traffic control tools ACR geo-replication gives you three complementary tools for controlling where traffic lands. Each one solves a different class of problem, and customers most often run into trouble when they reach for the wrong one. We name them up front and use these names throughout the post: Tool Who controls it What it does Use cases Health-aware failover Platform (automatic) Reroutes the global endpoint away from a region whose registry can't reliably serve requests Regional incidents, automatic recovery Replica enable/disable for global routing Customer (manual) Excludes a specific replica from global endpoint routing without deleting it; data continues syncing DR rehearsals, planned maintenance, quarantining a replica without losing it Regional endpoints Customer (per request) Dedicated per-region URLs ( myregistry.<region>.geo.azurecr.io ) that bypass the internal traffic manager entirely Pinning AKS clusters to co-located replicas, push/pull consistency, capacity planning, troubleshooting, client-side failover Health-aware failover and replica enable/disable both act on the global endpoint. Regional endpoints are a separate URL surface that coexists with the global endpoint — enabling them does not disable the global endpoint myregistry.azurecr.io . You can use both simultaneously and choose per workload. The behavior in question When the registry in one region experiences a real degradation, there are three possible answers to "what happens?": (A) Nothing automatic. The customer must manually disable the affected region's endpoint to stop traffic from being routed there. (B) The system detects the regional front-door failure and reroutes within seconds. (C) A per-registry health evaluation detects the degradation and reroutes the global endpoint within minutes, with no customer action. After the region recovers, routing resumes automatically. The answer today is (C). Before health-aware failover, customers were stuck closer to (A) — the system could see whether the regional reverse proxy responded, but not whether the registry could actually serve real pull and push traffic end to end. Health-aware failover closes that gap. We walk through all three tools in the next section, in order: setting up geo-replication, using regional endpoints to pin specific workloads, keeping the global endpoint for everything else, the manual replica disable mechanism, re-enabling participation in global routing, and what to expect when health-aware failover triggers. Walkthrough The following steps assume an existing Premium SKU registry and the Azure CLI logged in. We use myregistry as the registry name, myrg as the resource group, and eastus as the home region. Substitute <your-registry> , <your-rg> , and <your-region> for your environment. Prerequisites A Premium SKU ACR registry (geo-replication requires Premium) Azure CLI ( az ) installed and logged in For regional endpoints (Step 2a): Azure CLI 2.86.0 or later. All regional endpoints commands ( --regional-endpoints , az acr show-endpoints , az acr login --endpoint ) are available natively in Azure CLI 2.86.0+. If you previously installed the acrregionalendpoint private preview CLI extension, uninstall it with az extension remove --name acrregionalendpoint to prevent conflicts with the built-in CLI commands. Step 1: Add a West US replica to a registry that lives in East US Geo-replication requires the Premium SKU. The create call below fails on Basic or Standard. # Confirm the registry is Premium az acr show --name myregistry --resource-group myrg \ --query sku.name --output tsv # Premium # Create a West US geo-replica az acr replication create --registry myregistry --location westus # Confirm both replicas are present az acr replication list --registry myregistry --output table NAME LOCATION PROVISIONING STATE STATUS REGION ENDPOINT ENABLED ------ ---------- -------------------- -------- ----------------------- eastus eastus Succeeded online True westus westus Succeeded online True Pushes and pulls continue working through the existing replica throughout initial sync. Because the registry is multi-region, multi-write, the existing replica keeps serving traffic while the new replica catches up in the background. Initial replica seeding time is a function of registry size — the total number and cumulative size of images already in the registry that need to be replicated to the new replica — not the size of any single image. Step 2a: Pin workloads to specific regions using regional endpoints Use regional endpoints when a workload needs explicit per-region control. The five common cases: Regional affinity — an AKS cluster in East US should pull from the East US replica, every time, without ever hopping to a more distant replica because of a network performance fluctuation. Predictable routing — workloads that need to know exactly which replica will serve them, for benchmarking, capacity planning, or in-region traffic SLAs. Push/pull consistency — pinning both ends of a publish-then-deploy flow to the same replica eliminates eventual-consistency races. Troubleshooting — reproducing an issue on a specific replica requires sending traffic to that specific replica. Client-side failover — customers with their own health checks and business rules want to implement failover on their own terms, on signals only they can see. Enable regional endpoints on the registry: az acr update -n myregistry -g myrg --regional-endpoints enabled When enabled, ACR automatically creates per-region login server URLs for every existing geo-replica. No per-region configuration is needed. Note: Regional endpoints can be enabled on any Premium SKU registry, even without geo-replication. A registry without geo-replication has a single geo-replica in the home region, which gets one regional endpoint URL. However, the feature is most useful when your registry has at least two geo-replicas, where you can pin different workloads to different replicas for routing control and capacity distribution. Push to a specific region using its regional endpoint: # Log in to the West US regional endpoint az acr login --name myregistry --endpoint westus # Tag and push using the regional endpoint URL docker tag myapp:v1 myregistry.westus.geo.azurecr.io/myapp:v1 docker push myregistry.westus.geo.azurecr.io/myapp:v1 Pin AKS deployments to their co-located replica by using regional endpoint URLs in the deployment manifest. The example below shows two clusters in different regions; each cluster references the regional endpoint for its own region's replica (assuming replicas exist in both eastus and westeurope ): # East US-based AKS cluster pulls from the East US replica apiVersion: apps/v1 kind: Deployment metadata: name: myapp-eastus spec: template: spec: containers: - name: myapp image: myregistry.eastus.geo.azurecr.io/myapp:v1 --- # West Europe-based AKS cluster pulls from the West Europe replica apiVersion: apps/v1 kind: Deployment metadata: name: myapp-westeurope spec: template: spec: containers: - name: myapp image: myregistry.westeurope.geo.azurecr.io/myapp:v1 This eliminates cross-region pulls when global routing would otherwise prefer a different replica for a given client, and it gives you a per-region traffic profile you can plan capacity against. Regional endpoint operational tips View all endpoints. Use az acr show-endpoints to see all endpoint URLs for your registry — global, regional (if enabled), and dedicated data endpoints (if enabled): az acr show-endpoints --name myregistry --resource-group myrg Import from a specific geo-replica. When importing images between registries, you can use a regional endpoint to import from a specific geo-replica of the source registry. This is useful when you want predictable network paths or need to import from a replica in a specific region: az acr import \ --name mydownstreamregistry \ --source myupstreamregistry.westeurope.geo.azurecr.io/myapp:v1 \ --image myapp:v1 Firewall rules for regional endpoints. If you use firewall rules, allow access to the following endpoints for each geo-replica that clients connect to: Endpoint Purpose myregistry.<region>.geo.azurecr.io Regional endpoint for registry operations myregistry.azurecr.io Global endpoint (if also used) myregistry.<region>.data.azurecr.io Layer downloads (if using private endpoints or dedicated data endpoints) *.blob.core.windows.net Layer downloads (if not using private endpoints or dedicated data endpoints) For the full list of endpoint types and FQDN patterns, see the ACR reference for various registry endpoints. DNS-based routing without changing manifests. If you don't want to maintain different deployment manifests per region, you can keep all manifests pointing to the global endpoint ( myregistry.azurecr.io ) and use software-defined networking or a regional traffic manager to resolve the global endpoint to the appropriate regional endpoint based on the originating region's traffic. This achieves the same co-location goals as regional endpoints — predictable routing and reduced latency — without embedding region-specific URLs in your deployment manifests. Step 2b: Keep using the global endpoint for everything else For workloads that don't need explicit pinning, do nothing. The global endpoint at myregistry.azurecr.io continues to work exactly as before, and the global endpoint plus health-aware failover gives you intelligent routing across replicas without configuration. ACR picks the best replica for each client based on network performance and reroutes during regional incidents. Regional endpoints coexist with the global endpoint — enabling them does not disable myregistry.azurecr.io . You can use both simultaneously and choose per workload, mixing pinned workloads (Step 2a) with workloads that ride the global endpoint (Step 2b) in the same registry. Step 3: Take a replica out of global endpoint routing Use this when you need to keep a replica alive but stop it from serving global-endpoint traffic — for DR rehearsals, planned maintenance, or troubleshooting an isolated replica. # Exclude the West US replica from global endpoint routing az acr replication update --registry myregistry --name westus \ --global-endpoint-routing false Confirm the change: az acr replication list --registry myregistry --output table NAME LOCATION PROVISIONING STATE STATUS REGION ENDPOINT ENABLED ------ ---------- -------------------- -------- ----------------------- eastus eastus Succeeded online True westus westus Succeeded online False Requests to myregistry.azurecr.io no longer route to West US. The replica still receives replicated content — and continues to replicate its own content out to other replicas — and storage quota and per-replica costs continue to accrue. If regional endpoints are enabled, the West US regional endpoint URL also continues to work; --global-endpoint-routing controls only the replica's participation in global endpoint routing. A note on naming. The CLI flag --global-endpoint-routing (on az acr replication update ) and the regional endpoints feature (enabled via az acr update --regional-endpoints enabled ) are two different things despite the similar names. --global-endpoint-routing controls whether a replica participates in global endpoint routing. The regional endpoints feature creates per-region URLs ( myregistry.<region>.geo.azurecr.io ) that bypass the global endpoint entirely. They are independent controls. In Azure CLI 2.86.0 and later, the old --region-endpoint-enabled flag has been renamed to --global-endpoint-routing . The old flag name is deprecated and will be removed in Azure CLI 2.87.0 (June 2026). If you have existing scripts or automation that use --region-endpoint-enabled , update them to use --global-endpoint-routing . CLI flags quick reference: Flag Scope Purpose --regional-endpoints Registry-level ( az acr create or az acr update ) Enables dedicated regional endpoint URLs ( myregistry.<region>.geo.azurecr.io ) for all geo-replicas. --global-endpoint-routing Per-geo-replica ( az acr replication create or az acr replication update ) Controls whether the global endpoint routes traffic to a specific geo-replica. Set to false to temporarily exclude a geo-replica from global routing. --data-endpoint-enabled Registry-level ( az acr create or az acr update ) Enables dedicated data endpoints ( myregistry.<region>.data.azurecr.io ) for layer blob downloads. Auto-enabled when at least one private endpoint is configured. This bidirectional sync during disable is intentional. When you re-enable the replica, every image pushed to the registry while the replica was disabled — from any region — is already present, so the replica can serve traffic immediately with no catch-up window. If we stopped syncing on disable, re-enabling would leave the replica with stale data and force a long catch-up before it could safely serve pulls. Step 4: Re-enable the replica to participate in global endpoint routing Re-enable the replica: az acr replication update --registry myregistry --name westus \ --global-endpoint-routing true NAME LOCATION PROVISIONING STATE STATUS REGION ENDPOINT ENABLED ------ ---------- -------------------- -------- ----------------------- eastus eastus Succeeded online True westus westus Succeeded online True There is no cooldown. The global endpoint resumes routing requests to the West US replica as soon as the change takes effect on ACR's side. Because data continued syncing while the replica was disabled (Step 3), the replica is immediately ready to serve pulls — no catch-up window. Note on DNS during disable/enable. When you take a replica out of global routing, ACR purges its own DNS records for that replica from the global endpoint on a fast path — there is no waiting on a published TTL on ACR's side. If clients run their own DNS cache for the global endpoint, however, those clients will keep resolving to the disabled replica until the client cache expires. We can't control client-side caches. The recommendation: do not run a long-lived DNS cache for the global endpoint. A short-lived DNS pin for the duration of a single push (covered in the DNS and Client-Side Considerations section) is fine and even helpful — but a long-lived DNS cache will make --global-endpoint-routing false look broken from the client's perspective. Step 5: What to expect when health-aware failover triggers Health-aware failover is automatic. ACR evaluates registry health on a per-registry basis, and when a registry in a region can't reliably serve requests, the global endpoint reroutes that registry's traffic to a healthy replica. There is no customer-invocable trigger — that's the point. End-to-end timing is on the order of minutes — fast enough to catch real regional degradation, slow enough to ride out transient errors that resolve on their own. DNS TTL may add additional propagation delay before all clients switch to the new region. Scope of health-aware failover. Health-aware failover applies only to operations against the global endpoint — the registry API calls (auth, get manifest, get tag, get referrers, get blob location). It evaluates health when those API calls come in; it does not trigger mid-operation. Two important consequences: Regional endpoints are not in scope. When you talk to a regional endpoint like myregistry.westus.geo.azurecr.io , you're talking to that one replica. There is no automatic reroute. If you've pinned a workload to a regional endpoint and that region degrades, you implement client-side failover by switching the workload to a different regional endpoint. Dedicated data endpoints are not in scope. Once a registry endpoint has redirected you to a dedicated data endpoint, you stay on that region's data endpoint for the duration of the layer download. There is no automatic reroute of an in-flight blob download. The region targeted by the redirect is decided up front by whichever registry endpoint served the blob-location call: the global endpoint chooses based on its per-registry health evaluation, and a regional endpoint always targets its own region. The signals you can use to confirm a failover is in progress: # Check replication status az acr replication list --registry myregistry --output table You can also check Resource Health for the registry in the Azure portal — navigate to your registry and select Resource health under the Help section to see platform-side degradation signals. You'll typically see: Increased pull latency as traffic shifts to a more distant replica Resource Health flagging known issues in the affected region Replication status indicating which replicas are online After the region recovers, the per-registry health evaluation marks it healthy again and the global endpoint resumes routing — automatic, no cooldown, no customer action. Note that health is evaluated per registry, not per region: if a degradation affects only a subset of registries in a region, only those registries are rerouted, and other registries in the same region continue to be served locally with no unnecessary latency penalty. Not triggered by throttling. Health-aware failover is DNS-based and responds to regional ACR service health and Azure infrastructure health. It does not reroute traffic based on HTTP 429 (throttling) responses. If a geo-replica is throttling your requests but the region's infrastructure is healthy, the global endpoint continues routing you to that geo-replica. To manage throttling, use regional endpoints to spread workloads across multiple geo-replicas for better capacity distribution. Note on long-running pushes during a failover. A multi-layer push that spans a failover boundary can land layers and the manifest on different replicas — exactly the failure mode that DNS bouncing produces during a single push. ACR is actively tightening health-aware failover behavior to minimize cross-replica scatter during these scenarios, and the recommendation today remains: pin pushes to a single replica via a regional endpoint when push/pull consistency matters. Common Questions Q1. Performance impact during initial replica creation on a live registry Because ACR is multi-region, multi-write, the existing replica continues serving pull and push traffic throughout the period when a new replica is being seeded. Replication is asynchronous and content propagates in the background; the time to populate a new geo-replica scales with the size of the registry — the cumulative number and total size of images already in the registry — not with any single image. The docs do not publish a quantified degradation percentage or a throttling window for this period, and they do not promise zero performance impact — the safe operating assumption for a live production registry is that existing replicas continue serving traffic normally, with the new replica catching up in the background. Q2. Restricted/updating state during initial sync There is no "restricted" state for the registry during normal replica creation. Writes, control-plane operations, and pushes/pulls against existing replicas continue normally. The only time configuration changes are unavailable is during a home region outage — see the relevant FAQ item later on for the full data-plane-versus-control-plane breakdown. Q3. Cooldown periods and non-straightforward failback scenarios There is no cooldown before failback, manual or automatic. Re-enabling a replica's participation in global endpoint routing takes effect immediately on ACR's side. Health-aware failover returns traffic to a region as soon as its per-registry health evaluation passes again. The failback case that is not seamless: if a recently pushed image has not yet replicated to the failover region, a pull from that region may not find the image until replication catches up. This is a function of eventual consistency, not failback timing — and it's part of a broader class of issues we cover in Q4. Q4. Common pull and push failure modes during the eventual-consistency window DNS bouncing during a single push is one well-known problem, but it isn't the only one. The eventual-consistency window between geo-replicas surfaces in several recurring failure modes worth knowing about: Push-then-immediate-pull-cross-region. Pushing myapp:v1 to one region and immediately pulling it from a different region can fail with manifest unknown until replication catches up. This shows up most painfully in CI/CD pipelines where one CI runner pushes an image and thousands of pods across other regions all try to pull from their local geo-replicas at the same time. Today, customers work around this with indeterminate sleeps before scheduling expensive compute, or with retry logic, or by waiting on a replication-complete signal — none of which is a clean planning story. Tag overwrite races. Pushing myapp:v1 , then re-pushing myapp:v1 shortly after with a fix (same tag, different digest), can leave different replicas resolving the same tag to different digests during the eventual-consistency window. Delete propagation. Deleting a tag or repository in one region takes some time to propagate to other replicas. Pulls from regions where the delete hasn't yet propagated can return the supposedly-deleted content. Mid-push failover scatter. A multi-layer push that spans a health-aware failover boundary or a DNS bouncing event can land layers on one replica and the manifest on another, surfacing as manifest validation errors or blob unknown on subsequent pulls. What ACR is doing about this. We're working on bounded staleness consistency for pushed images across all geo-replicas worldwide, which addresses these four failure modes directly. This will be covered in an upcoming blog post. If you're hitting eventual-consistency brittleness today and want to talk through your scenario, reach out to us on the Azure Container Registry GitHub repository — we want the customer signal to land in the design. Mitigations available today: Pin pushes to a single replica via a regional endpoint. Every sub-request in the push — login, blob uploads, manifest upload — goes to the same replica, eliminating the DNS bouncing and mid-push scatter classes entirely. Use a short-lived client-side DNS cache like dnsmasq scoped to the duration of a single push, only when you're not using regional endpoints. Do not run a long-lived DNS cache for the global endpoint — it interferes with --global-endpoint-routing false and with health-aware failover routing. Build retry logic into pulls that immediately follow a cross-region push. Either retry with backoff or check replication status with ACR webhooks before pulling. ACR can detect and notify you when an image or tag is available for pull in a geo-replica (say geo-replica B), after it has been pushed to another geo-replica (geo-replica A) and background replication has succeeded to geo-replica B. Design publish steps to be idempotent so retries triggered by mid-push failover are safe. Q5. Auth behavior across endpoint switches For safety, treat each global endpoint and each regional endpoint as its own authenticated surface. All registry APIs except the actual blob downloads (auth, manifests, tag resolution, referrers) flow through whichever endpoint you've chosen. If you switch from the global endpoint to a regional endpoint, or from one regional endpoint to another, re-authenticate. That means az acr login , fresh SDK auth, or — for AKS — letting the Kubernetes ACR credential provider handle re-auth, which it does automatically when the endpoint changes. Q6. Throttling under failover and pinning Throttling limits on registry API operations are per-replica, not per-registry. This has two operational implications: During health-aware failover, traffic that was spread across replicas can shift heavily onto whichever replicas remain in the global endpoint's routing pool. Capacity plan to spread traffic across two or three healthy replicas during a failover scenario rather than concentrating onto one — the global endpoint's routing already does this for you when multiple healthy replicas exist, but registries with only two regions configured can hit per-replica limits more easily during a failover. To mitigate, use regional endpoints to spread workloads across multiple geo-replicas and plan per-replica capacity. When pinning via regional endpoints (Step 2a), you concentrate traffic on whichever replica you've pinned to. If you've pinned all your AKS clusters to a single regional endpoint, you may hit that replica's per-region throttling limits at peak. Mitigations: pin different workloads to different regional endpoints across multiple regions for better topology mapping and capacity distribution, or use the global endpoint (Step 2b) for workloads where you don't need explicit pinning so ACR's routing can spread load. We're also working on improving the throttling metrics surfaced during health-aware failover events. Note: Health-aware failover does not reroute traffic based on HTTP 429 (throttling). If you're experiencing throttling but the region's infrastructure is healthy, the global endpoint continues routing you there. Use regional endpoints to explicitly spread load across replicas for capacity planning. Q7. Home region outage scope Geo-replication provides high availability for the data plane. During a home region outage, the control plane is unavailable, which means you can't create or delete replicas, modify network rules, or change replication settings until the home region recovers. ACR Tasks are also bound to the home region and don't run while it's unavailable. The data plane keeps working: Global endpoint continues routing pulls and pushes to healthy replicas. Regional endpoints continue working — you talk directly to specific replicas, and your client-side logic decides which region to use. Authentication, manifests, blob downloads, webhooks continue functioning through any healthy replica. The home region of a registry is fixed at creation and cannot be changed afterward. Microsoft's registry relocation guidance describes a redeployment procedure — creating a new registry in a different region — not an in-place change to an existing registry's home region. Note: If your registry uses a customer-managed key, review the key vault failover and redundancy guidance for maximum resilience. Key vault availability directly affects the registry's ability to encrypt and decrypt data. Q8. Webhooks during failover Webhooks fire from the replica that received the push. Because ACR also replicates content to other geo-replicas, webhooks fire from each geo-replica as the image syncs to it — so a single push results in webhook events from the receiving replica plus an event from each replica as replication completes. During a failover where pushes are routed to a different region, webhooks from those pushes fire from the new region; once the original region recovers and replication catches up, webhook events fire from there too. Webhook consumers should be designed to handle multiple events per pushed image and deduplicate as needed. Q9. Private endpoints with regional endpoints and dedicated data endpoints When a private endpoint is created against a registry, the private endpoint covers all of the registry's endpoint surfaces — the global endpoint, every regional endpoint (if regional endpoints are enabled), and every regional dedicated data endpoint. A single private endpoint in one VNet can reach the global endpoint (which routes you to a suitable replica), any regional endpoint in the same or a different region, and any region's dedicated data endpoint for blob downloads. The trade-off is private IP allocation: each endpoint surface consumes IPs in the VNet. With many replicas plus regional endpoints plus dedicated data endpoints all enabled, private endpoint creation can fail if the VNet runs out of available private IPs. IP address consumption per feature: Configuration IPs consumed per VNet Initial private endpoint (global endpoint + home region dedicated data endpoint) 2 Each geo-replication region added +1 (regional dedicated data endpoint) Regional endpoints enabled +1 per geo-replica Example: A registry with 3 geo-replicas and regional endpoints enabled consumes 7 private IPs per VNet: 1 (global) + 3 (data) + 3 (regional). Without regional endpoints, the same registry requires 4 private IPs: 1 (global) + 3 (data). Subnet sizing: Use at minimum a /27 (32 addresses) subnet for PE subnets on geo-replicated registries, and /24 where possible. To check how many private IPs are already consumed on a subnet: az network vnet subnet show \ --name <subnet-name> \ --vnet-name <vnet-name> \ --resource-group <resource-group> \ --query "{addressPrefix:addressPrefix, usedIPs:length(ipConfigurations || \`[]\`)}" \ --output table See the ACR private endpoints documentation for the full IP-allocation math and sizing guidance. Q10. Geo-replica creation stuck for private endpoint-enabled registries When creating a geo-replica for a registry that has private endpoints configured, the replica provisioning can get stuck in a Creating state if the identity performing the operation doesn't have sufficient permissions to create private endpoint networking resources. Solution: Manually delete the geo-replica that got stuck in the provisioning state. Ensure the identity has the permission Microsoft.Network/privateEndpoints/privateLinkServiceProxies/write before creating the geo-replica again. Also verify that every PE subnet connected to the registry has free IP capacity — if any PE subnet across any connected VNet does not have enough free IPs, the replication provisioning fails and rolls back. The replica appears briefly in a Creating state and then is removed. The resulting error does not identify which subnet or VNet is exhausted. Q11. Metrics, logs, and alerts for the three phases We map each phase to the signals available in the Monitoring Guidance section below. The headline: Resource Health (in the Azure portal) and az acr replication list give you the platform-side signals; Azure Monitor platform metrics are collected automatically, and resource logs require Diagnostic Settings to be enabled on the customer side. Behavior summary Scenario Automatic? Customer Action Required Notes Registry in a region degrades Yes None Health-aware failover; per-registry; minutes-scale; global endpoint operations only Region recovers after a degradation event Yes None No cooldown Pin AKS clusters to co-located replicas No Use regional endpoint URLs in deployment manifests (Step 2a) Coexists with global endpoint No pinning needed for most workloads Yes None — keep using myregistry.azurecr.io (Step 2b) Global endpoint plus health-aware failover Push/pull from the same replica (consistency) No Use a regional endpoint for both push and pull Eliminates DNS bouncing and mid-push scatter Capacity planning per region No Spread workloads across multiple regional endpoints Per-replica throttling; avoid concentrating on one replica DR rehearsal: take a replica out of global routing No az acr replication update --global-endpoint-routing false Data continues syncing both directions; costs continue accruing Re-enable replica participation in global routing No az acr replication update --global-endpoint-routing true No cooldown; replica is immediately ready Switch a workload between endpoints No Re-auth ( az acr login , SDK auth, or Kubernetes ACR credential provider) Each endpoint is its own authenticated surface Initial replica seeding on a live registry N/A None Existing replica continues serving traffic; seeding time scales with registry size Long-running push during a failover No Retry; design publishes to be idempotent Pin via regional endpoint to avoid mid-push scatter; ACR is tightening this behavior Pull of a recently pushed image from a different region No Wait for replication, retry with backoff, or check replication status Eventual consistency; bounded staleness consistency in development Home region outage Data plane: yes; control plane: no Use global or regional endpoints for data plane operations Control plane (replica config, network rules) requires home region DNS and Client-Side Considerations DNS bouncing during a single push is the most common geo-replication push problem in customer threads, and it warrants a section of its own. The failure mode. A docker push is a sequence of HTTP requests: blob uploads for each layer, then a manifest upload that references those layers by digest. If the Linux DNS resolver on the client doesn't cache myregistry.azurecr.io consistently for the duration of the push, individual sub-requests can resolve to different replicas. Because replication is eventually consistent, the manifest can land on a replica that doesn't yet have the layers it references, and the manifest validation fails. The two mitigations: Regional endpoints pin the push to a single replica end-to-end. Every sub-request — login, blob uploads, manifest upload — goes to the same replica. This is the cleanest fix and the one we recommend for any pipeline where push/pull consistency matters. A short-lived client-side DNS cache like dnsmasq scoped to the duration of a single push. For Linux VMs in Azure, follow the DNS name resolution options guidance. The pin should last the push and no longer. For other clients performing pushes, you can customize your stack's DNS resolver to have a similar short-lived DNS cache to pin the global endpoint's resolved DNS for only the duration of an image push operation. A note on long-lived DNS caching for the global endpoint. Don't run a long-lived DNS cache for myregistry.azurecr.io . ACR purges its own DNS records on the server side when a replica is taken out of global routing (Step 3) and during health-aware failover; a long-lived client-side cache will keep clients pointed at the old region after our purge, which makes both the manual disable mechanism and health-aware failover look broken from the client's perspective. Retry behavior: In-flight pushes during a failover may fail. Design publish steps to be idempotent so retries are safe. Pipelines that push in one region and immediately pull from a different region should retry with backoff or check replication status — eventual consistency means the pull may race ahead of replication. ACR is working on bounded staleness consistency that addresses this directly by enabling proxying (on ACR infrastructure) an image pull request from one geo-replica (if it does not have the image) to another geo-replica that has the image; see the relevant FAQ item. Note: Specific retry counts, back-off intervals, and push timeout values are application-layer decisions. The platform behavior is documented; the retry policy belongs to your client. Monitoring Guidance We map the three phases to the signals available from each source. Where a signal requires customer-side configuration, we flag it. Phase A: Initial replication (after creating a new replica) az acr replication list and az acr replication show — confirm the new replica reaches provisioningState: Succeeded and status: online , and view per-replica status. Azure Monitor platform metrics — push count, pull count, and other registry metrics are collected automatically and visible in the Azure portal under Metrics. No customer configuration is needed to view platform metrics. To export metrics or enable resource logs (detailed operation logs), configure Diagnostic Settings on the registry. Phase B: Failover (planned via replica disable, or automatic via health-aware failover) Per-replica regionEndpointEnabled state via az acr replication list — confirms whether a manual disable took effect, i.e. which replicas are currently eligible for global endpoint routing. Note: this flag reflects the manual configuration for configuring a geo-replica's global endpoint routing eligibility; it does not indicate whether health-aware failover has actively rerouted traffic away from a replica. Resource Health for the registry (in the Azure portal under Help > Resource health) — surfaces platform-side degradation signals during incidents. ACR does not yet expose a definitive "this region is currently serving your traffic" signal; Resource Health and client-side latency changes are the best available indicators. Pull latency from clients — increased latency from a more distant replica is the client-observable signal that traffic has rerouted. Azure Monitor platform metrics — visible per-region in the Azure portal Metrics blade. To export metrics or query them programmatically, enable Diagnostic Settings. Phase C: Failback (replica returns to global routing) az acr replication list — confirms regionEndpointEnabled: True (manual) or online status across all replicas (automatic). Pull latency normalizing as clients reach the recovered replica again. Resource Health clearing for the registry (visible in the Azure portal). Note: The health-aware failover blog calls out ongoing work to surface richer signals — including notifications for when routing changes and which region is currently serving your traffic. The signals listed above are what's available today. Pricing Considerations Storage billing vs. storage quota: Storage is billed per geo-replica — a 1 GiB image replicated to 5 geo-replicas is charged as 5 GiB of storage (1 GiB × 5 geo-replicas). However, storage quota (the tier's maximum storage limit) counts the image only once — the same 1 GiB image counts as 1 GiB toward your tier's maximum, not 5 GiB. Data transfer: Geo-replication can reduce costs by enabling in-region image pushes and pulls, which avoids cross-region data transfer charges during these push or pull operations. However, cross-region data transfer charges still apply when ACR replicates pushed content to other geo-replicas as part of eventual consistency. Disabled replicas still cost: When you take a replica out of global routing with --global-endpoint-routing false , storage and per-replica costs continue accruing because data continues syncing bidirectionally. For more information, see ACR pricing. Cleanup Run these commands to undo the walkthrough setup. Order matters: disable regional endpoints before deleting replicas, since regional endpoint URLs depend on which replicas exist. # Disable regional endpoints if you enabled them in Step 2a az acr update -n myregistry -g myrg --regional-endpoints disabled # Re-enable any replicas you disabled in Step 3 (no-op if already enabled) az acr replication update --registry myregistry --name westus \ --global-endpoint-routing true # Delete the West US replica created in Step 1 az acr replication delete --registry myregistry --name westus # Confirm only the home region replica remains az acr replication list --registry myregistry --output table Note: Replica deletion is a control-plane operation that requires the home region to be available. During a home region outage, replica configuration cannot be modified. Summary Table Question Answer When should I use regional endpoints vs the global endpoint? Use regional endpoints (Step 2a) for workloads that need affinity, predictable routing, push/pull consistency, troubleshooting, or client-side failover. Use the global endpoint (Step 2b) for everything else and let health-aware failover handle routing. What should I enable for secure, resilient layer downloads? Enable dedicated data endpoints. They scope firewall rules tightly to your registry and replace wildcard storage DNS with predictable per-region FQDNs. How do I avoid DNS-bouncing manifest validation failures on push? Pin pushes to a single replica via a regional endpoint. A short-lived client-side dnsmasq for the push duration is also fine if you're not using regional endpoints. Should I run a long-lived DNS cache for the global endpoint? No. ACR purges DNS server-side on disable and during failover; client-side caching works against that. Do I need to re-auth when switching endpoints? Yes. Each global or regional endpoint is its own authenticated surface. az acr login , SDK auth, or the Kubernetes ACR credential provider handles the re-auth. What happens during a home region outage? Data plane keeps working through any replica via the global endpoint or regional endpoints. Control plane operations (replica configuration, network rules) are unavailable until the home region recovers. The home region is fixed at registry creation. What's ACR doing about eventual-consistency pain? Bounded staleness consistency for cross-replica pushed images is in development and will be covered in an upcoming blog post. Reach out via GitHub if you want to share your scenario. For the full automation matrix — what's automatic, what requires customer action, and what to expect for each scenario — see the behavior summary above. If you have further questions about ACR geo-replication routing, pinning, capacity planning, eventual consistency, or failover behavior, reach out to us on the Azure Container Registry GitHub repository or file feedback through the Azure portal.104Views0likes0CommentsRegional Endpoints for Azure Container Registry Geo-Replication — Now in Public Preview
By Johnson Shi, Zoey (Zhuyu) Li, Huangli Wu What's new Regional endpoints for geo-replicated Azure Container Registries are now in public preview. See the feature's official MS Learn documentation. If you've been following since the private preview announcement, here's what changed: No feature flag registration. No subscription enrollment so all Azure subscriptions and customers can now use this feature. No CLI extension. Regional endpoints commands are built into Azure CLI 2.86.0+ natively. If you installed the private preview acrregionalendpoint extension, uninstall it to avoid conflicts. Native CLI and portal support. With Azure CLI 2.86.0+, enable regional endpoints for all geo-replicas of a registry with az acr create --regional-endpoints enabled or az acr update --regional-endpoints enabled . The Azure portal also supports configuring regional endpoints natively. CLI flag rename for configuring a geo-replica's global endpoint routing (an existing separate feature). The existing flag --region-endpoint-enabled (on az acr replication create/update ) has been renamed to --global-endpoint-routing . Key clarifications: "--global-endpoint-routing" (formerly "--region-endpoint-enabled" on "az acr replication create / az acr replication update") — controls whether a specific geo-replica participates in global endpoint routing. This is an existing feature that is different from the new registry-level "--regional-endpoints" feature being discussed in this post. "--regional-endpoints" (on az "acr create / az acr update") — enables or disables the regional endpoints feature at the registry level for all geo-replicas. This is the feature discussed in this post. See the endpoint reference for the full breakdown of the various registry endpoints (global endpoints, regional endpoints, and data endpoints). Regional endpoints are available on Premium SKU registries in all Azure public cloud regions. What are regional endpoints? Regional endpoints give you dedicated, per-region login server URLs for each geo-replica with the following URL pattern: myregistry.eastus.geo.azurecr.io myregistry.westeurope.geo.azurecr.io Regional endpoints coexist with the registry's global endpoint ( myregistry.azurecr.io ) — enabling regional endpoints doesn't disable a registry's global endpoint that is backed by Azure-managed routing. You can choose per workload: You can use the global endpoint with automatic Azure-managed routing with health-aware failover, where Azure will route your requests to the geo-replica with the best network performance profile to the client. You can use a regional endpoint when you need explicit control or routing to a specific geo-replica. Other resources: For the full background on why regional endpoints exist and the problems they solve, see the private preview blog post. For the complete operational deep dive — health-aware failover, throttling considerations, storage quota and pricing, eventual consistency, home region outage behavior, DNS propagation, private endpoint interaction, capacity planning, and monitoring guidance — see How ACR geo-replication handles failover, failback, and traffic redirection. For the behind-the-scenes engineering implementation — architectural overview and the engineering system design of the feature — see Determinism over magic: the engineering design behind Azure Container Registry Regional Endpoints. Getting started Enable regional endpoints on an existing registry: az acr update -n myregistry -g myrg --regional-endpoints enabled View all registry endpoint URLs, including the registry global endpoint, geo-replica regional endpoints, and data endpoints: az acr show-endpoints --name myregistry --resource-group myrg Using regional endpoints Authenticate to a specific regional endpoint: az acr login --name myregistry --endpoint eastus Push to a specific geo-replica. Images and tags pushed to a geo-replica via regional endpoints still propagate to all other geo-replicas under eventual consistency. docker tag myapp:v1 myregistry.eastus.geo.azurecr.io/myapp:v1 docker push myregistry.eastus.geo.azurecr.io/myapp:v1 Pull an image: docker pull myregistry.eastus.geo.azurecr.io/myapp:v1 You can specify regional endpoints directly in Kubernetes deployment manifests if you need to pin workloads to specific regions. This ensures clusters in specific regions always pull from their colocated replica, providing predictable routing and reduced latency. By using different regional endpoints in each cluster's manifests, you can choose to guarantee that each cluster pulls from its local replica instead of relying on Azure-managed routing. East US cluster deployment: apiVersion: apps/v1 kind: Deployment metadata: name: myapp-eastus spec: template: spec: containers: - name: myapp image: myregistry.eastus.geo.azurecr.io/myapp:v1 West Europe cluster deployment: apiVersion: apps/v1 kind: Deployment metadata: name: myapp-westeurope spec: template: spec: containers: - name: myapp image: myregistry.westeurope.geo.azurecr.io/myapp:v1 When to use regional endpoints Scenario What to do Most workloads Keep using the global endpoint ( myregistry.azurecr.io ). Health-aware failover handles routing automatically. Pin AKS clusters to co-located replicas Use regional endpoint URLs in deployment manifests. CI/CD push-then-pull consistency Pin pushes to a regional endpoint to avoid eventual-consistency races. Client-side failover Switch between regional endpoints based on your own health checks. Capacity planning Spread workloads across multiple regional endpoints to avoid per-replica throttling. Troubleshooting Target a specific geo-replica to reproduce or isolate an issue. What changed from private preview Private preview Public preview Feature flag registration required ( az feature register ) No registration needed Subscription private preview enrollment and propagation wait Immediately available to all Azure subscriptions for all Premium SKU registries in all Azure public cloud regions. Separate CLI extension ( acrregionalendpoint ) Built into Azure CLI 2.86.0+ natively No registry-level CLI flag az acr update --regional-endpoints enabled enables regional endpoints for all geo-replicas --region-endpoint-enabled flag for controlling a geo-replica's global endpoint routing via az acr replication update Flag for controlling a geo-replica's global endpoint routing renamed to --global-endpoint-routing No portal support Native Azure portal support for enabling regional endpoints for new registries (during creation) and for existing registries Private preview docs in Azure/acr Full documentation on MS Learn Enabling regional endpoints in the Azure portal You can enable regional endpoints directly from the Azure portal for both new registries (during creation), as well as existing registries: If you were in the private preview 1. Uninstall the CLI extension. The private preview CLI extension conflicts with the built-in commands in Azure CLI 2.86.0+. Remove it: az extension remove --name acrregionalendpoint Verify it's gone: az extension list --query "[?name=='acrregionalendpoint']" -o table 2. Ensure you're running Azure CLI 2.86.0 or later. Regional endpoints commands are available natively starting in Azure CLI 2.86.0. Check your version: az version 3. Update scripts that use --region-endpoint-enabled for controlling global endpoint routing for a geo-replica. The old flag name for controlling a geo-replica's global endpoint routing configuration is deprecated and will be removed in Azure CLI 2.87.0 (June 2026). Update to --global-endpoint-routing : # Old (deprecated) az acr replication update --registry myregistry --name westus \ --region-endpoint-enabled false # New az acr replication update --registry myregistry --name westus \ --global-endpoint-routing false Why the rename? The old flag name --region-endpoint-enabled was confusing — it sounded like it controlled the regional endpoints feature, but it actually controlled whether a geo-replica participates in global endpoint routing. The new name --global-endpoint-routing says exactly what it does. For a full breakdown of all three CLI flags and how they relate, see the endpoint reference. Learn more Full documentation: Geo-replication in Azure Container Registry — Regional endpoints — prerequisites, CLI commands, network considerations, private endpoint integration, and troubleshooting. Operational deep dive: How ACR geo-replication handles failover, failback, and traffic redirection — health-aware failover, throttling, eventual consistency, DNS considerations, monitoring, pricing, and a full walkthrough. Behind-the-scenes engineering implementation: Determinism over magic: the engineering design behind Azure Container Registry Regional Endpoints — architectural details and the engineering system design behind the feature. Endpoint reference: Azure Container Registry endpoint reference — all endpoint types, URL formats, and CLI flags in one place. Private endpoints: Connect privately to a registry using private endpoints — IP allocation math, subnet sizing, and NIC queries for registries with regional endpoints. Firewall rules: Configure firewall access rules — which FQDNs to allow for regional endpoints. Feedback We'd love to hear how you're using regional endpoints and what we can improve. Reach out via: Azure Container Registry GitHub repository — issues, feature requests, and discussion Azure portal feedback — use the feedback button in the Azure portal on your registry's page Regional endpoints are on the path to GA. Your feedback directly shapes the feature's direction.180Views1like1CommentVNet integration for Azure SRE Agent (preview)
For many production systems, the logs, databases, private endpoints, repositories, and runbooks an SRE Agent needs to do its job are behind network boundaries your security team already governs. VNet integration for Azure SRE Agent, now in preview, puts the agent's outbound traffic under those same controls - your virtual network, your NSG rules, your private DNS - so it reaches only what your network allows. The principle is one your security team already applies to every other workload: a component's network access shouldn't depend on the component behaving correctly. Identity governs what the agent can reach. Permissions and hooks shape what it does within reach. The network sits beneath both: it blocks any request to a destination you haven't allowed no matter what the agent decides. Why egress control matters Two reasons. First, the agent reads sensitive things by design. Inspecting logs, code, configuration, and internal systems is the whole point during an incident, which means you have to decide where that data can go. Open egress gives that data a path out of your network - a risk you wouldn't accept for any other production-adjacent workload. Second, it reasons over text it didn't write - logs, issue descriptions, tool output — which is how prompt injection gets in. Handling that is partly model safety, and Azure SRE Agent runs under Microsoft's Responsible AI standard with safety work from OpenAI and Anthropic. Network controls add another layer: an instruction that tries to reach a destination you haven't allowed can't run, because the network blocks it. For example, an agent investigating an outage might query Log Analytics, read deployment configuration, and call an internal runbook - all private resources. With VNet integration, those calls follow the routes, DNS, and firewall rules your workloads already use. A request to an external endpoint you haven't allowed fails at the network boundary. It doesn't depend on the model recognizing the risk and refusing; the network stops it either way. Choose an egress mode Azure SRE Agent has three egress modes, and you don't have to start at the strongest. Unrestricted - all outbound traffic allowed Limited - deny all outbound, allow an explicit list of hosts. Gives you host-level control without setting up a full VNet Azure VNet - outbound traffic goes through a delegated subnet in your network, with your NSG rules and private DNS applied. The recommended mode for production and regulated workloads. How Azure VNet mode works Outbound traffic takes one of two paths, and every call takes exactly one. Your VNet. Everything not placed on the managed path goes through a delegated subnet in your own network, where your NSG rules, private DNS, and firewall all apply. The agent is just another workload on that subnet, so it can reach what the subnet can reach: databases behind private endpoints, internal services, monitoring stores, and key vaults -the parts of production that aren't reachable from the public internet. The resources that matter most during an incident are usually the private ones. If your network connects to on-premises over ExpressRoute or VPN, the agent can reach those systems too, as long as your existing routes and rules allow it. The managed infra path. Some destinations go through Azure SRE Agent's managed infrastructure network instead - platform services the agent needs, plus optional categories you turn on: package registries, code repositories, and remote MCP servers. This path skips your VNet, so your NSG rules and Firewall Policies don't apply to it. Treat it as a deliberate exception, used only where you need it. Why public services start on the managed path Public services are hard to allow by IP address. GitHub, PyPI, npm, NuGet, apt, and the container registries run on large, changing IP ranges, and they don't map to a single Azure service tag. If your NSG filters by IP and port, keeping those lists up to date is constant work, and when a list falls behind, the agent can't pull a package or read a repository - and an investigation stalls on a networking problem that has nothing to do with the incident. Each category has a toggle: package registries (PyPI, npm, NuGet, apt), code repositories (GitHub, GitHub Enterprise, Azure DevOps), remote MCP servers, and a list of additional hostnames. Starting with these on the managed path keeps the agent working reliably without maintaining an IP allowlist. For build-time dependencies, that's usually fine. If you want this traffic inspected too, the next step is name-based (FQDN) egress filtering in your own network. Once your firewall can allow github.com and pypi.org by name, you can move these categories off the managed path and route them through your VNet instead Configure it Two decisions: the subnet, and what (if anything) uses the bypass. Navigate to Settings > Workspace Configuration > Network Choose Azure VNet as the egress mode. Select a subnet that is /28 or larger and delegated to `Microsoft.App/environments`. Decide which categories, if any, use the bypass. Restrict who can change the egress mode and bypass toggles. These settings widen or narrow the agent's reach, so govern them like any production network control. Test the outbound behavior before using the agent with production data. A reasonable setup for most enterprises during preview: use Azure VNet mode, keep package registries and code repositories on the bypass if you need reliable access to them, and route everything else through your VNet. Stricter environments can turn those categories off and rely on their own name-based firewall rules. What it doesn't cover yet VNet integration is in preview, with two limitations to know. It covers outbound traffic only - reaching the agent privately from inside your network isn't part of this preview. And connector traffic still routes over the public internet; the governance and credential isolation in Connectors V2 still apply. Use VNet integration for outbound control of the agent workspace, and combine it with identity, RBAC, tool permissions, hooks, and connector governance for a complete set of controls. Where it fits VNet integration doesn't replace identity, RBAC, tool permissions, or connector governance. It controls where traffic can go. The agent still needs the right identity and permissions to access a resource in the first place. Identity is the foundation: your RBAC assignments decide what the agent can reach. Permissions and hooks shape what it does within reach: allow/ask/deny rules control what runs, and hooks let you inspect or change a tool call before it runs. VNet integration sits underneath, controlling where traffic can go no matter what the agent tries to do. You want the agent to be capable. You also want a boundary that holds whether or not it is. Get started Create an SRE Agent - https://aka.ms/sreagent Documentation - https://aka.ms/sreagent/newdocs Recipes - https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent649Views1like0CommentsManaged Connectors for SRE Agent (preview)- Govern what your agent can do
Giving an agent access to a tool is the easy part. The harder question is what it's allowed to do with that access. "Can the agent copy a file in OneDrive?" mostly answers itself. "Can it copy any file, to any destination, over one that's already there?" is the one that decides whether the integration has a governance layer. Managed Connectors is built around that second question. It expands the catalog of tools the agent can reach - OneDrive, SharePoint, Google Drive, GitLab, Power BI, Microsoft Security Copilot, with more being added regularly - and pairs it with a governance model that keeps the policy for those tools outside the agent's control. This is part of the Azure SRE Agent announcements at Build 2026 What's new Managed Connectors is the next generation of our connector experience. It significantly expands the catalog of third-party and first-party SaaS integrations available to SRE Agent and surfaces each one to the agent as a curated set of operations through the Model Context Protocol (MCP) - the same standard the agent already uses for every other tool source. Governance: the agent gets capability, you keep control The governance model is the headline of this release, so it's worth being concrete about it. When you add a connector, you walk through a short wizard - Set up connector, Configure tools, Review & Save - and the "Configure tools" step is where the policy is set. Three things make it different from "just wire the API up to the LLM": You choose what's exposed - it isn't automatic. A connector might offer 40+ operations; in the wizard you pick the ones the agent can use. The rest aren't shown to the model, so it can't call them. Parameter policy lives outside the agent. For each selected operation you can mark parameters as user-defined (pinned to a value you specify) or agent-defined (the agent fills it in). On the Microsoft Planner “Create a task” tool, for example, you can choose the group ID from a list of your joined groups – this means that the agent provides the task details but can’t assign it to any arbitrary group, because that isn’t a parameter it sees when invoking the tool. Per-tool approval is built in. Each operation has an Allow/Ask toggle integrated directly into the creation and edit wizards. "Ask" routes the call through the agent runtime human-in-the-loop approval flow before it executes. On that same Microsoft Planner connector, you might leave read-only tools like “List tasks” or "Get plan details” on Allow, but flip “Delete a task” to Ask so a human must confirm before anything is removed. This is enforced on the agent's runtime; it is not a prompt instruction the model can be talked out of following. Credential Isolation No long-lived secrets in the agent. No API keys, no client secrets, no certificates, no OAuth tokens. All service credentials are encrypted at rest and stored outside of the agent’s trust boundary Automatic token refreshed. Once you consent, the internal connector resource keeps your tokens valid. You won't be asked to re-authenticate unless your service itself requires it. You consent once, in your own browser, with your own service. SRE Agent never proxies your password or the sign-in flow. Per-connection authorization. Each connection is bound to the specific SRE Agent instance you set up on and cannot be used by external threat actors. How it fits together All of this is stored and evaluated outside the agent loop. Each configured connector becomes an MCP server that the SRE Agent runtime registers as a tool source, the same standard wire format the agent uses for everything else, so adoption on the model side is trivial. Each layer does one job, and the trust boundary between "what the model decided" and "what was actually sent" is explicit and inspectable: the agent never sees the operations you didn't select, never sees the parameter slots you pinned, and cannot bypass approval on operations you marked Ask. How to try it Open the SRE Agent portal and go to Builder > Connectors. Pick a connector from the catalog with the “Preview” label and go through the creation wizard steps. At the “Set up connector” step, choose how the connector authenticates. Start with “OAuth” if you just want to sign-in and see it working against your own account. At “Configure tools”, select the operations you want to expose, pin any parameters that shouldn't be agent-controlled, and mark sensitive operations as “Ask.” Review & Save. The connector is registered with the runtime and immediately available to your agent. You can enable/disable specific tools or connectors in the “Capabilities” section. Edit connector – after creating the new connector, at any point you can go back and authenticate it with a different account, add or remove operations, update tool parameters and configure approval policies Resources Create new SRE Agent — https://aka.ms/sreagent SRE Agent Documentation — https://aka.ms/sreagent/newdocs SRE Agent recipes — https://aka.ms/sreagent/recipes Build 2026 SRE Agent announcements - https://aka.ms/Build26/blog/SREAgent289Views1like0CommentsPrivate Plugins with Azure SRE Agent
SRE's and platform teams are building operational skills specific to their infrastructure: investigation runbooks, compliance checks, cost analysis playbooks, deployment verification procedures. The next step is making that work reusable across every agent in the organization without exposing it publicly. Today, SRE Agent supports plugin marketplaces hosted in private GitHub repositories, including GitHub Enterprise. This is part of the Azure SRE Agent announcements at Build 2026. You can now point SRE Agent at a private repo when adding a marketplace or installing a plugin. Authentication is handled per-marketplace, and supports OAuth, GitHub PATs, and GitHub Apps for GHE tenants. From one agent to an organization’s plugin catalog Most teams start with a single SRE Agent connected to their services. The agent learns their infrastructure, runs their runbooks, and handles their incidents. It works well. Then adoption grows. A second team stands up their own agent. Then a third. Platform engineering wants every agent to run the same compliance checks. Security needs approval hooks enforced consistently. FinOps has cost governance skills that should be standard across the organization. Suddenly the question isn’t “how do I set up my agent,” it’s “how do we share operational knowledge across all of them.” Without a distribution model, teams end up copying skill files between agents manually. A platform team writes a runbook, shares it over email or a wiki link, and each service team pastes it into their agent individually. When the runbook improves, some agents get updated, some don’t. There’s no version tracking, no central catalog, and no way to know which agent is running which version of which skill. Private marketplace support solves this. How Private Plugin marketplace meet enterprise needs A platform team publishes once, every agent installs. Codify best practices as plugins in a private GitHub repo. Service teams add that repo as a marketplace in their agents and install what they need. Compliance checks, cost governance thresholds, incident playbooks, deployment verification procedures all distributed through versioned plugins. Each team retains ownership. Security controls which plugins enforce approval hooks. FinOps locks cost thresholds into parameter values. Platform engineering governs infrastructure investigation patterns. The marketplace is the distribution layer for organizational standards. Versions are pinned, updates are explicit. Each installation locks to the commit at install time. A merged PR upstream does not change any agent’s behavior. Teams promote new versions on their own schedule: validate in dev, promote to staging, then production. Different agents can run different versions simultaneously. Reuse across environments and tools. The same plugin works across dev, staging, and production agents, and can be reused by local coding agents and other services that support plugins. One source of truth, not separate copies per environment. Accessing Private Plugin marketplaces Private repo support adds authentication to the SRE Agent's plugin workflow so your agent can clone and install from repos that require credentials. Authentication is configured once per marketplace. Every plugin within it inherits the credentials. Auth method When to use Setup OAuth github.com repos your agent can already access Uses your existing GitHub connection. One click. Personal access token Private repos in other orgs on github.com Per-marketplace PAT. Scoped to just that marketplace. GitHub App GitHub Enterprise (*.ghe.com) BYO App with private key in Azure Key Vault. Short-lived tokens minted at runtime. Getting started In SRE Agent, navigate to Builder > Plugins, then click Add Marketplace and enter the URL of the private marketplace you want to connect to. Then click Connect to GitHub to complete the OAuth sign-in. Click Add and you will see the plugins available from your connected marketplace. Click on the plugin to install and in the detail view you can browse the skills packaged with the plugin. click Install to install this plugin. You can now see the skills imported from plugins from Capabilities > Skills > Custom Skills The bottom line Private repo support turns the Plugin Marketplace from a public skill catalog into your organization’s internal distribution platform for operational automation. Your team writes the plugins. Your agents install them. Your GitHub permissions control who has access. Try it yourself: create a private repo with a marketplace.json and a few skills, add it as a marketplace in your agent, and install a plugin. Resources SRE Agent documentation — https://aka.ms/sreagent/newdocs SRE Agent overview — https://aka.ms/sreagent/newdocsoverview Plugin Marketplace capability page — https://aka.ms/sreagent/newdocs/capabilities/plugin-marketplace Build 2026 SRE Agent announcements - https://aka.ms/Build26/blog/SREAgent217Views0likes0CommentsIntroducing On-demand Sandboxes for Azure Durable Task Scheduler (Private Preview)
Maybe it needs a native toolchain. Maybe it runs untrusted customer or LLM-generated code. Maybe it needs Python from a .NET orchestrator, or bursty compute that should scale to zero when the work is done. Today, we're thrilled to announce On-demand Sandboxes for Azure Durable Task Scheduler, now available in private preview. On-demand Sandboxes lets you move those individual workflow steps to managed, isolated compute while your orchestrator stays exactly where it is. Tell DTS which steps should run in isolation, provide a container image with the step code, and DTS handles provisioning, scaling, and teardown. No infrastructure to manage, no idle costs, no orchestrator changes. Sign up for On-demand Sandboxes Private Preview Today → Availability: On-demand Sandboxes targets the standalone Durable Task SDKs used outside the Azure Functions host — for apps running on Azure Container Apps, Azure Kubernetes Service, App Service, or anywhere else you self-host. The private preview supports the .NET and Python Durable Task SDKs, with additional language SDKs and Azure Functions support coming soon. What is Azure Durable Task Scheduler? The Durable Task Scheduler is a fully managed backend for durable execution on Azure. It can serve as the backend for a Durable Function App using the Durable Functions extension, or as the backend for an app leveraging the Durable Task SDKs in other compute environments, such as Azure Container Apps, Azure Kubernetes Service, or Azure App Service. For a deeper introduction, see the Durable Task Scheduler overview or the full Durable Task documentation. Why On-demand Sandboxes? Most activities belong in-process. They're fast, simple, and co-located with your orchestrator. But sometimes you hit a step that doesn't fit: it needs a native binary, a different language runtime, per-invocation isolation, or bursty compute you don't want to keep warm. On-demand Sandboxes gives you a way to handle those exceptions without spinning up dedicated infrastructure or managing scaling policies in Azure Kubernetes Service or Azure Container Apps. Activity-level granularity. Move individual steps to managed compute, not your whole app. Per-activity or per-invocation isolation. Each execution runs in a clean, microVM-backed sandbox. Ideal for untrusted code, customer plugins, or LLM-generated logic. Cross-runtime flexibility. Run a Python inference step from a .NET orchestrator. No compromise on either side. Scale-to-zero. Pay for CPU and memory per second of execution, not infrastructure that waits. No orchestrator changes. Your orchestration code and hosting model don't change at all. Here are a few scenarios where On-demand Sandboxes shines: Native toolchains. Package ffmpeg, LibreOffice, or Pandoc in a container without dragging them into your main app. CPU-heavy preprocessing. OCR, layout extraction, or image processing can scale independently of the rest of your workflow. Cross-runtime workflows. A .NET orchestrator dispatches a Python inference step. No compromises. Sandboxed code execution. Run customer plugins or LLM-generated code with a clean boundary on every invocation. Multi-tenant isolation. Tenant-specific steps get dedicated boundaries while everything else stays in-process. Bursty event-driven workloads. Steps that spike hard but rarely may not justify always-on infrastructure. Sub-second cold starts mean you get capacity when you need it without paying to keep it warm. How it works On-demand Sandboxes uses a two-part model: a worker profile in your orchestrator app that tells DTS which activities to offload, and a worker image that contains those activity implementations. Your orchestrator still calls activities the same way it always has; the decision to run one activity in a sandbox lives in the profile configuration. 1. Declare a sandbox worker profile In the app that hosts your orchestrator, define a sandbox worker profile. The profile gives DTS the container image, resource shape, concurrency setting, and activity names that should run in a sandbox: using Microsoft.DurableTask.Worker.AzureManaged.Sandbox; [SandboxWorkerProfile("code-executor")] internal sealed class CodeSandboxWorkerProfile : ISandboxWorkerProfile { public void Configure(SandboxOptions options) { options.ContainerImage = Environment.GetEnvironmentVariable("DTS_SANDBOX_IMAGE") ?? throw new InvalidOperationException("DTS_SANDBOX_IMAGE is required."); options.Cpu = "1000m"; options.Memory = "2048Mi"; options.MaxConcurrentActivities = 1; options.AddActivity(TaskNames.ExecuteCode); } } Then enable on-demand sandbox discovery when you configure the Durable Task worker in the main app: workerBuilder.AddTasks(tasks => tasks.AddAllGeneratedTasks()); workerBuilder.UseDurableTaskScheduler(options => { options.EndpointAddress = Environment.GetEnvironmentVariable("DTS_ENDPOINT"); options.TaskHubName = Environment.GetEnvironmentVariable("DTS_TASK_HUB"); options.Credential = credential; }); workerBuilder.EnableSandboxes(); Here's what the profile configuration does: SandboxWorkerProfile: a friendly profile id for this sandbox setup. It groups the activity, image, and resource settings for monitoring and reuse across deployments. ContainerImage: the container image (from your registry) that contains the activity implementations. Cpu / Memory: the resource shape for each worker instance. Sized per your activity's needs. MaxConcurrentActivities: how many activities a single worker instance can process concurrently. AddActivity: the specific activity to offload. Only activities added to a sandbox worker profile execute in DTS-managed isolated compute; everything else stays in-process. The orchestrator call site doesn't change: ExecuteCodeOutput execution = await context.CallActivityAsync<ExecuteCodeOutput>( TaskNames.ExecuteCode, new ExecuteCodeInput(pythonCode, input.CsvData)); ExecuteCode is not registered in the main app's in-process activity list. When the orchestrator calls it, DTS uses the codegen profile to route the work to the sandbox image. 2. Build the worker image The worker image is a container you own. In most apps, this worker lives in a separate project from the orchestrator host so it can have its own entry point, dependencies, and container image. It registers the activity implementations it can run and opts in to managed execution with UseSandboxWorker(): builder.Services.AddDurableTaskWorker(workerBuilder => { workerBuilder.AddTasks(tasks => { tasks.AddActivity<ExecuteCodeActivity>(); }); workerBuilder.UseSandboxWorker(); }); UseSandboxWorker() is the key line. It signals that this worker runs in DTS-managed compute. The sandbox worker does not need to configure the DTS endpoint, task hub, profile id, or credentials; DTS injects the runtime settings when it starts the container. The activity implementations themselves are standard Durable Task activities. There's nothing special about the activity code: it can call a runtime with different dependencies, such as Python and pandas, while running in an isolated container instead of in your main app's process. Package the image like any containerized service, including whatever runtimes and native tools the activity needs. Push it to your container registry (e.g., Azure Container Registry) and reference the image in the worker profile's ContainerImage option. View logs in the DTS dashboard Once your sandbox activities are running, you can view their execution logs directly in the Durable Task Scheduler dashboard. The dashboard shows real-time output from your managed workers, including stdout, stderr, and activity lifecycle events. This gives you full visibility into what's happening inside the sandbox without needing to configure external log sinks or set up your own observability pipeline. Demo Get started On-demand Sandboxes is in private preview. To get access, sign up here. We'll enable the feature on your scheduler and help you get your first sandbox activity running. Once you're in, the workflow is straightforward: declare a sandbox worker profile in your orchestrator app, build and push a worker image, and DTS takes care of the rest. Sign up for On-demand Sandboxes Private Preview Today → Documentation: Durable Task Scheduler overview Samples: Azure-Samples/Durable-Task-Scheduler Pricing: Azure Durable Task Scheduler pricing Questions, feedback, or ideas? Open an issue in the Durable-Task-Scheduler GitHub repo. We'd love to hear from you.347Views0likes0CommentsIntroducing Azure Container Apps Sandboxes: Secure Infrastructure for Agentic Workloads
Today we are announcing the public preview of Azure Container Apps Sandboxes - a new first-class resource type that gives you fast, secure, ephemeral compute environments with built-in suspend and resume. This is the underlying infrastructure on which products like Cloud sandboxes in GitHub Copilot, Foundry Hosted Agents, and Azure Container Apps Express are built, you now have the opportunity to build your solutions leveraging this infrastructure. Azure Container Apps Sandboxes unlocks two massive opportunities. For platform developers and ISVs, sandboxes give you the same isolated compute fabric that powers many Microsoft products. You get the building blocks to create your own multi-tenant platform on proven, enterprise-scale infrastructure. For AI agents, sandboxes become a self-configurable tool that lets agents extend their own capabilities on the fly. An agent can spin up a fresh sandbox in milliseconds and use it to execute untrusted code, compile source, test HTTP requests against a live app, launch a browser session, or tackle whatever needs a quick and scalable infrastructure. On one side it empowers humans to build platforms, on the other it empowers agents to build their own capabilities. Both get enterprise-grade isolation, instant startup, and snapshot-based persistence out of the box. We'll walk through the resource model, sandbox lifecycle, the features that set Sandboxes apart - like snapshots, lifecycle policies, network egress controls, volumes, and managed identities - and show you how to get started with the portal and CLI. What Are Container Apps Sandboxes? Container Apps Sandboxes are secure, isolated compute environments that start in sub-second time, scale to thousands, and cost nothing when idle. Each sandbox runs in its own hardware-isolated microVM boundary - fully separated from the host, the platform, and every other sandbox. You bring your own Open Container Initiative (OCI) image, and Sandboxes handle the rest: provisioning from prewarmed pools, strong multi-tenant isolation, and snapshot-based suspend/resume that preserves full memory and disk state across sessions. There are many ways Sandboxes can help you build your next project - here are a few: Your own build & test systems - wire a Sandbox into your CI/CD flow to run builds while your laptop stays cool. Agents that can run anything safely - an agent spawns a sandbox, executes work inside it, and returns the output with no agent host privileges required. Agent swarms - decompose a research question, spawn N sandbox workers in parallel (each pinned to its own image and egress policy), and synthesize the result. Early access customers are already unlocking significant benefits by leveraging Azure Container Apps Sandboxes. "With Azure Container Apps sandboxes, SitecoreAI can safely enable agents to take real action. The combination of multi-tenant isolation, rapid scale-out, and full automation allows Sitecore to run long-lived, autonomous agents that securely execute code, manage workflows, and interact with enterprise systems within secure, governed environments. With this foundation, we can build agents that do real work: assembling content, personalizing experiences, and optimizing campaigns in production. Agents that operate continuously, learn from results, and improve over time, so our customers get better outcomes without giving up control." - Mo Cherif, VP of AI and Innovation, Sitecore "We got early access to Azure Container Apps Sandboxes, and got the first prototype integrated with Atlas AI in hours, and it's already shaping a new Atlas AI capability that we plan to launch in preview in Q3. It gives every Atlas AI agent a safe, sandboxed workspace (file system, terminal, code execution) on a customer's live data in Cognite Data Fusion. The value: Industrial process, reliability, and production engineers spend days and weeks on questions like "which wells are underperforming and why?" These questions are tractable but expensive, so they are asked rarely and decisions are made on gut feel. With this, an agent pulls the data, runs the analysis, cross-references maintenance and inspection records, and returns a cited draft in minutes. Sandboxes make it practical: Aligned feature set, per-customer isolation, pause/resume across multi-day investigations, scale-to-zero economics." - Kelvin Sundli, Product manager, Atlas AI, Cognite Resource Model: Sandbox Groups and Sandboxes The top-level ARM resource is Microsoft.App/SandboxGroups. A Sandbox Group is the management boundary for a collection of sandboxes that share configuration - think of it like a Container Apps Environment, but purpose-built for sandboxes. When you create a Sandbox Group, you specify: Subscription, Resource Group, and Region Sandbox defaults (optional): default CPU, memory, disk, max sandbox count, and default idle timeout Networking: optionally deploy into a custom VNet with a dedicated subnet for private networking Identity: System or user assigned Entra identity. Individual sandboxes are created within a Sandbox Group. Each sandbox has its own source (disk image or snapshot), resource tier, lifecycle policy, network egress policy, environment variables, ports, volumes, and connections. Sandbox Lifecycle Sandboxes have a well-defined lifecycle with the following states: State Description Creating Provisioning the sandbox from a disk image or snapshot Running Actively executing - backed by a live microVM Idle System-suspended after inactivity; can auto-resume on the next request Suspended Full state (memory + disk) preserved as a snapshot; no compute costs Resuming Restoring from a suspended or idle state - sub-second for most workloads Stopped User-initiated stop; can be resumed Stopping Graceful shutdown in progress Deleting Teardown in progress The key insight here is the distinction between Idle and Suspended. When a sandbox goes idle (e.g., no traffic for a configured timeout), the system can automatically suspend it and capture a snapshot. When a new request arrives, the sandbox resumes transparently. This gives you scale-to-zero economics with stateful compute - something that wasn't possible before without significant custom engineering. Disk Images: Bring Your Own Container Sandboxes boot from Disk Images - Open Container Initiative (OCI) images converted into an optimized root filesystem format. You point to any OCI image (public or private registry), and the platform builds a bootable disk image from it. You can start with public, pre-built images maintained by the platform (for example, Ubuntu base images), or bring your own private images. For private registries, you can authenticate with username/token or use a user-assigned managed identity for Azure Container Registry (ACR) – integrated with Azure as you expect. Snapshots: Full-State Persistence Snapshots capture the complete state of a running sandbox - memory, disk, and all running processes. When you resume a sandbox from a snapshot, every process, open file handle, and in-memory data structure is restored exactly as it was. A snapshot captures the full state of a running sandbox: memory pages, disk, processes. Two ways to make one - automatically on suspend, or manually on demand. Three things they're great for: Checkpointing mid-task so a long-running agent can resume exactly where it left off Cloning an environment that's already warm - dependencies installed, caches populated, services running Shipping a "ready-to-go" state that resumes in sub-second instead of cold-booting Snapshots are free during the preview, after which they will be stored as Azure Blob Storage at standard rates. Each snapshot records the source sandbox, resource allocation (CPU, memory, disk), and container metadata - so what you get back is exactly what you snapshotted. Resource Tiers Every sandbox is assigned to a resource tier that determines its CPU, memory, and disk allocation: Tier CPU Memory Disk XS 0.25 vCPU 0.5 GB 5 GB S 0.5 vCPU 1 GB 10 GB M (default) 1vCPU 2 GB 20 GB L 2 vCPU 4 GB 40 GB XL 4 vCPU 8 GB 80 GB When creating a sandbox from a snapshot, the resource tier is inherited from the snapshot and cannot be changed - this ensures the restored environment has the exact resources it was running with when the snapshot was taken. Lifecycle Policies: Auto-Suspend and Auto-Delete Every sandbox can be configured with lifecycle policies that automate state transitions and cleanup: Auto-Suspend Idle timeout: How long a sandbox can sit idle before being suspended (configurable: 1m, 2m, 5m, 10m, 30m, 60m) Suspend mode: Disk + Memory (default): Full snapshot including memory state - resume picks up exactly where you left off, with all processes and in-memory data intact. Disk: Only the disk is preserved; the VM restarts fresh on resume. Useful when you only need file persistence, not process continuity. Auto-Delete Automatically delete sandboxes after a configurable number of days of inactivity Prevents accumulation of abandoned sandboxes that consume snapshot storage These lifecycle policies are what make Sandboxes economically viable at scale. A platform serving thousands of tenants can configure aggressive idle timeouts (say, 60 seconds) with Memory suspend mode, and each tenant's sandbox disappears from the billing meter almost immediately - but resumes in sub-second time the moment they return. Network Egress Policy For scenarios involving untrusted code - AI agents executing LLM-generated scripts, multi-tenant SaaS with user-submitted workloads - controlling outbound network access is critical. Sandboxes provide a per-sandbox Network Egress Policy: Default action: Allow or Deny all outbound traffic Host rules: Domain-pattern rules (e.g., *.github.com → Allow) to permit specific destinations Custom CIDR rules: Network-level rules for IP ranges (e.g., 10.0.0.0/8 → Deny) Skip egress proxy: Option to bypass the egress proxy entirely when custom VNet routing handles policy enforcement This means you can run a sandbox in a deny-by-default posture and allowlist only the specific endpoints it needs (your API server, a package registry, etc.) - without setting up NSGs or firewall appliances. Managed Volumes: Persistent and Shared Storage Sandboxes support two types of mountable volumes, both managed by Microsoft: Volume Type Backed By Best For Managed Azure Blob Azure Blob Storage Shared data across sandboxes, file uploads/downloads, persistent artifacts Managed Data Disk Azure Disk Storage High-performance storage for databases, build caches, large working sets - only available to one sandbox at a time Blob volumes come with a built-in file explorer in the portal - you can browse, upload, download, create folders, and drag-and-drop files directly. Data Disk volumes provide dedicated block storage with configurable sizes. Secrets and Identity Secrets Sandbox Groups support key-value secrets scoped to the group. Secrets can be created, edited, and referenced by sandboxes within the group. These secrets can be used in egress policies to modify requests with transform or header-injection rules, without exposing the secrets to code running inside the sandbox. Managed Identity Sandbox Groups support both system-assigned and user-assigned managed identities, with full RBAC role assignment management. This means your sandboxes can authenticate to Azure services (Key Vault, Storage, Cosmos DB, etc.) without managing credentials - the same identity model you use everywhere else in Azure. MCP Connectors and Triggers ACA Sandboxes now supports managed connectors through the Model Context Protocol (MCP), giving sandboxes access to external APIs - including Microsoft 365, Salesforce, ServiceNow, GitHub, and 1,400+ other systems - without managing credentials directly. Attach a Connector Gateway to your sandbox group, and every sandbox in the group can call external APIs through a standardized MCP interface at runtime. Pair connectors with triggers to build event-driven automation: route an Outlook email to a sandbox that triages it with an AI agent, or react to a SharePoint file upload by extracting and processing the document all without writing glue code. Triggers can fire a shell command inside a sandbox or invoke an HTTP endpoint the sandbox exposes, so your automation shapes fit naturally around your workload. The integration is built on the new Connector Namespace service (az connector-namespace), the same runtime behind Logic Apps and Power Platform connectors, now available as a programmable layer for sandboxes. See the end-to-end samples for runnable azd up-deployable examples covering email triage and document automation scenarios. The Portal Experience Azure Container Apps Sandboxes are only available in the new Azure Container Apps portal that provides a rich, IDE-like experience for working with sandboxes. Creating a Sandbox The portal offers multiple creation paths: Standard Sandbox - full configuration control over source, resources, lifecycle, networking, and volumes GitHub Copilot Sandbox - preset, Copilot CLI ready to go, GitHub credentials can be wired through the Access Token before the sandbox is created Claude Sandbox - Claude CLI pre-installed, ready for agentic coding inside the sandbox Using Coding Agents (Copilot CLI / Claude Code) If you live inside Copilot CLI or Claude Code, you don't need to learn a new CLI. Install the azure-sandbox skill once and your agent picks up the right skills: # GitHub Copilot CLI # Add as a plugin marketplace /plugin marketplace add microsoft/azure-container-apps # Install all skills /plugin install sandboxes@Azure-Container-Apps # Claude Code claude plugin add microsoft/azure-container-apps The skill runs prerequisite checks silently (az --version, az account show, node --version, aca --version), prompts only if something's missing, and maps natural-language asks to the right aca commands. Bundled runbooks cover Copilot CLI BYOK (bring your own Azure OpenAI key), the deploy-a-web-app walkthrough, and shell setup. Sandbox Detail Page Once your sandbox is running, the detail page gives you immediate access to the sandbox terminal and additional details, such as - Network Audit - real-time egress traffic log showing allowed and denied requests Monitor - live CPU, memory, disk, and network utilization charts Connectors - attached connections with an "Add" action Volumes - mounted volumes with an "Add" action Log Stream - streaming container logs Processes - running process list inside the sandbox Files - file explorer to browse the sandbox filesystem The toolbar actions let you manage the state of the sandbox - Resume or Stop. In the Ellipsis menu (⁝) you can find additional settings to manage network Egress Policy and ingress (Add port), take a Snapshot of the sandbox, Commit (save disk state as a new disk image), set Lifecycle Policy or permanently Delete the sandbox. Finally, you can see additional Details in a side panel. Getting Started with the CLI and Python SDK All sandbox and sandbox-group operations go through the aca CLI. There are no az containerapp sandbox commands, - az is only used for az login, az account show, and resource-group management. Install (CLI) # Mac, Linux curl -fsSL https://aka.ms/aca-cli-install | sh # Windows irm https://aka.ms/aca-cli-install-ps | iex Run aca --help to get started. Install (Python SDK) pip install azure-containerapps-sandbox For more details, quick start and examples on ACA CLI and Python SDK, please go to https://sandboxes.azure.com Evolution from Dynamic Sessions If you've used Azure Container Apps Dynamic Sessions, Sandboxes are the next evolution of that capability. Everything Sessions can do, Sandboxes can do - and significantly more: Capability Dynamic Sessions Sandboxes Sub-second startup ✓ ✓ Strong isolation ✓ ✓ Custom container images ✓ ✓ Custom VNet integration ✓ (Partial) ✓ Suspend/resume with Memory and Disk snapshots - ✓ Lifecycle policies (auto-suspend, auto-delete) - ✓ Network egress policy (per-sandbox) - ✓ Persistent managed volumes (Blob, Data Disk) - ✓ Managed identity (system + user-assigned) - ✓ Secrets management - ✓ Configurable resource tiers - ✓ Direct access to sandbox in Portal experience - ✓ We will continue to support Dynamic Sessions, but all new investment goes into Sandboxes. If you're building new workloads on isolated ephemeral compute, start with Sandboxes. How It All Fits Together ACA Sandboxes is a platform primitive. It's the foundation on which multiple Microsoft products are already built - including ACA Express, Cloud sandboxes in GitHub Copilot, and Foundry Hosted Agents. When you build on Sandboxes, you're building on the same infrastructure that powers Microsoft's own portfolio. This is the evolution of what we shared with Project Legion in 2024. Legion described the internal infrastructure; Sandboxes exposes it as a customer-facing primitive that you can use directly. What's Next • Deeper Azure integrations - first-class connectivity with Azure networking, identity, storage, and AI services • Enhanced SDK and CLI - richer programmatic experiences for managing sandboxes at scale • More Microsoft services built on Sandboxes - this is just the beginning Get Started Today • Portal: https://sandboxes.azure.com/ • Documentation: Azure Container Apps Sandboxes • Pricing: Azure Container Apps Pricing (per-second vCPU/memory billing, scale-to-zero, snapshots at Blob Storage rates) We'd love to hear your feedback. You can ask questions, or file issues on the Azure Container Apps GitHub (prefix with [Sandbox] for Sandboxes-specific issues).2.8KViews1like0CommentsWhat's new in Azure Container Apps at Build'26
Azure Container Apps (ACA) is a fully managed serverless container platform that enables developers to build and deploy microservices and modern applications without requiring container expertise or needing infrastructure management. ACA provides built-in autoscaling (including scale to zero), per-second billing, advanced networking, built-in observability, and simplified developer experiences across multiple programming languages and frameworks. The world of application development is shifting rapidly. Agentic AI is fundamentally changing the requirements of cloud platforms - more code is being written by AI, more apps are being deployed by agents, and more deployment stacks are being assembled autonomously. Platforms are aligning to two concurrent demands: hosting intelligent agents as first-class workloads, and giving those same agents access to empty, secure compute pools as tools they can invoke on demand. At the same time, the proliferation of AI-generated code means that platforms must offer strong isolation for untrusted workloads, instant provisioning for rapid iteration, and production-grade defaults that make the right thing the easy thing - for both humans and agents. Azure Container Apps is purpose-built for this new reality. Whether you're a developer shipping a web app in minutes or an agent spinning up ephemeral sandboxes for code execution, ACA provides the serverless foundation that meets both audiences where they are. Customers across industries are betting on ACA as the compute foundation for their AI and cloud-native workloads: Replit runs its agent-driven software creation platform on Azure, enabling enterprises like Hexaware to securely build and deploy AI-generated applications at scale with seamless procurement through Azure Marketplace. LayerX built its Ai Workforce document processing platform on Azure Container Apps, Azure OpenAI, Azure AI Search, and Cosmos DB - helping clients like Mitsui & Co. save 570 hours annually by automating manual document tasks. SJR built GX Manager with Microsoft Foundry to automate website personalization at scale - delivering production-grade, data-grounded content in seconds instead of hours of manual curation. August AI powers an AI health companion serving over 3.5 million customers on Azure infrastructure, scoring 100% on the U.S. Medical Licensing Examination and delivering potentially life-saving medical support. Photon Education created Classwise on Azure OpenAI and Foundry with Defender for Cloud security, enabling teachers to prepare lessons faster and engage students more effectively in inclusive learning environments. Microsoft Foundry Agent Service is built directly on Azure Container Apps, serving over 20,000 customers with a dedicated agent runtime that handles fast startup, tool execution, long-running operations, and enterprise-grade isolation at scale. Following the features announced at Ignite'25 and our continued momentum through early 2026, we're excited to share what's new at Build'26. This release deepens our commitment to the agentic era with new primitives for secure ephemeral compute, the fastest path from container to production, a reimagined portal experience, and continued investment in security, observability, and developer productivity. Azure Container Apps Sandboxes (Public Preview) Teams building agentic applications, multi-tenant platforms, development environments, and CI/CD systems have often had to stitch together custom infrastructure to run untrusted code safely, preserve state across sessions, and handle bursty demand without paying for idle capacity. Azure Container Apps Sandboxes addresses that challenge with a new first-class resource type that provides fast, secure, ephemeral compute environments with built-in suspend and resume capabilities. Each sandbox runs in its own hardware-isolated microVM boundary, supports standard OCI container images, and starts in sub-second time. Sandboxes can preserve memory, disk state, and preloaded libraries in a snapshot, so workloads resume quickly from the same point without incurring a cold-start reload penalty. Why Sandboxes are perfect for agents Agents can safely run AI-generated code in isolated environments with instant startup. Agents also accumulate context, intermediate results, and working state during long-running tasks. With sandbox snapshots, agents get persistent, isolated workspaces that survive across task boundaries - they can suspend and resume as needed, preserving full execution context including memory and disk. Key capabilities Sub-second startup - provision and execute immediately Hardware-isolated microVMs - strong security boundary for untrusted code Snapshot and resume - full state preservation (memory + disk) across sessions OCI container image support - bring any container Scale to zero, scale to thousands - consumption pricing with per-second billing This is the underlying infrastructure on which products like Cloud Sandbox in GitHub Copilot, Foundry Hosted Agents, and Azure Container Apps Express are built. ACA Sandboxes joins the Container Apps family alongside Apps, Jobs, Functions, and Dynamic Sessions as a foundational building block for the next generation of cloud and AI application workloads. Learn more about Azure Container Apps Sandboxes at https://aka.ms/aca/sandboxes Azure Container Apps Express (Public Preview) We recently launched Azure Container Apps Express in public preview - the simplest and fastest way to launch and scale powerful applications on Azure, from zero to hyperscale, without infrastructure decisions. It represents the first Azure compute platform purpose-built for agent and developer use alike. Express is based on years of experience running Azure Container Apps at scale. We've learned that most developers working on web apps, APIs, and agents want to deploy quickly, have automatic scaling, and avoid dealing with complex infrastructure. Express provides these capabilities - it sets up your environment in seconds, handles any amount of traffic, and removes complicated settings. This helps teams move from writing code to having a production-ready app in minutes, not hours. What makes Express different Instant provisioning - your app is running in seconds, not minutes Sub-second cold starts - fast enough for interactive UIs and on-demand agent endpoints Scale to and from zero - automatic, no configuration required Per-second billing - pay only for what you use, no environment provisioning fee Production-ready defaults - autoscaling, managed identity, secrets management, custom domains, container registry integration, revision management, and built-in observability Purpose-built for custom agents Agents need to spin up application endpoints on demand - fast, reliably, and without pre-provisioning infrastructure. Express is purpose-built for this pattern: it provisions in seconds, scales from zero instantly when an agent triggers a workload, and scales back down when the task is complete. Whether an agent is deploying a tool-use endpoint, standing up a temporary API for a multi-step workflow, or launching a web UI for human-in-the-loop review, Express gives it a production-grade, internet-reachable application with zero operational overhead. It's the fastest path from "an agent decided to deploy something" to "it's live and serving traffic." Learn more about Azure Container Apps Express at https://aka.ms/aca/express/launch-blog New Azure Container Apps Portal You open the Azure portal and want to deploy a Container App. Ten minutes later you're three blades deep, toggling settings you don't understand, wondering which workload profile is best before you even have an app. We built a different portal. One where deploying a container app takes less time than reading this paragraph. One where creating an Azure Container App is a single click. And one where experimental features ship weekly, not quarterly. Smart defaults, advanced when you need Developers care about outcomes - where their app is running and how to reach it - not starting with a configuration form. The new portal offers three creation modes to keep setup simple: Simple "one-click create" - auto-generates a unique name and provisions your app. Provide the container image and egress settings. That's it - no environment type selection, networking decisions, or container registry configuration. Advanced create - unlocks everything: custom VNets with subnet selection, managed identity for registry auth, lifecycle policies, egress controls, environment variables, custom scale rules, and more. It's a toggle at the top of the same form, not a separate workflow. Express App (Preview) - the new kind of ACA application that provisions and starts almost instantly. Observe quickly, act faster The app overview page surfaces critical information at a glance - including a unified Log Stream that brings app and system logs together in one place. Getting to the root cause now takes fewer clicks, and next steps are always one click away. Faster releases, direct feedback loop Azure Container Apps Express (Preview) and Azure Container Apps Sandboxes (Preview) are currently available only in this new portal. We ship weekly - often more. Upcoming Portal Features in settings give you an easy way to opt in to early access features and share feedback directly. Security: Defender for Cloud Serverless Containers Posture and Confidential Compute Security remains a top priority as enterprises run more sensitive and regulated workloads on Azure Container Apps. At Build'26, we're announcing two key security milestones. Public Preview: Defender for Cloud Serverless Containers Posture on Azure Container Apps Customers can now bring Azure Container Apps environments into Microsoft Defender for Cloud's Serverless Containers Posture experience, helping security teams extend posture management across more of their container estate from a single workflow. This makes it easier to gain visibility into Container Apps resources and assess risks across areas such as identity, networking, and container or image configuration. With this capability, teams can more consistently evaluate risk across container environments and use attack path analysis to identify potential exposure faster. The result is a more unified security posture, less manual effort, and stronger confidence when securing Container Apps deployments. Serverless Containers Posture is available as part of the Defender CSPM plan. Learn more at the Defender for Cloud documentation. General Availability: Confidential Compute for Azure Container Apps Confidential Compute in Azure Container Apps is now generally available, providing hardware-backed Trusted Execution Environments (TEEs) through workload profiles. This extends protection to data in use - in addition to data at rest and in transit - enabling teams to run higher-trust workloads with stronger isolation for sensitive data. With confidential computing now GA, Azure Container Apps becomes more viable for regulated, financial, healthcare, and other high-trust scenarios where organizations need hardware-enforced isolation that protects in-memory data, including from the underlying infrastructure. There is no extra charge for confidential compute workload profiles. Learn more at the Azure Confidential Computing documentation. Observability: HTTP Traffic Logs and OpenTelemetry Destinations Knowing what's happening inside your application is essential to running production workloads with confidence. At Build'26, we're announcing two enhancements that give teams deeper visibility and more flexibility in where they send telemetry. Monitor HTTP traffic in Azure Container Apps Azure Container Apps now adds a dedicated Azure Monitor diagnostic setting category - ContainerAppHTTPLogs - that exposes detailed HTTP access logs for incoming traffic. This capability is designed for high-volume request data, enabling teams to troubleshoot ingress and request-flow issues with much greater precision. With HTTP traffic logs, you can now investigate: Failed requests and error codes Latency patterns and outliers Retries and WebSocket disconnects Routing behavior and backend connectivity The result is faster issue resolution, less operational friction, and stronger confidence in running high-traffic, business-critical applications. Standard Azure Monitor log volume charges apply. Learn more at Azure Monitor pricing. Additional OpenTelemetry Destinations: New Relic, Dynatrace, Elastic Azure Container Apps enhances its managed OpenTelemetry (OTel) capabilities by expanding support for third-party observability platforms. This update introduces additional endpoint options for commonly used monitoring tools - New Relic, Dynatrace, and Elastic - extending the existing managed OpenTelemetry experience. Teams can now use a more consistent OpenTelemetry-based pipeline across Azure Monitor, Datadog, New Relic, Dynatrace, Elastic, and any OTLP-compatible endpoint, with less configuration overhead and more flexibility to route logs, metrics, and traces where they need them - without deploying or managing their own collectors. No extra charge applies. Learn more at the OpenTelemetry agents documentation. Additional Enhancements and Ecosystem Updates Beyond the headline announcements, Azure Container Apps continues to evolve with a steady cadence of improvements across the platform. Override Scale Rules in Azure Functions on Azure Container Apps Azure Functions on Container Apps has traditionally used platform-managed scaling, where triggers are automatically translated into KEDA scale rules. With the new allowScalingRuleOverride property, customers can now choose to override platform-managed scaling and define their own custom KEDA scaling rules. This enhancement is especially useful for scenarios where automatically generated KEDA rules lead to unintended scaling behavior, where workloads require custom thresholds or concurrency tuning, or where teams need standardized scaling policies across services. It works with any of the 60+ KEDA scalers - Service Bus, Kafka, PostgreSQL, HTTP concurrency, Cron, and more. Heroku Migration to Azure Container Apps With Heroku entering maintenance mode, Azure Container Apps is a natural landing zone for Heroku workloads. New guidance and tooling makes the migration path straightforward - from understanding why ACA is the right next step to a practical migration guide for hands-on implementation. Dapr v1.16 Platform Upgrade Azure Container Apps completed a staged platform upgrade to Dapr v1.16.4, bringing modernized actor scheduling, improved scalability for reminders, and updated TLS/security internals. The upgrade is fully platform-managed, with minimal customer action required for most workloads. Running AI Models on ACA Serverless GPUs The community continues to push the boundaries of what's possible with serverless GPUs on ACA. Recent highlights include running Gemma 4 with Ollama for fully private, self-hosted inference, and deploying ComfyUI for text-to-image and text-to-video workloads - all with scale-to-zero and per-second billing. Hosting Remote MCP Servers on ACA Azure Container Apps is emerging as the preferred platform for hosting Model Context Protocol (MCP) servers. With serverless scaling, idle billing, HTTP/1.1 and HTTP/2 support, and managed identity integration, ACA provides a production-ready environment for exposing tools and APIs to AI agents. Multiple tutorials and guides are now available for deploying MCP servers on ACA, including integration with Azure API Management. App Modernization with GitHub Copilot GitHub Copilot App Modernization can dramatically reduce the time required to modernize legacy applications and deploy them to ACA. A recent walkthrough demonstrated upgrading a classic ASP.NET MVC app on .NET Framework to .NET 10 and deploying it to Azure Container Apps in hours - with managed identity and Key Vault integration enabled by default. Azure Skills Repository for Container Apps The new Azure Skills repository includes comprehensive skills specifically for Azure Container Apps - covering troubleshooting, best practices, architecture patterns, security, deployment, and integration. These skills are designed to be used by AI agents and developer tools like GitHub Copilot CLI, providing rich context for building, deploying, and operating ACA workloads. It's another example of how the ACA ecosystem is evolving to be agent-native. Docker Compose for Agents Docker Compose for Agents on Container Apps (public preview) brings the familiar Compose workflow to agentic applications. Declare models, agents, and MCP tools in a single compose.yaml file and deploy unchanged from laptop to cloud - supporting LangGraph, Vercel AI SDK, Spring AI, CrewAI, and other frameworks. Learn more at the Compose for Agents documentation. What's Next Azure Container Apps is redefining how developers and agents build, deploy, and operate intelligent applications. With Sandboxes for secure ephemeral compute, Express for instant provisioning, a reimagined portal for streamlined management, and continued investment in security and observability - ACA provides the ideal foundation for the agentic era. The features announced at Build'26 deepen our commitment to making Azure Container Apps the platform where both humans and AI agents can ship production workloads with confidence, speed, and minimal operational overhead. Also, if you're at Build, come see us at the following sessions: Breakout 221: Idea to production-ready agent in seconds on AI-native runtime Demo 312: Multi-agents in action with 3 AI agents, 3 frameworks, tools & models Lab 580: Build and deploy reasoning agents with NVIDIA Nemotron and Foundry Lightning Talk 453: Building an End‑to‑End Enterprise AI Platform on Azure Or come visit us at the Azure Application Services booth #44. Visit our GitHub page for feedback, feature requests, or questions. Check out our roadmap to see what we're working on next. We look forward to hearing from you!667Views1like0CommentsIntroducing Azure Container Apps Express!
Three years ago, a 15-second cold start was industry-leading. Today, developers and AI agents expect sub-second. The speed bar has moved, and the tooling needs to move with it. After running Azure Container Apps for years, we've learned something important: for most developers, the ACA environment is an unnecessary construct. It adds provisioning time, configuration surface, and cognitive overhead — when all you really want is to run your app with scaling, networking, and operations handled for you. At the same time, a new class of workloads has emerged. Agent-first platforms — systems where AI agents deploy endpoints on demand, spin up tool-use APIs, and tear them down when work is done — demand an even more radical focus on speed and simplicity. Every second of provisioning delay is wasted agent productivity. Today, we're launching Azure Container Apps Express in Public Preview — the fastest, simplest way to go from a container image to an internet-reachable app on Azure, ready for many production-style workloads. What Is ACA Express? ACA Express removes the infrastructure decisions. There's no environment to provision, no networking to configure, no scaling rules to write. You bring a container image, Express handles everything else. Behind the scenes, Express runs your container on pre-provisioned capacity with sensible defaults baked in — so you skip environment setup without giving up ACA's serverless model. There's more coming in this space soon — keep watching. Here's what that means in practice: Instant provisioning — your app is running in seconds, not minutes Sub-second cold starts — fast enough for interactive UIs and on-demand agent endpoints Scale to and from zero — automatic, no configuration required (full scaling controls coming soon) Per-second billing — pay only for what you use Production-ready defaults — ingress, secrets, environment variables, and observability are built in Express is purpose-built for two audiences: developers who want to ship fast (SaaS apps, APIs, web dashboards, prototypes) and agents that deploy on demand (MCP servers, tool-use endpoints, multi-step workflow APIs, human-in-the-loop UIs). If you've ever waited for an ACA environment to provision, only to realize you didn't need half of the configuration options it asked you for — Express is your answer. What You Can Do Today Note: West Central US is currently the only available region. We will expand to new regions through the coming days. Express is in Public Preview starting today. It's a deliberate early ship — there's a meaningful feature gap compared to the existing Azure Container Apps offering, and we're filling it fast. New capabilities are landing on a rapid cadence throughout the preview, and by Microsoft Build in June, Express should be close to feature-complete. For the current list of supported features, known gaps, and what's on the way, see the Express documentation. We'd rather put valuable technology in your hands early and iterate with you than wait behind closed doors for perfection. Who Is Express For? Scenario Why Express SaaS apps and APIs Deploy and scale without infrastructure planning AI app frontends Chat UIs and copilot frontends that scale with usage spikes MCP servers Expose API endpoints for AI agents in seconds Agent workflows Spin up endpoints on demand, tear down when done Prototypes and startups Go from idea to production in minutes Web dashboards Internal tools with instant availability Get Started Express is available now in Public Preview. Try it: Azure Container Apps Express overview — concepts, capabilities, and the current feature support matrix Deploy your first app with the Azure CLI — step-by-step quickstart New Azure Container Apps Portal — create and manage Express apps alongside your existing Container Apps resources Have questions? Check the Azure Container Apps Express FAQ for answers to common questions about pricing, limits, regions, and the road to GA. We're building Express in the open and we want to hear from you. Tell us what features matter most, what works, and what doesn't — reach out on the Azure Container Apps GitHub or in the comments below.15KViews7likes6Comments