azure kubernetes service
231 TopicsShaping what Azure SRE Agent does: Tool Permissions and Hooks
When an AI agent runs against production, the first question every security team asks is "What can it do, who decided it could, and what stops it from doing something it should not." Azure SRE Agent reached general availability in March. Since then, teams inside Microsoft and customers running it against real production workloads have asked for the same thing: finer-grained controls over what the agent can do on its own and a clear answer to who governs each call that reaches a tool. Today at Build 2026, we are releasing global tool access policies as one of a set of new governance controls. This post covers how they work. Tool access policies give security and platform teams a single place to define which tools the agent can invoke, under what conditions, and what requires human approval before it runs. Underneath those policies sits the identity the agent runs as the bedrock that every other control layer depends on. It is defense in depth applied to agent behavior: layers of control, each one holding on its own, so that governing the agent is something you can read, audit, and reason about as you scale it across production. Identity is the bedrock: managed identity today, agent identity next Start here, because nothing else matters if you skip it. The identity the SRE Agent runs as, and the Azure RBAC role assignments on that identity, are the most powerful boundary the agent works inside of. If your role assignments do not grant the agent access to a resource, none of the controls below come into play, because the agent cannot reach the resource to begin with. Network rules, tool permissions, hooks, and connector contracts all sit on top of an RBAC story that you write. The features in this post add layers above that floor. They do not replace it. Today the SRE Agent operates as a managed identity, and your RBAC role assignments on that identity govern what it can do. This is the bedrock, and it is the same model your other Azure workloads already use. You assign roles, you scope them, and the agent inherits exactly what you granted and nothing more. Everything that follows assumes the bedrock is in place. With identity settled, the next question is the obvious one: where is the agent allowed to send its traffic? Permissions: govern what the agent does with a tool Identity decides what the agent can reach. Permissions decide what the agent does with the access it has, down to the individual tool. Two levels cover the range: a point-and-click grid for the common cases, and hooks when a decision needs your own code. The grid is the easy mode. Every tool the agent can use, built-in tools along with MCP servers, services, and custom tools, shows up in one searchable list with two switches. On/Off sets whether the tool is available at all; turn it off and the agent cannot use it. Allow/Ask sets what happens when it is on: Allow lets the agent run the tool automatically, Ask requires a human to approve every time, except in Autonomous mode. Select tools in bulk to flip a whole category at once, filter by category or permission, and use the Advanced permissions tab when you want rules that apply at global, per-agent, or per-thread scope instead of tool by tool. Defaults stay put until you touch them, and the engine is fail-closed: if a rule cannot be evaluated, the call is blocked rather than allowed. That covers most of what teams need. Underneath those switches are three rules, allow, ask, and deny, and the Advanced tab is where you set them by scope. Global rules apply to every agent and thread, Agent rules to one custom agent, Thread rules to a single conversation. Deny is the hard one: it blocks the tool outright no matter the run mode, and a deny at a higher scope always wins, so an Allow at thread scope cannot reopen something denied globally. That split is deliberate. A platform team sets the Global guardrails that should never be crossed and the Asks that always need a human, and service teams add their own Allow rules at Agent scope for routine work, without being able to override the guardrails above them. Platform team, Global scope: deny: bash(az * delete *) - never delete, on any agent or thread deny: bash(kubectl delete *) ask: bash(az webapp restart *) - always confirm, even in Autonomous allow: bash(az monitor *) - auto-approve monitoring queries Service team, Agent scope: allow: bash(kubectl get *) - routine read-only work allow: bash(kubectl describe *) Two details make this safe to lean on. Rules match the canonicalized tool invocation rather than the raw text, so enforcement holds no matter how the command was assembled. And fail-closed has a softer edge than a hard stop: a cached last-known-good policy covers transient failures, so a blip in the policy store blocks the call rather than silently widening access. You can find these under Capabilities > Tools missions. The layer worth spending time on is hooks. Allow and Ask answer "should this tool run." Hooks answer "should this specific call run, given exactly what it is about to do." A hook fires before the agent runs a tool and receives the actual call, parameters and all. Your code then decides the outcome and can reshape it: rewrite parameters before they are sent, inject extra context into the pipeline as a user message so the agent reconsiders before its next step, block the call outright, or redirect the agent toward a safer path. Because your code sees the real parameters, the decision can depend on anything you can express in code: which resource the call targets, whether a value falls outside an allowed range, the time of day, the result of an external policy lookup. This is where you write the rule the grid cannot. Two kinds of hook, mixable on the same agent. Command hooks are a script you write; reach for these when code is enough. Prompt hooks put a separate LLM in the loop as a judge that evaluates the call in context; reach for these when the decision needs reasoning rather than a fixed rule. A real example from our own internal test agent: when the agent tries to list files through the shell with ls or dir, a hook blocks the call. The agent absorbs the signal, reconsiders, and reaches for the ListDir tool instead. The hook did not argue with a human. It shaped what happened next. As with the grid, configure nothing and the agent behaves exactly as it does today. Both are additive. Authoring one is a short form. You name the hook, pick the event (Pre Tool Use, so it runs before the call), and set a tool matcher, either picked from the tool menu or written as a regex like (FetchWebpage|SearchMemory) with anchors and lookaheads when you need them, so the hook fires only on the calls you care about. You set a timeout and a fail mode (Block, so a hook that errors or hangs stops the call rather than waving it through), and you write the body in Bash or Python. A command hook reads the call as JSON on stdin, the event name, the tool name, its parameters, and the call id, and answers on stdout. Print nothing and exit zero to allow. Return a block decision with a reason to stop the call, and that reason is what the agent reads back. You can also substitute: run a cheaper or safer version yourself, block the real call, and hand your own output back as the result, so the agent never runs the expensive or risky original. #!/bin/bash input=$(cat) tool=$(echo "$input" | jq -r '.tool_name') # Block one tool, with a reason the agent will read if [ "$tool" = "ExampleToolName" ]; then echo '{"decision":"block","reason":"Blocked ExampleToolName by hook policy."}' exit 0 fi # Otherwise allow: print nothing and exit 0 exit 0 You can find these under Builder > Hooks Each layer holds on its own The layers stack. Identity is the floor: your RBAC assignments decide what the agent can reach at all. Permissions, the grid and hooks together, decide what it does with a tool. You author each layer, each one holds whether or not the layer above it behaves as expected, and all of it configures through the same ARM and Bicep surface your platform team already uses, reproducible the way the rest of your Azure estate is. The upgrade path is additive and non-breaking. Existing agents keep working. Turn on each control when you are ready, in the order your governance requires. There is more coming. We run Azure SRE Agent inside Microsoft on our own production workloads, so we feel the same gaps you do, and the next round is shaped by what we hear from teams running it in production today. Which control is doing the most for you, and which one are you still waiting on? Let us know and thank you! Getting started Create new SRE Agent — https://aka.ms/sreagent SRE Agent Documentation — https://aka.ms/sreagent/newdocs SRE Agent recipes — https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent278Views0likes0CommentsReference Architecture for a High Scale Moodle Environment on Azure
Introduction Moodle is an open-source learning platform that was developed in 1999 by Martin Dougiamas, a computer scientist and educator from Australia. Moodle stands for Modular Object-Oriented Dynamic Learning Environment, and it is written in PHP, a popular web programming language. Moodle aims to provide educators and learners with a flexible and customizable online environment for teaching and learning, where they can create and access courses, activities, resources, and assessments. Moodle also supports collaboration, communication, and feedback among users, as well as various plugins and integrations with other systems and tools. Moodle is widely used around the world by schools, universities, businesses, and other organizations, with over 100 million registered users and 250,000 registered sites as of 2020. Moodle is also supported by a large and active community of developers, educators, and users, who contribute to its development, documentation, translation, and support. [URL] is the official website of the Moodle project, where anyone can download the software, join the forums, access the documentation, participate in events, and find out more about Moodle. Goal The goal for this architecture is to have a Moodle environment that can handle 400k concurrent users and scale in and out its application resources according to usage. Using Azure managed services to minimize operational burden was a design premise because standard Moodle reference architectures are based on Virtual Machines that comes with a heavy operational cost. Challenges Being a monolith application, scaling Moodle in a modern cloud native environment is challenging. We choose to use Kubernetes as its computing provider due to the fact that it allow us to build a Moodle artifact in an immutable way that allows it to scale out and in when needed in a fast and automatic way and also recover from potential failures by simply recreating its Deployments without the need to maintain Virtual Machine resources, introducing the concept of pets vs cattle[1] to a scenario that at first glance wouldn't be feasible. Since Moodle is written in PHP it has no concept of database polling, creating a scenario where its underlying database is heavily impacted by new client requests, making it necessary to use an external database pooling solution that had to be custom tailored in order to handle the amount of connections for a heavy-traffic setup like this instead of using Azure Database for PostgreSQL's built-in pgbouncer. The same effect is also observed in its Redis implementation, where a custom Redis cluster had to be created, whereas using Azure Cache for Redis would incur prohibitive costs due to the way it is set up for a more general usage. 1 - https://learn.microsoft.com/en-us/dotnet/architecture/cloud-native/definition#the-cloud Architecture This architecture uses Azure managed (PaaS) components to minimize operational burden by using Azure Kubernetes Service to run Moodle, Azure Storage Account to host course content, Azure Database for PostgreSQL Flexible Server as its database and Azure Front Door to expose the application to the public as well as caching commonly used assets. The solution also leverages Azure Availability Zones to distribute its component across different zones in the region to optimize its availability. Provisioning the solution The provisioning has two parts: setting up the infrastructure and the application. The first part uses Terraform to deploy easily. The second part involves creating Moodle's database and configuring the application for optimal performance based on the templates, number of users, etc. and installing templates, courses, plugins etc. The following steps walk you through all tasks needed to have this job done. Clone the repository $ git clone https://github.com/Azure-Samples/moodle-high-scale Provision the infrastructure $ cd infra/ $ az login $ az group create --name moodle-high-scale --location <region> $ terraform init $ terraform plan -var moodle-environment=production $ terraform apply -var moodle-environment=production $ az aks get-credentials --name moodle-high-scale --resource-group moodle-high-scale Provision the Redis Cluster $ cd ../manifests/redis-cluster $ kubectl apply -f redis-configmap.yaml $ kubectl apply -f redis-cluster.yaml $ kubectl apply -f redis-service.yaml Wait for all the replicas to be running $ ./init.sh Type 'yes' when prompted. Deploy Moodle and its services Change image in moodle-service.yaml and also adjust the moodle data storage account name in the nfs-pv.yaml (see commented lines in the files) $ cd ../../images/moodle $ az acr build --registry moodlehighscale<suffix> -t moodle:v0.1 --file Dockerfile . $ cd ../../manifests $ kubectl apply -f pgbouncer-deployment.yaml $ kubectl apply -f nfs-pv.yaml $ kubectl apply -f nfs-pvc.yaml $ kubectl apply -f moodle-service.yaml $ kubectl -n moodle get svc –watch Provision the frontend configuration that will be used to expose Moodle and its assets publicly $ cd ../frontend $ terraform init $ terraform plan $ terraform apply Approve the private endpoint connection request from Frontdoor in moodle-svc-pls resource. Private Link Services > moodle-svc-pls > Private Endpoint Connections > Select the request from Front Door and click on Approve. Install database $ kubectl -n moodle exec -it deployment/moodle-deployment -- /bin/bash $ php /var/www/html/admin/cli/install_database.php --adminuser=admin_user --adminpass=admin_pass --agree-license Deploy Moodle Cron Change image in moodle-cron.yaml $ cd ../manifests $ kubectl apply -f moodle-cron.yaml Your Moodle installation is now ready to use! Conclusion You can create a Moodle environment that is scalable and reliable in minutes with a very simple approach, without having to deal with the hassle of operating its parts that normally comes with standard Moodle installations.1.8KViews8likes1CommentDesigning for High Availability: The Operational Reference for Running a Geo-Replicated ACR
By Johnson Shi, Zoey (Zhuyu) Li, Huangli Wu Introduction Three of the most common questions we hear from enterprise teams running geo-replicated Azure Container Registries (ACR) are: "How do I control which region serves my traffic?" — When my AKS clusters are spread across regions, can I pin each one to its co-located replica, or am I stuck with however the global endpoint routes? "What happens during a regional incident — is failover automatic or do I have to act?" — If the registry in one region degrades, does the global endpoint reroute on its own, or do I need to manually disable the affected replica? "What happens after the region recovers — does traffic return on its own?" — Is there a cooldown, a quarantine, or any manual step before failback? We answer those head-on, then go deeper on the operational details that come up when you actually run a geo-replicated registry: authentication across endpoint switches, throttling under load concentration, eventual-consistency failure modes, home region outage scope, webhooks, and private endpoint interaction. We draw on the official geo-replication docs, the global endpoint health-aware failover blog, the regional endpoints engineering design implementation, the regional endpoints public preview and private preview announcements, and the ACR reference for various registry endpoints, . This post also draws notes from the ACR product team on roadmap items that aren't yet documented elsewhere. Key Takeaways Health-aware failover is automatic. When the registry in a region degrades, the global endpoint reroutes away from it on the order of minutes, evaluated per-registry. No customer action required. Failback is automatic too. Once health-aware failover marks a region healthy again, the global endpoint resumes routing to it. There is no cooldown period. Health-aware failover applies only to global endpoint operations. It does not apply to regional endpoints (you're talking to one replica, period) or to dedicated data endpoints (the redirect is per-region). Health-aware failover is not triggered by throttling. It responds to regional ACR service health and Azure infrastructure health, not HTTP 429 responses. Use regional endpoints to manage per-replica throttling. Regional endpoints (Step 2a) give you explicit per-region URLs for workloads that need affinity, capacity planning, push/pull consistency, troubleshooting, or client-side failover. Use myregistry.<region>.geo.azurecr.io . Regional endpoints are available on Premium SKU registries. For workloads that don't need pinning, do nothing (Step 2b). The global endpoint plus health-aware failover handles routing automatically. Re-authenticate when switching endpoints. Each global or regional endpoint is its own authenticated surface; re-auth via az acr login , SDK auth, or the Kubernetes ACR credential provider on endpoint change. Don't run a long-lived DNS cache for the global endpoint. ACR purges DNS server-side on disable and during failover; a long-lived client cache works against that. For production workloads, enable dedicated data endpoints for security and DNS predictability on layer downloads. ACR is working on bounded staleness consistency for cross-replica eventual-consistency failure modes; see the FAQ. Background What is ACR geo-replication? Geo-replication is a Premium SKU feature that turns a single ACR registry into a multi-region, multi-write service. Every geo-replica in every region is writable — you can push, pull, and delete from any of them — and content syncs asynchronously between replicas under an eventual consistency model. Per-push replication time scales with the size and number of images being pushed. Similarly, when creating a new geo-replica, the time to populate the new geo-replica scales with the total size of the registry. A geo-replicated registry exposes a global endpoint at myregistry.azurecr.io . Behind that endpoint, ACR uses an internal traffic manager to direct each request to the replica with the best network performance profile for the caller — usually the closest replica, but not always. When clients are equidistant from multiple replicas, or when the closest replica is experiencing Azure infrastructure degradation, requests may be routed elsewhere. A geo-replicated registry also exposes a regional endpoint at myregistry.<region>.geo.azurecr.io , which allows clients to pin API requests to a specific geo-replica in lieu of global endpoints, which has Azure-managed routing among geo-replicas. Zone redundancy is always enabled for geo-replicas in regions where Azure has multiple availability zones — in those regions, ACR automatically spreads replica data across multiple availability zones within each region to protect against zonal outages. Endpoints and data endpoints: what goes where A common point of confusion: when you push or pull, not every request goes to the same place. The registry endpoints (global endpoint and regional endpoints), as well as the data endpoint, do different jobs. Your choice of data endpoint configuration has real consequences for security and resilience. Two kinds of traffic flow during a typical pull: Registry API traffic — authentication, manifest reads/writes, tag resolution, referrers, repository operations, blob location lookups, listing, metadata. This is everything except the actual layer (blob) bytes. All these API requests go to the global endpoint ( myregistry.azurecr.io ) or, if you've pinned your clients to call these APIs to a specific geo-replica, a geo-replica's regional endpoint ( myregistry.<region>.geo.azurecr.io ). Behind the scenes, the global endpoint internally proxies these requests to a specific geo-replica. Layer (blob) downloads — when the client asks for a blob, the registry doesn't serve the bytes itself. It returns an HTTP 307 redirect to a regional data endpoint (separate endpoint from the global endpoint or regional endpoints), and the client follows the redirect to download the layer from that region. Where that 307 sends you depends on whether you've enabled the registry's dedicated data endpoints feature: Configuration Layer downloads redirect to Default (no dedicated data endpoints) *.blob.core.windows.net (the underlying Azure storage account) Dedicated data endpoints enabled myregistry.<region>.data.azurecr.io for the region you were routed to Private endpoints enabled myregistry.<region>.data.azurecr.io for the region you were routed to Regional by design. Dedicated data endpoints always land you on a specific geo-replica's data endpoint — there is no "global data endpoint." With the global endpoint as your registry endpoint, the 307 redirect picks the data endpoint for whichever region the global endpoint chose to serve you. With a regional endpoint pinned to a specific region, the 307 always redirects you to that same region's data endpoint — never cross-region. Why dedicated data endpoints matter. Dedicated data endpoints are a Premium SKU feature that exists primarily to address security and firewall scoping. By default, layer downloads redirect to *.blob.core.windows.net — a wildcard storage FQDN. Firewall rules to allow that wildcard either let all Azure storage accounts through or none of them, which raises data exfiltration concerns and isn't tightly scoped to your registry. Dedicated data endpoints replace the wildcard with a fully qualified domain in your registry's own domain — myregistry.<region>.data.azurecr.io — so firewall rules can be scoped tightly to your specific registry, in your specific regions. That same design choice can also make layer downloads more predictable during routing changes. With dedicated data endpoints, the data endpoint FQDN is known ahead of time and lives in the registry's domain — one predictable hostname per region, configured once. Without them, the layer download has to resolve a wildcard storage FQDN that points to whichever storage account the registry happens to have provisioned, which is a separate DNS resolution path with its own routing behavior and its own caching profile. Dedicated data endpoints simplify the DNS picture by aligning the data path with the registry path and keeping the entire pull experience inside one set of predictable, scoped FQDNs. For any geo-replicated registry where security and high availability matter, enable dedicated data endpoints. Note: Health-aware failover applies only to operations against the global endpoint, not to regional endpoints or dedicated data endpoints. Take note that health-aware failover only kicks in and directs traffic away from a geo-replica when an Azure region is experiencing significant infrastructure degradation. At this stage, it does not kick in to redirect traffic to another geo-replica if a client's data plane API requests are throttled. See the relevant section below for the full scope when health-aware auto failover kicks in or not. The three traffic control tools ACR geo-replication gives you three complementary tools for controlling where traffic lands. Each one solves a different class of problem, and customers most often run into trouble when they reach for the wrong one. We name them up front and use these names throughout the post: Tool Who controls it What it does Use cases Health-aware failover Platform (automatic) Reroutes the global endpoint away from a region whose registry can't reliably serve requests Regional incidents, automatic recovery Replica enable/disable for global routing Customer (manual) Excludes a specific replica from global endpoint routing without deleting it; data continues syncing DR rehearsals, planned maintenance, quarantining a replica without losing it Regional endpoints Customer (per request) Dedicated per-region URLs ( myregistry.<region>.geo.azurecr.io ) that bypass the internal traffic manager entirely Pinning AKS clusters to co-located replicas, push/pull consistency, capacity planning, troubleshooting, client-side failover Health-aware failover and replica enable/disable both act on the global endpoint. Regional endpoints are a separate URL surface that coexists with the global endpoint — enabling them does not disable the global endpoint myregistry.azurecr.io . You can use both simultaneously and choose per workload. The behavior in question When the registry in one region experiences a real degradation, there are three possible answers to "what happens?": (A) Nothing automatic. The customer must manually disable the affected region's endpoint to stop traffic from being routed there. (B) The system detects the regional front-door failure and reroutes within seconds. (C) A per-registry health evaluation detects the degradation and reroutes the global endpoint within minutes, with no customer action. After the region recovers, routing resumes automatically. The answer today is (C). Before health-aware failover, customers were stuck closer to (A) — the system could see whether the regional reverse proxy responded, but not whether the registry could actually serve real pull and push traffic end to end. Health-aware failover closes that gap. We walk through all three tools in the next section, in order: setting up geo-replication, using regional endpoints to pin specific workloads, keeping the global endpoint for everything else, the manual replica disable mechanism, re-enabling participation in global routing, and what to expect when health-aware failover triggers. Walkthrough The following steps assume an existing Premium SKU registry and the Azure CLI logged in. We use myregistry as the registry name, myrg as the resource group, and eastus as the home region. Substitute <your-registry> , <your-rg> , and <your-region> for your environment. Prerequisites A Premium SKU ACR registry (geo-replication requires Premium) Azure CLI ( az ) installed and logged in For regional endpoints (Step 2a): Azure CLI 2.86.0 or later. All regional endpoints commands ( --regional-endpoints , az acr show-endpoints , az acr login --endpoint ) are available natively in Azure CLI 2.86.0+. If you previously installed the acrregionalendpoint private preview CLI extension, uninstall it with az extension remove --name acrregionalendpoint to prevent conflicts with the built-in CLI commands. Step 1: Add a West US replica to a registry that lives in East US Geo-replication requires the Premium SKU. The create call below fails on Basic or Standard. # Confirm the registry is Premium az acr show --name myregistry --resource-group myrg \ --query sku.name --output tsv # Premium # Create a West US geo-replica az acr replication create --registry myregistry --location westus # Confirm both replicas are present az acr replication list --registry myregistry --output table NAME LOCATION PROVISIONING STATE STATUS REGION ENDPOINT ENABLED ------ ---------- -------------------- -------- ----------------------- eastus eastus Succeeded online True westus westus Succeeded online True Pushes and pulls continue working through the existing replica throughout initial sync. Because the registry is multi-region, multi-write, the existing replica keeps serving traffic while the new replica catches up in the background. Initial replica seeding time is a function of registry size — the total number and cumulative size of images already in the registry that need to be replicated to the new replica — not the size of any single image. Step 2a: Pin workloads to specific regions using regional endpoints Use regional endpoints when a workload needs explicit per-region control. The five common cases: Regional affinity — an AKS cluster in East US should pull from the East US replica, every time, without ever hopping to a more distant replica because of a network performance fluctuation. Predictable routing — workloads that need to know exactly which replica will serve them, for benchmarking, capacity planning, or in-region traffic SLAs. Push/pull consistency — pinning both ends of a publish-then-deploy flow to the same replica eliminates eventual-consistency races. Troubleshooting — reproducing an issue on a specific replica requires sending traffic to that specific replica. Client-side failover — customers with their own health checks and business rules want to implement failover on their own terms, on signals only they can see. Enable regional endpoints on the registry: az acr update -n myregistry -g myrg --regional-endpoints enabled When enabled, ACR automatically creates per-region login server URLs for every existing geo-replica. No per-region configuration is needed. Note: Regional endpoints can be enabled on any Premium SKU registry, even without geo-replication. A registry without geo-replication has a single geo-replica in the home region, which gets one regional endpoint URL. However, the feature is most useful when your registry has at least two geo-replicas, where you can pin different workloads to different replicas for routing control and capacity distribution. Push to a specific region using its regional endpoint: # Log in to the West US regional endpoint az acr login --name myregistry --endpoint westus # Tag and push using the regional endpoint URL docker tag myapp:v1 myregistry.westus.geo.azurecr.io/myapp:v1 docker push myregistry.westus.geo.azurecr.io/myapp:v1 Pin AKS deployments to their co-located replica by using regional endpoint URLs in the deployment manifest. The example below shows two clusters in different regions; each cluster references the regional endpoint for its own region's replica (assuming replicas exist in both eastus and westeurope ): # East US-based AKS cluster pulls from the East US replica apiVersion: apps/v1 kind: Deployment metadata: name: myapp-eastus spec: template: spec: containers: - name: myapp image: myregistry.eastus.geo.azurecr.io/myapp:v1 --- # West Europe-based AKS cluster pulls from the West Europe replica apiVersion: apps/v1 kind: Deployment metadata: name: myapp-westeurope spec: template: spec: containers: - name: myapp image: myregistry.westeurope.geo.azurecr.io/myapp:v1 This eliminates cross-region pulls when global routing would otherwise prefer a different replica for a given client, and it gives you a per-region traffic profile you can plan capacity against. Regional endpoint operational tips View all endpoints. Use az acr show-endpoints to see all endpoint URLs for your registry — global, regional (if enabled), and dedicated data endpoints (if enabled): az acr show-endpoints --name myregistry --resource-group myrg Import from a specific geo-replica. When importing images between registries, you can use a regional endpoint to import from a specific geo-replica of the source registry. This is useful when you want predictable network paths or need to import from a replica in a specific region: az acr import \ --name mydownstreamregistry \ --source myupstreamregistry.westeurope.geo.azurecr.io/myapp:v1 \ --image myapp:v1 Firewall rules for regional endpoints. If you use firewall rules, allow access to the following endpoints for each geo-replica that clients connect to: Endpoint Purpose myregistry.<region>.geo.azurecr.io Regional endpoint for registry operations myregistry.azurecr.io Global endpoint (if also used) myregistry.<region>.data.azurecr.io Layer downloads (if using private endpoints or dedicated data endpoints) *.blob.core.windows.net Layer downloads (if not using private endpoints or dedicated data endpoints) For the full list of endpoint types and FQDN patterns, see the ACR reference for various registry endpoints. DNS-based routing without changing manifests. If you don't want to maintain different deployment manifests per region, you can keep all manifests pointing to the global endpoint ( myregistry.azurecr.io ) and use software-defined networking or a regional traffic manager to resolve the global endpoint to the appropriate regional endpoint based on the originating region's traffic. This achieves the same co-location goals as regional endpoints — predictable routing and reduced latency — without embedding region-specific URLs in your deployment manifests. Step 2b: Keep using the global endpoint for everything else For workloads that don't need explicit pinning, do nothing. The global endpoint at myregistry.azurecr.io continues to work exactly as before, and the global endpoint plus health-aware failover gives you intelligent routing across replicas without configuration. ACR picks the best replica for each client based on network performance and reroutes during regional incidents. Regional endpoints coexist with the global endpoint — enabling them does not disable myregistry.azurecr.io . You can use both simultaneously and choose per workload, mixing pinned workloads (Step 2a) with workloads that ride the global endpoint (Step 2b) in the same registry. Step 3: Take a replica out of global endpoint routing Use this when you need to keep a replica alive but stop it from serving global-endpoint traffic — for DR rehearsals, planned maintenance, or troubleshooting an isolated replica. # Exclude the West US replica from global endpoint routing az acr replication update --registry myregistry --name westus \ --global-endpoint-routing false Confirm the change: az acr replication list --registry myregistry --output table NAME LOCATION PROVISIONING STATE STATUS REGION ENDPOINT ENABLED ------ ---------- -------------------- -------- ----------------------- eastus eastus Succeeded online True westus westus Succeeded online False Requests to myregistry.azurecr.io no longer route to West US. The replica still receives replicated content — and continues to replicate its own content out to other replicas — and storage quota and per-replica costs continue to accrue. If regional endpoints are enabled, the West US regional endpoint URL also continues to work; --global-endpoint-routing controls only the replica's participation in global endpoint routing. A note on naming. The CLI flag --global-endpoint-routing (on az acr replication update ) and the regional endpoints feature (enabled via az acr update --regional-endpoints enabled ) are two different things despite the similar names. --global-endpoint-routing controls whether a replica participates in global endpoint routing. The regional endpoints feature creates per-region URLs ( myregistry.<region>.geo.azurecr.io ) that bypass the global endpoint entirely. They are independent controls. In Azure CLI 2.86.0 and later, the old --region-endpoint-enabled flag has been renamed to --global-endpoint-routing . The old flag name is deprecated and will be removed in Azure CLI 2.87.0 (June 2026). If you have existing scripts or automation that use --region-endpoint-enabled , update them to use --global-endpoint-routing . CLI flags quick reference: Flag Scope Purpose --regional-endpoints Registry-level ( az acr create or az acr update ) Enables dedicated regional endpoint URLs ( myregistry.<region>.geo.azurecr.io ) for all geo-replicas. --global-endpoint-routing Per-geo-replica ( az acr replication create or az acr replication update ) Controls whether the global endpoint routes traffic to a specific geo-replica. Set to false to temporarily exclude a geo-replica from global routing. --data-endpoint-enabled Registry-level ( az acr create or az acr update ) Enables dedicated data endpoints ( myregistry.<region>.data.azurecr.io ) for layer blob downloads. Auto-enabled when at least one private endpoint is configured. This bidirectional sync during disable is intentional. When you re-enable the replica, every image pushed to the registry while the replica was disabled — from any region — is already present, so the replica can serve traffic immediately with no catch-up window. If we stopped syncing on disable, re-enabling would leave the replica with stale data and force a long catch-up before it could safely serve pulls. Step 4: Re-enable the replica to participate in global endpoint routing Re-enable the replica: az acr replication update --registry myregistry --name westus \ --global-endpoint-routing true NAME LOCATION PROVISIONING STATE STATUS REGION ENDPOINT ENABLED ------ ---------- -------------------- -------- ----------------------- eastus eastus Succeeded online True westus westus Succeeded online True There is no cooldown. The global endpoint resumes routing requests to the West US replica as soon as the change takes effect on ACR's side. Because data continued syncing while the replica was disabled (Step 3), the replica is immediately ready to serve pulls — no catch-up window. Note on DNS during disable/enable. When you take a replica out of global routing, ACR purges its own DNS records for that replica from the global endpoint on a fast path — there is no waiting on a published TTL on ACR's side. If clients run their own DNS cache for the global endpoint, however, those clients will keep resolving to the disabled replica until the client cache expires. We can't control client-side caches. The recommendation: do not run a long-lived DNS cache for the global endpoint. A short-lived DNS pin for the duration of a single push (covered in the DNS and Client-Side Considerations section) is fine and even helpful — but a long-lived DNS cache will make --global-endpoint-routing false look broken from the client's perspective. Step 5: What to expect when health-aware failover triggers Health-aware failover is automatic. ACR evaluates registry health on a per-registry basis, and when a registry in a region can't reliably serve requests, the global endpoint reroutes that registry's traffic to a healthy replica. There is no customer-invocable trigger — that's the point. End-to-end timing is on the order of minutes — fast enough to catch real regional degradation, slow enough to ride out transient errors that resolve on their own. DNS TTL may add additional propagation delay before all clients switch to the new region. Scope of health-aware failover. Health-aware failover applies only to operations against the global endpoint — the registry API calls (auth, get manifest, get tag, get referrers, get blob location). It evaluates health when those API calls come in; it does not trigger mid-operation. Two important consequences: Regional endpoints are not in scope. When you talk to a regional endpoint like myregistry.westus.geo.azurecr.io , you're talking to that one replica. There is no automatic reroute. If you've pinned a workload to a regional endpoint and that region degrades, you implement client-side failover by switching the workload to a different regional endpoint. Dedicated data endpoints are not in scope. Once a registry endpoint has redirected you to a dedicated data endpoint, you stay on that region's data endpoint for the duration of the layer download. There is no automatic reroute of an in-flight blob download. The region targeted by the redirect is decided up front by whichever registry endpoint served the blob-location call: the global endpoint chooses based on its per-registry health evaluation, and a regional endpoint always targets its own region. The signals you can use to confirm a failover is in progress: # Check replication status az acr replication list --registry myregistry --output table You can also check Resource Health for the registry in the Azure portal — navigate to your registry and select Resource health under the Help section to see platform-side degradation signals. You'll typically see: Increased pull latency as traffic shifts to a more distant replica Resource Health flagging known issues in the affected region Replication status indicating which replicas are online After the region recovers, the per-registry health evaluation marks it healthy again and the global endpoint resumes routing — automatic, no cooldown, no customer action. Note that health is evaluated per registry, not per region: if a degradation affects only a subset of registries in a region, only those registries are rerouted, and other registries in the same region continue to be served locally with no unnecessary latency penalty. Not triggered by throttling. Health-aware failover is DNS-based and responds to regional ACR service health and Azure infrastructure health. It does not reroute traffic based on HTTP 429 (throttling) responses. If a geo-replica is throttling your requests but the region's infrastructure is healthy, the global endpoint continues routing you to that geo-replica. To manage throttling, use regional endpoints to spread workloads across multiple geo-replicas for better capacity distribution. Note on long-running pushes during a failover. A multi-layer push that spans a failover boundary can land layers and the manifest on different replicas — exactly the failure mode that DNS bouncing produces during a single push. ACR is actively tightening health-aware failover behavior to minimize cross-replica scatter during these scenarios, and the recommendation today remains: pin pushes to a single replica via a regional endpoint when push/pull consistency matters. Common Questions Q1. Performance impact during initial replica creation on a live registry Because ACR is multi-region, multi-write, the existing replica continues serving pull and push traffic throughout the period when a new replica is being seeded. Replication is asynchronous and content propagates in the background; the time to populate a new geo-replica scales with the size of the registry — the cumulative number and total size of images already in the registry — not with any single image. The docs do not publish a quantified degradation percentage or a throttling window for this period, and they do not promise zero performance impact — the safe operating assumption for a live production registry is that existing replicas continue serving traffic normally, with the new replica catching up in the background. Q2. Restricted/updating state during initial sync There is no "restricted" state for the registry during normal replica creation. Writes, control-plane operations, and pushes/pulls against existing replicas continue normally. The only time configuration changes are unavailable is during a home region outage — see the relevant FAQ item later on for the full data-plane-versus-control-plane breakdown. Q3. Cooldown periods and non-straightforward failback scenarios There is no cooldown before failback, manual or automatic. Re-enabling a replica's participation in global endpoint routing takes effect immediately on ACR's side. Health-aware failover returns traffic to a region as soon as its per-registry health evaluation passes again. The failback case that is not seamless: if a recently pushed image has not yet replicated to the failover region, a pull from that region may not find the image until replication catches up. This is a function of eventual consistency, not failback timing — and it's part of a broader class of issues we cover in Q4. Q4. Common pull and push failure modes during the eventual-consistency window DNS bouncing during a single push is one well-known problem, but it isn't the only one. The eventual-consistency window between geo-replicas surfaces in several recurring failure modes worth knowing about: Push-then-immediate-pull-cross-region. Pushing myapp:v1 to one region and immediately pulling it from a different region can fail with manifest unknown until replication catches up. This shows up most painfully in CI/CD pipelines where one CI runner pushes an image and thousands of pods across other regions all try to pull from their local geo-replicas at the same time. Today, customers work around this with indeterminate sleeps before scheduling expensive compute, or with retry logic, or by waiting on a replication-complete signal — none of which is a clean planning story. Tag overwrite races. Pushing myapp:v1 , then re-pushing myapp:v1 shortly after with a fix (same tag, different digest), can leave different replicas resolving the same tag to different digests during the eventual-consistency window. Delete propagation. Deleting a tag or repository in one region takes some time to propagate to other replicas. Pulls from regions where the delete hasn't yet propagated can return the supposedly-deleted content. Mid-push failover scatter. A multi-layer push that spans a health-aware failover boundary or a DNS bouncing event can land layers on one replica and the manifest on another, surfacing as manifest validation errors or blob unknown on subsequent pulls. What ACR is doing about this. We're working on bounded staleness consistency for pushed images across all geo-replicas worldwide, which addresses these four failure modes directly. This will be covered in an upcoming blog post. If you're hitting eventual-consistency brittleness today and want to talk through your scenario, reach out to us on the Azure Container Registry GitHub repository — we want the customer signal to land in the design. Mitigations available today: Pin pushes to a single replica via a regional endpoint. Every sub-request in the push — login, blob uploads, manifest upload — goes to the same replica, eliminating the DNS bouncing and mid-push scatter classes entirely. Use a short-lived client-side DNS cache like dnsmasq scoped to the duration of a single push, only when you're not using regional endpoints. Do not run a long-lived DNS cache for the global endpoint — it interferes with --global-endpoint-routing false and with health-aware failover routing. Build retry logic into pulls that immediately follow a cross-region push. Either retry with backoff or check replication status with ACR webhooks before pulling. ACR can detect and notify you when an image or tag is available for pull in a geo-replica (say geo-replica B), after it has been pushed to another geo-replica (geo-replica A) and background replication has succeeded to geo-replica B. Design publish steps to be idempotent so retries triggered by mid-push failover are safe. Q5. Auth behavior across endpoint switches For safety, treat each global endpoint and each regional endpoint as its own authenticated surface. All registry APIs except the actual blob downloads (auth, manifests, tag resolution, referrers) flow through whichever endpoint you've chosen. If you switch from the global endpoint to a regional endpoint, or from one regional endpoint to another, re-authenticate. That means az acr login , fresh SDK auth, or — for AKS — letting the Kubernetes ACR credential provider handle re-auth, which it does automatically when the endpoint changes. Q6. Throttling under failover and pinning Throttling limits on registry API operations are per-replica, not per-registry. This has two operational implications: During health-aware failover, traffic that was spread across replicas can shift heavily onto whichever replicas remain in the global endpoint's routing pool. Capacity plan to spread traffic across two or three healthy replicas during a failover scenario rather than concentrating onto one — the global endpoint's routing already does this for you when multiple healthy replicas exist, but registries with only two regions configured can hit per-replica limits more easily during a failover. To mitigate, use regional endpoints to spread workloads across multiple geo-replicas and plan per-replica capacity. When pinning via regional endpoints (Step 2a), you concentrate traffic on whichever replica you've pinned to. If you've pinned all your AKS clusters to a single regional endpoint, you may hit that replica's per-region throttling limits at peak. Mitigations: pin different workloads to different regional endpoints across multiple regions for better topology mapping and capacity distribution, or use the global endpoint (Step 2b) for workloads where you don't need explicit pinning so ACR's routing can spread load. We're also working on improving the throttling metrics surfaced during health-aware failover events. Note: Health-aware failover does not reroute traffic based on HTTP 429 (throttling). If you're experiencing throttling but the region's infrastructure is healthy, the global endpoint continues routing you there. Use regional endpoints to explicitly spread load across replicas for capacity planning. Q7. Home region outage scope Geo-replication provides high availability for the data plane. During a home region outage, the control plane is unavailable, which means you can't create or delete replicas, modify network rules, or change replication settings until the home region recovers. ACR Tasks are also bound to the home region and don't run while it's unavailable. The data plane keeps working: Global endpoint continues routing pulls and pushes to healthy replicas. Regional endpoints continue working — you talk directly to specific replicas, and your client-side logic decides which region to use. Authentication, manifests, blob downloads, webhooks continue functioning through any healthy replica. The home region of a registry is fixed at creation and cannot be changed afterward. Microsoft's registry relocation guidance describes a redeployment procedure — creating a new registry in a different region — not an in-place change to an existing registry's home region. Note: If your registry uses a customer-managed key, review the key vault failover and redundancy guidance for maximum resilience. Key vault availability directly affects the registry's ability to encrypt and decrypt data. Q8. Webhooks during failover Webhooks fire from the replica that received the push. Because ACR also replicates content to other geo-replicas, webhooks fire from each geo-replica as the image syncs to it — so a single push results in webhook events from the receiving replica plus an event from each replica as replication completes. During a failover where pushes are routed to a different region, webhooks from those pushes fire from the new region; once the original region recovers and replication catches up, webhook events fire from there too. Webhook consumers should be designed to handle multiple events per pushed image and deduplicate as needed. Q9. Private endpoints with regional endpoints and dedicated data endpoints When a private endpoint is created against a registry, the private endpoint covers all of the registry's endpoint surfaces — the global endpoint, every regional endpoint (if regional endpoints are enabled), and every regional dedicated data endpoint. A single private endpoint in one VNet can reach the global endpoint (which routes you to a suitable replica), any regional endpoint in the same or a different region, and any region's dedicated data endpoint for blob downloads. The trade-off is private IP allocation: each endpoint surface consumes IPs in the VNet. With many replicas plus regional endpoints plus dedicated data endpoints all enabled, private endpoint creation can fail if the VNet runs out of available private IPs. IP address consumption per feature: Configuration IPs consumed per VNet Initial private endpoint (global endpoint + home region dedicated data endpoint) 2 Each geo-replication region added +1 (regional dedicated data endpoint) Regional endpoints enabled +1 per geo-replica Example: A registry with 3 geo-replicas and regional endpoints enabled consumes 7 private IPs per VNet: 1 (global) + 3 (data) + 3 (regional). Without regional endpoints, the same registry requires 4 private IPs: 1 (global) + 3 (data). Subnet sizing: Use at minimum a /27 (32 addresses) subnet for PE subnets on geo-replicated registries, and /24 where possible. To check how many private IPs are already consumed on a subnet: az network vnet subnet show \ --name <subnet-name> \ --vnet-name <vnet-name> \ --resource-group <resource-group> \ --query "{addressPrefix:addressPrefix, usedIPs:length(ipConfigurations || \`[]\`)}" \ --output table See the ACR private endpoints documentation for the full IP-allocation math and sizing guidance. Q10. Geo-replica creation stuck for private endpoint-enabled registries When creating a geo-replica for a registry that has private endpoints configured, the replica provisioning can get stuck in a Creating state if the identity performing the operation doesn't have sufficient permissions to create private endpoint networking resources. Solution: Manually delete the geo-replica that got stuck in the provisioning state. Ensure the identity has the permission Microsoft.Network/privateEndpoints/privateLinkServiceProxies/write before creating the geo-replica again. Also verify that every PE subnet connected to the registry has free IP capacity — if any PE subnet across any connected VNet does not have enough free IPs, the replication provisioning fails and rolls back. The replica appears briefly in a Creating state and then is removed. The resulting error does not identify which subnet or VNet is exhausted. Q11. Metrics, logs, and alerts for the three phases We map each phase to the signals available in the Monitoring Guidance section below. The headline: Resource Health (in the Azure portal) and az acr replication list give you the platform-side signals; Azure Monitor platform metrics are collected automatically, and resource logs require Diagnostic Settings to be enabled on the customer side. Behavior summary Scenario Automatic? Customer Action Required Notes Registry in a region degrades Yes None Health-aware failover; per-registry; minutes-scale; global endpoint operations only Region recovers after a degradation event Yes None No cooldown Pin AKS clusters to co-located replicas No Use regional endpoint URLs in deployment manifests (Step 2a) Coexists with global endpoint No pinning needed for most workloads Yes None — keep using myregistry.azurecr.io (Step 2b) Global endpoint plus health-aware failover Push/pull from the same replica (consistency) No Use a regional endpoint for both push and pull Eliminates DNS bouncing and mid-push scatter Capacity planning per region No Spread workloads across multiple regional endpoints Per-replica throttling; avoid concentrating on one replica DR rehearsal: take a replica out of global routing No az acr replication update --global-endpoint-routing false Data continues syncing both directions; costs continue accruing Re-enable replica participation in global routing No az acr replication update --global-endpoint-routing true No cooldown; replica is immediately ready Switch a workload between endpoints No Re-auth ( az acr login , SDK auth, or Kubernetes ACR credential provider) Each endpoint is its own authenticated surface Initial replica seeding on a live registry N/A None Existing replica continues serving traffic; seeding time scales with registry size Long-running push during a failover No Retry; design publishes to be idempotent Pin via regional endpoint to avoid mid-push scatter; ACR is tightening this behavior Pull of a recently pushed image from a different region No Wait for replication, retry with backoff, or check replication status Eventual consistency; bounded staleness consistency in development Home region outage Data plane: yes; control plane: no Use global or regional endpoints for data plane operations Control plane (replica config, network rules) requires home region DNS and Client-Side Considerations DNS bouncing during a single push is the most common geo-replication push problem in customer threads, and it warrants a section of its own. The failure mode. A docker push is a sequence of HTTP requests: blob uploads for each layer, then a manifest upload that references those layers by digest. If the Linux DNS resolver on the client doesn't cache myregistry.azurecr.io consistently for the duration of the push, individual sub-requests can resolve to different replicas. Because replication is eventually consistent, the manifest can land on a replica that doesn't yet have the layers it references, and the manifest validation fails. The two mitigations: Regional endpoints pin the push to a single replica end-to-end. Every sub-request — login, blob uploads, manifest upload — goes to the same replica. This is the cleanest fix and the one we recommend for any pipeline where push/pull consistency matters. A short-lived client-side DNS cache like dnsmasq scoped to the duration of a single push. For Linux VMs in Azure, follow the DNS name resolution options guidance. The pin should last the push and no longer. For other clients performing pushes, you can customize your stack's DNS resolver to have a similar short-lived DNS cache to pin the global endpoint's resolved DNS for only the duration of an image push operation. A note on long-lived DNS caching for the global endpoint. Don't run a long-lived DNS cache for myregistry.azurecr.io . ACR purges its own DNS records on the server side when a replica is taken out of global routing (Step 3) and during health-aware failover; a long-lived client-side cache will keep clients pointed at the old region after our purge, which makes both the manual disable mechanism and health-aware failover look broken from the client's perspective. Retry behavior: In-flight pushes during a failover may fail. Design publish steps to be idempotent so retries are safe. Pipelines that push in one region and immediately pull from a different region should retry with backoff or check replication status — eventual consistency means the pull may race ahead of replication. ACR is working on bounded staleness consistency that addresses this directly by enabling proxying (on ACR infrastructure) an image pull request from one geo-replica (if it does not have the image) to another geo-replica that has the image; see the relevant FAQ item. Note: Specific retry counts, back-off intervals, and push timeout values are application-layer decisions. The platform behavior is documented; the retry policy belongs to your client. Monitoring Guidance We map the three phases to the signals available from each source. Where a signal requires customer-side configuration, we flag it. Phase A: Initial replication (after creating a new replica) az acr replication list and az acr replication show — confirm the new replica reaches provisioningState: Succeeded and status: online , and view per-replica status. Azure Monitor platform metrics — push count, pull count, and other registry metrics are collected automatically and visible in the Azure portal under Metrics. No customer configuration is needed to view platform metrics. To export metrics or enable resource logs (detailed operation logs), configure Diagnostic Settings on the registry. Phase B: Failover (planned via replica disable, or automatic via health-aware failover) Per-replica regionEndpointEnabled state via az acr replication list — confirms whether a manual disable took effect, i.e. which replicas are currently eligible for global endpoint routing. Note: this flag reflects the manual configuration for configuring a geo-replica's global endpoint routing eligibility; it does not indicate whether health-aware failover has actively rerouted traffic away from a replica. Resource Health for the registry (in the Azure portal under Help > Resource health) — surfaces platform-side degradation signals during incidents. ACR does not yet expose a definitive "this region is currently serving your traffic" signal; Resource Health and client-side latency changes are the best available indicators. Pull latency from clients — increased latency from a more distant replica is the client-observable signal that traffic has rerouted. Azure Monitor platform metrics — visible per-region in the Azure portal Metrics blade. To export metrics or query them programmatically, enable Diagnostic Settings. Phase C: Failback (replica returns to global routing) az acr replication list — confirms regionEndpointEnabled: True (manual) or online status across all replicas (automatic). Pull latency normalizing as clients reach the recovered replica again. Resource Health clearing for the registry (visible in the Azure portal). Note: The health-aware failover blog calls out ongoing work to surface richer signals — including notifications for when routing changes and which region is currently serving your traffic. The signals listed above are what's available today. Pricing Considerations Storage billing vs. storage quota: Storage is billed per geo-replica — a 1 GiB image replicated to 5 geo-replicas is charged as 5 GiB of storage (1 GiB × 5 geo-replicas). However, storage quota (the tier's maximum storage limit) counts the image only once — the same 1 GiB image counts as 1 GiB toward your tier's maximum, not 5 GiB. Data transfer: Geo-replication can reduce costs by enabling in-region image pushes and pulls, which avoids cross-region data transfer charges during these push or pull operations. However, cross-region data transfer charges still apply when ACR replicates pushed content to other geo-replicas as part of eventual consistency. Disabled replicas still cost: When you take a replica out of global routing with --global-endpoint-routing false , storage and per-replica costs continue accruing because data continues syncing bidirectionally. For more information, see ACR pricing. Cleanup Run these commands to undo the walkthrough setup. Order matters: disable regional endpoints before deleting replicas, since regional endpoint URLs depend on which replicas exist. # Disable regional endpoints if you enabled them in Step 2a az acr update -n myregistry -g myrg --regional-endpoints disabled # Re-enable any replicas you disabled in Step 3 (no-op if already enabled) az acr replication update --registry myregistry --name westus \ --global-endpoint-routing true # Delete the West US replica created in Step 1 az acr replication delete --registry myregistry --name westus # Confirm only the home region replica remains az acr replication list --registry myregistry --output table Note: Replica deletion is a control-plane operation that requires the home region to be available. During a home region outage, replica configuration cannot be modified. Summary Table Question Answer When should I use regional endpoints vs the global endpoint? Use regional endpoints (Step 2a) for workloads that need affinity, predictable routing, push/pull consistency, troubleshooting, or client-side failover. Use the global endpoint (Step 2b) for everything else and let health-aware failover handle routing. What should I enable for secure, resilient layer downloads? Enable dedicated data endpoints. They scope firewall rules tightly to your registry and replace wildcard storage DNS with predictable per-region FQDNs. How do I avoid DNS-bouncing manifest validation failures on push? Pin pushes to a single replica via a regional endpoint. A short-lived client-side dnsmasq for the push duration is also fine if you're not using regional endpoints. Should I run a long-lived DNS cache for the global endpoint? No. ACR purges DNS server-side on disable and during failover; client-side caching works against that. Do I need to re-auth when switching endpoints? Yes. Each global or regional endpoint is its own authenticated surface. az acr login , SDK auth, or the Kubernetes ACR credential provider handles the re-auth. What happens during a home region outage? Data plane keeps working through any replica via the global endpoint or regional endpoints. Control plane operations (replica configuration, network rules) are unavailable until the home region recovers. The home region is fixed at registry creation. What's ACR doing about eventual-consistency pain? Bounded staleness consistency for cross-replica pushed images is in development and will be covered in an upcoming blog post. Reach out via GitHub if you want to share your scenario. For the full automation matrix — what's automatic, what requires customer action, and what to expect for each scenario — see the behavior summary above. If you have further questions about ACR geo-replication routing, pinning, capacity planning, eventual consistency, or failover behavior, reach out to us on the Azure Container Registry GitHub repository or file feedback through the Azure portal.109Views0likes0CommentsConnect Metrics to Traces with Exemplars in Azure Monitor
Following Microsoft’s recent GA announcement for OpenTelemetry (OTel) support, we are excited to announce support for Exemplars for customers instrumenting metrics with Prometheus or OpenTelemetry and traces using OpenTelemetry, enhancing Azure Monitor’s integrated observability experience for cloud-native applications. Modern cloud-native applications generate enormous volumes of telemetry. Metrics help teams detect that something is wrong, but traces explain why. Exemplars bridge these two worlds by attaching trace references directly to metric data points, making it dramatically easier to pivot from a spike in latency or errors to the exact distributed trace responsible for the issue. With Azure Monitor, customers can now ingest metrics with exemplars and visualize them in Azure Managed Grafana. This enables seamless correlation between metrics and traces, helping engineering teams troubleshoot issues faster and reduce mean time to resolution (MTTR). Why Exemplars Matter Traditional monitoring workflows often require users to manually correlate data across multiple systems. Exemplars simplify this workflow by embedding trace context directly into metric samples. For example, if a latency metric spikes at a specific timestamp, the exemplar associated with that data point can link directly to the distributed trace responsible for the outlier. This provides several benefits: Faster root cause analysis Quicker transition from aggregate metrics to request-level details Simplified debugging workflows for SRE and platform teams Better observability experiences for microservices and distributed applications Unified Observability with Azure Monitor With Azure Monitor and Azure Managed Grafana, you can now: Ingest OTLP or Prometheus metrics with exemplars into Azure Monitor Workspace Store and analyze traces in Azure Monitor Application Insights Visualize exemplar markers directly in Grafana charts Navigate from a metric spike to the exact distributed trace associated with that data point By combining these signals in a single observability platform, organizations can correlate infrastructure health, application behavior, and request traces without context switching between tooling. How It Works Once metrics, exemplars, and traces are ingested into Azure Monitor, Azure Managed Grafana can consume exemplar information from the configured Prometheus data source. When exemplars are enabled in Grafana dashboards, users will see markers associated with individual metric data points. Selecting an exemplar opens the associated trace in Azure Monitor, providing end-to-end diagnostic context. Getting Started Setup data ingestion: Instrument your application to emit OpenTelemetry traces, OpenTelemetry or Prometheus metrics with exemplars, and enable ingestion of the same to Azure Monitor using OpenTelemetry Collector. Follow the instructions in Ingest OTLP Data into Azure Monitor with OTel Collector - Azure Monitor | Microsoft Learn. After this step, you will have the Log Analytics Workspace, Azure Monitor Workspace and Application Insights resources all set up to store the telemetry data. Create an Azure Managed Grafana instance and connect it with the Azure Monitor Workspace by navigating to your Azure Monitor Workspace in the Azure portal and then clicking on “Linked Grafana workspaces”. To learn more, see Manage an Azure Monitor workspace - Azure Monitor | Microsoft Learn Optionally, enable Azure Managed Prometheus on your AKS cluster or use remote-write and configure it to use the same Azure Monitor Workspace to centralize infrastructure and application metrics. Enable Exemplars in Azure Managed Grafana: After setting up the data ingestion, ensure that logs and traces are flowing into Log Analytics Workspace, and metrics are flowing into Azure Monitor Workspace. Step 1: Enable Exemplars on Prometheus Data Source in Azure Managed Grafana Navigate to Connections -> Data Sources in Azure Managed Grafana. Since you have connected Azure Managed Grafana to Azure Monitor Workspace, you will see the data source (Managed_Prometheus_<AMW-Name>) already configured. If the data source is not configured, follow the steps here to add your Azure Monitor Workspace as a data source. Open the data source configuration. Click Add Exemplars to enable exemplar support. Step 2: Configure Trace Linking with Azure Monitor In the exemplar configuration section, toggle Internal Link to On. Select Azure Monitor as the data source. In the Label Name, enter the name of the field in the labels object that should be used to get the trace id, eg. trace_id. Click Save & Test. This configuration enables direct navigation from exemplar markers in Grafana charts to the associated traces stored in Azure Monitor. Azure Managed Grafana also supports trace correlation from other solutions like Jaeger etc. To use your trace solution, use the appropriate links. Step 3: Enable Exemplars in Dashboards Navigate to a Grafana dashboard that uses your configured Prometheus data source. Open the panel options for a metrics chart. Toggle Exemplars to On. Once enabled, exemplar markers will appear on supported metric visualizations. Clicking on it will show exemplar details along with an option to open the corresponding distributed trace in Azure Monitor. To learn more, visit https://aka.ms/azmon-exemplars140Views1like0CommentsRegional Endpoints for Azure Container Registry Geo-Replication — Now in Public Preview
By Johnson Shi, Zoey (Zhuyu) Li, Huangli Wu What's new Regional endpoints for geo-replicated Azure Container Registries are now in public preview. See the feature's official MS Learn documentation. If you've been following since the private preview announcement, here's what changed: No feature flag registration. No subscription enrollment so all Azure subscriptions and customers can now use this feature. No CLI extension. Regional endpoints commands are built into Azure CLI 2.86.0+ natively. If you installed the private preview acrregionalendpoint extension, uninstall it to avoid conflicts. Native CLI and portal support. With Azure CLI 2.86.0+, enable regional endpoints for all geo-replicas of a registry with az acr create --regional-endpoints enabled or az acr update --regional-endpoints enabled . The Azure portal also supports configuring regional endpoints natively. CLI flag rename for configuring a geo-replica's global endpoint routing (an existing separate feature). The existing flag --region-endpoint-enabled (on az acr replication create/update ) has been renamed to --global-endpoint-routing . Key clarifications: "--global-endpoint-routing" (formerly "--region-endpoint-enabled" on "az acr replication create / az acr replication update") — controls whether a specific geo-replica participates in global endpoint routing. This is an existing feature that is different from the new registry-level "--regional-endpoints" feature being discussed in this post. "--regional-endpoints" (on az "acr create / az acr update") — enables or disables the regional endpoints feature at the registry level for all geo-replicas. This is the feature discussed in this post. See the endpoint reference for the full breakdown of the various registry endpoints (global endpoints, regional endpoints, and data endpoints). Regional endpoints are available on Premium SKU registries in all Azure public cloud regions. What are regional endpoints? Regional endpoints give you dedicated, per-region login server URLs for each geo-replica with the following URL pattern: myregistry.eastus.geo.azurecr.io myregistry.westeurope.geo.azurecr.io Regional endpoints coexist with the registry's global endpoint ( myregistry.azurecr.io ) — enabling regional endpoints doesn't disable a registry's global endpoint that is backed by Azure-managed routing. You can choose per workload: You can use the global endpoint with automatic Azure-managed routing with health-aware failover, where Azure will route your requests to the geo-replica with the best network performance profile to the client. You can use a regional endpoint when you need explicit control or routing to a specific geo-replica. Other resources: For the full background on why regional endpoints exist and the problems they solve, see the private preview blog post. For the complete operational deep dive — health-aware failover, throttling considerations, storage quota and pricing, eventual consistency, home region outage behavior, DNS propagation, private endpoint interaction, capacity planning, and monitoring guidance — see How ACR geo-replication handles failover, failback, and traffic redirection. For the behind-the-scenes engineering implementation — architectural overview and the engineering system design of the feature — see Determinism over magic: the engineering design behind Azure Container Registry Regional Endpoints. Getting started Enable regional endpoints on an existing registry: az acr update -n myregistry -g myrg --regional-endpoints enabled View all registry endpoint URLs, including the registry global endpoint, geo-replica regional endpoints, and data endpoints: az acr show-endpoints --name myregistry --resource-group myrg Using regional endpoints Authenticate to a specific regional endpoint: az acr login --name myregistry --endpoint eastus Push to a specific geo-replica. Images and tags pushed to a geo-replica via regional endpoints still propagate to all other geo-replicas under eventual consistency. docker tag myapp:v1 myregistry.eastus.geo.azurecr.io/myapp:v1 docker push myregistry.eastus.geo.azurecr.io/myapp:v1 Pull an image: docker pull myregistry.eastus.geo.azurecr.io/myapp:v1 You can specify regional endpoints directly in Kubernetes deployment manifests if you need to pin workloads to specific regions. This ensures clusters in specific regions always pull from their colocated replica, providing predictable routing and reduced latency. By using different regional endpoints in each cluster's manifests, you can choose to guarantee that each cluster pulls from its local replica instead of relying on Azure-managed routing. East US cluster deployment: apiVersion: apps/v1 kind: Deployment metadata: name: myapp-eastus spec: template: spec: containers: - name: myapp image: myregistry.eastus.geo.azurecr.io/myapp:v1 West Europe cluster deployment: apiVersion: apps/v1 kind: Deployment metadata: name: myapp-westeurope spec: template: spec: containers: - name: myapp image: myregistry.westeurope.geo.azurecr.io/myapp:v1 When to use regional endpoints Scenario What to do Most workloads Keep using the global endpoint ( myregistry.azurecr.io ). Health-aware failover handles routing automatically. Pin AKS clusters to co-located replicas Use regional endpoint URLs in deployment manifests. CI/CD push-then-pull consistency Pin pushes to a regional endpoint to avoid eventual-consistency races. Client-side failover Switch between regional endpoints based on your own health checks. Capacity planning Spread workloads across multiple regional endpoints to avoid per-replica throttling. Troubleshooting Target a specific geo-replica to reproduce or isolate an issue. What changed from private preview Private preview Public preview Feature flag registration required ( az feature register ) No registration needed Subscription private preview enrollment and propagation wait Immediately available to all Azure subscriptions for all Premium SKU registries in all Azure public cloud regions. Separate CLI extension ( acrregionalendpoint ) Built into Azure CLI 2.86.0+ natively No registry-level CLI flag az acr update --regional-endpoints enabled enables regional endpoints for all geo-replicas --region-endpoint-enabled flag for controlling a geo-replica's global endpoint routing via az acr replication update Flag for controlling a geo-replica's global endpoint routing renamed to --global-endpoint-routing No portal support Native Azure portal support for enabling regional endpoints for new registries (during creation) and for existing registries Private preview docs in Azure/acr Full documentation on MS Learn Enabling regional endpoints in the Azure portal You can enable regional endpoints directly from the Azure portal for both new registries (during creation), as well as existing registries: If you were in the private preview 1. Uninstall the CLI extension. The private preview CLI extension conflicts with the built-in commands in Azure CLI 2.86.0+. Remove it: az extension remove --name acrregionalendpoint Verify it's gone: az extension list --query "[?name=='acrregionalendpoint']" -o table 2. Ensure you're running Azure CLI 2.86.0 or later. Regional endpoints commands are available natively starting in Azure CLI 2.86.0. Check your version: az version 3. Update scripts that use --region-endpoint-enabled for controlling global endpoint routing for a geo-replica. The old flag name for controlling a geo-replica's global endpoint routing configuration is deprecated and will be removed in Azure CLI 2.87.0 (June 2026). Update to --global-endpoint-routing : # Old (deprecated) az acr replication update --registry myregistry --name westus \ --region-endpoint-enabled false # New az acr replication update --registry myregistry --name westus \ --global-endpoint-routing false Why the rename? The old flag name --region-endpoint-enabled was confusing — it sounded like it controlled the regional endpoints feature, but it actually controlled whether a geo-replica participates in global endpoint routing. The new name --global-endpoint-routing says exactly what it does. For a full breakdown of all three CLI flags and how they relate, see the endpoint reference. Learn more Full documentation: Geo-replication in Azure Container Registry — Regional endpoints — prerequisites, CLI commands, network considerations, private endpoint integration, and troubleshooting. Operational deep dive: How ACR geo-replication handles failover, failback, and traffic redirection — health-aware failover, throttling, eventual consistency, DNS considerations, monitoring, pricing, and a full walkthrough. Behind-the-scenes engineering implementation: Determinism over magic: the engineering design behind Azure Container Registry Regional Endpoints — architectural details and the engineering system design behind the feature. Endpoint reference: Azure Container Registry endpoint reference — all endpoint types, URL formats, and CLI flags in one place. Private endpoints: Connect privately to a registry using private endpoints — IP allocation math, subnet sizing, and NIC queries for registries with regional endpoints. Firewall rules: Configure firewall access rules — which FQDNs to allow for regional endpoints. Feedback We'd love to hear how you're using regional endpoints and what we can improve. Reach out via: Azure Container Registry GitHub repository — issues, feature requests, and discussion Azure portal feedback — use the feedback button in the Azure portal on your registry's page Regional endpoints are on the path to GA. Your feedback directly shapes the feature's direction.187Views1like1CommentWhat’s new in Observability at Build 2026
At Build 2026, Azure Monitor introduces major advancements in end-to-end observability, extending across AI agents, applications, and infrastructure with OpenTelemetry at its core. New capabilities with Azure Copilot Observability agent, SLI/SLO support, and smarter alerting help teams move faster from detection to root cause while reducing noise and manual effort. Together, these innovations enable developers and SREs to operate modern, AI-driven systems with greater insight, efficiency, and alignment to customer experience.509Views2likes0CommentsVNet integration for Azure SRE Agent (preview)
For many production systems, the logs, databases, private endpoints, repositories, and runbooks an SRE Agent needs to do its job are behind network boundaries your security team already governs. VNet integration for Azure SRE Agent, now in preview, puts the agent's outbound traffic under those same controls - your virtual network, your NSG rules, your private DNS - so it reaches only what your network allows. The principle is one your security team already applies to every other workload: a component's network access shouldn't depend on the component behaving correctly. Identity governs what the agent can reach. Permissions and hooks shape what it does within reach. The network sits beneath both: it blocks any request to a destination you haven't allowed no matter what the agent decides. Why egress control matters Two reasons. First, the agent reads sensitive things by design. Inspecting logs, code, configuration, and internal systems is the whole point during an incident, which means you have to decide where that data can go. Open egress gives that data a path out of your network - a risk you wouldn't accept for any other production-adjacent workload. Second, it reasons over text it didn't write - logs, issue descriptions, tool output — which is how prompt injection gets in. Handling that is partly model safety, and Azure SRE Agent runs under Microsoft's Responsible AI standard with safety work from OpenAI and Anthropic. Network controls add another layer: an instruction that tries to reach a destination you haven't allowed can't run, because the network blocks it. For example, an agent investigating an outage might query Log Analytics, read deployment configuration, and call an internal runbook - all private resources. With VNet integration, those calls follow the routes, DNS, and firewall rules your workloads already use. A request to an external endpoint you haven't allowed fails at the network boundary. It doesn't depend on the model recognizing the risk and refusing; the network stops it either way. Choose an egress mode Azure SRE Agent has three egress modes, and you don't have to start at the strongest. Unrestricted - all outbound traffic allowed Limited - deny all outbound, allow an explicit list of hosts. Gives you host-level control without setting up a full VNet Azure VNet - outbound traffic goes through a delegated subnet in your network, with your NSG rules and private DNS applied. The recommended mode for production and regulated workloads. How Azure VNet mode works Outbound traffic takes one of two paths, and every call takes exactly one. Your VNet. Everything not placed on the managed path goes through a delegated subnet in your own network, where your NSG rules, private DNS, and firewall all apply. The agent is just another workload on that subnet, so it can reach what the subnet can reach: databases behind private endpoints, internal services, monitoring stores, and key vaults -the parts of production that aren't reachable from the public internet. The resources that matter most during an incident are usually the private ones. If your network connects to on-premises over ExpressRoute or VPN, the agent can reach those systems too, as long as your existing routes and rules allow it. The managed infra path. Some destinations go through Azure SRE Agent's managed infrastructure network instead - platform services the agent needs, plus optional categories you turn on: package registries, code repositories, and remote MCP servers. This path skips your VNet, so your NSG rules and Firewall Policies don't apply to it. Treat it as a deliberate exception, used only where you need it. Why public services start on the managed path Public services are hard to allow by IP address. GitHub, PyPI, npm, NuGet, apt, and the container registries run on large, changing IP ranges, and they don't map to a single Azure service tag. If your NSG filters by IP and port, keeping those lists up to date is constant work, and when a list falls behind, the agent can't pull a package or read a repository - and an investigation stalls on a networking problem that has nothing to do with the incident. Each category has a toggle: package registries (PyPI, npm, NuGet, apt), code repositories (GitHub, GitHub Enterprise, Azure DevOps), remote MCP servers, and a list of additional hostnames. Starting with these on the managed path keeps the agent working reliably without maintaining an IP allowlist. For build-time dependencies, that's usually fine. If you want this traffic inspected too, the next step is name-based (FQDN) egress filtering in your own network. Once your firewall can allow github.com and pypi.org by name, you can move these categories off the managed path and route them through your VNet instead Configure it Two decisions: the subnet, and what (if anything) uses the bypass. Navigate to Settings > Workspace Configuration > Network Choose Azure VNet as the egress mode. Select a subnet that is /28 or larger and delegated to `Microsoft.App/environments`. Decide which categories, if any, use the bypass. Restrict who can change the egress mode and bypass toggles. These settings widen or narrow the agent's reach, so govern them like any production network control. Test the outbound behavior before using the agent with production data. A reasonable setup for most enterprises during preview: use Azure VNet mode, keep package registries and code repositories on the bypass if you need reliable access to them, and route everything else through your VNet. Stricter environments can turn those categories off and rely on their own name-based firewall rules. What it doesn't cover yet VNet integration is in preview, with two limitations to know. It covers outbound traffic only - reaching the agent privately from inside your network isn't part of this preview. And connector traffic still routes over the public internet; the governance and credential isolation in Connectors V2 still apply. Use VNet integration for outbound control of the agent workspace, and combine it with identity, RBAC, tool permissions, hooks, and connector governance for a complete set of controls. Where it fits VNet integration doesn't replace identity, RBAC, tool permissions, or connector governance. It controls where traffic can go. The agent still needs the right identity and permissions to access a resource in the first place. Identity is the foundation: your RBAC assignments decide what the agent can reach. Permissions and hooks shape what it does within reach: allow/ask/deny rules control what runs, and hooks let you inspect or change a tool call before it runs. VNet integration sits underneath, controlling where traffic can go no matter what the agent tries to do. You want the agent to be capable. You also want a boundary that holds whether or not it is. Get started Create an SRE Agent - https://aka.ms/sreagent Documentation - https://aka.ms/sreagent/newdocs Recipes - https://aka.ms/sreagent/recipes Build 2026 Announcement - https://aka.ms/Build26/blog/SREAgent655Views1like0CommentsInside ACR Artifact Cache: Pull-Through Caching at Scale
By: Akash Singhal, Luis Dieguez, Kiran Challa, Nathan Anderson, Tony Vargas, Caroline Barker, Ren Shao, Mabel Egba, Toddy Mladenov, Johnson Shi Introduction For many customers, Azure Container Registry (ACR) is the only registry their workloads can trust, even when images and artifacts originate from a different registry such as Docker Hub, Microsoft Artifact Registry, GitHub Container Registry, Quay, another ACR, or a private registry. ACR Artifact Cache makes this many-to-one model practical by letting a platform team map a downstream ACR repository path to an upstream source repository. Here, upstream means the source registry and repository ACR contacts on behalf of the customer, and downstream means the ACR-facing path customers pull from. From the outside, the experience looks like a normal pull from ACR. Inside the service, that pull moves through the same multi-tenant registry platform that serves ACR traffic across regions, clouds, and data plane stamps. This series is about the gap between that simple external experience and the internal system. The goal is to show what happens inside ACR, why the system is designed this way, and how those design choices shape the behavior customers ultimately observe. Some implementation details are simplified, and the system continues to evolve. The request paths and design constraints are representative, but this article intentionally avoids service-by-service internals that are not necessary to understand the feature. For this overview, the useful mental model is: serve now, hydrate for later. Later sections will show where that model helps, and where it creates engineering pressure. Why serve upstream content from ACR? Pulling directly from an upstream is often sufficient for development, but production systems need stronger guarantees from the pull path. The failure modes are familiar to anyone who has operated containerized workloads at scale: an upstream registry is slow or temporarily unavailable an upstream applies rate limits or burst protection credentials for various upstream sources need to be handled safely ACR-to-ACR scenarios should avoid customer-managed credentials entirely by using managed identity network policy expects pulls to stay inside an approved network boundary a platform team wants one shared, sanitized catalog of public content for first-party consumption while individual teams pull only what they need Let’s take Docker Hub as a concrete example. Docker Hub pull rate limits mean that unauthenticated users and Docker Personal users can exhaust their allowed pulls in a time window, causing shared build agents or Kubernetes nodes to receive rate-limit errors instead of images. That is a useful example because it makes the upstream dependency visible, but it is not the whole story. The broader engineering problem is that upstream-sourced artifacts should behave like local registry dependencies once a customer chooses to route them through ACR. Artifact Cache addresses that problem by letting customers map a downstream ACR namespace to an upstream namespace, pull through ACR, and allow ACR to materialize content locally as it is requested. A pull-through cache inside ACR Azure Container Registry operates across 60+ Azure regions and 6 public and sovereign clouds, serves hundreds of thousands of registries, and handles billions of requests per day. Artifact Cache is only one part of that larger service, but it is large enough to be a distributed systems problem in its own right: more than 100 million image pulls per day, petabyte-scale egress, upstreams with different behavior, and customers who expect registry pulls to remain predictable. This scale matters because Artifact Cache is not deployed beside ACR as a separate service. It is part of the same registry system that serves normal pushes, pulls, tag listing, catalog operations, authentication flows, private networking scenarios, and other registry API traffic. That means Artifact Cache has to fit into ACR's existing resource model and request-serving model. Customers configure cache rules and authentication boundaries through the control plane, then their pulls are served through the data plane. The next sections follow those two parts in order: first the resources customers create, then the runtime path those resources affect. The customer workflow The setup begins in the control plane, where customers define the relationship between an ACR namespace and an upstream source. A customer starts with an ACR and chooses an upstream repository. In the examples below, myregistry.azurecr.io is the customer's ACR login server. The dockerhub/library/node path is the downstream ACR namespace the customer wants to use for cached content. The authentication model depends on the upstream: For a public upstream, the cache rule may not need credentials. For a private upstream, the customer stores upstream credential material in their Azure Key Vault, creates a credential set that references those secrets, and then associates that credential set with a cache rule. At access time, ACR uses the system-assigned managed identity associated with the cache rule to read the referenced Key Vault secrets, so the customer controls access by granting that identity the required secret permissions. ACR materializes those credentials only when it needs to contact the upstream, so the customer-owned Key Vault remains the secret store. For an ACR-to-ACR upstream, the customer can use a user-assigned managed identity. In that scenario, credential sets are not part of the flow; managed identity replaces the credential-set and Key Vault path. At a high level, the customer defines a namespace mapping: docker pull myregistry.azurecr.io/dockerhub/library/node:latest maps to: docker pull docker.io/library/node:latest In ACR, that mapping is stored as a cache rule: a control-plane resource that maps a downstream ACR path to an upstream source path. If the upstream requires authentication, the cache rule links to the appropriate credential boundary: a credential set backed by customer-owned Key Vault secrets, or a user-assigned managed identity for ACR-to-ACR. This is where the control-plane/data-plane split shows up. The control plane manages registry configuration through surfaces such as CLI, portal, Bicep, ARM templates, and other Azure Resource Manager clients. ARM sends those resource operations to the ACR control plane, which creates or updates the cache rule and, when needed, the credential set as child resources under the registry. Those resources do not own customer secrets or identities directly; they link to existing Azure resources such as the customer's Key Vault or an optional user-assigned managed identity. Later, the data plane uses that persisted configuration to decide whether a runtime registry request, such as a pull or tag listing, should be handled by Artifact Cache. After setup, the runtime path begins with the simplest possible pull: docker pull myregistry.azurecr.io/dockerhub/library/node:latest To understand what happens after that command, we need a map of the ACR components that participate in the request path. The ACR components involved The architecture needed for this overview is much smaller than ACR's full internal service graph. ACR is a regionalized service. The control plane operates at the regional level, while data plane stamps serve hot-path registry traffic for the registries assigned to them. A registry is pinned to a stamp, and high-traffic regions may have more than one stamp. Stamp architecture is an ACR concept covered in more detail in the stamp rebalancing post; this article only needs the simplified model below. For this article, ACR has three important boundaries: The regional control plane manages registry resources and provisioning operations. The data plane stamp serves hot-path registry traffic for registries pinned to that stamp. The storage layer holds downstream registry metadata, blobs, and storage-backed event queues. At this level of detail, a data plane stamp is composed of a few major runtime substrates. The registry data plane virtual machine scale set (VMSS) is the core ACR data plane. It runs containerized services including the frontend, the registry API entry point that receives and routes OCI and ACR-specific requests. The data proxy VMSS also runs containerized services and serves selected blob-content paths. It serves eligible blob-content traffic behind ACR's dedicated data endpoint; see the ACR data endpoint documentation. The stamp also includes a runtime cluster for additional data plane services, including services that are not on the hot path. This article will not explain why ACR uses both VMSS-based services and a runtime cluster inside the data plane stamp. That tradeoff is useful context, but it belongs in a separate deep dive. For Artifact Cache, the important point is narrower: the stamp contains the runtime substrates that participate in data plane serving, including runtime-cluster services that process async import and hydration work. The component list is: Component Role Region control plane Manages registry resources and provisioning operations Data plane stamp Serves pinned registries in a region Registry data plane VMSS Core ACR data plane for OCI and ACR-specific APIs Frontend Handles OCI registry API traffic inside the registry data plane Data proxy VMSS Serves selected blob-content paths, including Artifact Cache Runtime Kubernetes Cluster Hosts additional data plane services, including async import and hydration workers Cache rule Maps downstream ACR path to upstream path Credential set or managed identity Provides the upstream authentication boundary when needed Cache Backend service Handles cache-rule-backed pulls Storage queue Regional storage resource used for hydration events Metadata/blob storage Stores downstream manifests, tags, digests, and layer blobs Import workers Run in the data plane runtime cluster and hydrate downstream content asynchronously Upstream registry Public, private, or another ACR registry used as the source The diagram below is a component map rather than a step-by-step pull trace. It shows one visible data plane stamp in West US for myregistry.azurecr.io, with a muted marker to indicate that larger regions can contain multiple stamps. The stamp contains a registry data plane VMSS, a data proxy VMSS, and a runtime Kubernetes cluster. Regional metadata/blob storage and the storage queue sit outside the stamp boundary. The storage queue is also outside the regional control plane cluster; it is a storage resource consumed by data plane runtime-cluster workers. First artifact pull Now return to the pull request: docker pull myregistry.azurecr.io/dockerhub/library/node:latest The request reaches the data plane stamp where myregistry is pinned. The frontend in the registry data plane VMSS handles the registry API request and forwards it to the Cache Backend Service, which checks whether the requested repository path matches a cache rule. If there is no matching cache rule, the request follows the normal ACR path. If a cache rule matches, Artifact Cache logic applies. The next check is local state. ACR looks at downstream metadata and blob storage to determine whether the requested manifest and blobs are already available locally. If the content is present, ACR can serve it from the downstream registry path. If the content is not available locally, ACR resolves the upstream repository path from the cache rule. If the upstream requires authentication, ACR uses the configured auth boundary for that upstream: a credential set for private upstreams, or a user-assigned managed identity for ACR-to-ACR upstreams. The request can then be served through the upstream-backed data path, with the data proxy handling the blob content path. The first pull does not need to wait for durable hydration to complete before the client receives content. Serving the pull and hydrating the downstream registry are related operations, but they are deliberately separated. The trace above follows the same node:latest image used in the setup example. On a cache miss, the data plane queues an async import event for the requested image while still serving the client request. Manifest content returns through the frontend path. For layer blobs, the frontend returns a redirect to the data proxy, and the client follows that redirect while the data proxy streams blob content from the upstream CDN. The data plane serves the customer request, but it also detects that durable downstream state needs to be populated. That durable work is where hydration comes in. Hydration Hydration is the process that materializes upstream content into the downstream ACR registry. ACR performs hydration asynchronously because the data plane workload can be bursty and variable. A deployment or scale-out event can cause many clients to request the same not-yet-hydrated image at nearly the same time. Image size, layer count, multi-platform manifest trees, upstream behavior, queue depth, and retry behavior all matter in a multi-tenant service. The north star is to coordinate those requests: collapse duplicate work, hydrate the content from upstream, and serve all waiting clients without turning one customer action into unnecessary upstream load. That coordination problem is challenging at ACR scale, and we are continuing to improve it. The existing async import path gives Artifact Cache a durable and scalable foundation while that serving path continues to evolve. At a high level, the data plane queues an import event. A notification service consumes the event and dispatches work to import workers in the data plane runtime cluster. Those workers fetch the required content from the upstream registry and write manifests, tags, digests, and layer blobs into ACR metadata and blob storage. When import workers complete, they notify the notification service, which can publish completion signals through ACR eventing surfaces such as Event Grid and webhooks. This allows customers to use webhooks to detect when cached content is fully available locally. You can read more about how it works here. The mental model is that the first pull can serve immediately, while hydration makes future local serving durable. A follow-up post will go deeper on the work ACR does to reduce upstream load during this hydration window. Later pulls After hydration completes, later pulls for the same content can be served from ACR. For digest references, the model is relatively direct because a digest is content-addressed. If ACR has the requested digest and its blobs downstream, the data plane can serve that content locally. Tags are more subtle because tags can change. A tag such as latest is a name that can point to different content over time. Artifact Cache therefore must care about freshness semantics for tag-based pulls. This is one of the reasons a pull-through cache becomes more complex than "fetch once and forget." The benefit is not only lower latency. ACR also reduces repeated dependency on the upstream for content that has already been materialized downstream. Guarding the pull path Once content is hydrated, ACR must serve that content from the customer's registry boundary even when the upstream is slow, unavailable, or returning errors. That distinction matters for tag-based pulls: ACR may need upstream checks to reason about freshness, but an upstream failure should not automatically prevent ACR from serving content that is already available downstream. Artifact Cache also must be careful about how it behaves when upstreams are unhealthy. If an upstream starts returning 5xx errors or throttling requests, ACR should avoid amplifying the problem by repeatedly sending customer-triggered requests upstream. Circuit breaking and upstream work minimization are part of being a good steward of both customer traffic and upstream registry limits. More details to follow in subsequent posts. There is a separate availability question inside ACR: what happens if Artifact Cache-specific components, such as the cache backend path, are operationally unavailable? ACR handles that case gracefully by falling back to normal registry pull behavior: it checks the customer's registry state and serves the image if the requested content already exists in ACR. In other words, cache-backend unavailability should not block pulls for content that is already present in the registry. What we will explore next This overview is the map for the rest of the series. The following posts will go deeper into the parts of the system where the design pressure is highest. Minimizing upstream work We will start with how Artifact Cache avoids making more upstream requests than necessary. This becomes difficult when many clients request the same not-yet-hydrated image at the same time. A Kubernetes scale-out event is the classic example: many nodes may ask for the same image concurrently, and the system must avoid turning one customer's action into unnecessary duplicate upstream work. Making Artifact Cache observable to customers We will also look at how customers understand whether their cache rule is healthy, whether credentials are usable, and why a pull failed. This is hard because a failed pull can involve customer configuration, Key Vault access, managed identity configuration, upstream credentials, upstream availability, data plane request handling, or asynchronous hydration. The engineering challenge is to expose the right customer-facing health and debug signals without turning internal topology into the user interface. Repository semantics in Artifact Cache Finally, we will look at repository semantics. Once upstream content becomes local, the repository is no longer just a mirror. Tags can move upstream, digest references are content-addressed, and customers may push their own content into downstream repositories. The visible repository state can involve both upstream-derived content and customer-owned downstream writes. Closing Artifact Cache is designed to make upstream-sourced artifacts behave like ACR-served content once customers choose to route those artifacts through their registry. The design goal is that customers can pull from ACR and reason about the result using ACR boundaries: registry configuration, local serving, customer-visible health, and predictable repository semantics.256Views2likes0CommentsIs Your Monitoring Actually Working? What's New in Monitoring Coverage
Monitoring is only useful when the right signals are collected, the right alerts are in place, and the data is actually flowing when teams need it. In large Azure environments, confirming all three across every VM and AKS cluster can still take too much manual work. At Microsoft Ignite, we introduced Monitoring Coverage in Azure Monitor, a centralized preview experience for finding coverage gaps and enabling recommended VM and container monitoring at scale. At Microsoft Build, we are expanding that experience with two new capabilities that make monitoring easier to operationalize: data flow status and at-scale recommended alert enablement for virtual machines and Azure Kubernetes Service (AKS). With these updates, teams can move beyond asking whether monitoring was configured. They can see whether recommended monitoring is enabled, whether important alert coverage is missing, and whether configuration issues may prevent monitoring data from reaching its destination. Monitoring Coverage overview with recommendations and data flow status. What is Monitoring Coverage? Monitoring Coverage in Azure Monitor gives you a single place to review recommended monitoring across supported Azure resources. The Overview page summarizes coverage across your selected scope, shows Azure Advisor observability recommendations, and provides quick actions to enable recommended monitoring settings. Coverage is grouped into basic, partial, and enhanced monitoring so you can quickly understand whether a resource is using only default monitoring or has the Microsoft-recommended configuration enabled. From there, you can drill into the Monitoring Details tab to review individual resources and take action. New: data flow status The most important question after enabling monitoring is simple: is the data flowing? Data flow status helps answer that question directly from Monitoring Coverage. The new data flow status summary shows how many resources need attention, passed initial checks, or are not configured for validation. It also highlights top resources that need attention so operators can start with the most important issues first. When you open data flow status for a resource, Azure Monitor shows validation checks across areas such as: Resource configuration Data collection rule associations Network connectivity Data flows to the configured destination Detected issues are prioritized at the top of the details pane, and each validation check includes a recommended action. After making a fix, you can run validation again to confirm that data flow issues are resolved. Data flow status details with validation checks and recommended actions. Alternatively, you can visualize your data flows and identify problems from there. New: enable recommended alerts at scale Monitoring Coverage now also helps close alerting gaps. From the Overview page, you can see recommendations such as Enable VM Recommended Alerts and Enable AKS Recommended Alerts, then select Apply to configure recommended alert rules from a centralized flow. For virtual machines, you can enable alerts across an entire subscription or choose selected resources. Subscription scope is useful when you want recommended alerts to apply broadly, including to future VMs in the selected subscription. Selected resource scope gives you more granular control when you want to enable alert rules for a specific set of VMs. The enablement flow lets you review recommended alert rules, adjust thresholds, and configure notification options such as email, Azure Resource Manager role notifications, Azure mobile app notifications, or an existing action group. Some VMs may already have alerts configured, and new rules are designed not to duplicate existing alerts. For AKS, Monitoring Coverage can surface recommended alert gaps and start the same guided pattern: review impacted resources, configure recommended alert settings, and use Review + Enable to create the alert rules. A resource-centric view for follow-up The Monitoring Details tab brings coverage and data flow into the same resource list. Two columns are especially useful for triage: Monitoring coverage and Data flow status. Select either value to open resource-level details. Monitoring coverage details show what is configured for the resource, including VM Insights, recommended alerts, data collection rules, data sources, destinations, and agent version when available. Data flow details show validation results and recommended remediation steps. This makes it easier to move from a high-level gap to the specific resource and configuration that needs attention. Getting started Monitoring Coverage is available in preview from the Azure portal. Open Monitor, select Monitoring Coverage (preview), and choose the subscriptions and resources you want to review. From the Overview page, you can: Review coverage across VMs and AKS resources. Apply recommendations to enable VM Insights, container monitoring, and recommended alerts. Use data flow status to find resources whose monitoring data needs attention. Open Monitoring Details for resource-level coverage and validation results. A few preview notes: enablement operations include up to 100 resources at a time, and enabling monitoring or alert rules may create data collection rules, deploy Azure Monitor Agent, configure destinations, or create alert rules. Data collection, workspace ingestion, and alert rules may incur costs based on the settings you enable. To learn more, see Monitoring coverage in Azure Monitor (preview). Looking ahead Monitoring Coverage is part of our continued work to make Azure Monitor easier to operationalize at scale. We want teams to spend less time hunting for monitoring gaps and more time acting on reliable, validated signals. We would love your feedback as you try these new Build updates and we look to expand support beyond this set of resource types. Use the Azure portal feedback options or share feedback through your Microsoft account team.253Views1like0CommentsManaged Connectors for SRE Agent (preview)- Govern what your agent can do
Giving an agent access to a tool is the easy part. The harder question is what it's allowed to do with that access. "Can the agent copy a file in OneDrive?" mostly answers itself. "Can it copy any file, to any destination, over one that's already there?" is the one that decides whether the integration has a governance layer. Managed Connectors is built around that second question. It expands the catalog of tools the agent can reach - OneDrive, SharePoint, Google Drive, GitLab, Power BI, Microsoft Security Copilot, with more being added regularly - and pairs it with a governance model that keeps the policy for those tools outside the agent's control. This is part of the Azure SRE Agent announcements at Build 2026 What's new Managed Connectors is the next generation of our connector experience. It significantly expands the catalog of third-party and first-party SaaS integrations available to SRE Agent and surfaces each one to the agent as a curated set of operations through the Model Context Protocol (MCP) - the same standard the agent already uses for every other tool source. Governance: the agent gets capability, you keep control The governance model is the headline of this release, so it's worth being concrete about it. When you add a connector, you walk through a short wizard - Set up connector, Configure tools, Review & Save - and the "Configure tools" step is where the policy is set. Three things make it different from "just wire the API up to the LLM": You choose what's exposed - it isn't automatic. A connector might offer 40+ operations; in the wizard you pick the ones the agent can use. The rest aren't shown to the model, so it can't call them. Parameter policy lives outside the agent. For each selected operation you can mark parameters as user-defined (pinned to a value you specify) or agent-defined (the agent fills it in). On the Microsoft Planner “Create a task” tool, for example, you can choose the group ID from a list of your joined groups – this means that the agent provides the task details but can’t assign it to any arbitrary group, because that isn’t a parameter it sees when invoking the tool. Per-tool approval is built in. Each operation has an Allow/Ask toggle integrated directly into the creation and edit wizards. "Ask" routes the call through the agent runtime human-in-the-loop approval flow before it executes. On that same Microsoft Planner connector, you might leave read-only tools like “List tasks” or "Get plan details” on Allow, but flip “Delete a task” to Ask so a human must confirm before anything is removed. This is enforced on the agent's runtime; it is not a prompt instruction the model can be talked out of following. Credential Isolation No long-lived secrets in the agent. No API keys, no client secrets, no certificates, no OAuth tokens. All service credentials are encrypted at rest and stored outside of the agent’s trust boundary Automatic token refreshed. Once you consent, the internal connector resource keeps your tokens valid. You won't be asked to re-authenticate unless your service itself requires it. You consent once, in your own browser, with your own service. SRE Agent never proxies your password or the sign-in flow. Per-connection authorization. Each connection is bound to the specific SRE Agent instance you set up on and cannot be used by external threat actors. How it fits together All of this is stored and evaluated outside the agent loop. Each configured connector becomes an MCP server that the SRE Agent runtime registers as a tool source, the same standard wire format the agent uses for everything else, so adoption on the model side is trivial. Each layer does one job, and the trust boundary between "what the model decided" and "what was actually sent" is explicit and inspectable: the agent never sees the operations you didn't select, never sees the parameter slots you pinned, and cannot bypass approval on operations you marked Ask. How to try it Open the SRE Agent portal and go to Builder > Connectors. Pick a connector from the catalog with the “Preview” label and go through the creation wizard steps. At the “Set up connector” step, choose how the connector authenticates. Start with “OAuth” if you just want to sign-in and see it working against your own account. At “Configure tools”, select the operations you want to expose, pin any parameters that shouldn't be agent-controlled, and mark sensitive operations as “Ask.” Review & Save. The connector is registered with the runtime and immediately available to your agent. You can enable/disable specific tools or connectors in the “Capabilities” section. Edit connector – after creating the new connector, at any point you can go back and authenticate it with a different account, add or remove operations, update tool parameters and configure approval policies Resources Create new SRE Agent — https://aka.ms/sreagent SRE Agent Documentation — https://aka.ms/sreagent/newdocs SRE Agent recipes — https://aka.ms/sreagent/recipes Build 2026 SRE Agent announcements - https://aka.ms/Build26/blog/SREAgent292Views1like0Comments