containers

409 Topics

Follow up to Azure Host OS Article
When will there be a follow up to the Azure Host OS article? It would be great to learn more about how the NT kernel is being used inside Microsoft outside of standard Windows. This is the article: Azure Host OS
Steskalj
Jul 11, 2026 Place Windows Server Insiders
58Views
0likes
1Comment
Lock Down AKS End to End with Application Gateway for Containers and Managed Cilium L7
Hello Folks! If your AKS cluster looks like most production clusters I have walked through, one of two things is true. Either nobody writes any network policies and every pod can talk to every other pod, so one compromised container blows up the entire blast radius. Or somebody wrote a few coarse rules along the lines of “namespace A talks to namespace B over port 80”, which sounds secure right up until an attacker realizes that port 80 is exactly where they were planning to live anyway. Real attacks happen at Layer 7, dressed up like ordinary HTTP traffic, and L3 / L4 plumbing cannot tell the difference. That is the gap session MAIS09 from the Microsoft Azure Infrastructure Summit 2026 closes. Vyshnavi Namani and Darshil Shah from the Azure Networking product team walked through how two AKS-managed add-ons, Application Gateway for Containers (AGC) and Cilium L7 via Advanced Container Networking Services (ACNS), can lock down the entire path from the public internet to a single pod. No NGINX. No external WAF appliance. No third-party CNI to babysit. 📺 Watch the session: Why IT Pros Should Care Let me cut to the chase. If you operate AKS clusters today, this session matters because: You probably still have an ingress controller and an external WAF stitched together with annotations and prayers. AGC plus ACNS collapses that stack into first-party add-ons that AKS owns end to end. Both Application Gateway for Containers and Advanced Container Networking Services are generally available. This is not a preview demo, this is production. Security is finally readable. Every rule is a YAML object. Code review, audit, GitOps. No more “what does this NGINX config map even do anymore” archaeology. It actually works on a real attack pattern. The demo shows WAF killing a SQL-injection-style GET that Cilium would have happily forwarded, because the method (GET) was on the allow list. If you have ever had to explain to an auditor why a single compromised pod could pivot across your whole cluster, this is your exit ramp. The AKS Security Gap This Closes Most clusters are protected by a load balancer at the edge and basically nothing inside. The cluster door looks like a vault, but the hallways are wide open. Cilium calls this the lateral movement problem, and it is exactly how Kubernetes attacks unfold in the wild. Compromise a pod, then phone home, then pivot. What MAIS09 demonstrates is something different. AGC is the L7 front door (the metal detector at the lobby). ACNS Cilium L7 is the lock on every pod’s office door. Both speak HTTP. Both enforce identity. Both are managed by AKS itself. The legacy alternative, Application Gateway Ingress Controller (AGIC), bolted a full Application Gateway onto your cluster through a translator. Two services, two lifecycles, two finger-pointing teams when something broke. AGC is the successor, built from scratch for Kubernetes, speaking the Gateway API natively, enabled with a single AKS flag. AKS provisions the controller, wires the identity, delegates the subnet, and owns the upgrades. You own the policies. AGC + Managed Cilium, End to End Here is the mental model from the session. Picture four concentric layers of defense between the public internet and a pod. AGC front end. One Azure resource, one public DNS name, and (thanks to the Kubernetes Gateway API) multiple hostnames behind the same IP. The demo runs Contoso, Fabrikam, and Adventure Works on a single AGC public IP using three HTTPRoute objects. One infrastructure, three websites. Real cost savings, real ownership clarity (platform owns the Gateway, app teams own the HTTPRoutes). Azure WAF on AGC. This is the content inspector. It runs the OWASP Core Rule Set (DRS 2.1 in the demo) against every incoming request, looks for SQL injection, cross-site scripting, path traversal, and the rest of the OWASP Top 10, and returns a 403 before the packet ever touches your pod. Microsoft maintains the rule set, you bind it to AGC via a SecurityPolicy. ACNS Cilium L7 ingress on every pod. This is where identity-based policy lives. Rules key off pod labels, not IPs, because IPs change every time the cluster autoscaler does its job. The demo uses an allow-agc-l7-get-only CiliumNetworkPolicy that lets the AGC backend reach the tenant pods, but only with GET or GET /products. Anything else, POST, PUT, DELETE, gets a Cilium-synthesized 403 before NGINX ever sees the request. ACNS east-west and egress policy. Two more policies do the heavy lifting inside. client-may-call-contoso-get-only lets the client pod reach Contoso with GET, and only Contoso. A default-deny baseline blocks everything else (pod-to-pod and pod-to-internet) with a single carve-out for kube-dns on port 53. The magic is that the same Cilium engine handles north-south, east-west, and egress with one consistent identity model. eBPF in the Linux kernel does the enforcement on the same node as the pod, so the decision happens before the packet leaves the host. No sidecars, no iptables sprawl, no daemonset you need to upgrade by hand. Real-world Scenarios The demo walks through six tests and the results map directly onto things you are probably trying to solve right now: Multi-site hosting on one IP. Three hostnames, one AGC, three 200 OKs from three different backend pods. If you are paying for three load balancers today, you can stop. WAF blocks a malicious GET that ACNS would have let through. This is the punch line of why you need both layers. The method (GET) is on the Cilium allow list, but the payload is a SQLi pattern. WAF returns 403 at the edge. Defense in depth, working as advertised. Method enforcement at the pod door. GET returns 200, POST/PUT/DELETE return 403, GET /admin returns 403, GET /products returns 200. Cilium is doing actual HTTP inspection, not just dropping packets. East-west enforcement with readable verdicts. Client to Contoso GET is 200. Same client, same destination, POST is 403 (L7 deny, TCP completed). Client to Fabrikam is 000 (L4 drop, no TCP handshake). Reading the difference between 403 and 000 is now a debuggable signal, not a mystery. Default-deny egress kills phone-home. A pod tries to reach bing.com. DNS resolves (the carve-out works), TCP SYN goes nowhere, wget gives up with exit code 1. If that pod was compromised and trying to exfiltrate data, this is where the attack chain dies. Selective allow still works. Same pod, same tools, but a DNS lookup against kube-dns inside the cluster returns instantly. We did not unplug the network. We locked it down with a purpose. Honest tradeoffs to call out. The session does not pretend everything is free. AGC introduces a billed subnet association and a managed identity you do not manage in BYO mode. Cilium L7 needs the Cilium data plane (ACNS Container Network Security features are Cilium-only). The Envoy proxy that handles L7 inspection has a cost only when you actually enforce L7, which is a fair deal in my book. Getting Started If you want to try this on a cluster of your own, three flags do most of the work on az aks create: --network-dataplane cilium (turns on the eBPF data plane) --enable-acns (enables Advanced Container Networking Services, including Hubble observability and Cilium L7 policy) --enable-app-routing or the ALB add-on flag (provisions the AGC controller as an AKS-managed add-on) From there you write four YAML objects: a default-deny CiliumNetworkPolicy, an allow-DNS carve-out, an AGC ingress allow with method and path constraints, and your east-west allow rules. The session repo includes the full set so you can clone and follow along. One bonus worth knowing about. ACNS ships Hubble out of the box, with pre-built Azure Managed Grafana dashboards. Flow logs, service maps, policy hit counts. Even on pods that are not yet under L7 enforcement, you get observability for free. When something breaks at 2 a.m., you have an audit trail instead of a tcpdump. Resources Azure Application Gateway for Containers documentation Set up Layer 7 policies with Advanced Container Networking Services AKS security concepts Cluster security best practices for AKS Container Network Observability for AKS (Hubble, Prometheus, Grafana) Advanced Container Networking Services hands-on lab Use cases of Advanced Network Observability for AKS (Azure Networking Blog) Watch the Rest of the Summit If MAIS09 hit the spot, there are dozens more sessions in the same playlist covering AKS networking at scale, Azure Local, AVM, the new Deployment Agent, and a lot more. Grab a coffee and binge. Microsoft Azure Infrastructure Summit 2026 playlist Cheers! Pierre Roman
Pierre_Roman
Jul 09, 2026 Place ITOps Talk Blog
139Views
1like
1Comment
IPv6 Dual-Stack Endpoints for Azure Container Registry (Public Preview)
By Johnson Shi, Aviral Takkar, Bin Du Introduction Two of the most common networking questions we hear from teams running Azure Container Registry (ACR) are: "Can my registry serve clients on IPv6 networks?" — Teams operating IPv6-only or dual-stack networks need their container registry reachable over IPv6. "How do we start moving registry traffic toward IPv6 without breaking anything?" — Organizations guarding against IPv4 address exhaustion, or operating under IPv6 transition mandates, want a migration path that doesn't disrupt existing IPv4 clients. Today, we're announcing the public preview of IPv6 dual-stack endpoints for Azure Container Registry for public endpoints and firewall rules, with IPv6 over private endpoints planned for GA. Set your registry's endpoint protocol to IPv4AndIPv6 , and its endpoints become reachable over both IPv4 and IPv6 — so IPv4-only, dual-stack, and IPv6-capable clients all connect to the same registry, each over whichever protocol their network stack selects. Key Takeaways ACR registries now support an endpointProtocol setting with two values: IPv4 (default) and IPv4AndIPv6 (dual stack, preview). Dual stack is additive — your registry continues serving IPv4 clients exactly as before. There is no IPv6-only mode. Dual stack requires dedicated data endpoints to be enabled ( --data-endpoint-enabled true ), and dedicated data endpoints require the Premium SKU. The service enforces this requirement. You can enable it today with Azure CLI 2.87.0 via az acr update --endpoint-protocol IPv4AndIPv6 . FQDN-based client firewall rules keep working unchanged; IP-based allowlists need to account for IPv6 traffic. Limitation: This public preview covers IPv6 for the registry's public endpoints and firewall rules only. IPv6 over private endpoints is planned for a future release. Limitation: ACR Tasks isn't supported on a registry that has IPv6 dual-stack enabled. Tasks does not work when the endpoint protocol isIPv6 dual-stack, including quick builds (with az acr build) and quick task runs (with az acr run). Support is planned for a future release. How to enable it On an existing registry (Azure CLI 2.87.0 or later) Dual stack requires dedicated data endpoints, so enable both in a single update: az acr update --name <your-registry> --data-endpoint-enabled true --endpoint-protocol IPv4AndIPv6 If dedicated data endpoints are already enabled, set the endpoint protocol on its own: az acr update --name <your-registry> --endpoint-protocol IPv4AndIPv6 Verify the configuration: az acr show --name <your-registry> --query "{endpointProtocol:endpointProtocol, dataEndpointEnabled:dataEndpointEnabled}" { "dataEndpointEnabled": true, "endpointProtocol": "IPv4AndIPv6" } Note: If your clients sit behind a firewall and you're enabling dedicated data endpoints for the first time, add firewall rules for <your-registry>.<region>.data.azurecr.io before enabling — switching from *.blob.core.windows.net to dedicated data endpoints changes where layer blobs are downloaded from. See Dedicated data endpoints for details. Reverting to IPv4 Dual stack is reversible at any time: az acr update --name <your-registry> --endpoint-protocol IPv4 Reverting the endpoint protocol leaves dedicated data endpoints enabled; disable them separately if desired. Scope of this preview This public preview enables IPv6 for the registry's public endpoints — the login server, dedicated data endpoints, and regional endpoints (if enabled). IPv6 over private endpoints isn't part of this preview. Support is planned for a future release. Until then, registries reached through a private endpoint continue to use IPv4. Additionally, IPv6 dual-stack support for ACR Tasks, including support for `az acr build` and `az acr run`, are not supported in the public preview. Support is planned for a future release. Requirements and how features compose Requirement Why Premium SKU Dedicated data endpoints are a Premium feature. Dedicated data endpoints enabled IPv4AndIPv6 requires dataEndpointEnabled: true ; the service rejects the setting otherwise. Azure CLI 2.87.0+ Adds --endpoint-protocol to az acr update . For geo-replicated registries, the endpoint protocol is a registry-level setting, and dedicated data endpoints exist in every replica region. Firewall guidance: rules based on registry FQDNs — the login server, dedicated data endpoints, and regional endpoints (if enabled) — continue to work unchanged for dual-stack registries; only IP-address-based allowlists need updating for IPv6. To learn more, see IPv6 dual-stack endpoints in Azure Container Registry (preview) and the ACR endpoint reference. If you have further questions about IPv6 dual-stack endpoints or dedicated data endpoints, reach out to us on the Azure Container Registry GitHub repository or file feedback through the Azure portal.
johnsonshi_msft
Jun 24, 2026 Place Apps on Azure Blog
177Views
1like
0Comments
How Many Copies of Each Layer Does Your Container Registry Actually Need?
Authors: Payal Mahesh and Vicky Lin Azure Container Registry team: Jeanine Burke and Johnson Shi Introduction It's Monday morning. You spin up a fresh 1,000-node AKS cluster for a big training run or a fleet-wide rollout. Every node reaches for the same large container image at the same instant. What actually happens in the next ten minutes - and whether your pods reach Ready in 9 minutes or 14 - turns out to depend on a single number you've probably never thought about: how many copies of each image layer exist behind your registry. At the surface, you see a single capacity number for your registry size - but behind that abstraction, Azure Container Registry maintains copies of your layer data to optimize pull performance. That number of copies directly determines the read throughput available per layer. Each copy can serve requests independently, so distributing the layer across storage allows it to be read in parallel. More copies mean more independent readers - and higher aggregate throughput when thousands of nodes pull at once. The intuitive answer is that more is better: add copies, get faster pulls. When we actually tested it at 1,000-node scale, the truth turned out to be more interesting: A few extra copies helped a little. A moderate number helped a lot, and eliminated storage throttling entirely. A large number helped no more than the moderate one. A huge number actually made pulls slower again. Think of it like opening checkout lanes at a grocery store. Opening a few more lanes when the store is slammed cuts the line dramatically. Past a certain point, though, extra lanes barely help, because by then it's the customers, not the cashiers, who are the bottleneck. And open too many? Now the staff is spread thin and tripping over each other, and the line moves worse than it did at the sweet spot. This post walks through what we measured, why the curve bends where it does, and what we're building next so finding that sweet spot isn't something anyone has to do by hand. Key Takeaways There's a sweet spot, not a slope. Adding copies per layer cut pod-startup P99 by 27% and raised P50 per-node egress throughput by 244%, but only up to a point. Past that, the returns vanish, and far past it, latency actually regresses. Storage throttling is the real enemy. The win comes from spreading load across enough storage backends that no single backend gets pinned at its egress ceiling. Once throttling is gone, more copies stop helping. Storage scale alone has a ceiling. Even at the sweet spot, the per-backend egress limit caps total throughput. The next jump in performance has to come from somewhere else, which is exactly what we're building (see What's Next). This isn't something customers should need to manage. We're building a proactive, on-demand storage scaling capability that automatically grows the footprint before throttling happens and shrinks it back when the burst is over. A quick bit of background Within a region, the layer data behind your container images is backed by Azure storage. The number of copies ACR maintains per layer determines how many independent storage backends a concurrent-pull workload can spread its reads across. That's what matters, because each backend has a finite egress ceiling. Once concurrent reads against one backend get close to that ceiling, requests start getting throttled, and your pulls slow down in proportion. The principle is simple: more copies per layer means more backends serving the same data, which means more total egress headroom and fewer throttled requests. What we wanted data on was how many, and where it stops helping. How we tested We ran a controlled series of large-scale pull tests against ACR Premium on a roughly 1,000-node cluster, with every node pulling the same large image cold at the same time (no local cache on any node). The only thing we changed between runs was the number of per-layer copies behind a single registry endpoint. Everything else, including rate limits, the image, node count, and concurrency, stayed constant. For each run we measured pod-startup latency (P50/P90/P99), end-to-end storage read latency, egress throughput distributions (P50-P99.9), and storage throttling events. Pod-startup latency is our headline metric, because it's the one number that reflects the actual customer experience no matter where the bottleneck happens to be. Per-node egress throughput matters too, though. It tells you directly how much pull bandwidth ACR delivers to your fleet, and it's usually what customers have in mind when they ask how much faster extra copies will make their pulls. We report egress as a distribution rather than a single average, since per-request and per-time-window views can tell very different stories about the same set of pulls. These are observations from a single controlled environment, not a service guarantee. Absolute numbers will move with image size, node count, layer composition, network topology, and concurrency. What we found We tested five configurations, sweeping from a low baseline number of per-layer copies up to a very high one. We name them by relative copy count rather than exact instance counts: Baseline: the lowest level, our reference point. Low: a modest step up from Baseline. Mid: a meaningful step up from Low. Higher: a further step up from Mid. Very high: the largest configuration we tested, well above Higher. Here are the numbers. All percent changes are relative to Baseline. Configuration Pod startup P50 Pod startup P90 Pod startup P99 Storage throttling events Peak per-backend egress Baseline (fewest copies) 9m 36s 11m 0s 14m 16s Many; all top backends above the egress ceiling Highest Low 9m 27s (−2%) 10m 14s (−7%) 12m 59s (−9%) Some; one backend still above the ceiling High Mid 9m 25s (−2%) 9m 45s (−11%) 10m 22s (−27%) Zero Below the ceiling Higher 9m 20s (−3%) 9m 37s (−13%) 10m 22s (−27%) Zero Well below the ceiling Very high 9m 28s (−1%) 10m 31s (−4%) 13m 48s (−3%) Zero Lowest Look at the P99 pod-startup column from top to bottom: 14m 16s, 12m 59s, 10m 22s, 10m 22s, 13m 48s. It improves, flattens out, then climbs back up. Three things explain that shape: 1. The win: Throttling falls off a cliff at the Mid configuration As we added copies per layer, per-backend egress fell and storage-side throttling decreased. At the Mid configuration, throttling errors hit zero, and they stayed at zero for every configuration above it. The upside isn't just that the errors went away, though. It's raw pull bandwidth. At the Mid sweet spot, the typical node saw its P50 egress throughput jump 244% over Baseline. With load spread across enough copies, each node pulled its layers off storage much faster, not just without stalling. For a workload owner, that's the difference between watching pods come up in a steady stream and watching them stall for tens of seconds at a time while throttling clears. Same image, same node count, same registry, very different experience. To put it in concrete terms: if your team runs a daily AI training kickoff that needs all 1,000 nodes pulling before the job can start, this is the difference between starting on time and starting four minutes late every day. Over a quarter of training runs, that adds up. 2. The surprise: more copies made pulls slower This is the finding that genuinely surprised us. Going from Higher to Very high, the largest configuration we tested, cost us 3 minutes and 26 seconds at P99: 10m 22s climbing back up to 13m 48s. That gave back almost the entire benefit we'd built up over the previous four configurations. Tail storage-read latency at Very high actually came out worse than Baseline. The Very high run is where the wheels came off, and the reason is the trade-off underneath. Once storage throttling is gone, more copies stop buying you anything, and the cost of fanning reads across that many backends starts to take over. The throughput distribution shows it clearly. P50 and P75 throughput had been climbing steadily and getting smoother through Mid and Higher, then dropped sharply at Very high while the peak P99/P99.9 spikes came back. Spread the same load across too many backends and it fragments into smaller, less consistent bursts. The takeaway is that "more is better" stops being true past the sweet spot, and the failure mode is quiet. You won't see throttling errors. You'll just see your pulls get slower. 3. What we didn't expect: at few copies, the hottest backend is what hurts you At the lowest copy counts, pull traffic wasn't spread evenly across the underlying storage footprint. Some backends absorbed far more traffic than others. As we added copies, that distribution evened out and the hottest backends cooled down. The implication is sharp. You can saturate the busiest backend, and trigger throttling, even when the total headroom across all your backends is large in aggregate. What matters is the load on the hottest backend, not the average. That's exactly the failure mode that demand-driven, proactive scaling (described below) is meant to head off before it happens. So how should you think about this? You don't size copies yourself; ACR manages the storage footprint behind your registry. Still, it helps to understand what moves the sweet spot, because the shape of your own workload is what decides where it lands. The bigger your worst-case concurrent burst (more nodes, larger images, higher concurrency), the more copies per layer it takes to keep pulls off the throttling ceiling, and the further out the sweet spot sits. Smaller workloads may already be sitting on the flat part of the curve. One thing is worth saying plainly. The storage footprint underneath is managed by ACR and shared across many registries, so there's no fixed, private storage budget that maps one-to-one to your workload. The sweet spot isn't a number you compute and provision; it's a behavior the platform has to land on for you, which is exactly why we're moving toward demand-driven scaling that handles it automatically. That's what brings us to what we're building next. What's next: proactive, on-demand storage scaling and a caching layer The fixed-copy tests above answer the question "how many should the ACR system provision?" but they assume a single, static answer. Real workloads aren't static. A 1,000-node burst happens at deploy time, not at 3 a.m. on a Tuesday. And no matter how many copies are provisioned, the per-backend storage ceiling still bounds peak deliverable throughput. So we're investing along two complementary directions. 1. Proactive, demand-driven storage scaling We're building a capability that adjusts the number of per-layer copies automatically based on real-time pull demand: Proactive, not reactive. The system scales the storage footprint before concurrent pull pressure pushes any single backend near the throttling threshold, so throttling is prevented before it forms rather than cleaned up after the fact. On-demand scale-out. The footprint expands automatically as sustained pull demand grows. Scale-in when demand subsides. The footprint contracts so you're not paying for steady-state capacity you only needed during a burst. Tiering for cold content. Long-tail, rarely-pulled content can sit on colder storage, so the redundant footprint of frequently-pulled content doesn't pay full hot-storage cost everywhere. The benefit to customers is straightforward: smoother pulls under burst, higher delivered throughput on average, no permanent over-provisioning, and no manual re-tuning as workloads grow. 2. A caching layer to absorb burst beyond the storage ceiling Even a perfectly scaled storage footprint runs into the per-backend egress ceiling at extreme scale. To push past it, we're investing in a caching layer in the registry service that absorbs burst traffic before it ever reaches storage. A pull surge that hits the same set of layers, which is the common case for fleet-wide deployments, can be served largely from cache. That takes a lot of load off any single storage backend and complements the storage scaling above. We'll share results from this work in follow-up posts. If you have questions about scaling ACR for your workload, or about how we measure storage performance, reach out on the Azure Container Registry GitHub repository. Note: All results in this post are based on controlled internal testing configurations and are intended to illustrate general scaling behavior rather than prescribe exact configurations.
payalmahesh
Jun 22, 2026 Place Apps on Azure Blog
226Views
0likes
0Comments
Tutorial:A graceful process to develop and deploy Docker Containers to Azure with Visual Studio Code
Creating and deploying Docker containers to Azure resources manually can be a complicated and time-consuming process. This tutorial outlines a graceful process for developing and deploying a Linux Docker container on your Windows PC, making it easy to deploy to Azure resources. This tutorial emphasizes using the user interface to complete most of the steps, making the process more reliable and understandable. While there are a few steps that require the use of command lines, the majority of tasks can be completed using the UI. This focus on the UI is what makes the process graceful and user-friendly. In this tutorial, we will use a Python Flask application as an example, but the steps should be similar for other languages such as Node.js. Prerequisites: Before you begin, you'll need to have the following prerequisites set up: WSL 2 installation WSL provides a great way to develop your Linux application on a Windows machine, without worrying about compatibility issues when running in a Linux environment. We recommend installing WSL 2 as it has better support with Docker. To install WSL 2, open PowerShell or Windows Command Prompt in administrator mode, enter below command: wsl --install And then restart your machine. You'll also need to install the WSL extension in your Visual Studio Code. Python 3 installation Run “wsl” in your command prompt. Then run following commands to install python 3.10 (if you use Python 3.5 or a lower version, you may need to install venv by yourself): sudo apt-get update sudo apt-get upgrade sudo apt install python3.10 Docker for Linux You'll need to install Docker in your Linux environment. For Ubuntu, please refer to below official documentation: https://docs.docker.com/engine/install/ubuntu/ Docker for Windows To create an image for your application in WSL, you'll need Docker Desktop for Windows. Download the installer from below Docker website and run the downloaded file to install it. https://www.docker.com/products/docker-desktop/ Steps for Developing and Deployment 1. Connect Visual Studio Code to WSL To develop your project in Visual Studio Code in WSL, you need to click the bottom left blue button: Then select “Connect to WSL” or “Connect to WSL using Distro”: 2. Install some extensions for Visual Studio Code Below two extensions have to be installed after you connect Visual Studio Code to WSL. The Docker extension can help you create Dockerfile automatically and highlight the syntax of Dockerfile. Please search and install via Visual Studio Code Extension. To deploy your container to Azure in Visual Studio Code, you also need to have Azure Tools installed. 3. Create your project folder Click "Terminal" in menu, and click "New Terminal": Then you should see a terminal for your WSL. I use a quick simple Flask application here for example, so I run below command to clone its git project: git clone https://github.com/Azure-Samples/msdocs-python-flask-webapp-quickstart 4. Python Environment setup (optional) After you install Python 3 and create project folder. It is recommended to create your own project python environment. It makes your runtime and modules easy to be managed. To setup your Python Environment in your project, you need to run below commands in the terminal: cd msdocs-python-flask-webapp-quickstart python3 -m venv .venv Then after you open the folder, you will be able to see some folders are created in your project: Then if you open the app.py file, you can see it used the newly created python environment as your python environment: If you open a new terminal, you also find the prompt shows that you are now in new python environment as well: Then run below command to install the modules required in the requirement.txt: pip install -r requirements.txt 5. Generate a Dockerfile for your application To create a docker image, you need to have a Dockerfile for your application. You can use Docker extension to create the Dockerfile for you automatically. To do this, enter ctrl+shift+P and search "Dockerfile" in your Visual Studio Code. Then select “Docker: Add Docker Files to Workspace” You will be required to select your programming languages and framework(It also supports other language such as node.js, java, node). I select “Python Flask”. Firstly, you will be asked to select the entry point file. I select app.py for my project. Secondly, you will be asked the port your application listens on. I select 80. Finally, you will be asked if Docker Compose file is included. I select no as it is not multi-container. A Dockefile like below is generated: Note: If you do not have requirements.txt file in the project, the Docker extension will create one for you. However, it DOES NOT contain all the modules you installed for this project. Therefore, it is recommended to have the requirements.txt file before you create the Dockerfile. You can run below command in the terminal to create the requirements.txt file: pip freeze > requirements.txt After the file is generated, please add “gunicorn” in the requirements.txt if there is no "gunicorn" as the Dockerfile use it to launch your application for Flask application. Please review the Dockerfile it generated and see if there is anything need to modify. You will also find there is a .dockerignore file is generated too. It contains the file and the folder to be excluded from the image. Please also check it too see if it meets your requirement. 6. Build the Docker Image You can use the Docker command line to build image. However, you can also right-click anywhere in the Dockefile and select build image to build the image: Please make sure that you have Docker Desktop running in your Windows. Then you should be able to see the docker image with the name of the project and tag as "latest" in the Docker extension. 7. Push the Image to Azure Container Registry Click "Run" for the Docker image you created and check if it works as you expected. Then, you can push it to the Azure Container Registry (ACR). Click "Push" and select "Azure". You may need to create a new registry if there isn't one. Answer the questions that Visual Studio Code asks you, such as subscription and ACR name, and then push the image to the ACR. 8. Deploy the image to Azure Resources Follow the instructions in the following documents to deploy the image to the corresponding Azure resource: Azure App Service or Azure Container App: Deploy a containerized app to Azure (visualstudio.com) Opens in new window or tab Container Instance: Deploy container image from Azure Container Registry using a service principal - Azure Container Instances | Microsoft Learn Opens in new window or tab
yorkzhang
Jun 18, 2026 Place Apps on Azure Blog
6.9KViews
4likes
1Comment
Introducing Azure Container Apps Sandboxes: Secure Infrastructure for Agentic Workloads
Today we are announcing the public preview of Azure Container Apps Sandboxes - a new first-class resource type that gives you fast, secure, ephemeral compute environments with built-in suspend and resume. This is the underlying infrastructure on which products like Cloud sandboxes in GitHub Copilot, Foundry Hosted Agents, and Azure Container Apps Express are built, you now have the opportunity to build your solutions leveraging this infrastructure. Azure Container Apps Sandboxes unlocks two massive opportunities. For platform developers and ISVs, sandboxes give you the same isolated compute fabric that powers many Microsoft products. You get the building blocks to create your own multi-tenant platform on proven, enterprise-scale infrastructure. For AI agents, sandboxes become a self-configurable tool that lets agents extend their own capabilities on the fly. An agent can spin up a fresh sandbox in milliseconds and use it to execute untrusted code, compile source, test HTTP requests against a live app, launch a browser session, or tackle whatever needs a quick and scalable infrastructure. On one side it empowers humans to build platforms, on the other it empowers agents to build their own capabilities. Both get enterprise-grade isolation, instant startup, and snapshot-based persistence out of the box. We'll walk through the resource model, sandbox lifecycle, the features that set Sandboxes apart - like snapshots, lifecycle policies, network egress controls, volumes, and managed identities - and show you how to get started with the portal and CLI. What Are Container Apps Sandboxes? Container Apps Sandboxes are secure, isolated compute environments that start in sub-second time, scale to thousands, and cost nothing when idle. Each sandbox runs in its own hardware-isolated microVM boundary - fully separated from the host, the platform, and every other sandbox. You bring your own Open Container Initiative (OCI) image, and Sandboxes handle the rest: provisioning from prewarmed pools, strong multi-tenant isolation, and snapshot-based suspend/resume that preserves full memory and disk state across sessions. There are many ways Sandboxes can help you build your next project - here are a few: Your own build & test systems - wire a Sandbox into your CI/CD flow to run builds while your laptop stays cool. Agents that can run anything safely - an agent spawns a sandbox, executes work inside it, and returns the output with no agent host privileges required. Agent swarms - decompose a research question, spawn N sandbox workers in parallel (each pinned to its own image and egress policy), and synthesize the result. Early access customers are already unlocking significant benefits by leveraging Azure Container Apps Sandboxes. "With Azure Container Apps sandboxes, SitecoreAI can safely enable agents to take real action. The combination of multi-tenant isolation, rapid scale-out, and full automation allows Sitecore to run long-lived, autonomous agents that securely execute code, manage workflows, and interact with enterprise systems within secure, governed environments. With this foundation, we can build agents that do real work: assembling content, personalizing experiences, and optimizing campaigns in production. Agents that operate continuously, learn from results, and improve over time, so our customers get better outcomes without giving up control." - Mo Cherif, VP of AI and Innovation, Sitecore "We got early access to Azure Container Apps Sandboxes, and got the first prototype integrated with Atlas AI in hours, and it's already shaping a new Atlas AI capability that we plan to launch in preview in Q3. It gives every Atlas AI agent a safe, sandboxed workspace (file system, terminal, code execution) on a customer's live data in Cognite Data Fusion. The value: Industrial process, reliability, and production engineers spend days and weeks on questions like "which wells are underperforming and why?" These questions are tractable but expensive, so they are asked rarely and decisions are made on gut feel. With this, an agent pulls the data, runs the analysis, cross-references maintenance and inspection records, and returns a cited draft in minutes. Sandboxes make it practical: Aligned feature set, per-customer isolation, pause/resume across multi-day investigations, scale-to-zero economics." - Kelvin Sundli, Product manager, Atlas AI, Cognite Resource Model: Sandbox Groups and Sandboxes The top-level ARM resource is Microsoft.App/SandboxGroups. A Sandbox Group is the management boundary for a collection of sandboxes that share configuration - think of it like a Container Apps Environment, but purpose-built for sandboxes. When you create a Sandbox Group, you specify: Subscription, Resource Group, and Region Sandbox defaults (optional): default CPU, memory, disk, max sandbox count, and default idle timeout Networking: optionally deploy into a custom VNet with a dedicated subnet for private networking Identity: System or user assigned Entra identity. Individual sandboxes are created within a Sandbox Group. Each sandbox has its own source (disk image or snapshot), resource tier, lifecycle policy, network egress policy, environment variables, ports, volumes, and connections. Sandbox Lifecycle Sandboxes have a well-defined lifecycle with the following states: State Description Creating Provisioning the sandbox from a disk image or snapshot Running Actively executing - backed by a live microVM Idle System-suspended after inactivity; can auto-resume on the next request Suspended Full state (memory + disk) preserved as a snapshot; no compute costs Resuming Restoring from a suspended or idle state - sub-second for most workloads Stopped User-initiated stop; can be resumed Stopping Graceful shutdown in progress Deleting Teardown in progress The key insight here is the distinction between Idle and Suspended. When a sandbox goes idle (e.g., no traffic for a configured timeout), the system can automatically suspend it and capture a snapshot. When a new request arrives, the sandbox resumes transparently. This gives you scale-to-zero economics with stateful compute - something that wasn't possible before without significant custom engineering. Disk Images: Bring Your Own Container Sandboxes boot from Disk Images - Open Container Initiative (OCI) images converted into an optimized root filesystem format. You point to any OCI image (public or private registry), and the platform builds a bootable disk image from it. You can start with public, pre-built images maintained by the platform (for example, Ubuntu base images), or bring your own private images. For private registries, you can authenticate with username/token or use a user-assigned managed identity for Azure Container Registry (ACR) – integrated with Azure as you expect. Snapshots: Full-State Persistence Snapshots capture the complete state of a running sandbox - memory, disk, and all running processes. When you resume a sandbox from a snapshot, every process, open file handle, and in-memory data structure is restored exactly as it was. A snapshot captures the full state of a running sandbox: memory pages, disk, processes. Two ways to make one - automatically on suspend, or manually on demand. Three things they're great for: Checkpointing mid-task so a long-running agent can resume exactly where it left off Cloning an environment that's already warm - dependencies installed, caches populated, services running Shipping a "ready-to-go" state that resumes in sub-second instead of cold-booting Snapshots are free during the preview, after which they will be stored as Azure Blob Storage at standard rates. Each snapshot records the source sandbox, resource allocation (CPU, memory, disk), and container metadata - so what you get back is exactly what you snapshotted. Resource Tiers Every sandbox is assigned to a resource tier that determines its CPU, memory, and disk allocation: Tier CPU Memory Disk XS 0.25 vCPU 0.5 GB 5 GB S 0.5 vCPU 1 GB 10 GB M (default) 1vCPU 2 GB 20 GB L 2 vCPU 4 GB 40 GB XL 4 vCPU 8 GB 80 GB When creating a sandbox from a snapshot, the resource tier is inherited from the snapshot and cannot be changed - this ensures the restored environment has the exact resources it was running with when the snapshot was taken. Lifecycle Policies: Auto-Suspend and Auto-Delete Every sandbox can be configured with lifecycle policies that automate state transitions and cleanup: Auto-Suspend Idle timeout: How long a sandbox can sit idle before being suspended (configurable: 1m, 2m, 5m, 10m, 30m, 60m) Suspend mode: Disk + Memory (default): Full snapshot including memory state - resume picks up exactly where you left off, with all processes and in-memory data intact. Disk: Only the disk is preserved; the VM restarts fresh on resume. Useful when you only need file persistence, not process continuity. Auto-Delete Automatically delete sandboxes after a configurable number of days of inactivity Prevents accumulation of abandoned sandboxes that consume snapshot storage These lifecycle policies are what make Sandboxes economically viable at scale. A platform serving thousands of tenants can configure aggressive idle timeouts (say, 60 seconds) with Memory suspend mode, and each tenant's sandbox disappears from the billing meter almost immediately - but resumes in sub-second time the moment they return. Network Egress Policy For scenarios involving untrusted code - AI agents executing LLM-generated scripts, multi-tenant SaaS with user-submitted workloads - controlling outbound network access is critical. Sandboxes provide a per-sandbox Network Egress Policy: Default action: Allow or Deny all outbound traffic Host rules: Domain-pattern rules (e.g., *.github.com → Allow) to permit specific destinations Custom CIDR rules: Network-level rules for IP ranges (e.g., 10.0.0.0/8 → Deny) Skip egress proxy: Option to bypass the egress proxy entirely when custom VNet routing handles policy enforcement This means you can run a sandbox in a deny-by-default posture and allowlist only the specific endpoints it needs (your API server, a package registry, etc.) - without setting up NSGs or firewall appliances. Managed Volumes: Persistent and Shared Storage Sandboxes support two types of mountable volumes, both managed by Microsoft: Volume Type Backed By Best For Managed Azure Blob Azure Blob Storage Shared data across sandboxes, file uploads/downloads, persistent artifacts Managed Data Disk Azure Disk Storage High-performance storage for databases, build caches, large working sets - only available to one sandbox at a time Blob volumes come with a built-in file explorer in the portal - you can browse, upload, download, create folders, and drag-and-drop files directly. Data Disk volumes provide dedicated block storage with configurable sizes. Secrets and Identity Secrets Sandbox Groups support key-value secrets scoped to the group. Secrets can be created, edited, and referenced by sandboxes within the group. These secrets can be used in egress policies to modify requests with transform or header-injection rules, without exposing the secrets to code running inside the sandbox. Managed Identity Sandbox Groups support both system-assigned and user-assigned managed identities, with full RBAC role assignment management. This means your sandboxes can authenticate to Azure services (Key Vault, Storage, Cosmos DB, etc.) without managing credentials - the same identity model you use everywhere else in Azure. MCP Connectors and Triggers ACA Sandboxes now supports managed connectors through the Model Context Protocol (MCP), giving sandboxes access to external APIs - including Microsoft 365, Salesforce, ServiceNow, GitHub, and 1,400+ other systems - without managing credentials directly. Attach a Connector Gateway to your sandbox group, and every sandbox in the group can call external APIs through a standardized MCP interface at runtime. Pair connectors with triggers to build event-driven automation: route an Outlook email to a sandbox that triages it with an AI agent, or react to a SharePoint file upload by extracting and processing the document all without writing glue code. Triggers can fire a shell command inside a sandbox or invoke an HTTP endpoint the sandbox exposes, so your automation shapes fit naturally around your workload. The integration is built on the new Connector Namespace service (az connector-namespace), the same runtime behind Logic Apps and Power Platform connectors, now available as a programmable layer for sandboxes. See the end-to-end samples for runnable azd up-deployable examples covering email triage and document automation scenarios. The Portal Experience Azure Container Apps Sandboxes are only available in the new Azure Container Apps portal that provides a rich, IDE-like experience for working with sandboxes. Creating a Sandbox The portal offers multiple creation paths: Standard Sandbox - full configuration control over source, resources, lifecycle, networking, and volumes GitHub Copilot Sandbox - preset, Copilot CLI ready to go, GitHub credentials can be wired through the Access Token before the sandbox is created Claude Sandbox - Claude CLI pre-installed, ready for agentic coding inside the sandbox Using Coding Agents (Copilot CLI / Claude Code) If you live inside Copilot CLI or Claude Code, you don't need to learn a new CLI. Install the azure-sandbox skill once and your agent picks up the right skills: # GitHub Copilot CLI # Add as a plugin marketplace /plugin marketplace add microsoft/azure-container-apps # Install all skills /plugin install sandboxes@Azure-Container-Apps # Claude Code claude plugin add microsoft/azure-container-apps The skill runs prerequisite checks silently (az --version, az account show, node --version, aca --version), prompts only if something's missing, and maps natural-language asks to the right aca commands. Bundled runbooks cover Copilot CLI BYOK (bring your own Azure OpenAI key), the deploy-a-web-app walkthrough, and shell setup. Sandbox Detail Page Once your sandbox is running, the detail page gives you immediate access to the sandbox terminal and additional details, such as - Network Audit - real-time egress traffic log showing allowed and denied requests Monitor - live CPU, memory, disk, and network utilization charts Connectors - attached connections with an "Add" action Volumes - mounted volumes with an "Add" action Log Stream - streaming container logs Processes - running process list inside the sandbox Files - file explorer to browse the sandbox filesystem The toolbar actions let you manage the state of the sandbox - Resume or Stop. In the Ellipsis menu (⁝) you can find additional settings to manage network Egress Policy and ingress (Add port), take a Snapshot of the sandbox, Commit (save disk state as a new disk image), set Lifecycle Policy or permanently Delete the sandbox. Finally, you can see additional Details in a side panel. Getting Started with the CLI and Python SDK All sandbox and sandbox-group operations go through the  aca  CLI. There are no az containerapp sandbox commands, - az is only used for az login, az account show, and resource-group management. Install (CLI) # Mac, Linux curl -fsSL https://aka.ms/aca-cli-install | sh # Windows irm https://aka.ms/aca-cli-install-ps | iex Run aca --help to get started. Install (Python SDK) pip install azure-containerapps-sandbox For more details, quick start and examples on ACA CLI and Python SDK, please go to https://sandboxes.azure.com Evolution from Dynamic Sessions If you've used Azure Container Apps Dynamic Sessions, Sandboxes are the next evolution of that capability. Everything Sessions can do, Sandboxes can do - and significantly more: Capability Dynamic Sessions Sandboxes Sub-second startup ✓ ✓ Strong isolation ✓ ✓ Custom container images ✓ ✓ Custom VNet integration ✓ (Partial) ✓ Suspend/resume with Memory and Disk snapshots - ✓ Lifecycle policies (auto-suspend, auto-delete) - ✓ Network egress policy (per-sandbox) - ✓ Persistent managed volumes (Blob, Data Disk) - ✓ Managed identity (system + user-assigned) - ✓ Secrets management - ✓ Configurable resource tiers - ✓ Direct access to sandbox in Portal experience - ✓ We will continue to support Dynamic Sessions, but all new investment goes into Sandboxes. If you're building new workloads on isolated ephemeral compute, start with Sandboxes. How It All Fits Together ACA Sandboxes is a platform primitive. It's the foundation on which multiple Microsoft products are already built - including ACA Express, Cloud sandboxes in GitHub Copilot, and Foundry Hosted Agents. When you build on Sandboxes, you're building on the same infrastructure that powers Microsoft's own portfolio. This is the evolution of what we shared with Project Legion in 2024. Legion described the internal infrastructure; Sandboxes exposes it as a customer-facing primitive that you can use directly. What's Next • Deeper Azure integrations - first-class connectivity with Azure networking, identity, storage, and AI services • Enhanced SDK and CLI - richer programmatic experiences for managing sandboxes at scale • More Microsoft services built on Sandboxes - this is just the beginning Get Started Today • Portal: https://sandboxes.azure.com/ • Documentation: Azure Container Apps Sandboxes • Pricing: Azure Container Apps Pricing (per-second vCPU/memory billing, scale-to-zero, snapshots at Blob Storage rates) We'd love to hear your feedback. You can ask questions, or file issues on the Azure Container Apps GitHub (prefix with [Sandbox] for Sandboxes-specific issues).
vyomnagrani
Jun 10, 2026 Place Apps on Azure Blog
5.5KViews
3likes
1Comment
Reference Architecture for a High Scale Moodle Environment on Azure
Introduction Moodle is an open-source learning platform that was developed in 1999 by Martin Dougiamas, a computer scientist and educator from Australia. Moodle stands for Modular Object-Oriented Dynamic Learning Environment, and it is written in PHP, a popular web programming language. Moodle aims to provide educators and learners with a flexible and customizable online environment for teaching and learning, where they can create and access courses, activities, resources, and assessments. Moodle also supports collaboration, communication, and feedback among users, as well as various plugins and integrations with other systems and tools. Moodle is widely used around the world by schools, universities, businesses, and other organizations, with over 100 million registered users and 250,000 registered sites as of 2020. Moodle is also supported by a large and active community of developers, educators, and users, who contribute to its development, documentation, translation, and support. [URL] is the official website of the Moodle project, where anyone can download the software, join the forums, access the documentation, participate in events, and find out more about Moodle. Goal The goal for this architecture is to have a Moodle environment that can handle 400k concurrent users and scale in and out its application resources according to usage. Using Azure managed services to minimize operational burden was a design premise because standard Moodle reference architectures are based on Virtual Machines that comes with a heavy operational cost. Challenges Being a monolith application, scaling Moodle in a modern cloud native environment is challenging. We choose to use Kubernetes as its computing provider due to the fact that it allow us to build a Moodle artifact in an immutable way that allows it to scale out and in when needed in a fast and automatic way and also recover from potential failures by simply recreating its Deployments without the need to maintain Virtual Machine resources, introducing the concept of pets vs cattle[1] to a scenario that at first glance wouldn't be feasible. Since Moodle is written in PHP it has no concept of database polling, creating a scenario where its underlying database is heavily impacted by new client requests, making it necessary to use an external database pooling solution that had to be custom tailored in order to handle the amount of connections for a heavy-traffic setup like this instead of using Azure Database for PostgreSQL's built-in pgbouncer. The same effect is also observed in its Redis implementation, where a custom Redis cluster had to be created, whereas using Azure Cache for Redis would incur prohibitive costs due to the way it is set up for a more general usage. 1 - https://learn.microsoft.com/en-us/dotnet/architecture/cloud-native/definition#the-cloud Architecture This architecture uses Azure managed (PaaS) components to minimize operational burden by using Azure Kubernetes Service to run Moodle, Azure Storage Account to host course content, Azure Database for PostgreSQL Flexible Server as its database and Azure Front Door to expose the application to the public as well as caching commonly used assets. The solution also leverages Azure Availability Zones to distribute its component across different zones in the region to optimize its availability. Provisioning the solution The provisioning has two parts: setting up the infrastructure and the application. The first part uses Terraform to deploy easily. The second part involves creating Moodle's database and configuring the application for optimal performance based on the templates, number of users, etc. and installing templates, courses, plugins etc. The following steps walk you through all tasks needed to have this job done. Clone the repository $ git clone https://github.com/Azure-Samples/moodle-high-scale Provision the infrastructure $ cd infra/ $ az login $ az group create --name moodle-high-scale --location <region> $ terraform init $ terraform plan -var moodle-environment=production $ terraform apply -var moodle-environment=production $ az aks get-credentials --name moodle-high-scale --resource-group moodle-high-scale Provision the Redis Cluster $ cd ../manifests/redis-cluster $ kubectl apply -f redis-configmap.yaml $ kubectl apply -f redis-cluster.yaml $ kubectl apply -f redis-service.yaml Wait for all the replicas to be running $ ./init.sh Type 'yes' when prompted. Deploy Moodle and its services Change image in moodle-service.yaml and also adjust the moodle data storage account name in the nfs-pv.yaml (see commented lines in the files) $ cd ../../images/moodle $ az acr build --registry moodlehighscale<suffix> -t moodle:v0.1 --file Dockerfile . $ cd ../../manifests $ kubectl apply -f pgbouncer-deployment.yaml $ kubectl apply -f nfs-pv.yaml $ kubectl apply -f nfs-pvc.yaml $ kubectl apply -f moodle-service.yaml $ kubectl -n moodle get svc –watch Provision the frontend configuration that will be used to expose Moodle and its assets publicly $ cd ../frontend $ terraform init $ terraform plan $ terraform apply Approve the private endpoint connection request from Frontdoor in moodle-svc-pls resource. Private Link Services > moodle-svc-pls > Private Endpoint Connections > Select the request from Front Door and click on Approve. Install database $ kubectl -n moodle exec -it deployment/moodle-deployment -- /bin/bash $ php /var/www/html/admin/cli/install_database.php --adminuser=admin_user --adminpass=admin_pass --agree-license Deploy Moodle Cron Change image in moodle-cron.yaml $ cd ../manifests $ kubectl apply -f moodle-cron.yaml Your Moodle installation is now ready to use! Conclusion You can create a Moodle environment that is scalable and reliable in minutes with a very simple approach, without having to deal with the hassle of operating its parts that normally comes with standard Moodle installations.
GuilhermeFrança
Jun 09, 2026 Place Apps on Azure Blog
1.9KViews
8likes
1Comment
Designing for High Availability: The Operational Reference for Running a Geo-Replicated ACR
By Johnson Shi, Zoey (Zhuyu) Li, Huangli Wu Introduction Three of the most common questions we hear from enterprise teams running geo-replicated Azure Container Registries (ACR) are: "How do I control which region serves my traffic?" — When my AKS clusters are spread across regions, can I pin each one to its co-located replica, or am I stuck with however the global endpoint routes? "What happens during a regional incident — is failover automatic or do I have to act?" — If the registry in one region degrades, does the global endpoint reroute on its own, or do I need to manually disable the affected replica? "What happens after the region recovers — does traffic return on its own?" — Is there a cooldown, a quarantine, or any manual step before failback? We answer those head-on, then go deeper on the operational details that come up when you actually run a geo-replicated registry: authentication across endpoint switches, throttling under load concentration, eventual-consistency failure modes, home region outage scope, webhooks, and private endpoint interaction. We draw on the official geo-replication docs, the global endpoint health-aware failover blog, the regional endpoints engineering design implementation, the regional endpoints public preview and private preview announcements, and the ACR reference for various registry endpoints, . This post also draws notes from the ACR product team on roadmap items that aren't yet documented elsewhere. Key Takeaways Health-aware failover is automatic. When the registry in a region degrades, the global endpoint reroutes away from it on the order of minutes, evaluated per-registry. No customer action required. Failback is automatic too. Once health-aware failover marks a region healthy again, the global endpoint resumes routing to it. There is no cooldown period. Health-aware failover applies only to global endpoint operations. It does not apply to regional endpoints (you're talking to one replica, period) or to dedicated data endpoints (the redirect is per-region). Health-aware failover is not triggered by throttling. It responds to regional ACR service health and Azure infrastructure health, not HTTP 429 responses. Use regional endpoints to manage per-replica throttling. Regional endpoints (Step 2a) give you explicit per-region URLs for workloads that need affinity, capacity planning, push/pull consistency, troubleshooting, or client-side failover. Use myregistry.<region>.geo.azurecr.io . Regional endpoints are available on Premium SKU registries. For workloads that don't need pinning, do nothing (Step 2b). The global endpoint plus health-aware failover handles routing automatically. Re-authenticate when switching endpoints. Each global or regional endpoint is its own authenticated surface; re-auth via az acr login , SDK auth, or the Kubernetes ACR credential provider on endpoint change. Don't run a long-lived DNS cache for the global endpoint. ACR purges DNS server-side on disable and during failover; a long-lived client cache works against that. For production workloads, enable dedicated data endpoints for security and DNS predictability on layer downloads. ACR is working on bounded staleness consistency for cross-replica eventual-consistency failure modes; see the FAQ. Background What is ACR geo-replication? Geo-replication is a Premium SKU feature that turns a single ACR registry into a multi-region, multi-write service. Every geo-replica in every region is writable — you can push, pull, and delete from any of them — and content syncs asynchronously between replicas under an eventual consistency model. Per-push replication time scales with the size and number of images being pushed. Similarly, when creating a new geo-replica, the time to populate the new geo-replica scales with the total size of the registry. A geo-replicated registry exposes a global endpoint at myregistry.azurecr.io . Behind that endpoint, ACR uses an internal traffic manager to direct each request to the replica with the best network performance profile for the caller — usually the closest replica, but not always. When clients are equidistant from multiple replicas, or when the closest replica is experiencing Azure infrastructure degradation, requests may be routed elsewhere. A geo-replicated registry also exposes a regional endpoint at myregistry.<region>.geo.azurecr.io , which allows clients to pin API requests to a specific geo-replica in lieu of global endpoints, which has Azure-managed routing among geo-replicas. Zone redundancy is always enabled for geo-replicas in regions where Azure has multiple availability zones — in those regions, ACR automatically spreads replica data across multiple availability zones within each region to protect against zonal outages. Endpoints and data endpoints: what goes where A common point of confusion: when you push or pull, not every request goes to the same place. The registry endpoints (global endpoint and regional endpoints), as well as the data endpoint, do different jobs. Your choice of data endpoint configuration has real consequences for security and resilience. Two kinds of traffic flow during a typical pull: Registry API traffic — authentication, manifest reads/writes, tag resolution, referrers, repository operations, blob location lookups, listing, metadata. This is everything except the actual layer (blob) bytes. All these API requests go to the global endpoint ( myregistry.azurecr.io ) or, if you've pinned your clients to call these APIs to a specific geo-replica, a geo-replica's regional endpoint ( myregistry.<region>.geo.azurecr.io ). Behind the scenes, the global endpoint internally proxies these requests to a specific geo-replica. Layer (blob) downloads — when the client asks for a blob, the registry doesn't serve the bytes itself. It returns an HTTP 307 redirect to a regional data endpoint (separate endpoint from the global endpoint or regional endpoints), and the client follows the redirect to download the layer from that region. Where that 307 sends you depends on whether you've enabled the registry's dedicated data endpoints feature: Configuration Layer downloads redirect to Default (no dedicated data endpoints) *.blob.core.windows.net (the underlying Azure storage account) Dedicated data endpoints enabled myregistry.<region>.data.azurecr.io for the region you were routed to Private endpoints enabled myregistry.<region>.data.azurecr.io for the region you were routed to Regional by design. Dedicated data endpoints always land you on a specific geo-replica's data endpoint — there is no "global data endpoint." With the global endpoint as your registry endpoint, the 307 redirect picks the data endpoint for whichever region the global endpoint chose to serve you. With a regional endpoint pinned to a specific region, the 307 always redirects you to that same region's data endpoint — never cross-region. Why dedicated data endpoints matter. Dedicated data endpoints are a Premium SKU feature that exists primarily to address security and firewall scoping. By default, layer downloads redirect to *.blob.core.windows.net — a wildcard storage FQDN. Firewall rules to allow that wildcard either let all Azure storage accounts through or none of them, which raises data exfiltration concerns and isn't tightly scoped to your registry. Dedicated data endpoints replace the wildcard with a fully qualified domain in your registry's own domain — myregistry.<region>.data.azurecr.io — so firewall rules can be scoped tightly to your specific registry, in your specific regions. That same design choice can also make layer downloads more predictable during routing changes. With dedicated data endpoints, the data endpoint FQDN is known ahead of time and lives in the registry's domain — one predictable hostname per region, configured once. Without them, the layer download has to resolve a wildcard storage FQDN that points to whichever storage account the registry happens to have provisioned, which is a separate DNS resolution path with its own routing behavior and its own caching profile. Dedicated data endpoints simplify the DNS picture by aligning the data path with the registry path and keeping the entire pull experience inside one set of predictable, scoped FQDNs. For any geo-replicated registry where security and high availability matter, enable dedicated data endpoints. Note: Health-aware failover applies only to operations against the global endpoint, not to regional endpoints or dedicated data endpoints. Take note that health-aware failover only kicks in and directs traffic away from a geo-replica when an Azure region is experiencing significant infrastructure degradation. At this stage, it does not kick in to redirect traffic to another geo-replica if a client's data plane API requests are throttled. See the relevant section below for the full scope when health-aware auto failover kicks in or not. The three traffic control tools ACR geo-replication gives you three complementary tools for controlling where traffic lands. Each one solves a different class of problem, and customers most often run into trouble when they reach for the wrong one. We name them up front and use these names throughout the post: Tool Who controls it What it does Use cases Health-aware failover Platform (automatic) Reroutes the global endpoint away from a region whose registry can't reliably serve requests Regional incidents, automatic recovery Replica enable/disable for global routing Customer (manual) Excludes a specific replica from global endpoint routing without deleting it; data continues syncing DR rehearsals, planned maintenance, quarantining a replica without losing it Regional endpoints Customer (per request) Dedicated per-region URLs ( myregistry.<region>.geo.azurecr.io ) that bypass the internal traffic manager entirely Pinning AKS clusters to co-located replicas, push/pull consistency, capacity planning, troubleshooting, client-side failover Health-aware failover and replica enable/disable both act on the global endpoint. Regional endpoints are a separate URL surface that coexists with the global endpoint — enabling them does not disable the global endpoint myregistry.azurecr.io . You can use both simultaneously and choose per workload. The behavior in question When the registry in one region experiences a real degradation, there are three possible answers to "what happens?": (A) Nothing automatic. The customer must manually disable the affected region's endpoint to stop traffic from being routed there. (B) The system detects the regional front-door failure and reroutes within seconds. (C) A per-registry health evaluation detects the degradation and reroutes the global endpoint within minutes, with no customer action. After the region recovers, routing resumes automatically. The answer today is (C). Before health-aware failover, customers were stuck closer to (A) — the system could see whether the regional reverse proxy responded, but not whether the registry could actually serve real pull and push traffic end to end. Health-aware failover closes that gap. We walk through all three tools in the next section, in order: setting up geo-replication, using regional endpoints to pin specific workloads, keeping the global endpoint for everything else, the manual replica disable mechanism, re-enabling participation in global routing, and what to expect when health-aware failover triggers. Walkthrough The following steps assume an existing Premium SKU registry and the Azure CLI logged in. We use myregistry as the registry name, myrg as the resource group, and eastus as the home region. Substitute <your-registry> , <your-rg> , and <your-region> for your environment. Prerequisites A Premium SKU ACR registry (geo-replication requires Premium) Azure CLI ( az ) installed and logged in For regional endpoints (Step 2a): Azure CLI 2.86.0 or later. All regional endpoints commands ( --regional-endpoints , az acr show-endpoints , az acr login --endpoint ) are available natively in Azure CLI 2.86.0+. If you previously installed the acrregionalendpoint private preview CLI extension, uninstall it with az extension remove --name acrregionalendpoint to prevent conflicts with the built-in CLI commands. Step 1: Add a West US replica to a registry that lives in East US Geo-replication requires the Premium SKU. The create call below fails on Basic or Standard. # Confirm the registry is Premium az acr show --name myregistry --resource-group myrg \ --query sku.name --output tsv # Premium # Create a West US geo-replica az acr replication create --registry myregistry --location westus # Confirm both replicas are present az acr replication list --registry myregistry --output table NAME LOCATION PROVISIONING STATE STATUS REGION ENDPOINT ENABLED ------ ---------- -------------------- -------- ----------------------- eastus eastus Succeeded online True westus westus Succeeded online True Pushes and pulls continue working through the existing replica throughout initial sync. Because the registry is multi-region, multi-write, the existing replica keeps serving traffic while the new replica catches up in the background. Initial replica seeding time is a function of registry size — the total number and cumulative size of images already in the registry that need to be replicated to the new replica — not the size of any single image. Step 2a: Pin workloads to specific regions using regional endpoints Use regional endpoints when a workload needs explicit per-region control. The five common cases: Regional affinity — an AKS cluster in East US should pull from the East US replica, every time, without ever hopping to a more distant replica because of a network performance fluctuation. Predictable routing — workloads that need to know exactly which replica will serve them, for benchmarking, capacity planning, or in-region traffic SLAs. Push/pull consistency — pinning both ends of a publish-then-deploy flow to the same replica eliminates eventual-consistency races. Troubleshooting — reproducing an issue on a specific replica requires sending traffic to that specific replica. Client-side failover — customers with their own health checks and business rules want to implement failover on their own terms, on signals only they can see. Enable regional endpoints on the registry: az acr update -n myregistry -g myrg --regional-endpoints enabled When enabled, ACR automatically creates per-region login server URLs for every existing geo-replica. No per-region configuration is needed. Note: Regional endpoints can be enabled on any Premium SKU registry, even without geo-replication. A registry without geo-replication has a single geo-replica in the home region, which gets one regional endpoint URL. However, the feature is most useful when your registry has at least two geo-replicas, where you can pin different workloads to different replicas for routing control and capacity distribution. Push to a specific region using its regional endpoint: # Log in to the West US regional endpoint az acr login --name myregistry --endpoint westus # Tag and push using the regional endpoint URL docker tag myapp:v1 myregistry.westus.geo.azurecr.io/myapp:v1 docker push myregistry.westus.geo.azurecr.io/myapp:v1 Pin AKS deployments to their co-located replica by using regional endpoint URLs in the deployment manifest. The example below shows two clusters in different regions; each cluster references the regional endpoint for its own region's replica (assuming replicas exist in both eastus and westeurope ): # East US-based AKS cluster pulls from the East US replica apiVersion: apps/v1 kind: Deployment metadata: name: myapp-eastus spec: template: spec: containers: - name: myapp image: myregistry.eastus.geo.azurecr.io/myapp:v1 --- # West Europe-based AKS cluster pulls from the West Europe replica apiVersion: apps/v1 kind: Deployment metadata: name: myapp-westeurope spec: template: spec: containers: - name: myapp image: myregistry.westeurope.geo.azurecr.io/myapp:v1 This eliminates cross-region pulls when global routing would otherwise prefer a different replica for a given client, and it gives you a per-region traffic profile you can plan capacity against. Regional endpoint operational tips View all endpoints. Use az acr show-endpoints to see all endpoint URLs for your registry — global, regional (if enabled), and dedicated data endpoints (if enabled): az acr show-endpoints --name myregistry --resource-group myrg Import from a specific geo-replica. When importing images between registries, you can use a regional endpoint to import from a specific geo-replica of the source registry. This is useful when you want predictable network paths or need to import from a replica in a specific region: az acr import \ --name mydownstreamregistry \ --source myupstreamregistry.westeurope.geo.azurecr.io/myapp:v1 \ --image myapp:v1 Firewall rules for regional endpoints. If you use firewall rules, allow access to the following endpoints for each geo-replica that clients connect to: Endpoint Purpose myregistry.<region>.geo.azurecr.io Regional endpoint for registry operations myregistry.azurecr.io Global endpoint (if also used) myregistry.<region>.data.azurecr.io Layer downloads (if using private endpoints or dedicated data endpoints) *.blob.core.windows.net Layer downloads (if not using private endpoints or dedicated data endpoints) For the full list of endpoint types and FQDN patterns, see the ACR reference for various registry endpoints. DNS-based routing without changing manifests. If you don't want to maintain different deployment manifests per region, you can keep all manifests pointing to the global endpoint ( myregistry.azurecr.io ) and use software-defined networking or a regional traffic manager to resolve the global endpoint to the appropriate regional endpoint based on the originating region's traffic. This achieves the same co-location goals as regional endpoints — predictable routing and reduced latency — without embedding region-specific URLs in your deployment manifests. Step 2b: Keep using the global endpoint for everything else For workloads that don't need explicit pinning, do nothing. The global endpoint at myregistry.azurecr.io continues to work exactly as before, and the global endpoint plus health-aware failover gives you intelligent routing across replicas without configuration. ACR picks the best replica for each client based on network performance and reroutes during regional incidents. Regional endpoints coexist with the global endpoint — enabling them does not disable myregistry.azurecr.io . You can use both simultaneously and choose per workload, mixing pinned workloads (Step 2a) with workloads that ride the global endpoint (Step 2b) in the same registry. Step 3: Take a replica out of global endpoint routing Use this when you need to keep a replica alive but stop it from serving global-endpoint traffic — for DR rehearsals, planned maintenance, or troubleshooting an isolated replica. # Exclude the West US replica from global endpoint routing az acr replication update --registry myregistry --name westus \ --global-endpoint-routing false Confirm the change: az acr replication list --registry myregistry --output table NAME LOCATION PROVISIONING STATE STATUS REGION ENDPOINT ENABLED ------ ---------- -------------------- -------- ----------------------- eastus eastus Succeeded online True westus westus Succeeded online False Requests to myregistry.azurecr.io no longer route to West US. The replica still receives replicated content — and continues to replicate its own content out to other replicas — and storage quota and per-replica costs continue to accrue. If regional endpoints are enabled, the West US regional endpoint URL also continues to work; --global-endpoint-routing controls only the replica's participation in global endpoint routing. A note on naming. The CLI flag --global-endpoint-routing (on az acr replication update ) and the regional endpoints feature (enabled via az acr update --regional-endpoints enabled ) are two different things despite the similar names. --global-endpoint-routing controls whether a replica participates in global endpoint routing. The regional endpoints feature creates per-region URLs ( myregistry.<region>.geo.azurecr.io ) that bypass the global endpoint entirely. They are independent controls. In Azure CLI 2.86.0 and later, the old --region-endpoint-enabled flag has been renamed to --global-endpoint-routing . The old flag name is deprecated and will be removed in Azure CLI 2.87.0 (June 2026). If you have existing scripts or automation that use --region-endpoint-enabled , update them to use --global-endpoint-routing . CLI flags quick reference: Flag Scope Purpose --regional-endpoints Registry-level ( az acr create or az acr update ) Enables dedicated regional endpoint URLs ( myregistry.<region>.geo.azurecr.io ) for all geo-replicas. --global-endpoint-routing Per-geo-replica ( az acr replication create or az acr replication update ) Controls whether the global endpoint routes traffic to a specific geo-replica. Set to false to temporarily exclude a geo-replica from global routing. --data-endpoint-enabled Registry-level ( az acr create or az acr update ) Enables dedicated data endpoints ( myregistry.<region>.data.azurecr.io ) for layer blob downloads. Auto-enabled when at least one private endpoint is configured. This bidirectional sync during disable is intentional. When you re-enable the replica, every image pushed to the registry while the replica was disabled — from any region — is already present, so the replica can serve traffic immediately with no catch-up window. If we stopped syncing on disable, re-enabling would leave the replica with stale data and force a long catch-up before it could safely serve pulls. Step 4: Re-enable the replica to participate in global endpoint routing Re-enable the replica: az acr replication update --registry myregistry --name westus \ --global-endpoint-routing true NAME LOCATION PROVISIONING STATE STATUS REGION ENDPOINT ENABLED ------ ---------- -------------------- -------- ----------------------- eastus eastus Succeeded online True westus westus Succeeded online True There is no cooldown. The global endpoint resumes routing requests to the West US replica as soon as the change takes effect on ACR's side. Because data continued syncing while the replica was disabled (Step 3), the replica is immediately ready to serve pulls — no catch-up window. Note on DNS during disable/enable. When you take a replica out of global routing, ACR purges its own DNS records for that replica from the global endpoint on a fast path — there is no waiting on a published TTL on ACR's side. If clients run their own DNS cache for the global endpoint, however, those clients will keep resolving to the disabled replica until the client cache expires. We can't control client-side caches. The recommendation: do not run a long-lived DNS cache for the global endpoint. A short-lived DNS pin for the duration of a single push (covered in the DNS and Client-Side Considerations section) is fine and even helpful — but a long-lived DNS cache will make --global-endpoint-routing false look broken from the client's perspective. Step 5: What to expect when health-aware failover triggers Health-aware failover is automatic. ACR evaluates registry health on a per-registry basis, and when a registry in a region can't reliably serve requests, the global endpoint reroutes that registry's traffic to a healthy replica. There is no customer-invocable trigger — that's the point. End-to-end timing is on the order of minutes — fast enough to catch real regional degradation, slow enough to ride out transient errors that resolve on their own. DNS TTL may add additional propagation delay before all clients switch to the new region. Scope of health-aware failover. Health-aware failover applies only to operations against the global endpoint — the registry API calls (auth, get manifest, get tag, get referrers, get blob location). It evaluates health when those API calls come in; it does not trigger mid-operation. Two important consequences: Regional endpoints are not in scope. When you talk to a regional endpoint like myregistry.westus.geo.azurecr.io , you're talking to that one replica. There is no automatic reroute. If you've pinned a workload to a regional endpoint and that region degrades, you implement client-side failover by switching the workload to a different regional endpoint. Dedicated data endpoints are not in scope. Once a registry endpoint has redirected you to a dedicated data endpoint, you stay on that region's data endpoint for the duration of the layer download. There is no automatic reroute of an in-flight blob download. The region targeted by the redirect is decided up front by whichever registry endpoint served the blob-location call: the global endpoint chooses based on its per-registry health evaluation, and a regional endpoint always targets its own region. The signals you can use to confirm a failover is in progress: # Check replication status az acr replication list --registry myregistry --output table You can also check Resource Health for the registry in the Azure portal — navigate to your registry and select Resource health under the Help section to see platform-side degradation signals. You'll typically see: Increased pull latency as traffic shifts to a more distant replica Resource Health flagging known issues in the affected region Replication status indicating which replicas are online After the region recovers, the per-registry health evaluation marks it healthy again and the global endpoint resumes routing — automatic, no cooldown, no customer action. Note that health is evaluated per registry, not per region: if a degradation affects only a subset of registries in a region, only those registries are rerouted, and other registries in the same region continue to be served locally with no unnecessary latency penalty. Not triggered by throttling. Health-aware failover is DNS-based and responds to regional ACR service health and Azure infrastructure health. It does not reroute traffic based on HTTP 429 (throttling) responses. If a geo-replica is throttling your requests but the region's infrastructure is healthy, the global endpoint continues routing you to that geo-replica. To manage throttling, use regional endpoints to spread workloads across multiple geo-replicas for better capacity distribution. Note on long-running pushes during a failover. A multi-layer push that spans a failover boundary can land layers and the manifest on different replicas — exactly the failure mode that DNS bouncing produces during a single push. ACR is actively tightening health-aware failover behavior to minimize cross-replica scatter during these scenarios, and the recommendation today remains: pin pushes to a single replica via a regional endpoint when push/pull consistency matters. Common Questions Q1. Performance impact during initial replica creation on a live registry Because ACR is multi-region, multi-write, the existing replica continues serving pull and push traffic throughout the period when a new replica is being seeded. Replication is asynchronous and content propagates in the background; the time to populate a new geo-replica scales with the size of the registry — the cumulative number and total size of images already in the registry — not with any single image. The docs do not publish a quantified degradation percentage or a throttling window for this period, and they do not promise zero performance impact — the safe operating assumption for a live production registry is that existing replicas continue serving traffic normally, with the new replica catching up in the background. Q2. Restricted/updating state during initial sync There is no "restricted" state for the registry during normal replica creation. Writes, control-plane operations, and pushes/pulls against existing replicas continue normally. The only time configuration changes are unavailable is during a home region outage — see the relevant FAQ item later on for the full data-plane-versus-control-plane breakdown. Q3. Cooldown periods and non-straightforward failback scenarios There is no cooldown before failback, manual or automatic. Re-enabling a replica's participation in global endpoint routing takes effect immediately on ACR's side. Health-aware failover returns traffic to a region as soon as its per-registry health evaluation passes again. The failback case that is not seamless: if a recently pushed image has not yet replicated to the failover region, a pull from that region may not find the image until replication catches up. This is a function of eventual consistency, not failback timing — and it's part of a broader class of issues we cover in Q4. Q4. Common pull and push failure modes during the eventual-consistency window DNS bouncing during a single push is one well-known problem, but it isn't the only one. The eventual-consistency window between geo-replicas surfaces in several recurring failure modes worth knowing about: Push-then-immediate-pull-cross-region. Pushing myapp:v1 to one region and immediately pulling it from a different region can fail with manifest unknown until replication catches up. This shows up most painfully in CI/CD pipelines where one CI runner pushes an image and thousands of pods across other regions all try to pull from their local geo-replicas at the same time. Today, customers work around this with indeterminate sleeps before scheduling expensive compute, or with retry logic, or by waiting on a replication-complete signal — none of which is a clean planning story. Tag overwrite races. Pushing myapp:v1 , then re-pushing myapp:v1 shortly after with a fix (same tag, different digest), can leave different replicas resolving the same tag to different digests during the eventual-consistency window. Delete propagation. Deleting a tag or repository in one region takes some time to propagate to other replicas. Pulls from regions where the delete hasn't yet propagated can return the supposedly-deleted content. Mid-push failover scatter. A multi-layer push that spans a health-aware failover boundary or a DNS bouncing event can land layers on one replica and the manifest on another, surfacing as manifest validation errors or blob unknown on subsequent pulls. What ACR is doing about this. We're working on bounded staleness consistency for pushed images across all geo-replicas worldwide, which addresses these four failure modes directly. This will be covered in an upcoming blog post. If you're hitting eventual-consistency brittleness today and want to talk through your scenario, reach out to us on the Azure Container Registry GitHub repository — we want the customer signal to land in the design. Mitigations available today: Pin pushes to a single replica via a regional endpoint. Every sub-request in the push — login, blob uploads, manifest upload — goes to the same replica, eliminating the DNS bouncing and mid-push scatter classes entirely. Use a short-lived client-side DNS cache like dnsmasq scoped to the duration of a single push, only when you're not using regional endpoints. Do not run a long-lived DNS cache for the global endpoint — it interferes with --global-endpoint-routing false and with health-aware failover routing. Build retry logic into pulls that immediately follow a cross-region push. Either retry with backoff or check replication status with ACR webhooks before pulling. ACR can detect and notify you when an image or tag is available for pull in a geo-replica (say geo-replica B), after it has been pushed to another geo-replica (geo-replica A) and background replication has succeeded to geo-replica B. Design publish steps to be idempotent so retries triggered by mid-push failover are safe. Q5. Auth behavior across endpoint switches For safety, treat each global endpoint and each regional endpoint as its own authenticated surface. All registry APIs except the actual blob downloads (auth, manifests, tag resolution, referrers) flow through whichever endpoint you've chosen. If you switch from the global endpoint to a regional endpoint, or from one regional endpoint to another, re-authenticate. That means az acr login , fresh SDK auth, or — for AKS — letting the Kubernetes ACR credential provider handle re-auth, which it does automatically when the endpoint changes. Q6. Throttling under failover and pinning Throttling limits on registry API operations are per-replica, not per-registry. This has two operational implications: During health-aware failover, traffic that was spread across replicas can shift heavily onto whichever replicas remain in the global endpoint's routing pool. Capacity plan to spread traffic across two or three healthy replicas during a failover scenario rather than concentrating onto one — the global endpoint's routing already does this for you when multiple healthy replicas exist, but registries with only two regions configured can hit per-replica limits more easily during a failover. To mitigate, use regional endpoints to spread workloads across multiple geo-replicas and plan per-replica capacity. When pinning via regional endpoints (Step 2a), you concentrate traffic on whichever replica you've pinned to. If you've pinned all your AKS clusters to a single regional endpoint, you may hit that replica's per-region throttling limits at peak. Mitigations: pin different workloads to different regional endpoints across multiple regions for better topology mapping and capacity distribution, or use the global endpoint (Step 2b) for workloads where you don't need explicit pinning so ACR's routing can spread load. We're also working on improving the throttling metrics surfaced during health-aware failover events. Note: Health-aware failover does not reroute traffic based on HTTP 429 (throttling). If you're experiencing throttling but the region's infrastructure is healthy, the global endpoint continues routing you there. Use regional endpoints to explicitly spread load across replicas for capacity planning. Q7. Home region outage scope Geo-replication provides high availability for the data plane. During a home region outage, the control plane is unavailable, which means you can't create or delete replicas, modify network rules, or change replication settings until the home region recovers. ACR Tasks are also bound to the home region and don't run while it's unavailable. The data plane keeps working: Global endpoint continues routing pulls and pushes to healthy replicas. Regional endpoints continue working — you talk directly to specific replicas, and your client-side logic decides which region to use. Authentication, manifests, blob downloads, webhooks continue functioning through any healthy replica. The home region of a registry is fixed at creation and cannot be changed afterward. Microsoft's registry relocation guidance describes a redeployment procedure — creating a new registry in a different region — not an in-place change to an existing registry's home region. Note: If your registry uses a customer-managed key, review the key vault failover and redundancy guidance for maximum resilience. Key vault availability directly affects the registry's ability to encrypt and decrypt data. Q8. Webhooks during failover Webhooks fire from the replica that received the push. Because ACR also replicates content to other geo-replicas, webhooks fire from each geo-replica as the image syncs to it — so a single push results in webhook events from the receiving replica plus an event from each replica as replication completes. During a failover where pushes are routed to a different region, webhooks from those pushes fire from the new region; once the original region recovers and replication catches up, webhook events fire from there too. Webhook consumers should be designed to handle multiple events per pushed image and deduplicate as needed. Q9. Private endpoints with regional endpoints and dedicated data endpoints When a private endpoint is created against a registry, the private endpoint covers all of the registry's endpoint surfaces — the global endpoint, every regional endpoint (if regional endpoints are enabled), and every regional dedicated data endpoint. A single private endpoint in one VNet can reach the global endpoint (which routes you to a suitable replica), any regional endpoint in the same or a different region, and any region's dedicated data endpoint for blob downloads. The trade-off is private IP allocation: each endpoint surface consumes IPs in the VNet. With many replicas plus regional endpoints plus dedicated data endpoints all enabled, private endpoint creation can fail if the VNet runs out of available private IPs. IP address consumption per feature: Configuration IPs consumed per VNet Initial private endpoint (global endpoint + home region dedicated data endpoint) 2 Each geo-replication region added +1 (regional dedicated data endpoint) Regional endpoints enabled +1 per geo-replica Example: A registry with 3 geo-replicas and regional endpoints enabled consumes 7 private IPs per VNet: 1 (global) + 3 (data) + 3 (regional). Without regional endpoints, the same registry requires 4 private IPs: 1 (global) + 3 (data). Subnet sizing: Use at minimum a /27 (32 addresses) subnet for PE subnets on geo-replicated registries, and /24 where possible. To check how many private IPs are already consumed on a subnet: az network vnet subnet show \ --name <subnet-name> \ --vnet-name <vnet-name> \ --resource-group <resource-group> \ --query "{addressPrefix:addressPrefix, usedIPs:length(ipConfigurations || \`[]\`)}" \ --output table See the ACR private endpoints documentation for the full IP-allocation math and sizing guidance. Q10. Geo-replica creation stuck for private endpoint-enabled registries When creating a geo-replica for a registry that has private endpoints configured, the replica provisioning can get stuck in a Creating state if the identity performing the operation doesn't have sufficient permissions to create private endpoint networking resources. Solution: Manually delete the geo-replica that got stuck in the provisioning state. Ensure the identity has the permission Microsoft.Network/privateEndpoints/privateLinkServiceProxies/write before creating the geo-replica again. Also verify that every PE subnet connected to the registry has free IP capacity — if any PE subnet across any connected VNet does not have enough free IPs, the replication provisioning fails and rolls back. The replica appears briefly in a Creating state and then is removed. The resulting error does not identify which subnet or VNet is exhausted. Q11. Metrics, logs, and alerts for the three phases We map each phase to the signals available in the Monitoring Guidance section below. The headline: Resource Health (in the Azure portal) and az acr replication list give you the platform-side signals; Azure Monitor platform metrics are collected automatically, and resource logs require Diagnostic Settings to be enabled on the customer side. Behavior summary Scenario Automatic? Customer Action Required Notes Registry in a region degrades Yes None Health-aware failover; per-registry; minutes-scale; global endpoint operations only Region recovers after a degradation event Yes None No cooldown Pin AKS clusters to co-located replicas No Use regional endpoint URLs in deployment manifests (Step 2a) Coexists with global endpoint No pinning needed for most workloads Yes None — keep using myregistry.azurecr.io (Step 2b) Global endpoint plus health-aware failover Push/pull from the same replica (consistency) No Use a regional endpoint for both push and pull Eliminates DNS bouncing and mid-push scatter Capacity planning per region No Spread workloads across multiple regional endpoints Per-replica throttling; avoid concentrating on one replica DR rehearsal: take a replica out of global routing No az acr replication update --global-endpoint-routing false Data continues syncing both directions; costs continue accruing Re-enable replica participation in global routing No az acr replication update --global-endpoint-routing true No cooldown; replica is immediately ready Switch a workload between endpoints No Re-auth ( az acr login , SDK auth, or Kubernetes ACR credential provider) Each endpoint is its own authenticated surface Initial replica seeding on a live registry N/A None Existing replica continues serving traffic; seeding time scales with registry size Long-running push during a failover No Retry; design publishes to be idempotent Pin via regional endpoint to avoid mid-push scatter; ACR is tightening this behavior Pull of a recently pushed image from a different region No Wait for replication, retry with backoff, or check replication status Eventual consistency; bounded staleness consistency in development Home region outage Data plane: yes; control plane: no Use global or regional endpoints for data plane operations Control plane (replica config, network rules) requires home region DNS and Client-Side Considerations DNS bouncing during a single push is the most common geo-replication push problem in customer threads, and it warrants a section of its own. The failure mode. A docker push is a sequence of HTTP requests: blob uploads for each layer, then a manifest upload that references those layers by digest. If the Linux DNS resolver on the client doesn't cache myregistry.azurecr.io consistently for the duration of the push, individual sub-requests can resolve to different replicas. Because replication is eventually consistent, the manifest can land on a replica that doesn't yet have the layers it references, and the manifest validation fails. The two mitigations: Regional endpoints pin the push to a single replica end-to-end. Every sub-request — login, blob uploads, manifest upload — goes to the same replica. This is the cleanest fix and the one we recommend for any pipeline where push/pull consistency matters. A short-lived client-side DNS cache like dnsmasq scoped to the duration of a single push. For Linux VMs in Azure, follow the DNS name resolution options guidance. The pin should last the push and no longer. For other clients performing pushes, you can customize your stack's DNS resolver to have a similar short-lived DNS cache to pin the global endpoint's resolved DNS for only the duration of an image push operation. A note on long-lived DNS caching for the global endpoint. Don't run a long-lived DNS cache for myregistry.azurecr.io . ACR purges its own DNS records on the server side when a replica is taken out of global routing (Step 3) and during health-aware failover; a long-lived client-side cache will keep clients pointed at the old region after our purge, which makes both the manual disable mechanism and health-aware failover look broken from the client's perspective. Retry behavior: In-flight pushes during a failover may fail. Design publish steps to be idempotent so retries are safe. Pipelines that push in one region and immediately pull from a different region should retry with backoff or check replication status — eventual consistency means the pull may race ahead of replication. ACR is working on bounded staleness consistency that addresses this directly by enabling proxying (on ACR infrastructure) an image pull request from one geo-replica (if it does not have the image) to another geo-replica that has the image; see the relevant FAQ item. Note: Specific retry counts, back-off intervals, and push timeout values are application-layer decisions. The platform behavior is documented; the retry policy belongs to your client. Monitoring Guidance We map the three phases to the signals available from each source. Where a signal requires customer-side configuration, we flag it. Phase A: Initial replication (after creating a new replica) az acr replication list and az acr replication show — confirm the new replica reaches provisioningState: Succeeded and status: online , and view per-replica status. Azure Monitor platform metrics — push count, pull count, and other registry metrics are collected automatically and visible in the Azure portal under Metrics. No customer configuration is needed to view platform metrics. To export metrics or enable resource logs (detailed operation logs), configure Diagnostic Settings on the registry. Phase B: Failover (planned via replica disable, or automatic via health-aware failover) Per-replica regionEndpointEnabled state via az acr replication list — confirms whether a manual disable took effect, i.e. which replicas are currently eligible for global endpoint routing. Note: this flag reflects the manual configuration for configuring a geo-replica's global endpoint routing eligibility; it does not indicate whether health-aware failover has actively rerouted traffic away from a replica. Resource Health for the registry (in the Azure portal under Help > Resource health) — surfaces platform-side degradation signals during incidents. ACR does not yet expose a definitive "this region is currently serving your traffic" signal; Resource Health and client-side latency changes are the best available indicators. Pull latency from clients — increased latency from a more distant replica is the client-observable signal that traffic has rerouted. Azure Monitor platform metrics — visible per-region in the Azure portal Metrics blade. To export metrics or query them programmatically, enable Diagnostic Settings. Phase C: Failback (replica returns to global routing) az acr replication list — confirms regionEndpointEnabled: True (manual) or online status across all replicas (automatic). Pull latency normalizing as clients reach the recovered replica again. Resource Health clearing for the registry (visible in the Azure portal). Note: The health-aware failover blog calls out ongoing work to surface richer signals — including notifications for when routing changes and which region is currently serving your traffic. The signals listed above are what's available today. Pricing Considerations Storage billing vs. storage quota: Storage is billed per geo-replica — a 1 GiB image replicated to 5 geo-replicas is charged as 5 GiB of storage (1 GiB × 5 geo-replicas). However, storage quota (the tier's maximum storage limit) counts the image only once — the same 1 GiB image counts as 1 GiB toward your tier's maximum, not 5 GiB. Data transfer: Geo-replication can reduce costs by enabling in-region image pushes and pulls, which avoids cross-region data transfer charges during these push or pull operations. However, cross-region data transfer charges still apply when ACR replicates pushed content to other geo-replicas as part of eventual consistency. Disabled replicas still cost: When you take a replica out of global routing with --global-endpoint-routing false , storage and per-replica costs continue accruing because data continues syncing bidirectionally. For more information, see ACR pricing. Cleanup Run these commands to undo the walkthrough setup. Order matters: disable regional endpoints before deleting replicas, since regional endpoint URLs depend on which replicas exist. # Disable regional endpoints if you enabled them in Step 2a az acr update -n myregistry -g myrg --regional-endpoints disabled # Re-enable any replicas you disabled in Step 3 (no-op if already enabled) az acr replication update --registry myregistry --name westus \ --global-endpoint-routing true # Delete the West US replica created in Step 1 az acr replication delete --registry myregistry --name westus # Confirm only the home region replica remains az acr replication list --registry myregistry --output table Note: Replica deletion is a control-plane operation that requires the home region to be available. During a home region outage, replica configuration cannot be modified. Summary Table Question Answer When should I use regional endpoints vs the global endpoint? Use regional endpoints (Step 2a) for workloads that need affinity, predictable routing, push/pull consistency, troubleshooting, or client-side failover. Use the global endpoint (Step 2b) for everything else and let health-aware failover handle routing. What should I enable for secure, resilient layer downloads? Enable dedicated data endpoints. They scope firewall rules tightly to your registry and replace wildcard storage DNS with predictable per-region FQDNs. How do I avoid DNS-bouncing manifest validation failures on push? Pin pushes to a single replica via a regional endpoint. A short-lived client-side dnsmasq for the push duration is also fine if you're not using regional endpoints. Should I run a long-lived DNS cache for the global endpoint? No. ACR purges DNS server-side on disable and during failover; client-side caching works against that. Do I need to re-auth when switching endpoints? Yes. Each global or regional endpoint is its own authenticated surface. az acr login , SDK auth, or the Kubernetes ACR credential provider handles the re-auth. What happens during a home region outage? Data plane keeps working through any replica via the global endpoint or regional endpoints. Control plane operations (replica configuration, network rules) are unavailable until the home region recovers. The home region is fixed at registry creation. What's ACR doing about eventual-consistency pain? Bounded staleness consistency for cross-replica pushed images is in development and will be covered in an upcoming blog post. Reach out via GitHub if you want to share your scenario. For the full automation matrix — what's automatic, what requires customer action, and what to expect for each scenario — see the behavior summary above. If you have further questions about ACR geo-replication routing, pinning, capacity planning, eventual consistency, or failover behavior, reach out to us on the Azure Container Registry GitHub repository or file feedback through the Azure portal.
johnsonshi_msft
Jun 08, 2026 Place Apps on Azure Blog
260Views
0likes
0Comments
Regional Endpoints for Azure Container Registry Geo-Replication — Now in Public Preview
By Johnson Shi, Zoey (Zhuyu) Li, Huangli Wu What's new Regional endpoints for geo-replicated Azure Container Registries are now in public preview. See the feature's official MS Learn documentation. If you've been following since the private preview announcement, here's what changed: No feature flag registration. No subscription enrollment so all Azure subscriptions and customers can now use this feature. No CLI extension. Regional endpoints commands are built into Azure CLI 2.86.0+ natively. If you installed the private preview acrregionalendpoint extension, uninstall it to avoid conflicts. Native CLI and portal support. With Azure CLI 2.86.0+, enable regional endpoints for all geo-replicas of a registry with az acr create --regional-endpoints enabled or az acr update --regional-endpoints enabled . The Azure portal also supports configuring regional endpoints natively. CLI flag rename for configuring a geo-replica's global endpoint routing (an existing separate feature). The existing flag --region-endpoint-enabled (on az acr replication create/update ) has been renamed to --global-endpoint-routing . Key clarifications: "--global-endpoint-routing" (formerly "--region-endpoint-enabled" on "az acr replication create / az acr replication update") — controls whether a specific geo-replica participates in global endpoint routing. This is an existing feature that is different from the new registry-level "--regional-endpoints" feature being discussed in this post. "--regional-endpoints" (on az "acr create / az acr update") — enables or disables the regional endpoints feature at the registry level for all geo-replicas. This is the feature discussed in this post. See the endpoint reference for the full breakdown of the various registry endpoints (global endpoints, regional endpoints, and data endpoints). Regional endpoints are available on Premium SKU registries in all Azure public cloud regions. What are regional endpoints? Regional endpoints give you dedicated, per-region login server URLs for each geo-replica with the following URL pattern: myregistry.eastus.geo.azurecr.io myregistry.westeurope.geo.azurecr.io Regional endpoints coexist with the registry's global endpoint ( myregistry.azurecr.io ) — enabling regional endpoints doesn't disable a registry's global endpoint that is backed by Azure-managed routing. You can choose per workload: You can use the global endpoint with automatic Azure-managed routing with health-aware failover, where Azure will route your requests to the geo-replica with the best network performance profile to the client. You can use a regional endpoint when you need explicit control or routing to a specific geo-replica. Other resources: For the full background on why regional endpoints exist and the problems they solve, see the private preview blog post. For the complete operational deep dive — health-aware failover, throttling considerations, storage quota and pricing, eventual consistency, home region outage behavior, DNS propagation, private endpoint interaction, capacity planning, and monitoring guidance — see How ACR geo-replication handles failover, failback, and traffic redirection. For the behind-the-scenes engineering implementation — architectural overview and the engineering system design of the feature — see Determinism over magic: the engineering design behind Azure Container Registry Regional Endpoints. Getting started Enable regional endpoints on an existing registry: az acr update -n myregistry -g myrg --regional-endpoints enabled View all registry endpoint URLs, including the registry global endpoint, geo-replica regional endpoints, and data endpoints: az acr show-endpoints --name myregistry --resource-group myrg Using regional endpoints Authenticate to a specific regional endpoint: az acr login --name myregistry --endpoint eastus Push to a specific geo-replica. Images and tags pushed to a geo-replica via regional endpoints still propagate to all other geo-replicas under eventual consistency. docker tag myapp:v1 myregistry.eastus.geo.azurecr.io/myapp:v1 docker push myregistry.eastus.geo.azurecr.io/myapp:v1 Pull an image: docker pull myregistry.eastus.geo.azurecr.io/myapp:v1 You can specify regional endpoints directly in Kubernetes deployment manifests if you need to pin workloads to specific regions. This ensures clusters in specific regions always pull from their colocated replica, providing predictable routing and reduced latency. By using different regional endpoints in each cluster's manifests, you can choose to guarantee that each cluster pulls from its local replica instead of relying on Azure-managed routing. East US cluster deployment: apiVersion: apps/v1 kind: Deployment metadata: name: myapp-eastus spec: template: spec: containers: - name: myapp image: myregistry.eastus.geo.azurecr.io/myapp:v1 West Europe cluster deployment: apiVersion: apps/v1 kind: Deployment metadata: name: myapp-westeurope spec: template: spec: containers: - name: myapp image: myregistry.westeurope.geo.azurecr.io/myapp:v1 When to use regional endpoints Scenario What to do Most workloads Keep using the global endpoint ( myregistry.azurecr.io ). Health-aware failover handles routing automatically. Pin AKS clusters to co-located replicas Use regional endpoint URLs in deployment manifests. CI/CD push-then-pull consistency Pin pushes to a regional endpoint to avoid eventual-consistency races. Client-side failover Switch between regional endpoints based on your own health checks. Capacity planning Spread workloads across multiple regional endpoints to avoid per-replica throttling. Troubleshooting Target a specific geo-replica to reproduce or isolate an issue. What changed from private preview Private preview Public preview Feature flag registration required ( az feature register ) No registration needed Subscription private preview enrollment and propagation wait Immediately available to all Azure subscriptions for all Premium SKU registries in all Azure public cloud regions. Separate CLI extension ( acrregionalendpoint ) Built into Azure CLI 2.86.0+ natively No registry-level CLI flag az acr update --regional-endpoints enabled enables regional endpoints for all geo-replicas --region-endpoint-enabled flag for controlling a geo-replica's global endpoint routing via az acr replication update Flag for controlling a geo-replica's global endpoint routing renamed to --global-endpoint-routing No portal support Native Azure portal support for enabling regional endpoints for new registries (during creation) and for existing registries Private preview docs in Azure/acr Full documentation on MS Learn Enabling regional endpoints in the Azure portal You can enable regional endpoints directly from the Azure portal for both new registries (during creation), as well as existing registries: If you were in the private preview 1. Uninstall the CLI extension. The private preview CLI extension conflicts with the built-in commands in Azure CLI 2.86.0+. Remove it: az extension remove --name acrregionalendpoint Verify it's gone: az extension list --query "[?name=='acrregionalendpoint']" -o table 2. Ensure you're running Azure CLI 2.86.0 or later. Regional endpoints commands are available natively starting in Azure CLI 2.86.0. Check your version: az version 3. Update scripts that use --region-endpoint-enabled for controlling global endpoint routing for a geo-replica. The old flag name for controlling a geo-replica's global endpoint routing configuration is deprecated and will be removed in Azure CLI 2.87.0 (June 2026). Update to --global-endpoint-routing : # Old (deprecated) az acr replication update --registry myregistry --name westus \ --region-endpoint-enabled false # New az acr replication update --registry myregistry --name westus \ --global-endpoint-routing false Why the rename? The old flag name --region-endpoint-enabled was confusing — it sounded like it controlled the regional endpoints feature, but it actually controlled whether a geo-replica participates in global endpoint routing. The new name --global-endpoint-routing says exactly what it does. For a full breakdown of all three CLI flags and how they relate, see the endpoint reference. Learn more Full documentation: Geo-replication in Azure Container Registry — Regional endpoints — prerequisites, CLI commands, network considerations, private endpoint integration, and troubleshooting. Operational deep dive: How ACR geo-replication handles failover, failback, and traffic redirection — health-aware failover, throttling, eventual consistency, DNS considerations, monitoring, pricing, and a full walkthrough. Behind-the-scenes engineering implementation: Determinism over magic: the engineering design behind Azure Container Registry Regional Endpoints — architectural details and the engineering system design behind the feature. Endpoint reference: Azure Container Registry endpoint reference — all endpoint types, URL formats, and CLI flags in one place. Private endpoints: Connect privately to a registry using private endpoints — IP allocation math, subnet sizing, and NIC queries for registries with regional endpoints. Firewall rules: Configure firewall access rules — which FQDNs to allow for regional endpoints. Feedback We'd love to hear how you're using regional endpoints and what we can improve. Reach out via: Azure Container Registry GitHub repository — issues, feature requests, and discussion Azure portal feedback — use the feedback button in the Azure portal on your registry's page Regional endpoints are on the path to GA. Your feedback directly shapes the feature's direction.
johnsonshi_msft
Jun 05, 2026 Place Apps on Azure Blog
285Views
1like
1Comment
Inside ACR Artifact Cache: Pull-Through Caching at Scale
By: Akash Singhal, Luis Dieguez, Kiran Challa, Nathan Anderson, Tony Vargas, Caroline Barker, Ren Shao, Mabel Egba, Toddy Mladenov, Johnson Shi Introduction For many customers, Azure Container Registry (ACR) is the only registry their workloads can trust, even when images and artifacts originate from a different registry such as Docker Hub, Microsoft Artifact Registry, GitHub Container Registry, Quay, another ACR, or a private registry. ACR Artifact Cache makes this many-to-one model practical by letting a platform team map a downstream ACR repository path to an upstream source repository. Here, upstream means the source registry and repository ACR contacts on behalf of the customer, and downstream means the ACR-facing path customers pull from. From the outside, the experience looks like a normal pull from ACR. Inside the service, that pull moves through the same multi-tenant registry platform that serves ACR traffic across regions, clouds, and data plane stamps. This series is about the gap between that simple external experience and the internal system. The goal is to show what happens inside ACR, why the system is designed this way, and how those design choices shape the behavior customers ultimately observe. Some implementation details are simplified, and the system continues to evolve. The request paths and design constraints are representative, but this article intentionally avoids service-by-service internals that are not necessary to understand the feature. For this overview, the useful mental model is: serve now, hydrate for later. Later sections will show where that model helps, and where it creates engineering pressure. Why serve upstream content from ACR? Pulling directly from an upstream is often sufficient for development, but production systems need stronger guarantees from the pull path. The failure modes are familiar to anyone who has operated containerized workloads at scale: an upstream registry is slow or temporarily unavailable an upstream applies rate limits or burst protection credentials for various upstream sources need to be handled safely ACR-to-ACR scenarios should avoid customer-managed credentials entirely by using managed identity network policy expects pulls to stay inside an approved network boundary a platform team wants one shared, sanitized catalog of public content for first-party consumption while individual teams pull only what they need Let’s take Docker Hub as a concrete example. Docker Hub pull rate limits mean that unauthenticated users and Docker Personal users can exhaust their allowed pulls in a time window, causing shared build agents or Kubernetes nodes to receive rate-limit errors instead of images. That is a useful example because it makes the upstream dependency visible, but it is not the whole story. The broader engineering problem is that upstream-sourced artifacts should behave like local registry dependencies once a customer chooses to route them through ACR. Artifact Cache addresses that problem by letting customers map a downstream ACR namespace to an upstream namespace, pull through ACR, and allow ACR to materialize content locally as it is requested. A pull-through cache inside ACR Azure Container Registry operates across 60+ Azure regions and 6 public and sovereign clouds, serves hundreds of thousands of registries, and handles billions of requests per day. Artifact Cache is only one part of that larger service, but it is large enough to be a distributed systems problem in its own right: more than 100 million image pulls per day, petabyte-scale egress, upstreams with different behavior, and customers who expect registry pulls to remain predictable. This scale matters because Artifact Cache is not deployed beside ACR as a separate service. It is part of the same registry system that serves normal pushes, pulls, tag listing, catalog operations, authentication flows, private networking scenarios, and other registry API traffic. That means Artifact Cache has to fit into ACR's existing resource model and request-serving model. Customers configure cache rules and authentication boundaries through the control plane, then their pulls are served through the data plane. The next sections follow those two parts in order: first the resources customers create, then the runtime path those resources affect. The customer workflow The setup begins in the control plane, where customers define the relationship between an ACR namespace and an upstream source. A customer starts with an ACR and chooses an upstream repository. In the examples below, myregistry.azurecr.io is the customer's ACR login server. The dockerhub/library/node path is the downstream ACR namespace the customer wants to use for cached content. The authentication model depends on the upstream: For a public upstream, the cache rule may not need credentials. For a private upstream, the customer stores upstream credential material in their Azure Key Vault, creates a credential set that references those secrets, and then associates that credential set with a cache rule. At access time, ACR uses the system-assigned managed identity associated with the cache rule to read the referenced Key Vault secrets, so the customer controls access by granting that identity the required secret permissions. ACR materializes those credentials only when it needs to contact the upstream, so the customer-owned Key Vault remains the secret store. For an ACR-to-ACR upstream, the customer can use a user-assigned managed identity. In that scenario, credential sets are not part of the flow; managed identity replaces the credential-set and Key Vault path. At a high level, the customer defines a namespace mapping: docker pull myregistry.azurecr.io/dockerhub/library/node:latest maps to: docker pull docker.io/library/node:latest In ACR, that mapping is stored as a cache rule: a control-plane resource that maps a downstream ACR path to an upstream source path. If the upstream requires authentication, the cache rule links to the appropriate credential boundary: a credential set backed by customer-owned Key Vault secrets, or a user-assigned managed identity for ACR-to-ACR. This is where the control-plane/data-plane split shows up. The control plane manages registry configuration through surfaces such as CLI, portal, Bicep, ARM templates, and other Azure Resource Manager clients. ARM sends those resource operations to the ACR control plane, which creates or updates the cache rule and, when needed, the credential set as child resources under the registry. Those resources do not own customer secrets or identities directly; they link to existing Azure resources such as the customer's Key Vault or an optional user-assigned managed identity. Later, the data plane uses that persisted configuration to decide whether a runtime registry request, such as a pull or tag listing, should be handled by Artifact Cache. After setup, the runtime path begins with the simplest possible pull: docker pull myregistry.azurecr.io/dockerhub/library/node:latest To understand what happens after that command, we need a map of the ACR components that participate in the request path. The ACR components involved The architecture needed for this overview is much smaller than ACR's full internal service graph. ACR is a regionalized service. The control plane operates at the regional level, while data plane stamps serve hot-path registry traffic for the registries assigned to them. A registry is pinned to a stamp, and high-traffic regions may have more than one stamp. Stamp architecture is an ACR concept covered in more detail in the stamp rebalancing post; this article only needs the simplified model below. For this article, ACR has three important boundaries: The regional control plane manages registry resources and provisioning operations. The data plane stamp serves hot-path registry traffic for registries pinned to that stamp. The storage layer holds downstream registry metadata, blobs, and storage-backed event queues. At this level of detail, a data plane stamp is composed of a few major runtime substrates. The registry data plane virtual machine scale set (VMSS) is the core ACR data plane. It runs containerized services including the frontend, the registry API entry point that receives and routes OCI and ACR-specific requests. The data proxy VMSS also runs containerized services and serves selected blob-content paths. It serves eligible blob-content traffic behind ACR's dedicated data endpoint; see the ACR data endpoint documentation. The stamp also includes a runtime cluster for additional data plane services, including services that are not on the hot path. This article will not explain why ACR uses both VMSS-based services and a runtime cluster inside the data plane stamp. That tradeoff is useful context, but it belongs in a separate deep dive. For Artifact Cache, the important point is narrower: the stamp contains the runtime substrates that participate in data plane serving, including runtime-cluster services that process async import and hydration work. The component list is: Component Role Region control plane Manages registry resources and provisioning operations Data plane stamp Serves pinned registries in a region Registry data plane VMSS Core ACR data plane for OCI and ACR-specific APIs Frontend Handles OCI registry API traffic inside the registry data plane Data proxy VMSS Serves selected blob-content paths, including Artifact Cache Runtime Kubernetes Cluster Hosts additional data plane services, including async import and hydration workers Cache rule Maps downstream ACR path to upstream path Credential set or managed identity Provides the upstream authentication boundary when needed Cache Backend service Handles cache-rule-backed pulls Storage queue Regional storage resource used for hydration events Metadata/blob storage Stores downstream manifests, tags, digests, and layer blobs Import workers Run in the data plane runtime cluster and hydrate downstream content asynchronously Upstream registry Public, private, or another ACR registry used as the source The diagram below is a component map rather than a step-by-step pull trace. It shows one visible data plane stamp in West US for myregistry.azurecr.io, with a muted marker to indicate that larger regions can contain multiple stamps. The stamp contains a registry data plane VMSS, a data proxy VMSS, and a runtime Kubernetes cluster. Regional metadata/blob storage and the storage queue sit outside the stamp boundary. The storage queue is also outside the regional control plane cluster; it is a storage resource consumed by data plane runtime-cluster workers. First artifact pull Now return to the pull request: docker pull myregistry.azurecr.io/dockerhub/library/node:latest The request reaches the data plane stamp where myregistry is pinned. The frontend in the registry data plane VMSS handles the registry API request and forwards it to the Cache Backend Service, which checks whether the requested repository path matches a cache rule. If there is no matching cache rule, the request follows the normal ACR path. If a cache rule matches, Artifact Cache logic applies. The next check is local state. ACR looks at downstream metadata and blob storage to determine whether the requested manifest and blobs are already available locally. If the content is present, ACR can serve it from the downstream registry path. If the content is not available locally, ACR resolves the upstream repository path from the cache rule. If the upstream requires authentication, ACR uses the configured auth boundary for that upstream: a credential set for private upstreams, or a user-assigned managed identity for ACR-to-ACR upstreams. The request can then be served through the upstream-backed data path, with the data proxy handling the blob content path. The first pull does not need to wait for durable hydration to complete before the client receives content. Serving the pull and hydrating the downstream registry are related operations, but they are deliberately separated. The trace above follows the same node:latest image used in the setup example. On a cache miss, the data plane queues an async import event for the requested image while still serving the client request. Manifest content returns through the frontend path. For layer blobs, the frontend returns a redirect to the data proxy, and the client follows that redirect while the data proxy streams blob content from the upstream CDN. The data plane serves the customer request, but it also detects that durable downstream state needs to be populated. That durable work is where hydration comes in. Hydration Hydration is the process that materializes upstream content into the downstream ACR registry. ACR performs hydration asynchronously because the data plane workload can be bursty and variable. A deployment or scale-out event can cause many clients to request the same not-yet-hydrated image at nearly the same time. Image size, layer count, multi-platform manifest trees, upstream behavior, queue depth, and retry behavior all matter in a multi-tenant service. The north star is to coordinate those requests: collapse duplicate work, hydrate the content from upstream, and serve all waiting clients without turning one customer action into unnecessary upstream load. That coordination problem is challenging at ACR scale, and we are continuing to improve it. The existing async import path gives Artifact Cache a durable and scalable foundation while that serving path continues to evolve. At a high level, the data plane queues an import event. A notification service consumes the event and dispatches work to import workers in the data plane runtime cluster. Those workers fetch the required content from the upstream registry and write manifests, tags, digests, and layer blobs into ACR metadata and blob storage. When import workers complete, they notify the notification service, which can publish completion signals through ACR eventing surfaces such as Event Grid and webhooks. This allows customers to use webhooks to detect when cached content is fully available locally. You can read more about how it works here. The mental model is that the first pull can serve immediately, while hydration makes future local serving durable. A follow-up post will go deeper on the work ACR does to reduce upstream load during this hydration window. Later pulls After hydration completes, later pulls for the same content can be served from ACR. For digest references, the model is relatively direct because a digest is content-addressed. If ACR has the requested digest and its blobs downstream, the data plane can serve that content locally. Tags are more subtle because tags can change. A tag such as latest is a name that can point to different content over time. Artifact Cache therefore must care about freshness semantics for tag-based pulls. This is one of the reasons a pull-through cache becomes more complex than "fetch once and forget." The benefit is not only lower latency. ACR also reduces repeated dependency on the upstream for content that has already been materialized downstream. Guarding the pull path Once content is hydrated, ACR must serve that content from the customer's registry boundary even when the upstream is slow, unavailable, or returning errors. That distinction matters for tag-based pulls: ACR may need upstream checks to reason about freshness, but an upstream failure should not automatically prevent ACR from serving content that is already available downstream. Artifact Cache also must be careful about how it behaves when upstreams are unhealthy. If an upstream starts returning 5xx errors or throttling requests, ACR should avoid amplifying the problem by repeatedly sending customer-triggered requests upstream. Circuit breaking and upstream work minimization are part of being a good steward of both customer traffic and upstream registry limits. More details to follow in subsequent posts. There is a separate availability question inside ACR: what happens if Artifact Cache-specific components, such as the cache backend path, are operationally unavailable? ACR handles that case gracefully by falling back to normal registry pull behavior: it checks the customer's registry state and serves the image if the requested content already exists in ACR. In other words, cache-backend unavailability should not block pulls for content that is already present in the registry. What we will explore next This overview is the map for the rest of the series. The following posts will go deeper into the parts of the system where the design pressure is highest. Minimizing upstream work We will start with how Artifact Cache avoids making more upstream requests than necessary. This becomes difficult when many clients request the same not-yet-hydrated image at the same time. A Kubernetes scale-out event is the classic example: many nodes may ask for the same image concurrently, and the system must avoid turning one customer's action into unnecessary duplicate upstream work. Making Artifact Cache observable to customers We will also look at how customers understand whether their cache rule is healthy, whether credentials are usable, and why a pull failed. This is hard because a failed pull can involve customer configuration, Key Vault access, managed identity configuration, upstream credentials, upstream availability, data plane request handling, or asynchronous hydration. The engineering challenge is to expose the right customer-facing health and debug signals without turning internal topology into the user interface. Repository semantics in Artifact Cache Finally, we will look at repository semantics. Once upstream content becomes local, the repository is no longer just a mirror. Tags can move upstream, digest references are content-addressed, and customers may push their own content into downstream repositories. The visible repository state can involve both upstream-derived content and customer-owned downstream writes. Closing Artifact Cache is designed to make upstream-sourced artifacts behave like ACR-served content once customers choose to route those artifacts through their registry. The design goal is that customers can pull from ACR and reason about the result using ACR boundaries: registry configuration, local serving, customer-visible health, and predictable repository semantics.
akashsinghal
Jun 02, 2026 Place Apps on Azure Blog
479Views
2likes
0Comments