azure container registry

44 Topics

Regional Endpoints for Azure Container Registry Geo-Replication — Now in Public Preview
By Johnson Shi, Zoey (Zhuyu) Li, Huangli Wu What's new Regional endpoints for geo-replicated Azure Container Registries are now in public preview. See the feature's official MS Learn documentation. If you've been following since the private preview announcement, here's what changed: No feature flag registration. No subscription enrollment so all Azure subscriptions and customers can now use this feature. No CLI extension. Regional endpoints commands are built into Azure CLI 2.86.0+ natively. If you installed the private preview acrregionalendpoint extension, uninstall it to avoid conflicts. Native CLI and portal support. With Azure CLI 2.86.0+, enable regional endpoints for all geo-replicas of a registry with az acr create --regional-endpoints enabled or az acr update --regional-endpoints enabled . The Azure portal also supports configuring regional endpoints natively. CLI flag rename for configuring a geo-replica's global endpoint routing (an existing separate feature). The existing flag --region-endpoint-enabled (on az acr replication create/update ) has been renamed to --global-endpoint-routing . Key clarifications: "--global-endpoint-routing" (formerly "--region-endpoint-enabled" on "az acr replication create / az acr replication update") — controls whether a specific geo-replica participates in global endpoint routing. This is an existing feature that is different from the new registry-level "--regional-endpoints" feature being discussed in this post. "--regional-endpoints" (on az "acr create / az acr update") — enables or disables the regional endpoints feature at the registry level for all geo-replicas. This is the feature discussed in this post. See the endpoint reference for the full breakdown of the various registry endpoints (global endpoints, regional endpoints, and data endpoints). Regional endpoints are available on Premium SKU registries in all Azure public cloud regions. What are regional endpoints? Regional endpoints give you dedicated, per-region login server URLs for each geo-replica with the following URL pattern: myregistry.eastus.geo.azurecr.io myregistry.westeurope.geo.azurecr.io Regional endpoints coexist with the registry's global endpoint ( myregistry.azurecr.io ) — enabling regional endpoints doesn't disable a registry's global endpoint that is backed by Azure-managed routing. You can choose per workload: You can use the global endpoint with automatic Azure-managed routing with health-aware failover, where Azure will route your requests to the geo-replica with the best network performance profile to the client. You can use a regional endpoint when you need explicit control or routing to a specific geo-replica. Other resources: For the full background on why regional endpoints exist and the problems they solve, see the private preview blog post. For the complete operational deep dive — health-aware failover, throttling considerations, storage quota and pricing, eventual consistency, home region outage behavior, DNS propagation, private endpoint interaction, capacity planning, and monitoring guidance — see How ACR geo-replication handles failover, failback, and traffic redirection. For the behind-the-scenes engineering implementation — architectural overview and the engineering system design of the feature — see Determinism over magic: the engineering design behind Azure Container Registry Regional Endpoints. Getting started Enable regional endpoints on an existing registry: az acr update -n myregistry -g myrg --regional-endpoints enabled View all registry endpoint URLs, including the registry global endpoint, geo-replica regional endpoints, and data endpoints: az acr show-endpoints --name myregistry --resource-group myrg Using regional endpoints Authenticate to a specific regional endpoint: az acr login --name myregistry --endpoint eastus Push to a specific geo-replica. Images and tags pushed to a geo-replica via regional endpoints still propagate to all other geo-replicas under eventual consistency. docker tag myapp:v1 myregistry.eastus.geo.azurecr.io/myapp:v1 docker push myregistry.eastus.geo.azurecr.io/myapp:v1 Pull an image: docker pull myregistry.eastus.geo.azurecr.io/myapp:v1 You can specify regional endpoints directly in Kubernetes deployment manifests if you need to pin workloads to specific regions. This ensures clusters in specific regions always pull from their colocated replica, providing predictable routing and reduced latency. By using different regional endpoints in each cluster's manifests, you can choose to guarantee that each cluster pulls from its local replica instead of relying on Azure-managed routing. East US cluster deployment: apiVersion: apps/v1 kind: Deployment metadata: name: myapp-eastus spec: template: spec: containers: - name: myapp image: myregistry.eastus.geo.azurecr.io/myapp:v1 West Europe cluster deployment: apiVersion: apps/v1 kind: Deployment metadata: name: myapp-westeurope spec: template: spec: containers: - name: myapp image: myregistry.westeurope.geo.azurecr.io/myapp:v1 When to use regional endpoints Scenario What to do Most workloads Keep using the global endpoint ( myregistry.azurecr.io ). Health-aware failover handles routing automatically. Pin AKS clusters to co-located replicas Use regional endpoint URLs in deployment manifests. CI/CD push-then-pull consistency Pin pushes to a regional endpoint to avoid eventual-consistency races. Client-side failover Switch between regional endpoints based on your own health checks. Capacity planning Spread workloads across multiple regional endpoints to avoid per-replica throttling. Troubleshooting Target a specific geo-replica to reproduce or isolate an issue. What changed from private preview Private preview Public preview Feature flag registration required ( az feature register ) No registration needed Subscription private preview enrollment and propagation wait Immediately available to all Azure subscriptions for all Premium SKU registries in all Azure public cloud regions. Separate CLI extension ( acrregionalendpoint ) Built into Azure CLI 2.86.0+ natively No registry-level CLI flag az acr update --regional-endpoints enabled enables regional endpoints for all geo-replicas --region-endpoint-enabled flag for controlling a geo-replica's global endpoint routing via az acr replication update Flag for controlling a geo-replica's global endpoint routing renamed to --global-endpoint-routing No portal support Native Azure portal support for enabling regional endpoints for new registries (during creation) and for existing registries Private preview docs in Azure/acr Full documentation on MS Learn Enabling regional endpoints in the Azure portal You can enable regional endpoints directly from the Azure portal for both new registries (during creation), as well as existing registries: If you were in the private preview 1. Uninstall the CLI extension. The private preview CLI extension conflicts with the built-in commands in Azure CLI 2.86.0+. Remove it: az extension remove --name acrregionalendpoint Verify it's gone: az extension list --query "[?name=='acrregionalendpoint']" -o table 2. Ensure you're running Azure CLI 2.86.0 or later. Regional endpoints commands are available natively starting in Azure CLI 2.86.0. Check your version: az version 3. Update scripts that use --region-endpoint-enabled for controlling global endpoint routing for a geo-replica. The old flag name for controlling a geo-replica's global endpoint routing configuration is deprecated and will be removed in Azure CLI 2.87.0 (June 2026). Update to --global-endpoint-routing : # Old (deprecated) az acr replication update --registry myregistry --name westus \ --region-endpoint-enabled false # New az acr replication update --registry myregistry --name westus \ --global-endpoint-routing false Why the rename? The old flag name --region-endpoint-enabled was confusing — it sounded like it controlled the regional endpoints feature, but it actually controlled whether a geo-replica participates in global endpoint routing. The new name --global-endpoint-routing says exactly what it does. For a full breakdown of all three CLI flags and how they relate, see the endpoint reference. Learn more Full documentation: Geo-replication in Azure Container Registry — Regional endpoints — prerequisites, CLI commands, network considerations, private endpoint integration, and troubleshooting. Operational deep dive: How ACR geo-replication handles failover, failback, and traffic redirection — health-aware failover, throttling, eventual consistency, DNS considerations, monitoring, pricing, and a full walkthrough. Behind-the-scenes engineering implementation: Determinism over magic: the engineering design behind Azure Container Registry Regional Endpoints — architectural details and the engineering system design behind the feature. Endpoint reference: Azure Container Registry endpoint reference — all endpoint types, URL formats, and CLI flags in one place. Private endpoints: Connect privately to a registry using private endpoints — IP allocation math, subnet sizing, and NIC queries for registries with regional endpoints. Firewall rules: Configure firewall access rules — which FQDNs to allow for regional endpoints. Feedback We'd love to hear how you're using regional endpoints and what we can improve. Reach out via: Azure Container Registry GitHub repository — issues, feature requests, and discussion Azure portal feedback — use the feedback button in the Azure portal on your registry's page Regional endpoints are on the path to GA. Your feedback directly shapes the feature's direction.
johnsonshi_msft
Jun 05, 2026 Place Apps on Azure Blog
105Views
1like
1Comment
Inside ACR Artifact Cache: Pull-Through Caching at Scale
By: Akash Singhal, Luis Dieguez, Kiran Challa, Nathan Anderson, Tony Vargas, Caroline Barker, Ren Shao, Mabel Egba, Toddy Mladenov, Johnson Shi Introduction For many customers, Azure Container Registry (ACR) is the only registry their workloads can trust, even when images and artifacts originate from a different registry such as Docker Hub, Microsoft Artifact Registry, GitHub Container Registry, Quay, another ACR, or a private registry. ACR Artifact Cache makes this many-to-one model practical by letting a platform team map a downstream ACR repository path to an upstream source repository. Here, upstream means the source registry and repository ACR contacts on behalf of the customer, and downstream means the ACR-facing path customers pull from. From the outside, the experience looks like a normal pull from ACR. Inside the service, that pull moves through the same multi-tenant registry platform that serves ACR traffic across regions, clouds, and data plane stamps. This series is about the gap between that simple external experience and the internal system. The goal is to show what happens inside ACR, why the system is designed this way, and how those design choices shape the behavior customers ultimately observe. Some implementation details are simplified, and the system continues to evolve. The request paths and design constraints are representative, but this article intentionally avoids service-by-service internals that are not necessary to understand the feature. For this overview, the useful mental model is: serve now, hydrate for later. Later sections will show where that model helps, and where it creates engineering pressure. Why serve upstream content from ACR? Pulling directly from an upstream is often sufficient for development, but production systems need stronger guarantees from the pull path. The failure modes are familiar to anyone who has operated containerized workloads at scale: an upstream registry is slow or temporarily unavailable an upstream applies rate limits or burst protection credentials for various upstream sources need to be handled safely ACR-to-ACR scenarios should avoid customer-managed credentials entirely by using managed identity network policy expects pulls to stay inside an approved network boundary a platform team wants one shared, sanitized catalog of public content for first-party consumption while individual teams pull only what they need Let’s take Docker Hub as a concrete example. Docker Hub pull rate limits mean that unauthenticated users and Docker Personal users can exhaust their allowed pulls in a time window, causing shared build agents or Kubernetes nodes to receive rate-limit errors instead of images. That is a useful example because it makes the upstream dependency visible, but it is not the whole story. The broader engineering problem is that upstream-sourced artifacts should behave like local registry dependencies once a customer chooses to route them through ACR. Artifact Cache addresses that problem by letting customers map a downstream ACR namespace to an upstream namespace, pull through ACR, and allow ACR to materialize content locally as it is requested. A pull-through cache inside ACR Azure Container Registry operates across 60+ Azure regions and 6 public and sovereign clouds, serves hundreds of thousands of registries, and handles billions of requests per day. Artifact Cache is only one part of that larger service, but it is large enough to be a distributed systems problem in its own right: more than 100 million image pulls per day, petabyte-scale egress, upstreams with different behavior, and customers who expect registry pulls to remain predictable. This scale matters because Artifact Cache is not deployed beside ACR as a separate service. It is part of the same registry system that serves normal pushes, pulls, tag listing, catalog operations, authentication flows, private networking scenarios, and other registry API traffic. That means Artifact Cache has to fit into ACR's existing resource model and request-serving model. Customers configure cache rules and authentication boundaries through the control plane, then their pulls are served through the data plane. The next sections follow those two parts in order: first the resources customers create, then the runtime path those resources affect. The customer workflow The setup begins in the control plane, where customers define the relationship between an ACR namespace and an upstream source. A customer starts with an ACR and chooses an upstream repository. In the examples below, myregistry.azurecr.io is the customer's ACR login server. The dockerhub/library/node path is the downstream ACR namespace the customer wants to use for cached content. The authentication model depends on the upstream: For a public upstream, the cache rule may not need credentials. For a private upstream, the customer stores upstream credential material in their Azure Key Vault, creates a credential set that references those secrets, and then associates that credential set with a cache rule. At access time, ACR uses the system-assigned managed identity associated with the cache rule to read the referenced Key Vault secrets, so the customer controls access by granting that identity the required secret permissions. ACR materializes those credentials only when it needs to contact the upstream, so the customer-owned Key Vault remains the secret store. For an ACR-to-ACR upstream, the customer can use a user-assigned managed identity. In that scenario, credential sets are not part of the flow; managed identity replaces the credential-set and Key Vault path. At a high level, the customer defines a namespace mapping: docker pull myregistry.azurecr.io/dockerhub/library/node:latest maps to: docker pull docker.io/library/node:latest In ACR, that mapping is stored as a cache rule: a control-plane resource that maps a downstream ACR path to an upstream source path. If the upstream requires authentication, the cache rule links to the appropriate credential boundary: a credential set backed by customer-owned Key Vault secrets, or a user-assigned managed identity for ACR-to-ACR. This is where the control-plane/data-plane split shows up. The control plane manages registry configuration through surfaces such as CLI, portal, Bicep, ARM templates, and other Azure Resource Manager clients. ARM sends those resource operations to the ACR control plane, which creates or updates the cache rule and, when needed, the credential set as child resources under the registry. Those resources do not own customer secrets or identities directly; they link to existing Azure resources such as the customer's Key Vault or an optional user-assigned managed identity. Later, the data plane uses that persisted configuration to decide whether a runtime registry request, such as a pull or tag listing, should be handled by Artifact Cache. After setup, the runtime path begins with the simplest possible pull: docker pull myregistry.azurecr.io/dockerhub/library/node:latest To understand what happens after that command, we need a map of the ACR components that participate in the request path. The ACR components involved The architecture needed for this overview is much smaller than ACR's full internal service graph. ACR is a regionalized service. The control plane operates at the regional level, while data plane stamps serve hot-path registry traffic for the registries assigned to them. A registry is pinned to a stamp, and high-traffic regions may have more than one stamp. Stamp architecture is an ACR concept covered in more detail in the stamp rebalancing post; this article only needs the simplified model below. For this article, ACR has three important boundaries: The regional control plane manages registry resources and provisioning operations. The data plane stamp serves hot-path registry traffic for registries pinned to that stamp. The storage layer holds downstream registry metadata, blobs, and storage-backed event queues. At this level of detail, a data plane stamp is composed of a few major runtime substrates. The registry data plane virtual machine scale set (VMSS) is the core ACR data plane. It runs containerized services including the frontend, the registry API entry point that receives and routes OCI and ACR-specific requests. The data proxy VMSS also runs containerized services and serves selected blob-content paths. It serves eligible blob-content traffic behind ACR's dedicated data endpoint; see the ACR data endpoint documentation. The stamp also includes a runtime cluster for additional data plane services, including services that are not on the hot path. This article will not explain why ACR uses both VMSS-based services and a runtime cluster inside the data plane stamp. That tradeoff is useful context, but it belongs in a separate deep dive. For Artifact Cache, the important point is narrower: the stamp contains the runtime substrates that participate in data plane serving, including runtime-cluster services that process async import and hydration work. The component list is: Component Role Region control plane Manages registry resources and provisioning operations Data plane stamp Serves pinned registries in a region Registry data plane VMSS Core ACR data plane for OCI and ACR-specific APIs Frontend Handles OCI registry API traffic inside the registry data plane Data proxy VMSS Serves selected blob-content paths, including Artifact Cache Runtime Kubernetes Cluster Hosts additional data plane services, including async import and hydration workers Cache rule Maps downstream ACR path to upstream path Credential set or managed identity Provides the upstream authentication boundary when needed Cache Backend service Handles cache-rule-backed pulls Storage queue Regional storage resource used for hydration events Metadata/blob storage Stores downstream manifests, tags, digests, and layer blobs Import workers Run in the data plane runtime cluster and hydrate downstream content asynchronously Upstream registry Public, private, or another ACR registry used as the source The diagram below is a component map rather than a step-by-step pull trace. It shows one visible data plane stamp in West US for myregistry.azurecr.io, with a muted marker to indicate that larger regions can contain multiple stamps. The stamp contains a registry data plane VMSS, a data proxy VMSS, and a runtime Kubernetes cluster. Regional metadata/blob storage and the storage queue sit outside the stamp boundary. The storage queue is also outside the regional control plane cluster; it is a storage resource consumed by data plane runtime-cluster workers. First artifact pull Now return to the pull request: docker pull myregistry.azurecr.io/dockerhub/library/node:latest The request reaches the data plane stamp where myregistry is pinned. The frontend in the registry data plane VMSS handles the registry API request and forwards it to the Cache Backend Service, which checks whether the requested repository path matches a cache rule. If there is no matching cache rule, the request follows the normal ACR path. If a cache rule matches, Artifact Cache logic applies. The next check is local state. ACR looks at downstream metadata and blob storage to determine whether the requested manifest and blobs are already available locally. If the content is present, ACR can serve it from the downstream registry path. If the content is not available locally, ACR resolves the upstream repository path from the cache rule. If the upstream requires authentication, ACR uses the configured auth boundary for that upstream: a credential set for private upstreams, or a user-assigned managed identity for ACR-to-ACR upstreams. The request can then be served through the upstream-backed data path, with the data proxy handling the blob content path. The first pull does not need to wait for durable hydration to complete before the client receives content. Serving the pull and hydrating the downstream registry are related operations, but they are deliberately separated. The trace above follows the same node:latest image used in the setup example. On a cache miss, the data plane queues an async import event for the requested image while still serving the client request. Manifest content returns through the frontend path. For layer blobs, the frontend returns a redirect to the data proxy, and the client follows that redirect while the data proxy streams blob content from the upstream CDN. The data plane serves the customer request, but it also detects that durable downstream state needs to be populated. That durable work is where hydration comes in. Hydration Hydration is the process that materializes upstream content into the downstream ACR registry. ACR performs hydration asynchronously because the data plane workload can be bursty and variable. A deployment or scale-out event can cause many clients to request the same not-yet-hydrated image at nearly the same time. Image size, layer count, multi-platform manifest trees, upstream behavior, queue depth, and retry behavior all matter in a multi-tenant service. The north star is to coordinate those requests: collapse duplicate work, hydrate the content from upstream, and serve all waiting clients without turning one customer action into unnecessary upstream load. That coordination problem is challenging at ACR scale, and we are continuing to improve it. The existing async import path gives Artifact Cache a durable and scalable foundation while that serving path continues to evolve. At a high level, the data plane queues an import event. A notification service consumes the event and dispatches work to import workers in the data plane runtime cluster. Those workers fetch the required content from the upstream registry and write manifests, tags, digests, and layer blobs into ACR metadata and blob storage. When import workers complete, they notify the notification service, which can publish completion signals through ACR eventing surfaces such as Event Grid and webhooks. This allows customers to use webhooks to detect when cached content is fully available locally. You can read more about how it works here. The mental model is that the first pull can serve immediately, while hydration makes future local serving durable. A follow-up post will go deeper on the work ACR does to reduce upstream load during this hydration window. Later pulls After hydration completes, later pulls for the same content can be served from ACR. For digest references, the model is relatively direct because a digest is content-addressed. If ACR has the requested digest and its blobs downstream, the data plane can serve that content locally. Tags are more subtle because tags can change. A tag such as latest is a name that can point to different content over time. Artifact Cache therefore must care about freshness semantics for tag-based pulls. This is one of the reasons a pull-through cache becomes more complex than "fetch once and forget." The benefit is not only lower latency. ACR also reduces repeated dependency on the upstream for content that has already been materialized downstream. Guarding the pull path Once content is hydrated, ACR must serve that content from the customer's registry boundary even when the upstream is slow, unavailable, or returning errors. That distinction matters for tag-based pulls: ACR may need upstream checks to reason about freshness, but an upstream failure should not automatically prevent ACR from serving content that is already available downstream. Artifact Cache also must be careful about how it behaves when upstreams are unhealthy. If an upstream starts returning 5xx errors or throttling requests, ACR should avoid amplifying the problem by repeatedly sending customer-triggered requests upstream. Circuit breaking and upstream work minimization are part of being a good steward of both customer traffic and upstream registry limits. More details to follow in subsequent posts. There is a separate availability question inside ACR: what happens if Artifact Cache-specific components, such as the cache backend path, are operationally unavailable? ACR handles that case gracefully by falling back to normal registry pull behavior: it checks the customer's registry state and serves the image if the requested content already exists in ACR. In other words, cache-backend unavailability should not block pulls for content that is already present in the registry. What we will explore next This overview is the map for the rest of the series. The following posts will go deeper into the parts of the system where the design pressure is highest. Minimizing upstream work We will start with how Artifact Cache avoids making more upstream requests than necessary. This becomes difficult when many clients request the same not-yet-hydrated image at the same time. A Kubernetes scale-out event is the classic example: many nodes may ask for the same image concurrently, and the system must avoid turning one customer's action into unnecessary duplicate upstream work. Making Artifact Cache observable to customers We will also look at how customers understand whether their cache rule is healthy, whether credentials are usable, and why a pull failed. This is hard because a failed pull can involve customer configuration, Key Vault access, managed identity configuration, upstream credentials, upstream availability, data plane request handling, or asynchronous hydration. The engineering challenge is to expose the right customer-facing health and debug signals without turning internal topology into the user interface. Repository semantics in Artifact Cache Finally, we will look at repository semantics. Once upstream content becomes local, the repository is no longer just a mirror. Tags can move upstream, digest references are content-addressed, and customers may push their own content into downstream repositories. The visible repository state can involve both upstream-derived content and customer-owned downstream writes. Closing Artifact Cache is designed to make upstream-sourced artifacts behave like ACR-served content once customers choose to route those artifacts through their registry. The design goal is that customers can pull from ACR and reason about the result using ACR boundaries: registry configuration, local serving, customer-visible health, and predictable repository semantics.
akashsinghal
Jun 02, 2026 Place Apps on Azure Blog
206Views
2likes
0Comments
Regional Endpoints for Geo-Replicated Azure Container Registries (Private Preview)
Imagine you're running Kubernetes clusters in multiple Azure regions—East US, West Europe, and Southeast Asia. You've configured ACR with geo-replication so your container images are available everywhere, but you've noticed something frustrating: you can't control which replica your clusters pull from. Sometimes your East US cluster pulls from West Europe, and you have no way to pin it to the co-located replica or troubleshoot why routing behaves unexpectedly. This scenario highlights a fundamental challenge with geo-replicated container registries: while Azure-managed routing optimizes for performance, it doesn't provide explicit control for custom failover strategies, troubleshooting, regional affinity, or predictable routing. Regional endpoints solve this by letting you choose exactly which region handles your requests. Background: How Geo-Replication Works Today Geo-replication allows you to maintain copies of your container registry in multiple Azure regions around the world. This means your container images are stored closer to where your applications run, reducing download times and improving reliability. You maintain a single registry name (like myregistry.azurecr.io), and Azure automatically routes your requests to the most suitable replica. The Challenge: Azure-Managed Routing Limitations While geo-replication has been invaluable for global deployments, the automatic routing creates challenges for some customers. When you push or pull images from a geo-replicated registry, Azure-managed routing automatically directs your request to the most suitable replica based on the client's network performance profile. While this Azure-managed routing works well for many scenarios, it creates several challenges for customers with specific requirements: Misrouting Issues: Azure-managed routing may not always select the replica you expect, particularly if network conditions fluctuate or if you're testing specific regional behavior. Geographic Ambiguity: Clients located equidistant from two replicas may experience unpredictable routing as Azure switches between them based on minor network performance variations. Push/Pull Consistency: Images pushed to one replica may be pulled from another during geo-replication synchronization, creating temporary inconsistencies that can impact deployment pipelines. For more details on troubleshooting push operations with geo-replicated registries, see Troubleshoot push operations. Lack of Regional Affinity: Clients may want to establish regional affinity between their applications and a specific replica, but Azure-managed routing doesn't provide a way to maintain this affinity. No Client-Side Failover: Without the ability to target specific replicas, you cannot implement client-side failover strategies or disaster recovery logic that explicitly switches between regions based on your own health checks and business rules. Introducing Regional Endpoints Regional endpoints solve these challenges by providing direct access to specific geo-replicated regions through dedicated login server URLs. Instead of relying solely on the global endpoint (myregistry.azurecr.io) with Azure-managed routing, you can now target specific regional replicas using the pattern: myregistry.<region-name>.geo.azurecr.io For example: myregistry.eastus.geo.azurecr.io myregistry.westeurope.geo.azurecr.io Important: Regional endpoints coexist with a geo-replicated registry's global endpoint at myregistry.azurecr.io. Enabling regional endpoints doesn't disable or replace the global endpoint - you can use both simultaneously. This allows you to use the global endpoint for most operations while selectively using regional endpoints when you need explicit regional control. How It Works Regional endpoints function as login servers—the URL endpoints you use to authenticate and interact with your registry—for specific geo-replicated regions. When you authenticate and interact with a regional endpoint instead of a registry's global endpoint, all your registry operations (authentication, artifact uploads/downloads, repository operations, and metadata actions) go directly to that specific regional replica, bypassing Azure-managed routing entirely. Downloading layer blobs (the actual container image layers) still follows your registry's existing configuration: For registries without Private Endpoints or Dedicated Data Endpoints, layer blob downloads still redirect to Azure storage accounts (*.blob.core.windows.net). For registries with Private Endpoints or Dedicated Data Endpoints enabled, layer blob downloads redirect to the corresponding region's dedicated data endpoint (myregistry.<region-name>.data.azurecr.io). Here's how the architecture compares: Global Endpoint (Azure-Managed Routing): Client ↓ myregistry.azurecr.io (Azure-managed routing) ↓ Geo-Replica with the Best Network Performance Profile ↓ Geo-Replica's Data Endpoint (blob storage or dedicated data endpoint) Regional Endpoint (Customer-Specified Routing): Client ↓ myregistry.<region-name>.geo.azurecr.io (client-managed routing) ↓ Specific Regional Geo-Replica ↓ Geo-Replica's Data Endpoint (blob storage or dedicated data endpoint) Regional vs. Global Endpoints Endpoint Type URL Format Purpose Use Case Global Endpoint myregistry.azurecr.io Login server with Azure-managed routing Default, optimal for most scenarios Regional Endpoint myregistry.<region-name>.geo.azurecr.io Login server for specific regional replica Predictable routing, client-side failover, regional affinity, troubleshooting Dedicated Data Endpoint myregistry.<region-name>.data.azurecr.io Layer blob downloads for Private Endpoint and Dedicated Data Endpoint-enabled registries Automatic blob download redirect from login server Storage Account *.blob.core.windows.net Layer blob downloads for registries without Private Endpoints or Dedicated Data Endpoints Automatic blob download redirect from login server Getting Started with Private Preview Prerequisites To participate in the regional endpoints private preview, you'll need: Premium SKU: Regional endpoints are available exclusively on Premium tier registries Azure CLI: Version 2.74.0 or later for the --regional-endpoints flag API version: The feature is available in all production regions in Azure Public Cloud via the 2026-01-01-preview ACR ARM API version NOTE: During private preview, regional endpoints are only available in Azure Public Cloud. Support for Azure Government, Azure China, and other national clouds will be available in public preview and beyond. NOTE: Regional endpoints can be enabled on any Premium SKU registry, even without geo-replication. A registry without geo-replication has a single geo-replica in the home region, which gets one regional endpoint URL. However, the feature is most useful when your registry has at least two geo-replicas. Step 1: Register the feature flag Register the RegionalEndpoints feature flag for your subscription: az feature register \ --namespace Microsoft.ContainerRegistry \ --name RegionalEndpoints The feature registration is auto-approved and takes approximately 1 hour to propagate. You can check the status with: az feature show \ --namespace Microsoft.ContainerRegistry \ --name RegionalEndpoints Wait until the state shows Registered before proceeding. Step 2: Propagate the registration Once the feature flag shows Registered, propagate the registration to your subscription's resource provider: az provider register -n Microsoft.ContainerRegistry Step 3: Install the preview CLI extension Download the preview CLI extension wheel file from https://aka.ms/acr/regionalendpoints/download and install it: az extension add \ --source acrregionalendpoint-1.0.0b1-py3-none-any.whl \ --allow-preview true What to Expect Once setup is complete, you can: Enable regional endpoints on both new and existing registries Access preview documentation Provide feedback via our GitHub roadmap Technical Deep Dive Enabling Regional Endpoints Enabling regional endpoints is simple and can be done for both new and existing registries: # Enable for new registry az acr create -n myregistry -g myrg -l <region-name> --regional-endpoints enabled --sku Premium # Enable for existing registry az acr update -n myregistry -g myrg --regional-endpoints enabled When you enable regional endpoints, ACR automatically creates login server URLs for all your geo-replicated regions. There's no need to manually configure individual regions - they're all available immediately. Authentication and Pushing/Pulling Images Using regional endpoints follows the same authentication experience as a geo-replicated registry's global endpoint: # Login to a specific regional endpoint az acr login --name myregistry --endpoint eastus # Tag an image with the regional endpoint URL docker tag myapp:v1 myregistry.eastus.geo.azurecr.io/myapp:v1 # Push images to the regional endpoint docker push myregistry.eastus.geo.azurecr.io/myapp:v1 # Pull images from the regional endpoint docker pull myregistry.eastus.geo.azurecr.io/myapp:v1 Regional endpoints support all the same authentication mechanisms as the global endpoint: Microsoft Entra, service principals, managed identities, and admin credentials. Kubernetes Integration One of the most powerful uses of regional endpoints is in Kubernetes deployments. You can specify regional endpoints directly in your deployment manifests, ensuring that Kubernetes clusters in specific regions always pull from their local replica: # East US-based AKS cluster deployment apiVersion: apps/v1 kind: Deployment metadata: name: myapp-eastus spec: template: spec: containers: - name: myapp image: myregistry.eastus.geo.azurecr.io/myapp:v1 --- # West Europe-based AKS cluster deployment apiVersion: apps/v1 kind: Deployment metadata: name: myapp-westeurope spec: template: spec: containers: - name: myapp image: myregistry.westeurope.geo.azurecr.io/myapp:v1 Integration with Dedicated Data Endpoints Regional endpoints work seamlessly with ACR's existing dedicated data endpoints feature. If your registry has dedicated data endpoints enabled, blob downloads from regional endpoints will automatically redirect to the dedicated data endpoints for that region, maintaining all the security benefits of scoped firewall rules without wildcard storage access. Integration with Private Endpoints For registries with Private Endpoints enabled, enabling regional endpoints creates an additional private IP address allocation for each geo-replicated region in all associated virtual networks (VNets). For example, if you have a registry with 3 existing geo-replicas and enable regional endpoints, each VNet with a private endpoint to your registry will consume 3 additional private IPs (one per regional endpoint). Firewall and Network Configuration When using regional endpoints, you'll need to configure your firewall rules to allow access to the specific endpoints you plan to use: # Registry operations using regional endpoints myregistry.<region-name>.geo.azurecr.io # Registry operations using the existing global endpoint for Azure-managed routing myregistry.azurecr.io # Layer blob downloads (required if your registry configuration has private endpoints or dedicated data endpoints enabled) myregistry.<region-name>.data.azurecr.io # Layer blob downloads (required if your registry configuration does NOT have private endpoints and does NOT have dedicated data endpoints enabled) *.blob.core.windows.net Related Resources Regional endpoints for geo-replicated registries (Preview) Geo-replication in Azure Container Registry Mitigate data exfiltration with dedicated data endpoints Connect privately to an Azure container registry using Azure Private Link Configure rules to access an Azure container registry behind a firewall
johnsonshi_msft
Jun 02, 2026 Place Apps on Azure Blog
871Views
0likes
0Comments
How ACR Artifact Cache Handles Multi-Arch Images: What Gets Cached and When Webhooks Fire
Clarifying Azure Container Registry's Artifact Cache behavior for multi-architecture container images, and how to use webhooks to detect when an image is fully cached locally and no longer being pulled through from upstream.
johnsonshi_msft
Jun 02, 2026 Place Apps on Azure Blog
329Views
0likes
0Comments
How ACR Runs Multi-Tenancy at Scale: Compute Stamp Rebalancing and Why You Never See It Happen
By Johnson Shi, Richard Yuan, Yi Zha, Susan Shi, Jeanine Burke, Bin Du, Clark Porter, Bernie Harris, Eric Du Introduction Two of the most common questions we hear from teams running container workloads at scale on Azure Container Registry (ACR) are: "How does ACR keep my registry's performance predictable when I'm sharing infrastructure with thousands of other tenants?" — Cloud services are inherently multi-tenant. What does ACR actually do to keep my workload from competing with my neighbors during high concurrency data plane API operations? "What happens when one tenant's workload grows large enough to affect the shared infrastructure?" — Is there an active intervention, or does the system just absorb the noise from concurrent registry operations? In this post, we clarify how ACR runs its multi-tenant fleet: the stamp architecture that underpins ACR's compute infrastructure in every Azure region, the practice of proactively rebalancing registries between compute stamps when one stamp gets hot from sustained registry data plane operations, and the additional stamp isolation options available for exceptional workloads. Running multi-tenancy well at scale isn't passive — it's an active operational practice, and customers benefit from it every day without seeing it happen. Key Takeaways An ACR registry can be geo-replicated: a registry can have geo-replicas (which are both read and write-enabled) in multiple Azure regions. Each geo-replica is served by an ACR compute stamp in a particular region — independent compute deployment units that underpin ACR regional infrastructure, each made up of VMSS-backed compute pools, that together serve many registry data plane operations belonging to many tenants. Compute stamps are simultaneously a compute capacity pool, a fault domain, and an update domain. Take note that compute stamps span only the compute component for ACR; ACR in each region maintains a separate pool of storage accounts shared across all compute stamps, which is not the focus of this post. When a compute stamp gets hot, ACR proactively rebalances by moving registries to a less-utilized stamp in the same region. The registry endpoint does not change; the move is transparent to the customer. For exceptional workloads where rebalancing alone would just transfer the problem, ACR can provide additional stamp isolation — placing registries on stamps with fewer co-tenants, providing better traffic isolation, fault domain separation, and update domain independence. This also structurally improves the stamps the tenant used to share with everyone else. ACR engineering uses a mix of reactive signals (outages, sustained errors, throttling, low throughput) and proactive signals (operational telemetry) to decide when to rebalance stamps. Hot-node P95 CPU, discussed in this post, is one of the proactive signals we use — for each 1-minute bin, take the hottest node's average CPU, then percentile across bins. Pool-average hides per-node hot-spotting; single-sample Max is too noisy. All of this is currently manual. Rebalancing decisions, migrations, and isolation provisioning are operator-driven today. We are actively investing in standardizing and automating the practice — automated stamp rebalancing and lifecycle management are on the roadmap. Background What is a stamp? A compute stamp is ACR's unit of compute deployment within a region. At a high level, ACR has the following compute components within a region to serve registry data plane operations: VMSS-backed compute pools. Virtual Machine Scale Sets are Azure's primitive for running a managed group of identical VMs that autoscale together. Each region has several compute stamps, each of which has a pool of VMs that handle registry data plane operations such as authentication, manifest operations, tag resolution, and registry-side metadata — the coordination layer of a container pull — plus a separate pool of VMs running the dataproxy component, which sits between clients and storage. For private endpoint pulls, when a client pulls a layer, the data proxy nodes of a compute stamp fetches from the regional storage pool (or from the data proxy's local compute cache) and streams the bytes back; it is effectively a private endpoint proxy and streaming compute cache layered together. Separately, each region has the following storage components shared across all stamps: A pool of storage accounts. Each ACR region has its own pool of Azure Storage accounts (currently shared across all compute stamps in the region) that hold the actual blob (layer) data and manifest content for the geo-replicas on residing them. Storage accounts are multi-tenant within a stamp and region — multiple registries' blobs may land in the same group of accounts, with strict multi-tenant isolation controls and authorization enforcement. Because the regional storage pool is not part of a compute stamp, a future blog post can cover how ACR is separately investing engineering resources to dynamically scale blobs hosted in a region's pool of storage accounts. Each ACR region typically contains multiple compute stamps serving many tenants' registries, all sharing a pool of storage accounts. For geo-replicated registries, a geo-replica in a region is bound to exactly one underlying ACR compute stamp and several underlying storage accounts. A geo-replicated registry's global endpoint (<registry>.azurecr.io), geo-replica regional endpoints, and geo-replica dedicated data endpoints are resolved via DNS — backed by ACR's own Traffic Manager profile — to a specific stamp serving that region's geo-replica. The stamp is ACR's unit of compute that handles a geo-replica's registry data plane operations and proxies requests to the underlying regional storage pool. The key conceptual point: an ACR compute stamp is simultaneously a capacity pool (autoscale operates on it), a fault domain (incidents on the stamp affect all its tenants), and an update domain (rollouts progress through update domains within the stamp). When we move a registry between compute stamps in the same region, we are moving it between all three at once — and the customer's endpoint URLs do not change. From the customer's perspective, the migration is fully seamless: there are no endpoint changes, no DNS updates to make, and no action required on their part. The registry continues to work exactly as before, and the customer does not need to know or care that the underlying stamp has changed. Why multi-tenancy at scale is an active practice The naive picture is: provision enough capacity, autoscale handles the rest. This works in steady state. It does not work when one tenant's workload grows enough to systematically influence stamp behavior, when traffic shape is bursty enough that averages understate peaks, or when a single large tenant's blast radius becomes uncomfortably concentrated on a shared stamp. None of these is something a passive autoscaler will fix. They require an operator decision: this registry would be better served on that stamp. ACR engineering does this continuously — from routine rebalancing to providing additional isolation for exceptional workloads. How We Do It: Stamp Rebalancing Stamp rebalancing — a recurring practice Several signals can trigger a stamp rebalancing decision — reactive signals such as sustained errors, outages, throttling that customers observe or that we observe in our own telemetry, low throughput on a stamp, or proactive signals like hot-node P95 CPU (described in this post below) breaching a threshold. The most recent rebalancing work used hot-node P95 as the proactive trigger; other rebalancing decisions have been driven by the reactive signals just listed. When any of these fires, ACR engineering identifies the registries contributing most to the problem and picks one or more to move to a less-utilized stamp in the same region. The mechanism is straightforward: we initiate elevated operator actions, the control plane re-binds the registry's home_stamp field, DNS routing follows, in-flight requests on the source stamp drain in 30–60 seconds, and new traffic lands on the destination stamp. The cutover takes minutes. The customer's registry endpoint does not change. Most customers never know it happened; the ones whose registry moved typically see better latency afterward. Rebalancing to an existing cooler stamp is a recurring practice that resolves most multi-tenant pressure. For exceptional workloads where rebalancing to another shared stamp would just transfer the problem, ACR may provide additional stamp isolation — placing registries on stamps with fewer co-tenants, giving the tenant better traffic isolation, fault domain separation, and update domain independence while also structurally improving the stamps that tenant used to share with everyone else. Rebalancing at different scales ACR applies rebalancing across a spectrum of scenarios, from moving a handful of registries to a cooler stamp to providing additional stamp isolation for exceptional workloads. The decision criterion is workload size relative to the shared fleet — if moving a tenant to a different shared stamp would just transfer the hot-stamp problem to the destination, additional stamp isolation is the right answer. For everyone else, rebalancing to an existing stamp is sufficient. Both are manual today; both stamp provisioning and rebalancing mechanisms described are on ACR's roadmap to be automated with less operator involvement. Hot-node P95: one of the signals we use proactively Rebalancing decisions are driven by a mix of reactive and proactive signals. Reactive signals — outages, sustained error rates, frequent throttling, low throughput that customers report or that we see in our own telemetry — are the obvious triggers. But waiting for these means waiting for a customer-visible problem. Proactive signals let us intervene before that happens. Hot-node P95 CPU, showcased in this post, is one of the proactive signals we use, and it was the primary signal for the most recent rebalancing work described in the example below. The choice of CPU metric matters. Three candidates: Pool-average CPU. Averages every node in the pool. Hides per-node hot-spotting — a pool with 6% average CPU can still have one node at 99%. Single-sample Max CPU. The highest 1-minute sample. Captures spikes, but is dominated by single-bin noise that doesn't represent sustained load. Hot-node P95 CPU. For each 1-minute bin, take the hottest node's average CPU. Then percentile across bins over a representative 12-hour peak window. This is "how hot is the worst node, most of the time." Hot-node P95 captures sustained per-node load without being noisy, and it tracks customer-visible behavior more closely than either alternative. A concrete illustration from a recent regional resize: on one shared stamp's dataproxy pool, Max CPU touched 96% — alarming if read alone. But hot-node P95 was 43%, meaning most of the time even the hottest node was comfortably loaded; the 96% was a single 1-minute spike. Using Max as the operating signal would have triggered an unnecessary intervention. Using pool-average would have missed real hot-spotting elsewhere. Hot-node P95 is the right operating point for this particular signal — and it is one input among several that feed the broader rebalancing decision. A Recent Example: Rebalancing Large AI Workloads for Additional Isolation We recently completed the rebalancing of registries belonging to one of the largest AI workloads in the region, providing additional isolation to address the scale of their traffic. The customer's workload had grown to the point where its presence on the shared stamps was systematically influencing stamp behavior — variability that affected their own pull latency, and variability that affected every other tenant on the same shared stamps. The customer had 40 registries homed across two shared stamps in the region, with a severely long-tailed traffic distribution: the top four registries carried 96.7% of the customer's traffic. When that much load is concentrated in four registries, the migration cannot proceed as one batch. We moved them in phases, smallest to largest, with observation windows between phases: Idle and small-traffic tail first — about thirty low-traffic registries, used to validate the cutover tooling against the destination stamp. Medium-traffic registries next — in sub-batches with 24 hours of observation between them. The top four, one at a time — each individually with 48 hours of observation between cutovers. Order: smallest to largest, so each cutover was a sanity check at increasing load. The cumulative effect on the shared stamps the customer had previously occupied: Shared stamp + pool Hot-Node P95 CPU change Max CPU change Stamp A — registry pool -7% flat Stamp A — dataproxy pool -34% 96% → 64% Stamp B — registry pool -33% -3 percentage points Stamp B — dataproxy pool -44% -5 percentage points Stamp A dataproxy is the headline. The hottest node went from briefly touching 96% to maxing out at 64%, with sustained hot-node P95 dropping from 43% to 28.5%. Every other tenant homed on Stamp A — most with no idea this rebalancing happened — now runs on a structurally healthier pool, with more headroom, lower tail latency under load, and lower risk of CPU-driven incidents during traffic spikes. Stamp B saw similar relief. After the rebalancing, we right-sized the shared stamps downward — lowering the VMSS minimum instance count on each to match the new traffic level. Hot-node P95 was the primary signal driving this resize work, the same proactive signal that motivated the rebalancing in the first place: when hot traffic leaves a shared stamp, capacity right-sizing follows. Findings ACR runs this recurring stamp rebalancing practice for one reason: to give customers more guaranteed performance — higher and more predictable pull throughput, lower tail latency, better fault and update isolation — whether through routine rebalancing or additional isolation for exceptional workloads. Every tenant on the rebalanced stamps gets more headroom, more predictable behavior under load, and a smaller blast radius for any single incident or rollout. Three things happen continuously in any ACR region to make this real: registries get rebalanced between stamps as load patterns shift, exceptional workloads get additional stamp isolation when no shared stamp can absorb them sustainably, and stamps get continuously right-sized when load enters or leaves. All three are operator-driven today, all three are being invested in for automation, and all three are guided by a combination of reactive signals (outages, errors, throttling) and proactive signals (hot-node P95 CPU is one of them). The thesis is straightforward: cloud multi-tenancy at scale is not a passive property of the architecture. It is an active operational practice that exists to give customers guaranteed performance and predictable behavior. The customers who benefit most from it are usually the customers who never notice it's happening. Summary Question Answer How does ACR keep multi-tenant performance predictable at scale? By actively moving registries between compute stamps as load shifts — rebalancing in the common case, providing additional isolation for exceptional workloads. What is a compute stamp? An ACR compute deployment unit within a region's geo-replica: VMSS-backed registry and data proxy compute pools. Simultaneously a compute capacity pool, fault domain, and update domain. A region typically contains multiple stamps. Take note that ACR maintains a separate pool of regional storage accounts shared across all compute stamps. Do customers see when their registry moves between stamps? No. Stamps are within a region; the global endpoint and any regional endpoint URLs do not change. The cutover takes minutes; in-flight requests drain in 30–60 seconds. Does providing additional isolation only help the isolated tenant? No — every other tenant who was sharing a stamp with that workload also benefits, because the largest source of variability has been removed from the shared fleet. What signals drive these decisions? A mix of reactive signals (outages, sustained errors, throttling, low throughput) and proactive signals from our own telemetry. Hot-node P95 CPU — the 95th percentile, across a 12-hour peak window, of the hottest node's CPU in each 1-minute bin — is one of the proactive signals, and it was the primary signal for the most recent rebalancing work. Is all of this automated? Not yet. Rebalancing, isolation provisioning, and migrations are operator-driven today. Standardizing and automating these practices is an active investment.
johnsonshi_msft
Jun 02, 2026 Place Apps on Azure Blog
296Views
0likes
0Comments
Determinism over magic: the engineering design behind Azure Container Registry Regional Endpoints
By Zoey Li, Huangli Wu, Johnson Shi, Wei Meng Introduction Azure Container Registry (ACR) supports geo-replication: one registry resource with active-active replicas across multiple Azure regions. You push or pull through any replica, and ACR asynchronously replicates content to all others. For geo-replicated registries, ACR exposes a global endpoint — myregistry.azurecr.io — backed by Azure Traffic Manager (TM), which routes requests based on network performance. This works well for most workloads. But "automatic" is also "opaque": customers can't see or influence which region TM picks, and for teams with data-residency requirements or their own failover logic, not knowing which replica served a request wasn't enough. This post is an engineering deep dive into the design of Regional Endpoints: per-replica DNS names that let customers explicitly target a single registry replica while preserving the global endpoint as the automatic-failover entry point. We'll walk through the DNS topology, how authentication stays portable across endpoints, the certificate strategy, private endpoint integration, and the trade-offs we made to keep routing deterministic. The Problem: Opaque Routing, No Customer Control Traffic Manager's performance-based routing is a black box from the customer's perspective. The system picks the "best" region for each DNS resolution, but customers cannot influence or predict that choice. This created several pain points: Unpredictable routing: A client in a specific region cannot guarantee it hits the local replica. Network topology changes, TM probe timing, and DNS caching all introduce non-determinism. Troubleshooting latency spikes requires first figuring out which region served the request. Network isolation gaps: Enterprises with strict data residency requirements need deterministic in-region request handling. The global endpoint can route cross-region, which may not satisfy these requirements. A large financial services firm reported building duplicate single-region registries per geography as a workaround — abandoning geo-replication entirely. No client-side failover: Customers who wanted failover control (e.g., "try local, fall back to DR region") had no addressable per-region targets to configure in containerd mirrors or client retry logic. Reduced confidence in geo-replication: Some customers disabled geo-replication because they couldn't verify it was working as expected. Without per-region addressability, they couldn't confirm data locality or measure per-replica performance. What Regional Endpoints Are Regional Endpoints provide a dedicated DNS name for each replica region: myregistry.<region>.geo.azurecr.io For example, a registry contoso with replicas in East US and West Europe gets: contoso.eastus.geo.azurecr.io → resolves to the East US replica contoso.westeurope.geo.azurecr.io → resolves to the West Europe replica contoso.azurecr.io → unchanged global endpoint with TM routing and auto-failover Key properties: Coexistence: Regional endpoints exist alongside the global endpoint. Enabling them changes nothing about existing global endpoint behavior. No auto-failover on regional (by design): A request to contoso.eastus.geo.azurecr.io goes to East US. If East US is down, the request fails. This is the explicit trade-off — determinism means no silent rerouting. Global remains the failover entry point: Customers who want automatic failover continue using the global endpoint, now enhanced with health-aware routing. Regional endpoints complement, not replace, the global endpoint. DNL registries: Registries using Deterministic Name Labels include the hash in the hostname (e.g., contoso-ffb4cphwfsc2gbgg.eastus.geo.azurecr.io). Architecture Deep Dive DNS Infrastructure Each replica gets a stable hostname — contoso.eastus.geo.azurecr.io. For public access, this resolves via CNAME to the regional ACR registry server. For private endpoint access, it resolves through the privatelink zone (<region>.geo.privatelink.azurecr.io) to a private IP in the customer's VNet — the same two-hop CNAME pattern that the global login server and data endpoints already use when accessed via private link. How Authentication Works When a client calls a regional endpoint, the auth challenge uses the regional hostname for the token endpoint. However, the token itself is scoped to the global registry name (service=contoso.azurecr.io), making tokens interoperable across all endpoints for the same registry and requested scope. The key design choice: the realm uses the regional host (so the token endpoint matches the hostname the client is talking to), but the service stays global (so tokens are portable). A token obtained from contoso.eastus.geo.azurecr.io works equally against contoso.azurecr.io or contoso.westeurope.geo.azurecr.io. This means customers don't need separate credential stores per region — but Docker does require a docker login per hostname because it keys credential storage on the registry URL. Private Endpoint Integration Regional endpoints reuse the existing registry private endpoint group. When regional endpoints are enabled, new group members of the form registry_<region> are added alongside existing members. Regional endpoints are independent of data endpoints — enabling regional endpoints does not require or modify data endpoint configuration. IP consumption: Enabling regional endpoints adds N private IPs (one per replica region) to the existing baseline of 1 + N (1 global + N data). Total with regional endpoints: 1 + 2×N private IPs. Plan subnet capacity accordingly. When regional endpoints are enabled on a registry that already has private endpoints, the PE connections are updated asynchronously to include the new regional group members. Certificate Strategy TLS certificates are extended with wildcard SAN entries per region (e.g., *.<region>.geo.azurecr.io), supporting TLS validation for regional hostnames without issuing a certificate per registry. Lifecycle and Reversibility Regional endpoints are a registry-level feature. Once enabled, every replica automatically gets a regional hostname. The feature follows standard registry and replica operations: Enable (az acr update --regional-endpoints enabled): Public DNS records are created for all existing replicas. If private endpoints exist, they are updated asynchronously to include new regional members. Add replica: The new region automatically gets its regional DNS record. If private endpoints exist, they are updated to include the new region. Remove replica: That region's regional DNS record is removed. If private endpoints exist, the corresponding member is removed. Disable (az acr update --regional-endpoints disabled): All regional DNS records are removed. If private endpoints exist, the regional members are removed. The global endpoint continues working throughout. The feature can be re-enabled later. This is separate from the per-replication --global-endpoint-routing property (previously --region-endpoint-enabled, renamed in Azure CLI 2.87.0), which controls whether a replica participates in Traffic Manager routing on the global endpoint. That property has no effect on regional endpoint access. Design Trade-offs Regional Endpoints make the region explicit, so they intentionally do not auto-fail over. If East US is down, contoso.eastus.geo.azurecr.io fails — it does not silently mean West US. This is the point: determinism means no silent rerouting. In practice, ACR replicas are zone-redundant by default, spreading across multiple availability zones within a region — so a full regional outage is significantly less likely than a single-zone failure. Customers who want automatic failover across regions continue using the global endpoint. Other trade-offs we made: Docker login is per hostname. Tokens are interoperable across endpoints, but credential stores are hostname-scoped. az acr login --endpoint <region> provides a convenience shortcut. All-or-nothing enablement. Regional endpoints are enabled for all replicas simultaneously — you cannot enable for one region and not another. This simplifies the control plane and avoids partial-state confusion. More private IPs. Each regional endpoint adds one private IP per PE. Plan for 1 + 2×N IPs for N replicas. Replication lag is not masked. A pull to contoso.eastus.geo.azurecr.io will fail if the image hasn't replicated to East US yet. Regional endpoints guarantee region affinity, not data availability. For push-then-immediate-pull workflows, use retries or check replication status before pulling from a different region. What Customers Should Know Enabling Regional Endpoints Requires Azure CLI 2.86.0 or later. # Enable on existing registry az acr update -n myregistry --regional-endpoints enabled # Verify endpoints az acr show-endpoints -n myregistry Authentication # Login to a regional endpoint az acr login -n myregistry --endpoint eastus # Or manually with Docker (tokens are interoperable) TOKEN=$(az acr login -n myregistry --expose-token --query accessToken -o tsv) echo $TOKEN | docker login myregistry.eastus.geo.azurecr.io -u 00000000-0000-0000-0000-000000000000 --password-stdin Tokens are interoperable — a token obtained from any endpoint (regional or global) works across all endpoints. Docker requires a separate docker login per hostname since it keys credentials on the registry URL. TM Routing Disable The existing az acr replication update --global-endpoint-routing disabled removes a region from Traffic Manager routing on the global endpoint. This does not affect direct regional endpoint access — contoso.eastus.geo.azurecr.io remains reachable regardless of TM endpoint status. Rollout and Safety The feature is designed for safe, incremental adoption: Additive: Enabling creates new DNS hostnames. Existing global endpoint records and behavior are untouched. Reversible: Disabling removes the regional DNS records. The global endpoint continues working throughout. Isolated failure domain: A regional endpoint issue affects only traffic explicitly sent to that hostname. Global endpoint traffic is completely unaffected. No customer-side infrastructure required: No agents, sidecars, or additional services to deploy. The feature has been validated through integration testing. Outcome and What's Next Regional Endpoints complete the addressability story for geo-replicated registries. Customers now have two complementary access patterns: Global endpoint (contoso.azurecr.io): Automatic routing with health-aware failover. Best for workloads that want resilience without operational overhead. Regional endpoints (contoso.<region>.geo.azurecr.io): Deterministic, pinned-to-region access. Best for compliance, troubleshooting, and client-controlled failover strategies. Together, these give customers both explicit control and automatic failover — the combination that was previously impossible without abandoning geo-replication. Looking ahead, we are exploring ways to improve pull behavior when content has not yet replicated locally — addressing the replication lag limitation without sacrificing region affinity. To learn more about ACR geo-replication, see Geo-replication in Azure Container Registry. To enable regional endpoints, see the CLI reference or the Azure portal.
zoeyli
Jun 01, 2026 Place Apps on Azure Blog
188Views
1like
0Comments
Microsoft 365 multi-agent workflow with Microsoft Agent Framework
Learn how to design and run a multi‑agent workflow with Microsoft Agent Framework: from building a coordinated set of specialized agents and tools, to hosting and deploying them with Azure AI Foundry, and finally exposing the same workflow to users in Microsoft 365 (Teams or Copilot). This walkthrough demonstrates a practical end‑to‑end pattern for orchestrating agents, adding tools, and packaging the solution for real‑world applications.
Vincent_Giraud
Apr 23, 2026 Place Apps on Azure Blog
467Views
0likes
0Comments
Azure Container Registry Premium SKU Now Supports 100 TiB Storage
Today, we're excited to announce that Azure Container Registry Premium SKU now supports up to 100 TiB of registry storage—a 2.5x increase from the previous 40 TiB limit, and a 5x increase from the original 20 TiB limit just two years ago. We've also improved geo-replication data sync speed, reducing data sync times for new replicas. We're also introducing an updated Portal experience for storage capacity visibility—a long-standing customer request. You can now monitor your storage consumption directly from the Monitoring tab in the Azure Portal Overview blade, making it easier to track usage against your registry limits. Imagine you're managing container infrastructure for a large enterprise. Your teams have embraced containerization, migrating critical workloads from VMs to containers for improved composability and deployment velocity. Meanwhile, your AI and machine learning teams are storing increasingly large model artifacts, agent tooling, and pipeline outputs in your registry. You've watched your storage consumption climb steadily toward the 40 TiB limit, and you're evaluating complex workarounds like splitting workloads across multiple registries. With today's announcement, that constraint is lifted. Premium SKU registries now support up to 100 TiB, giving you the headroom to consolidate workloads and scale confidently. Background: Container and AI Adoption Drive Storage Growth Organizations continue to adopt containers at an accelerating pace. The migration from virtual machines to containerized architectures—driven by the composability, portability, and operational benefits of containers—shows no signs of slowing. At the same time, the AI revolution has introduced new storage demands: large language models, vision models, agent frameworks, and their associated tooling all require substantial registry capacity. These parallel trends have pushed many enterprises toward the previous 40 TiB limit faster than anticipated. The Challenge: Storage Constraints at Scale For organizations operating at scale, the 40 TiB limit created operational challenges: Multi-Registry Complexity: Teams were forced to split workloads across multiple registries, complicating access control, networking, and operational visibility. Architectural Workarounds: Some organizations implemented custom garbage collection and artifact lifecycle policies specifically to stay under limits, rather than based on actual retention requirements. Growth Planning Uncertainty: Rapidly growing AI workloads made capacity planning difficult, with some organizations uncertain whether they could consolidate new model artifacts in their primary registry. Geo-Replication Provisioning: Syncing data to new geo-replicas for expanding global footprints took longer than desired, slowing regional expansion. Introducing 100 TiB Storage Limits Premium SKU registries now support up to 100 TiB of storage—a 2.5x increase that provides substantial headroom for continued growth. This limit applies to the total storage across all repositories in a single registry. We've also improved geo-replication data sync speed when expanding your registry's global footprint with new replicas, as detailed in the table below. What's Changing Aspect Previous New Premium SKU Storage Limit 40 TiB 100 TiB Basic/Standard SKU Limits Unchanged Unchanged No Action Required The new 100 TiB limit is automatically available for all Premium SKU registries. There's no migration, feature flag, or configuration change required—your registry can now grow beyond 40 TiB without any intervention. Who Benefits This storage increase is particularly valuable for: Enterprise platform teams managing centralized container registries for large organizations with hundreds of development teams AI and ML teams storing large model artifacts, training outputs, and inference containers Organizations migrating from VMs who are consolidating legacy workloads into containerized architectures Global enterprises using geo-replication across many regions, where storage is replicated to each replica Some of the world's largest AI and financial services organizations have been operating near the previous limit and will benefit immediately from this increase. Getting Started Check Your Current Usage You can view your registry's current storage consumption and the new 100 TiB limit in the Azure Portal (under the Monitoring tab in the Overview blade) or via CLI: # View registry storage usage and limits. The registry size limit will be under MaximumStorageCapacity. az acr show-usage --name myregistry --output table The Portal, CLI, and REST API/SDKs all now reflect the increased 100 TiB capacity. You can programmatically query your registry's storage usage via the List Usages REST API, making it easy to integrate capacity monitoring into your existing tooling and dashboards. Upgrade to Premium SKU The 100 TiB storage limit is exclusive to Premium SKU. If you're on Basic or Standard and need higher storage capacity, upgrading to Premium unlocks the full 100 TiB limit along with geo-replication, enhanced throughput, private endpoints, and other enterprise features: # Upgrade to Premium SKU az acr update --name myregistry --sku Premium Related Resources Azure Container Registry service tiers and limits Geo-replication in Azure Container Registry Best practices for Azure Container Registry Azure Container Registry pricing List Usages REST API
johnsonshi_msft
Mar 16, 2026 Place Apps on Azure Blog
386Views
0likes
2Comments
Health-Aware Failover for Azure Container Registry Geo-Replication
Azure Container Registry (ACR) supports geo-replication: one registry resource with active-active (primary-primary), write-enabled geo-replicas across multiple Azure regions. You can push or pull through any replica, and ACR asynchronously replicates content and metadata to all other replicas using an eventual consistency model. For geo-replicated registries, ACR exposes a global endpoint like contoso.azurecr.io; that URL is backed by Azure Traffic Manager, which routes requests to the replica with the best network performance profile (usually the closest region). That's the promise. But TM routing at the global endpoint was latency-aware, not fully workload-health-aware: it could see whether the regional front door responded, but not whether that region could successfully serve real pull and push traffic end to end. This post walks through how we connected ACR Health Monitor's deep dependency checks to Traffic Manager so the global endpoint avoids routing to degraded replicas, improving failover outcomes and reducing customer-facing errors during regional incidents. The Problem: Healthy on the Outside, Broken on the Inside Traffic Manager routes traffic using performance-based routing, directing each DNS query to the endpoint with the lowest latency for the caller. To decide whether an endpoint is viable, TM periodically probes a health endpoint — and for ACR, that health check tested exactly one thing: is the reverse proxy responding? The problem is that a container registry is much more than a web server. A successful docker pull touches storage (where layers and manifests live), caching infrastructure, authentication and authorization services, and the metadata service. Any one of those backend dependencies can fail independently while the reverse proxy keeps happily returning 200 OK to Traffic Manager's health probes. This meant that during real outages — a storage degradation in a region, a caching failure, an authentication service disruption — Traffic Manager had no idea anything was wrong. It kept sending customers straight into a broken region, and those customers got 500 errors on their pull and push operations. We saw this pattern play out across multiple incidents: storage degradations, caching failures, VM outages, and full datacenter events — each lasting hours, all cases where geo-replicated registries had healthy replicas in other regions that could have served traffic, but Traffic Manager kept routing to the degraded region because the shallow health check passed. The Manual Workaround (and Its Failure Mode) Customers could work around this by manually disabling the affected endpoint: az acr replication update --name contoso --region eastus --region-endpoint-enabled false But this required customers to detect the outage, identify the affected region, and manually disable the endpoint — all during an active incident. Worse, in the most severe scenarios, the manual workaround could not be reliably executed. The endpoint-disable operation itself routes through the regional resource provider — the very infrastructure that's degraded. You can't tell the control plane to reroute traffic away from a region when the control plane in that region is the thing that's down. Customers were stuck. How Health Monitor Solves This ACR runs an internal service called Health Monitor within its data plane infrastructure. Its original job was narrowly scoped: it tracked the health of individual nodes so that the load balancer could route traffic to healthy instances within a region. What it didn't do was share that health signal with Traffic Manager for cross-region routing. We extended Health Monitor with a new deep health endpoint that aggregates the health status of multiple critical data plane dependencies. Rather than just asking "is the reverse proxy up?", this endpoint answers the real question: "can this region actually serve container registry requests right now?" Before we walk through the implementation details, here is a simplified before-and-after view: Before After What Gets Checked The deep health endpoint evaluates the availability of: Storage — The storage layer that holds image layers and manifests. This is the most fundamental dependency; if storage is unreachable, no image operations can succeed. Caching infrastructure — Used for caching and distributed coordination. Failures here degrade push operations and can affect pull latency. Container availability — The health of the internal services that process registry API requests. Authentication services — The authorization pipeline that validates whether a caller has permission to pull or push. Metadata service — For registries using metadata search capabilities, the metadata service is also monitored. If the health evaluation determines that the region cannot reliably serve requests, the endpoint returns unhealthy. Traffic Manager sees the failure, degrades the endpoint, and routes subsequent DNS queries to the next-lowest-latency replica — all automatically, with no customer intervention required. Per-Registry Intelligence Getting regional health right was the first step — but we needed to go further. A blunt "is the region healthy?" check would be too coarse. In each region, ACR distributes customer data across a large pool of storage accounts. A storage degradation might affect only a subset of those accounts — meaning most registries in the region are fine, and only those whose data lives on the affected accounts need to fail over. Health Monitor evaluates health on a per-registry basis. When a Traffic Manager probe arrives, Health Monitor determines which backing resources that specific registry depends on and evaluates health against those specific resources — not the region's overall health. This means that if contoso.azurecr.io depends on resources that are experiencing errors but fabrikam.azurecr.io depends on healthy ones in the same region, only Contoso's traffic gets rerouted. Fabrikam keeps getting served locally with no unnecessary latency penalty. The same per-registry logic applies to other dependencies. If a registry has metadata search enabled and the metadata service is down, that registry's endpoint goes unhealthy. If another registry in the same region doesn't use metadata search, it stays healthy. Tuning for Stability Failing over too eagerly is almost as bad as not failing over at all. A transient blip shouldn't send traffic across the continent. We tuned the thresholds so that the endpoint is only marked unhealthy after a sustained pattern of failures — not a single transient error. The end-to-end failover timing — from the onset of a real dependency failure through Health Monitor detection, Traffic Manager probe cycles, and DNS TTL propagation — is on the order of minutes, not seconds. This is deliberately conservative: fast enough to catch real regional degradation, but slow enough to ride out the kind of transient errors that resolve on their own. For context, Traffic Manager itself probes endpoints every 30 seconds and requires multiple consecutive failures before degrading an endpoint, and DNS TTL adds additional propagation delay before all clients switch to the new region. It's worth noting that DNS-based failover has an inherent limitation: even after Traffic Manager updates its DNS response, existing clients may continue reaching the degraded endpoint until their local DNS cache expires. Docker daemons, container runtimes, and CI/CD systems all cache DNS resolutions. The failover is not instantaneous — but it is automatic, which is a dramatic improvement over the previous state where failover either required manual intervention or simply didn't happen. Health Monitor's Own Resilience A natural question: what happens if Health Monitor itself fails? Health Monitor is designed to fail-open. If the monitor process is unable to evaluate dependencies — because it has crashed, is restarting, or cannot reach a dependency to check its status — the health endpoint returns healthy, preserving the pre-existing routing behavior. This ensures that a Health Monitor failure cannot itself cause a false failover. The system degrades gracefully back to the original latency-based routing rather than introducing a new failure mode. How Routing Changed The change is transparent to customers. They still access their registry through the same myregistry.azurecr.io hostname. The difference is that the system behind that hostname is now actively steering them away from degraded regions instead of blindly routing on latency alone. What Customers Should Know For registries with geo-replication enabled, this improvement is automatic — no configuration changes or action required: Pull operations benefit the most. When traffic is rerouted to a healthy replica, image layers are served from that replica's storage. For images that have completed replication to the target region, pulls succeed seamlessly. For recently pushed images that haven't yet replicated, a pull from the failover region may not find the image until replication catches up. If your workflow pushes an image and immediately pulls from a different region, consider building in retry logic or checking replication status before pulling. Push operations are more nuanced. If failover or DNS re-resolution happens during an in-flight push, that push can fail and may need to be retried. This failure mode is not new to health-aware failover; it can already occur when DNS resolves a client to a different region during a push. During failover, customers should expect both higher push latency and a higher chance of retries for long-running uploads. For production pipelines, use retry logic and design publish steps to be idempotent. Single-region registries are unaffected by this change. Traffic Manager is only involved when replicas exist; registries without geo-replication continue to route directly to their single region. In the edge case where the only region is degraded, Traffic Manager has nowhere else to route, so it continues routing to the original endpoint — the same behavior as before. Observability When a failover occurs, customers can observe the routing change through several signals: Increased pull latency from a different region — if your monitoring shows image pull times increasing, it may indicate traffic has been rerouted to a more distant replica. Azure Resource Health — check the Resource Health blade for your registry to see if there's a known issue in your primary region. Replication status — the replication health API shows the status of each replica, which can help confirm whether a specific region is experiencing issues. We're actively working on improving the observability story here — including richer signals for when routing changes occur and which region is currently serving your traffic. Rollout and Safety We rolled this out incrementally, following Azure's safe deployment practices across ring-based deployment stages. The migration involved updating each registry's Traffic Manager configuration to use the new deep health evaluation. This is controlled at the Traffic Manager level, making it straightforward to roll back a specific registry or region if needed. We also built in safeguards to quickly revert to previous routing behavior if needed. If Health Monitor's deep health evaluation were to malfunction and falsely report regions as unhealthy, we can disable it and revert to the original pass-through behavior — the same shallow health check as before — as a safety net. The Outcome Since rolling out Health Monitor-based routing, geo-replicated registries now automatically fail over during the types of regional degradation events that previously required manual intervention or resulted in extended customer impact. The classes of incidents we tracked — storage outages, caching failures, VM disruptions, and authentication service degradation — now trigger automatic rerouting to healthy replicas. This is one piece of a broader effort to improve ACR's resilience for geo-replicated registries. Other recent and ongoing work includes improving replication consistency for rapid tag overwrites, enabling cross-region pull-through for images that haven't finished replicating, and optimizing the replication service's resource utilization for large registries. Geo-replication has always been ACR's answer to multi-region availability. Health Monitor makes sure that promise holds when it matters most — when something goes wrong. To learn more about ACR geo-replication, see Geo-replication in Azure Container Registry. To configure geo-replication for your registry, see Enable geo-replication.
johnsonshi_msft
Mar 16, 2026 Place Apps on Azure Blog
445Views
2likes
0Comments
Proactive Health Monitoring and Auto-Communication Now Available for Azure Container Registry
Today, we're introducing Azure Container Registry's (ACR) latest service health enhancement: automated auto-communication through Azure Service Health alerts. When ACR detects degradation in critical operations—authentication, image push, and pull—your teams are now proactively notified through Azure Service Health, delivering better transparency and faster communication without waiting for manual incident reporting. For platform teams, SRE organizations, and enterprises with strict SLA requirements, this means container registry health events are now communicated automatically and integrated into your existing incident management and observability workflows. Background: Why Registry Availability Matters Container registries sit at the heart of modern software delivery. Every CI/CD pipeline build, every Kubernetes pod startup, and every production deployment depends on the ability to authenticate, push artifacts, and pull images reliably. When a registry experiences degradation—even briefly—the downstream impact can cascade quickly: failed pipelines, delayed deployments, and application startup failures across multiple clusters and environments. Until now, ACR customers discovered service issues primarily through two paths: monitoring their own workloads for symptoms (failed pulls, auth errors), or checking the Azure Status page reactively. Neither approach gives your team the head start needed to coordinate an effective response before impact is felt. Auto-Communication Through Azure Service Health Alerts ACR now provides faster communication when: Degradation is detected in your region Automated remediation is in progress Engineering teams have been engaged and are actively mitigating These notifications arrive through Azure Service Health, the same platform your teams already use to track planned maintenance and health advisories across all your Azure resources. You receive timely visibility into registry health events—with rich context including tracking IDs, affected regions, impacted resources, and mitigation timelines—without needing to open a support request or continuously monitor dashboards. Who Benefits This capability delivers value across every team that depends on container registry availability: Enterprise platform teams managing centralized registries for large organizations will receive early warning before CI/CD pipelines begin failing across hundreds of development teams. SRE organizations can integrate ACR health signals into their existing incident management workflows—via webhook integration with PagerDuty, Opsgenie, ServiceNow, and similar tools—rather than relying on synthetic monitoring or customer reports. Teams with strict SLA requirements can now correlate production incidents with documented ACR service events, supporting post-incident reviews and customer communication. All ACR customers gain a level of registry observability that previously required custom monitoring infrastructure to approximate. A Part of ACR's Broader Observability Strategy Automated Service Health auto-communication is one component of ACR's ongoing investment in service health and observability. Combined with Azure Monitor metrics, diagnostic logs and events, Service Health alerts give your teams a layered observability posture: Signal What It Tells You Service Health alerts ACR-wide service events in your regions, with official mitigation status Azure Monitor metrics Registry-level request rates, success rates, and storage utilization. This will be available soon Diagnostic logs Repository and operation-level audit trail What's next: We are working on exposing additional ACR metrics through Azure Monitor, giving you deeper visibility into registry operations—such as authentication, pull and push API requests, and error breakdowns—directly in the Azure portal. This will enable self-service diagnostics, allowing your teams to investigate and troubleshoot registry issues independently without opening a support request. Getting Started To configure Service Health alerts for ACR, navigate to Service Health in the Azure portal, create an alert rule filtering on Container Registry, and attach an action group with your preferred notification channels (email, SMS, webhook). Alerts can also be created programmatically via ARM templates or Bicep for infrastructure-as-code workflows. For the full step-by-step setup guide—including recommended alert configurations for production-critical, maintenance awareness, and comprehensive monitoring scenarios—see Configure Service Health alerts for Azure Container Registry.
FeynmanZhou
Mar 12, 2026 Place Apps on Azure Blog
446Views
0likes
0Comments