Blog Post

Apps on Azure Blog
8 MIN READ

Health-Aware Failover for Azure Container Registry Geo-Replication

johshmsft's avatar
johshmsft
Icon for Microsoft rankMicrosoft
Mar 16, 2026

Making geo-replicated ACR registries health-aware, not just latency-aware.

Azure Container Registry (ACR) supports geo-replication: one registry resource with active-active (primary-primary), write-enabled geo-replicas across multiple Azure regions. You can push or pull through any replica, and ACR asynchronously replicates content and metadata to all other replicas using an eventual consistency model.

For geo-replicated registries, ACR exposes a global endpoint like contoso.azurecr.io; that URL is backed by Azure Traffic Manager, which routes requests to the replica with the best network performance profile (usually the closest region).

That's the promise. But TM routing at the global endpoint was latency-aware, not fully workload-health-aware: it could see whether the regional front door responded, but not whether that region could successfully serve real pull and push traffic end to end.

This post walks through how we connected ACR Health Monitor's deep dependency checks to Traffic Manager so the global endpoint avoids routing to degraded replicas, improving failover outcomes and reducing customer-facing errors during regional incidents.

The Problem: Healthy on the Outside, Broken on the Inside

Traffic Manager routes traffic using performance-based routing, directing each DNS query to the endpoint with the lowest latency for the caller. To decide whether an endpoint is viable, TM periodically probes a health endpoint — and for ACR, that health check tested exactly one thing: is the reverse proxy responding?

The problem is that a container registry is much more than a web server. A successful docker pull touches storage (where layers and manifests live), caching infrastructure, authentication and authorization services, and the metadata service. Any one of those backend dependencies can fail independently while the reverse proxy keeps happily returning 200 OK to Traffic Manager's health probes.

This meant that during real outages — a storage degradation in a region, a caching failure, an authentication service disruption — Traffic Manager had no idea anything was wrong. It kept sending customers straight into a broken region, and those customers got 500 errors on their pull and push operations.

We saw this pattern play out across multiple incidents: storage degradations, caching failures, VM outages, and full datacenter events — each lasting hours, all cases where geo-replicated registries had healthy replicas in other regions that could have served traffic, but Traffic Manager kept routing to the degraded region because the shallow health check passed.

The Manual Workaround (and Its Failure Mode)

Customers could work around this by manually disabling the affected endpoint:

az acr replication update --name contoso --region eastus --region-endpoint-enabled false

But this required customers to detect the outage, identify the affected region, and manually disable the endpoint — all during an active incident. Worse, in the most severe scenarios, the manual workaround could not be reliably executed. The endpoint-disable operation itself routes through the regional resource provider — the very infrastructure that's degraded. You can't tell the control plane to reroute traffic away from a region when the control plane in that region is the thing that's down. Customers were stuck.

How Health Monitor Solves This

ACR runs an internal service called Health Monitor within its data plane infrastructure. Its original job was narrowly scoped: it tracked the health of individual nodes so that the load balancer could route traffic to healthy instances within a region. What it didn't do was share that health signal with Traffic Manager for cross-region routing.

We extended Health Monitor with a new deep health endpoint that aggregates the health status of multiple critical data plane dependencies. Rather than just asking "is the reverse proxy up?", this endpoint answers the real question: "can this region actually serve container registry requests right now?"

Before we walk through the implementation details, here is a simplified before-and-after view:

Before

After

What Gets Checked

The deep health endpoint evaluates the availability of:

  • Storage — The storage layer that holds image layers and manifests. This is the most fundamental dependency; if storage is unreachable, no image operations can succeed.
  • Caching infrastructure — Used for caching and distributed coordination. Failures here degrade push operations and can affect pull latency.
  • Container availability — The health of the internal services that process registry API requests.
  • Authentication services — The authorization pipeline that validates whether a caller has permission to pull or push.
  • Metadata service — For registries using metadata search capabilities, the metadata service is also monitored.

If the health evaluation determines that the region cannot reliably serve requests, the endpoint returns unhealthy. Traffic Manager sees the failure, degrades the endpoint, and routes subsequent DNS queries to the next-lowest-latency replica — all automatically, with no customer intervention required.

Per-Registry Intelligence

Getting regional health right was the first step — but we needed to go further. A blunt "is the region healthy?" check would be too coarse. In each region, ACR distributes customer data across a large pool of storage accounts. A storage degradation might affect only a subset of those accounts — meaning most registries in the region are fine, and only those whose data lives on the affected accounts need to fail over.

Health Monitor evaluates health on a per-registry basis. When a Traffic Manager probe arrives, Health Monitor determines which backing resources that specific registry depends on and evaluates health against those specific resources — not the region's overall health.

This means that if contoso.azurecr.io depends on resources that are experiencing errors but fabrikam.azurecr.io depends on healthy ones in the same region, only Contoso's traffic gets rerouted. Fabrikam keeps getting served locally with no unnecessary latency penalty.

The same per-registry logic applies to other dependencies. If a registry has metadata search enabled and the metadata service is down, that registry's endpoint goes unhealthy. If another registry in the same region doesn't use metadata search, it stays healthy.

Tuning for Stability

Failing over too eagerly is almost as bad as not failing over at all. A transient blip shouldn't send traffic across the continent. We tuned the thresholds so that the endpoint is only marked unhealthy after a sustained pattern of failures — not a single transient error.

The end-to-end failover timing — from the onset of a real dependency failure through Health Monitor detection, Traffic Manager probe cycles, and DNS TTL propagation — is on the order of minutes, not seconds. This is deliberately conservative: fast enough to catch real regional degradation, but slow enough to ride out the kind of transient errors that resolve on their own. For context, Traffic Manager itself probes endpoints every 30 seconds and requires multiple consecutive failures before degrading an endpoint, and DNS TTL adds additional propagation delay before all clients switch to the new region.

It's worth noting that DNS-based failover has an inherent limitation: even after Traffic Manager updates its DNS response, existing clients may continue reaching the degraded endpoint until their local DNS cache expires. Docker daemons, container runtimes, and CI/CD systems all cache DNS resolutions. The failover is not instantaneous — but it is automatic, which is a dramatic improvement over the previous state where failover either required manual intervention or simply didn't happen.

Health Monitor's Own Resilience

A natural question: what happens if Health Monitor itself fails? Health Monitor is designed to fail-open. If the monitor process is unable to evaluate dependencies — because it has crashed, is restarting, or cannot reach a dependency to check its status — the health endpoint returns healthy, preserving the pre-existing routing behavior. This ensures that a Health Monitor failure cannot itself cause a false failover. The system degrades gracefully back to the original latency-based routing rather than introducing a new failure mode.

How Routing Changed

The change is transparent to customers. They still access their registry through the same myregistry.azurecr.io hostname. The difference is that the system behind that hostname is now actively steering them away from degraded regions instead of blindly routing on latency alone.

What Customers Should Know

For registries with geo-replication enabled, this improvement is automatic — no configuration changes or action required:

  • Pull operations benefit the most. When traffic is rerouted to a healthy replica, image layers are served from that replica's storage. For images that have completed replication to the target region, pulls succeed seamlessly. For recently pushed images that haven't yet replicated, a pull from the failover region may not find the image until replication catches up. If your workflow pushes an image and immediately pulls from a different region, consider building in retry logic or checking replication status before pulling.
  • Push operations are more nuanced. If failover or DNS re-resolution happens during an in-flight push, that push can fail and may need to be retried. This failure mode is not new to health-aware failover; it can already occur when DNS resolves a client to a different region during a push. During failover, customers should expect both higher push latency and a higher chance of retries for long-running uploads. For production pipelines, use retry logic and design publish steps to be idempotent.
  • Single-region registries are unaffected by this change. Traffic Manager is only involved when replicas exist; registries without geo-replication continue to route directly to their single region. In the edge case where the only region is degraded, Traffic Manager has nowhere else to route, so it continues routing to the original endpoint — the same behavior as before.

Observability

When a failover occurs, customers can observe the routing change through several signals:

  • Increased pull latency from a different region — if your monitoring shows image pull times increasing, it may indicate traffic has been rerouted to a more distant replica.
  • Azure Resource Health — check the Resource Health blade for your registry to see if there's a known issue in your primary region.
  • Replication status — the replication health API shows the status of each replica, which can help confirm whether a specific region is experiencing issues.

We're actively working on improving the observability story here — including richer signals for when routing changes occur and which region is currently serving your traffic.

Rollout and Safety

We rolled this out incrementally, following Azure's safe deployment practices across ring-based deployment stages. The migration involved updating each registry's Traffic Manager configuration to use the new deep health evaluation. This is controlled at the Traffic Manager level, making it straightforward to roll back a specific registry or region if needed.

We also built in safeguards to quickly revert to previous routing behavior if needed. If Health Monitor's deep health evaluation were to malfunction and falsely report regions as unhealthy, we can disable it and revert to the original pass-through behavior — the same shallow health check as before — as a safety net.

The Outcome

Since rolling out Health Monitor-based routing, geo-replicated registries now automatically fail over during the types of regional degradation events that previously required manual intervention or resulted in extended customer impact. The classes of incidents we tracked — storage outages, caching failures, VM disruptions, and authentication service degradation — now trigger automatic rerouting to healthy replicas.

This is one piece of a broader effort to improve ACR's resilience for geo-replicated registries. Other recent and ongoing work includes improving replication consistency for rapid tag overwrites, enabling cross-region pull-through for images that haven't finished replicating, and optimizing the replication service's resource utilization for large registries.

Geo-replication has always been ACR's answer to multi-region availability. Health Monitor makes sure that promise holds when it matters most — when something goes wrong.

To learn more about ACR geo-replication, see Geo-replication in Azure Container Registry. To configure geo-replication for your registry, see Enable geo-replication.

Updated Mar 16, 2026
Version 1.0
No CommentsBe the first to comment