Applies to: Azure SQL Managed Instance Scenario: Geo DR using Failover Groups (FOG) between two SQL MI instances in different regions, deployed in Hub–Spoke topology with a third-party firewall/NVA in the Hub.
Overview
Failover groups for Azure SQL Managed Instance provide a managed way to replicate all user databases from a primary managed instance to a secondary managed instance in another region, with stable listener endpoints that remain constant across geo-failovers.
In this post, we share a field scenario where FOG creation failed even though the customer had “connectivity” established between regions. The root cause turned out to be network path behaviour in a hub–spoke environment with centralized firewalling—specifically around required ports, routing, and traffic handling.
Customer environment
Network topology (high level)
- Spoke VNet (Region 1): Primary SQL MI deployed into a dedicated subnet
- Spoke VNet (Region 2): Secondary SQL MI deployed into a dedicated subnet
- Hub VNet: Centralized security with third‑party firewall (NVA)
- Routing: Spoke → Hub (Firewall) → Spoke for east‑west traffic
SQL Managed Instance is placed inside a customer VNet/subnet and relies on both data-plane and control-plane communications. Network design choices directly affect service behaviours.
Problem statement
During deployment, Failover Group creation failed (or remained stuck), and the system did not progress to seeding/replication from the primary to the secondary instance.
Symptoms we observed
- Failover group creation repeatedly failed or timed out.
- Replication did not begin between primary and secondary.
- “Basic reachability” (e.g., VNet connectivity) appeared to be in place, but the platform validation still failed.
What makes this scenario tricky (why “it looks connected” but still fails)
Failover group replication between two SQL MI instances requires bidirectional connectivity between the managed instance subnets over specific ports. Microsoft documents the requirement to allow inbound and outbound TCP connections on:
- TCP 5022
- TCP 11000–11999
In hub–spoke architectures, traffic commonly transits a firewall/NVA with:
- UDR steering
- Stateful inspection
- NAT/SNAT policies
- Optional TLS inspection
- Session idle timeouts
Any of these can alter traffic characteristics in ways that break replication initialization—even if a simple ping/route test passes (and many PaaS endpoints don’t even support ping).
Investigation approach (how we narrowed it down)
We used a structured, layered approach:
- Validate documented FOG prerequisites for SQL MI
- Secondary managed instance must be empty.
- Primary and secondary instance configuration should match (compute, storage, service tier).
- VNets must have non-overlapping address ranges.
- Both instances must be in the same DNS zone (secondary created with primary DNS zone ID).
- NSG rules must allow required ports (5022 and 11000–11999) inbound/outbound between MI subnets.
- Validate cross-instance network connectivity using SQL MI-aware methods - A practical technique is to run connectivity tests from within SQL MI (for example, using SQL Agent job scripts) to validate the required ports and endpoints used by failover groups.
- Inspect the hub firewall path - We reviewed how the firewall handled:
- East‑west traffic between the spoke subnets
- Session persistence and state tracking
- NAT/SNAT behavior
- Inspection policies for the port ranges
Root cause hypotheses (7 common failure scenarios in Hub–Spoke + Firewall)
Below are eight realistic scenarios we consider in this topology -
1) Required ports not fully allowed
Failover group replication requires TCP 5022 and TCP 11000–11999 to be open inbound/outbound between MI subnets.
Common failure modes:
- only 5022 opened (range missing)
- opened one direction only
- opened at NSG but blocked at firewall (or vice versa)
2) NSG rules correct, but firewall policy overrides them
Even if NSGs are correct, a hub firewall may still deny or partially allow flows based on rule order or zone policies.
3) Asymmetric routing due to UDRs
If the outbound path is forced through the firewall but the return path is different, stateful firewalls will drop return traffic.
4) SNAT/NAT applied on east‑west traffic
Some firewall designs SNAT spoke-to-spoke flows. Replication initialization can fail if expected addressing/identity changes in transit.
5) Session idle timeout on firewall/NVA
Replication flows can be long-lived; aggressive idle timeouts can interrupt establishment or maintenance.
6) DNS zone mismatch / DNS zone ID not reused (Doc prerequisite)
Both instances must be in the same DNS zone; the secondary must be created using the primary instance’s DNS zone ID. Once assigned, the DNS zone can’t be modified.
7) Address space overlap
The primary and secondary VNet ranges must not overlap (including peered VNets).
Resolution (what we changed to make FOG creation succeed)
We applied changes in the following order to reduce risk and isolate impact:
Step 1 — Make the documented prerequisites true end-to-end
- Confirmed non-overlapping address spaces across the two spoke VNets.
- Confirmed the secondary MI was empty and configured with matching tier/size.
- Confirmed the secondary MI used the primary DNS zone ID, ensuring both instances were in the same DNS zone.
Step 2 — Ensure port openness across all enforcement points
- Ensured both NSGs and firewall policies allowed:
- TCP 5022
- TCP 11000–11999
inbound + outbound between the two SQL MI subnets.
Step 3 — Stabilize the routed path through the Hub firewall
- Ensured symmetric routing (same path both directions) for MI subnet ↔ MI subnet flows.
- Reviewed firewall policies to avoid unintended translation or disruption of east‑west traffic.
Step 4 — Validate using SQL MI-aware connectivity tests
- Verified required port reachability using SQL MI failover-group connectivity testing approach.
Result
After implementing the changes, FOG created successfully, seeding completed, and databases started replicating from primary to secondary.
Lessons learned (what to check before deploying in Hub–Spoke + Firewall)
If your customer uses centralized firewalling in hub–spoke, bake these into your pre-flight checklist:
Pre-flight checklist (recommended)
- Address space planning: No overlap across primary/secondary VNets and peered VNets.
- DNS zone planning: Secondary must reuse the primary DNS zone ID; DNS zone cannot be modified later.
- Ports: Allow TCP 5022 and TCP 11000–11999 inbound/outbound between MI subnets across NSG + firewall.
- Topology risk reduction: Keep MI-to-MI traffic path stable; avoid unnecessary inspection/transformation.
- Validation: Use SQL MI-aware connectivity validation (not just generic network tests).