SQL Server Distributed AG's Forwarder Is Not Syncing After Primary AG's Internal Failover

Question

I have set up a SQL Server Distributed Availability Group (DAG) in Kubernetes using SQL Server on Ubuntu images. The setup consists of two availability groups (AGs) across two separate clusters:

Setup Details:
Primary Cluster (AG1) Pods: ag1-0 (Primary), ag1-1, ag1-2. The Primary is Exposed via the LoadBalancer service.
Remote Cluster (AG2): Pods: ag2-0 (The Primary of AG2, Acting as a forwarder of DAG), ag2-1, ag2-2. The Forwarder (ag2-0) is Exposed via the LoadBalancer service.

Distributed AG Configuration:
AG1 and AG2 are part of the DAG. Each AG’s primary is dynamically selected using the pod's label role=primary. LISTENER_URL in the DAG configuration points to the LoadBalancer service of each AG.

Issue: DAG Not Syncing After AG1 Failover
For testing, I triggered a failover in AG1 using:

`ALTER AVAILABILITY GROUP [AG1] FORCE_FAILOVER_ALLOW_DATA_LOSS;`
The global primary changed from ag1-0 to ag1-1, and I updated the role=primary label accordingly (removed from ag1-0, added to ag1-1. However, AG2 (the forwarder and its replicas) stopped syncing and became unhealthy.

From ag2-0 (forwarder) logs, I only see connection timeouts and disconnections from the global primary. AG2 is not automatically reconnecting to the new primary (ag1-1), even though the LoadBalancer service in LISTENER_URL now points to ag1-1.

Logs from ag2-0 (Forwarder) Shows Like
A connection timeout has occurred while attempting to establish a connection to GLOBAL PRIMARY. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance
Steps I Tried:
- Checked DAG Configuration – The LISTENER_URL is correctly set to the LoadBalancer of AG1, which now points to ag1-1.
- Ran the Resume Command:
`ALTER DATABASE [agtestdb] SET HADR RESUME;`
This did not resolve the issue.
- Verified Network Connectivity

Questions:
- What steps are required to ensure AG2 correctly syncs with the new global primary (ag1-1) after an AG1's internal failover?
- Is there a specific command that needs to be run on the forwarder (ag2-0) or the new global primary (ag1-1) to reestablish synchronization?
- Why isn’t AG2 automatically reconnecting, even though the LoadBalancer service points to the correct primary?
- Are there any best practices for handling SQL Server DAG failovers in Kubernetes?

Any insights would be greatly appreciated!

neajmorshad · Answer

As I am handling failover manually, Availability Groups are created with `FAILOVER_MODE = MANUAL`, (Following Etcd's Raft Implementation for leader election and failover decisions), I only do failover when the primary is unavailable. So `FORCE_FAILOVER_ALLOW_DATA_LOSS;` needs to be used.

My primary AG (AG1) is properly synced and healthy (The databases were suspended after running the force failover command, But I resumed them manually. Old primary ag1-0 is also rejoined successfully). New Primary is taking new writes and all the secondaries including the old primary ag1-0 are being synced with the latest Global Primary ag1-1. So I guess it's not an issue with the `FORCE_FAILOVER_ALLOW_DATA_LOSS;` command.

I tried with the `ALTER AVAILABILITY GROUP [DAG] MODIFY` on the Global Primary to reset the `SEEDING_MODE` for restarting automatic seeding. Which also didn't solve the issue.

sivertsolem · Answer

`ALTER AVAILABILITY GROUP [AG1] FORCE_FAILOVER_ALLOW_DATA_LOSS;`There's your starting issue.Forces failover of the availability group, with possible data loss, to the failover target. The failover target will take over the primary role and recover its copy of each database and bring them online as the new primary databases. On any remaining secondary replicas, every secondary database is suspended until manually resumed. When the former primary replica becomes available, it will switch to the secondary role, and its databases will become suspended secondary databases.ALTER AVAILABILITY GROUP (Transact-SQL) - SQL Server | Microsoft LearnWhat you most likely wanted to do, was the `ALTER AVAILABILITY GROUP [AG1] FAILOVER;` command.FORCE_FAILOVER_ALLOW_DATA_LOSS is only for situations where the primary is unavailable, automatic failover has not occurred, and the regular FAILOVER command does not work.As for why resuming doesn't work, I'll have to admit I don't personally have sufficient experience with DAGs.I would suggest verifying whether the databases are suspended on the primary as well, and running a MODIFY on your Global primary resetting your modes.Resetting SEEDING_MODE is how you'd restart (or cancel) automatic seeding attempts, for example.ALTER AVAILABILITY GROUP [DAG1]
   MODIFY
   AVAILABILITY GROUP ON
      'AG1' WITH
      (
         AVAILABILITY_MODE = YourCurrentMode,
         FAILOVER_MODE = YourCurrentMode,
         SEEDING_MODE = YourCurrentMode
      ),
      'AG2' WITH
      (
         AVAILABILITY_MODE = YourCurrentMode,
         FAILOVER_MODE = YourCurrentMode,
         SEEDING_MODE = YourCurrentMode
      );
GO

Forum Discussion

SQL Server Distributed AG's Forwarder Is Not Syncing After Primary AG's Internal Failover

2 Replies

Resources