Forum Discussion
SQL Server AG Failover - Automatic Failover
Hello,
I am looking for a straight and definitive answer that I was hoping someone could answer for me. I want to trust what Copilot says, but I really need to hear it from Microsoft and I can't find any documentation from Microsoft confirming my question.
My Environment:
- 2 replicas in datacenter 1
- 1 replica in datacenter 2
All three (3) replicas are set to synchronous-commit mode with automatic failover.
I tested the failover manually between all three (3) replicas without issue.
When I test the automatic failover - I take down both replicas in datacenter 1 at the same time to simulate a datacenter outage. I look at the replica in datacenter 2 and it is just says (Resolving...) next to the replica name. The replica does not come online and the DB is not moved.
When I was searching I couldn't find out why. So I turned to Copilot not solve the issue, but to see if it could point me in the right direction.
I tell Copilot my setup and what happened. Copilot responded stating that by design from Microsoft you cannot have more than two (2) replicas set to synchronous-commit mode with automatic failover in a SQL Server AG instance. That if more than two (2) are set for automatic failover. The SQL Server AG will use the first two (2) replicas it sees in its metadata and ignore the rest. Copilot went into detail about why this is designed this way, but the amount of information would make this post longer than it already is.
If this is true - then when I took down both replicas in datacenter 1, SQL Server AG only saw those two (2) replicas in datacenter 1 as the available replicas to use for an automatic failover and thus why the replica in datacenter 2 did not come online and the DB not being moved
So let's do a test.
I brought back up the two (2) replicas in datacenter 1. Then I made a change in the AG proprieties. I set the 2nd replica in datacenter 1 to manual. So 1 replica is set to automatic failover and 1 replica is set to manual failover in datacenter 1. The replica in datacenter 2 is set to automatic failover
I then take down both replicas in datacenter 1 again to simulate the "outage" and the replica in datacenter 2 comes online and the DB is moved.
So is Copilot right? Can there only be two (2) replicas allowed to have/use automatic failover? I cannot find a definitive answer confirming this.
Or is my configuration wrong/missing something and if it is, could you please point me in the right direction on how to get this resolved?
2 Replies
- SivertSolemIron Contributor
Microsoft documentation claims you can have up to five replicas with synchronous commit and automatic failover.
SQL Server 2019 (15.x) increases the maximum number of synchronous replicas to 5, up from 3 in SQL Server 2017 (14.x). You can configure this group of five replicas to have automatic failover within the group. There's one primary replica, plus four synchronous secondary replicas.
What is an Always On availability group? - SQL Server Always On | Microsoft Learn
I believe in your first scenario, you met this constraint:
Automatic failover requires that both the primary replica and the target secondary replica run under synchronous-commit mode with the failover mode set to Automatic. In addition, the secondary replica must already be synchronized, have WSFC quorum, and meet the conditions specified by the flexible failover policy of the availability group.
Scenario 1:
- All three replicas set to automatic failover (2 on Site 1, 1 on Site 2)
- Stop both nodes on Site 1
- Last node in resolving state
I would assume that in this scenario, the failover cluster first moved the primary to the second replica on Site 1.
When that replica was removed, the WSFC no longer had quorum, and the replica on Site 2 could not be elected as primary. (or it hadn't re-synchronized after initial failover)
The loss of quorum would be especially true if you brought down both site 1 nodes simultaneously.
Scenario 2:
- One replica Site 1 and Site 2 automatic failover, one replica Site 1 manual failover
- Stop both nodes on Site 1
- Last node is primary
For this scenario, if you brought down the Site 1 nodes in series, starting with the node configured for automatic failover, WSFC had quorum and the replicas were synchronized. Failover occurred when the first node went down.
--
A Secondary replica will never promote to primary, if it cannot reach quorum.
A Primary replica may stay as a primary replica, if it still has the votes (otherwise it'll also stop)
Quorum is based on votes. For SQL Server Availability Groups, all replicas that participate in automatic failover should have a vote. You may have configured a "witness". It has voting power, but cannot run cluster resources (and does not store cluster configuration).
The necessary votes to have quorum is "(all possible votes / 2) + 1".
For three nodes, that means two votes required.These measures are in place to ensure that you do not end up with two replicas acting as primaries (split-brain).
That could result in replicas getting different updates, and you get a mess on your hands. - iamaamirloneCopper Contributor
It supports automatic failure only between one primary and one secondary. Even if you set multiple replicas to automatic, you will see only one is honored. You only need to configure one synchronous secondary with automatic fail over, keep the all others set to manual.