Forum Discussion
Availability Group failover issue
This is a solid breakdown of the issue, and you’ve already done some of the right troubleshooting steps. The error message during the attempted failover strongly suggests that, despite both replicas showing as synchronized beforehand, at the moment of failover one or more databases on Node B were not in a clean, synchronized state or hadn’t fully joined the AG resource. This can occasionally happen if replication health looks fine on the surface but hasn’t fully hardened writes on the secondary.
A few things to consider:
✅ Regularly validate database join/sync state – Get-DbaAgDatabase (from dbatools) or sys.dm_hadr_database_replica_states can give a more accurate picture of database readiness beyond the AG dashboard.
🛠️ Check for tempdb / I/O anomalies – Since the trigger was an I/O error in tempdb, it’s worth validating your storage subsystem for intermittent corruption or latency, which can ripple into the AG’s stability.
🔄 Health detection & failover sensitivity – The cluster may try to fail over too quickly before the secondary is fully ready. Adjusting failure conditions, as you’ve already started with “maximum failures” and failover period, can reduce false or premature failovers.
🕵️ Verbose logging during failover attempts – Enable detailed HADRS trace flags or review AlwaysOn_health extended events to catch the precise database state at failover time.
🧩 Node C configuration – Since Node C is manual failover only, ensure quorum and witness settings are optimal. Sometimes failback behavior is influenced by how the cluster perceives node votes.
In short, the issue doesn’t mean auto-failover is unreliable, but rather that the secondary replica wasn’t 100% failover-ready in that moment. Proactive health checks, storage validation, and cluster tuning should reduce the chance of this happening again.