SQL Server Failover Cluster Resource/Group stop itself due lost connection to storage

Copper Contributor

I have an environment where we had;

  • MSSQL Cluster
  • Data, Log and Quorom on iSCSI storage (AWS FSx Ontap)
  • Fileshare witness (AWS FSx Windows)

 

Just out of curiosity, can MSSQL Server Cluster Resource/Group can stop itself when it notice on OS level where the SQL Server Cluster lost connection to the storage?

 

Because MSSQL Server connection to SQL server to iSCSI storage has intermittent connection issues due to backend having maintenance/patching on the software/hardware level.

 

I read on another forum post here, he stated SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete, thus caused the cluster resource 'SQL Server (Name)' (resource type 'SQL Server', DLL 'sqsrvres.dll') to exceed its time-out threshold.

 

Does above forum cases is similar to the lost connection to the iSCSI storage, since all data/log is inside there when ever the application run do write/read query process to the database.

 

If that the case, is there any timeout that we can change to increase before SQL cluster decide to failover to stop the SQL services. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.

 

If my question or information is not detail feel free to ask about it. Hope can hear from you guys advice on this.

 

Thanks.

 

4 Replies

@afiqazmi31 - Hi just checking if you made any progress on this. We are experiencing identical issues. 

@afiqazmi31 @RodVK If you're willing to accept higher timeouts before automatic failovers, it appears that you want to adjust the Health Check Timeout.

select name, health_check_timeout from sys.availability_groups

As described in this Learn article, sp_server_diagnostics returns results at 1/3 of the timeout threshold.
Configure a flexible automatic failover policy for an availability group - SQL Server Always On | Mi...

 

sp_server_diagnostics is what checks the io subsystem.

 

In other words, to increase the acceptable timeout, set health_check_timeout to a higher value (milliseconds).

 

ALTER AVAILABILITY GROUP <AG> SET (HEALTH_CHECK_TIMEOUT = 60000);

Hi @SivertSolem thanks for the reply but we are using Failover cluster instance, and the rhs.exe is bugchecking after storage seems to disappear after a failover. Not using Always on Availability groups. 

@RodVK 

Aah, apologies. That should have been given by using iSCSI as quorum.

The health check timeout is still configurable, but in this case in the Failover Cluster Configuration.

https://learn.microsoft.com/en-us/sql/sql-server/failover-clusters/windows/configure-healthchecktime...


Powershell

Import-Module FailoverClusters  
  
$fci = "SQL Server (INST1)"  
Get-ClusterResource $fci | Set-ClusterParameter HealthCheckTimeout 60000

 T-SQL

ALTER SERVER CONFIGURATION   
SET FAILOVER CLUSTER PROPERTY HealthCheckTimeout = 60000;