Oct 11 2023 02:11 AM - edited Oct 11 2023 02:19 AM
I have an environment where we had;
Just out of curiosity, can MSSQL Server Cluster Resource/Group can stop itself when it notice on OS level where the SQL Server Cluster lost connection to the storage?
Because MSSQL Server connection to SQL server to iSCSI storage has intermittent connection issues due to backend having maintenance/patching on the software/hardware level.
I read on another forum post here, he stated SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete, thus caused the cluster resource 'SQL Server (Name)' (resource type 'SQL Server', DLL 'sqsrvres.dll') to exceed its time-out threshold.
Does above forum cases is similar to the lost connection to the iSCSI storage, since all data/log is inside there when ever the application run do write/read query process to the database.
If that the case, is there any timeout that we can change to increase before SQL cluster decide to failover to stop the SQL services. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.
If my question or information is not detail feel free to ask about it. Hope can hear from you guys advice on this.
Thanks.
Apr 07 2024 11:39 PM
@afiqazmi31 - Hi just checking if you made any progress on this. We are experiencing identical issues.
Apr 09 2024 02:02 AM
@afiqazmi31 @RodVK If you're willing to accept higher timeouts before automatic failovers, it appears that you want to adjust the Health Check Timeout.
select name, health_check_timeout from sys.availability_groups
As described in this Learn article, sp_server_diagnostics returns results at 1/3 of the timeout threshold.
Configure a flexible automatic failover policy for an availability group - SQL Server Always On | Mi...
sp_server_diagnostics is what checks the io subsystem.
In other words, to increase the acceptable timeout, set health_check_timeout to a higher value (milliseconds).
ALTER AVAILABILITY GROUP <AG> SET (HEALTH_CHECK_TIMEOUT = 60000);
Apr 09 2024 05:04 PM
Hi @SivertSolem thanks for the reply but we are using Failover cluster instance, and the rhs.exe is bugchecking after storage seems to disappear after a failover. Not using Always on Availability groups.
Apr 10 2024 12:56 AM
Aah, apologies. That should have been given by using iSCSI as quorum.
The health check timeout is still configurable, but in this case in the Failover Cluster Configuration.
Powershell
Import-Module FailoverClusters
$fci = "SQL Server (INST1)"
Get-ClusterResource $fci | Set-ClusterParameter HealthCheckTimeout 60000
T-SQL
ALTER SERVER CONFIGURATION
SET FAILOVER CLUSTER PROPERTY HealthCheckTimeout = 60000;