Apr 11 2023 01:08 AM - edited Apr 11 2023 01:13 AM
Hello Friends
I've been testing Cluster Failover for a couple of days.
I have some problem with virtual machines, or so it works fine and I'm just bothering you guys. Namely. I have two nodes. I turn off (hard) the node on which the virtual machine is currently located. After a few seconds, the cluster detects a failure. It does not start the virtual machine on the next node, but puts it in the "Unmonitored" state. After these few minutes it starts the virtual machine on the second node. Can it be speeded up somehow?
If it matters, the quorum is based on the iSCSI drive.
Thank you in advance for your help.
Apr 11 2023 03:31 AM
There are several steps you can take to speed up the cluster failover process:
Optimize the Network: Ensure that the network between the cluster nodes is optimized for performance. This can include using high-speed network adapters, optimizing network settings, and ensuring that there are no bottlenecks in the network.
Allocate Sufficient Resources: Ensure that the cluster nodes have sufficient resources, including CPU, memory, and storage, to handle the failover process. If resources are limited, the failover process may take longer.
Please click Mark as Best Response & Like if my post helped you to solve your issue. This will help others to find the correct solution easily.
Apr 11 2023 08:50 AM
Solution@Darvin1705 The Get-Cluster | fl will tell you about a parameter called "ResiliencyDefaultPeriod" which is set to 240s (4mins). During node outage, the cluster must wait for 4mins for the failing node to become responsive again before taking action. It is this time that the VMs remain in unmonitored state in the cluster. You must not change this parameter because in a real scenario, 4 minutes is what it typically takes for devices like network switches to reboot and return to senses before the cluster realizes something is very wrong and starts to failover the VMs to the surviving and responsive nodes.
Apr 11 2023 08:50 AM
Solution@Darvin1705 The Get-Cluster | fl will tell you about a parameter called "ResiliencyDefaultPeriod" which is set to 240s (4mins). During node outage, the cluster must wait for 4mins for the failing node to become responsive again before taking action. It is this time that the VMs remain in unmonitored state in the cluster. You must not change this parameter because in a real scenario, 4 minutes is what it typically takes for devices like network switches to reboot and return to senses before the cluster realizes something is very wrong and starts to failover the VMs to the surviving and responsive nodes.