Apr 01 2020 04:09 AM
Dear Support,
We have 4 nodes server forming the failover cluster. When I unplug two power cord of one of node, the VM on that node will become unmonitor state, and it failed to move those VMs to other host. and the node will become isolated.
svr1: 2 VMs (mgtsvr1 (CPU x 8 and 24G Ram) , adsvr1 (CPU x 8 and 16G RAM))
svr2: 2 VMs (recsvr1 (CPU X14 and 16G Ram), failsvr2 (CPU X14 and 16G Ram))
svr3: 2 VMs (recsvr2 (CPU X14 and 16G Ram), failsvr1 (CPU X14 and 16G Ram))
svr4: 1 VMs (fmgtsvr1 (CPU x 8 and 24G Ram))
svr1 -4 server spec as follows;
ram: 64G
CPU: Xeon Gold 6134 3.20GHz x2
ROM: 600GB
Please advise
Apr 01 2020 08:19 PM
Apr 02 2020 01:53 AM - edited Apr 02 2020 01:57 AM
Make sure you have a cluster witness correctly configured and run the failover cluster validation to make sure everything is in the green.
If the validation shows no errors and all warnings are accounted for, and your cluster witness is configured correctly and visible from all nodes all the time, try your scenario again.
Additionally make sure the storage where you placed vm's configuration and vhd's is visible and usable from all nodes.
I forgot: make sure that all your vms can run on any node. Test this with live migration. Try to live migrate all vms from node to every other node. Do this for every node and every vm. If, for example, you use static RAM in your vm's and there is simply not enough memory available on any other node to start the failed vm, it will not fail over. You should be able to pause and drain any one node at any time to have a consistent failover possible.
Apr 08 2020 12:53 PM
Does the virtual machine ever move after a period of time? In Windows 2016 Failover Clustering, we added Compute and VM Resiliency. It was for those cases where a node or storage may have a transient error and have no connectivity for a very quick brief period of time. Prior to Windows 2016, if the node went down, all would move over (powered off and on). This would cause issues if it was only down for say 30 seconds. Once the node was back and joined, the VM would start again. The VM would either continue running during this time if the physical disk was still accessible or would go into a paused-critical state until the drive came back up as long as it did within the time period allowed. The setting for this is ResiliencyPeriod and the default period is 240 seconds (4 minutes). So it would be 4 minutes before the VM would be moved to another node.
To see what this is set for would be the commands:
Cluster property:
(Get-Cluster).ResiliencyDefaultPeriod
Group common property for granular control:
(Get-ClusterGroup “My VM”).ResiliencyPeriod