Windows Server Summit 2024
Mar 26 2024 08:00 AM - Mar 28 2024 03:30 PM (PDT)
Microsoft Tech Community

node server down failover cluster fail to move VM to other node

Copper Contributor

Dear Support,

 

We have 4 nodes server forming the failover cluster. When I unplug two power cord of one of node, the VM on that node will become unmonitor state, and it failed to move  those VMs to other host. and the node will become isolated. 

svr1: 2 VMs (mgtsvr1  (CPU x 8  and 24G Ram) , adsvr1  (CPU x 8 and 16G RAM))

svr2: 2 VMs (recsvr1 (CPU X14 and 16G Ram), failsvr2 (CPU X14 and 16G Ram))

svr3: 2 VMs (recsvr2 (CPU X14 and 16G Ram), failsvr1 (CPU X14 and 16G Ram))

svr4: 1 VMs (fmgtsvr1 (CPU x 8  and 24G Ram))

svr1 -4 server spec as follows;

ram: 64G

CPU: Xeon Gold  6134 3.20GHz x2
ROM:  600GB

Please advise

 

 

 

3 Replies
Hi Guys,

Is there any one meet this situation?

Make sure you have a cluster witness correctly configured and run the failover cluster validation to make sure everything is in the green.

If the validation shows no errors and all warnings are accounted for, and your cluster witness is configured correctly and visible from all nodes all the time, try your scenario again.

 

Additionally make sure the storage where you placed vm's configuration and vhd's is visible and usable from all nodes.

 

I forgot: make sure that all your vms can run on any node. Test this with live migration. Try to live migrate all vms from node to every other node. Do this for every node and every vm. If, for example, you use static RAM in your vm's and there is simply not enough memory available on any other node to start the failed vm, it will not fail over. You should be able to pause and drain any one node at any time to have a consistent failover possible.

@Pang_Lau 

@Pang_Lau 

Does the virtual machine ever move after a period of time?  In Windows 2016 Failover Clustering, we added Compute and VM Resiliency.  It was for those cases where a node or storage may have a transient error and have no connectivity for a very quick brief period of time.  Prior to Windows 2016, if the node went down, all would move over (powered off and on).  This would cause issues if it was only down for say 30 seconds. Once the node was back and joined, the VM would start again.  The VM would either continue running during this time if the physical disk was still accessible or would go into a paused-critical state until the drive came back up as long as it did within the time period allowed.  The setting for this is ResiliencyPeriod and the default period is 240 seconds (4 minutes).  So it would be 4 minutes before the VM would be moved to another node.

 

To see what this is set for would be the commands:

 

Cluster property: 

(Get-Cluster).ResiliencyDefaultPeriod

Group common property for granular control: 

(Get-ClusterGroup “My VM”).ResiliencyPeriod