Virtual Machine Compute Resiliency in Windows Server 2016

Former Employee

Mar 15, 2019

In today’s cloud scale environments, commonly comprising of commodity hardware, transient failures have become more common than hard failures. In these circumstances, reacting aggressively to handle these transient failures can cause more downtime than it prevents. Windows Server 2016, therefore introduces increased Virtual Machine (VM) resiliency to intra-cluster communication failures in your compute cluster.

Interesting Transient Failure Scenarios

The following are some potentially transient scenarios where it would be beneficial for your VM to be more resilient to intra-cluster communication failures:

Node disconnected: The cluster service attempts to connect to all active nodes. The disconnected (Isolated) node cannot talk to any node in an active cluster membership.

Cluster Service crash: The Cluster Service on a node is down. The node is not communicating with any other node.

Asymmetric disconnect: The Cluster Service is attempting to connect to all active nodes. The isolated node can talk to at least one node in active cluster membership.

New Failover Clustering States

In Windows Server 2016, to reflect the new Failover Cluster workflow-in the event of transient failures, three new states have been introduced:

A new VM state, Unmonitored , has been introduced in Failover Cluster Manager to reflect a VM that is no longer monitored by the cluster service.

Two new cluster node states have been introduced to reflect nodes which are not in active membership but were host to VM role(s) before being removed from active membership:

Isolated:
- The node is no longer in an active membership
- The node continues to host the VM role

Quarantine:
- The node is no longer allowed to join the cluster for a fixed time period (default: 2 hours)
- This action prevents flapping nodes from negatively impacting other nodes and the overall cluster health
- By default, a node is quarantined, if it ungracefully leaves the cluster, three times within an hour
- VMs hosted by the node are gracefully drained once quarantined
- No more than 25% of nodes can be quarantined at any given time

The node can be brought out of quarantine by running the Failover Clustering PowerShell ^© cmdlet, Start-ClusterNode with the –CQ or –ClearQuarantine flag.

VM Compute Resiliency Workflow in Windows Server 2016

The VM resiliency workflow in a compute cluster is as follows:

In the event of a “transient” intra-cluster communication failure, on a node hosting VMs, the node is placed into an Isolated state and removed from its active cluster membership. The VM on the node is now considered to be in an Unmonitored state by the cluster service.

If the isolated node continues to experience intra-cluster communication failures, after a certain period (default of 4 minutes), the VM is failed over to a suitable node in the cluster, and the node is now moved to a Down state.

If a node is isolated a certain number of times (default three times) within an hour, it is placed into a Quarantine state for a certain period (default two hours) and all the VMs from the node are drained to a suitable node in the cluster.

Configuring Node Isolation and Quarantine settings

To achieve the desired Service Level Agreement guarantees for your environment, you can configure the following cluster settings, controlling how your node is placed in isolation or quarantine:

Setting	Description	Default	Values
ResiliencyLevel	Defines how unknown failures handled	2	1 – Allow the node to be in Isolated state only if the node gave a notification and it went away for known reason, otherwise fail immediately. Known reasons include Cluster Service crash or Asymmetric Connectivity between nodes. 2- Always let a node go to an Isolated state and give it time before taking over ownership of the VMs. PowerShell: (Get-Cluster).ResiliencyLevel = <value>
ResiliencyPeriod	Duration to allow VM to run isolated (in seconds)	240	0 – Reverts to pre-Windows Server 2016 behavior PowerShell: Cluster property: (Get-Cluster).ResiliencyDefaultPeriod = <value> Group common property for granular control: (Get-ClusterGroup “My VM”).ResiliencyPeriod= <value> Note: A value of -1 for the group property causes the cluster property to be used.
QuarantineThreshold	Number of failures before a node is Quarantined.	3	PowerShell: (Get-Cluster).QuarantineThreshold = <value>
QuarantineDuration	Duration to disallow cluster node join (in seconds)	7200	0xFFFFFFFF – Never allow node to join (in seconds) PowerShell: (Get-Cluster).QuarantineDuration = <value>