Failover Clustering

5 MIN READ

Draining Nodes for Planned Maintenance with Windows Server 2012

Microsoft

Mar 15, 2019

First published on MSDN on Apr 03, 2012

Windows Server 2012 Failover Clusters are easier to manage and maintain with the new “Node Drain” and “Resume with Failback” features. This enables nodes to be gracefully drained for planned maintenance. This functionality is part of the infrastructure that enables “Cluster Aware Updating” (CAU) for patching nodes in a cluster.

Overview

Bringing an individual node down for planned maintenance is a common administrative task, to for example install a Service Pack or hardware upgrades.

On a Windows Server 2008 R2 Failover Cluster, this is a manual process where you place a cluster node in PAUSED state, and then move individual Roles (workloads) to the other nodes in the cluster as outlined in this KB article .

In Windows Server 2012 conducting planned maintenance on Failover Clusters is dramatically simplified, as these steps are automated in the Node Drain (or Node Maintenance Mode) feature.

Node Drain

Using Node Drain you can automate moving the Roles (workloads) off of a cluster node. Think of Node Drain is to as an enhanced, workload aware Node Pause.

Steps automated by Node Drain:

1) The cluster node is put in a PAUSED state, which prevents other workloads hosted on other nodes from moving to the node.

2) The Roles (workloads) currently owned by the cluster node, are sorted according to their Priority order. (Priority of Roles is another new Failover Clustering functionality in Windows Server 2012.)

3) The Roles are then distributed to the other active nodes in the cluster in priority order. Node Drain works with all workloads running on the cluster. For virtual machines, it leverages live migrations and memory-aware intelligent placement.

4) When all the Roles are moved off of the cluster node, Node Drain operation is completed.

Initiating Node Drain through Failover Cluster Manager:

Initiating Node Drain through Failover Cluster Manager snap-in is a simple one-click operation:

Open Failover Cluster Manager (CluAdmin.msc)

On the left hand pane navigate to Nodes

Right-click on the node you wish to drain

Under Pause select Drain Roles

Note: If you select “Do Not Drain Roles”, then it would simply “PAUSE” the node similar to Windows Server 2008 R2.

Initiating Node Drain through PowerShell:

You can initiate Node Drain using the “Suspend-ClusterNode” PowerShell command.

There are additional advanced options available through PowerShell to manage draining nodes, which includes:

Parameter	Purpose
Drain	Initiates Node Drain
TargetNode	The destination node where all drained roles will be moved/live migrated to
ForceDrain	Moves the roles off of the draining node even if the Group cannot move either because no other node can host this group or it is in locked state
Wait	Defines an amount of time to wait for the Node Drain operation to begin

Status of Drained Node:

When a Node Drain is initiated, the command returns the NodeDrainStatus property, indicating that the cluster node has begun the node drain operation. You can track the status of the on-going node drain operation using these two cluster node common properties:

Node Common Property	Values	Purpose
NodeDrainStatus	0 – Not Initiated	This property indicates the current status of the Node Drain.
1 – In Progress
2 – Completed
3 – Failed
NodeDrainTarget	Cluster Node Id	ID of the cluster node which all the workload will be moved to. This ID is set when you use the TargetNode parameter.

Node Drain Failure:

Node Drain will fail if a virtual machine’s Live Migration fails due to some reason, or if a Role cannot be moved as the node being drained is the last possible owner node for the Role.

Upon encountering an error with an individual role, the node drain operation will continue to drain the remaining roles hosted on the node. The status of node drain would be set to “3” only after the remaining roles are drained from the cluster node.

Restarting Node Drain and optionally you can specify “-ForceDrain” parameter to override any errors encountered during the initial node drain.

Rebooting a Drained Node:

Once a node is drained, it will remain in the PAUSED state across reboots to prevent any roles from moving to that node, until the node is resumed. This keeps the node drained for the duration of the maintenance window.

Node Resume with Failback

When a node is drained, the cluster will remember the workload(s) that were moved off of the node. When resuming the node after maintenance, you have the option of moving back all the workload(s) to the cluster node. This will restore the cluster back to the original state it was in before the maintenance.

Steps automated Node Resume with Failback:

1) The cluster node is removed from PAUSED state - this enables workload(s) to move to this node.

2) The workload(s) that were originally drained from the node are moved back using Failback.

If a failback policy is configured to only failback during a specific failback window, resume will honor the setting and the roles failback will be delayed until the failback window.

Resuming Node through Failover Cluster Manager:

Open Failover Cluster Manager (CluAdmin.msc)

On the left hand pane navigate to Nodes

Right-click on the node you wish to resume

Under Resume select Fail Roles Back

Note: If you select “Do Not Fail Roles Back”, then it would simply “RESUME” the node similar to Windows Server 2008 R2.

Resuming Node through PowerShell:

You can resume a node using the Resume-ClusterNode PowerShell command.

There are additional advanced options available through PowerShell to manage resuming nodes, which includes:

Name

Value

Purpose

Failback

NoFailback – Don’t Failback workload

Immediate – Failback immediately

Policy – Failback during configured Window

This defines the type of failback to expect after node is resumed.

Additional Information:

Cancelling Node Drain:

Draining a node may be a long running operation. A Node Drain that is in progress can be cancelled by initiating a Node Resume. This will cause the Node Drain operation to stop, and if Fail Roles Back is specified, the drained workloads which were moved will be moved back to the cluster node.

Configuring the Move Type for a Virtual Machine

Node Drain and Node Resume with Failback will leverage Live Migration for virtual machines so that a node can be drained with no downtime. Live Migration may at times be a long running operation, and there may be scenarios where you wish to quickly drain a node. Node draining provides the flexibility to allow configuration of how VMs should be moved, using either Live Migration or Quick Migration.

You also have the granular control to configure the move type to be used based on the priority setting of the VM. This is configured with the Virtual Machine Resource Type property private property NodeDrainMoveTypeThreshold:

Name

Value

Purpose

NodeDrainMoveTypeThreshold

(Private Property)

Priority of Virtual Machines

Virtual Machines with Priority equal to or higher than the specified priority will be moved using Live Migration.

Virtual Machines with Priority lower than the specified priority will be moved using Quick Migration.

Example PowerShell commands to view or modify this private property:

Creating property:
Get-ClusterResourceType "Virtual Machine" | Set-ClusterParameter -Create @{"NodeDrainMoveTypeThreshold"="3000"}

Modifying created property:
Get-ClusterResourceType "Virtual Machine" | Set-ClusterParameter -Multiple @{"NodeDrainMoveTypeThreshold"="3000"}

Reading property:
Get-ClusterResourceType "Virtual Machine" | Get-ClusterParameter NodeDrainMoveTypeThreshold

Conclusion:

Node Drain is a great new time-saving feature in Windows Server 2012 Failover Clustering for conducting planned maintenance. Using this feature, you can easily drain the workload(s) off of a cluster node in a single click, and easily restore them when maintenance operations are completed on the cluster node.

Thanks!

Amitabh Tamhane                                                                                                           Lokesh Koppolu
Program Manager II                                                                                                        Principal Development Lead
Clustering & High Availability                                                                                       Clustering & High Availability
Microsoft                                                                                                                          Microsoft

Published Mar 15, 2019

Version 1.0

Rob-Hindman

Microsoft

Joined June 05, 2017

View Profile

Failover Clustering

Follow this blog board to get notified when there's new activity