No such thing as a Heartbeat Network
Published Mar 25 2019 03:49 PM 47.4K Views
Microsoft

This is a blog that has been a long time coming.  From time to time, we get a request about how to configure networking in Failover Clusters.  One of the questions we get is how should the heartbeat network be configured and that is what the focus is on this blog.  I am here to say, there is no such thing, and never was, a heartbeat network.

 

Please allow me to give a little background and explain.

 

In Windows 2003 and below Failover Clustering, you could define which network was used for Cluster Communication.  Below is a picture for reference.

 

Heartbeat1.jpg

 

In the picture above, we would want to select Private for our Cluster Communication to as to not use the Public which has all WAN traffic.  All Cluster Communication between nodes (joins, registry updates/changes, etc) would go only over this network if it is up.  As the picture shows, the networks are called Public and Private.  As years went by, some started calling the Private network a Heartbeat network.

 

Heartbeats are small packets (134 bytes) that travel over UDP Port 3343 on all networks configured for Cluster use between all nodes.  They serve multiple purposes.

 

  • Establishes if a Cluster Network is up or down
  • Establishes routes between nodes
  • Ensure the health of a node is good or bad

So let's say I have Private set as my priority network for Cluster Communications.  If it is up, we are sending our communication through it.  But what happens if that network wasn't reliable.  If a node tries to join and packets are dropping, then the join could fail.  If this is the case, you either determine where the problem is and fix it, or go back into the Cluster properties and set the Public as priority.

 

Starting in Windows 2008 Failover Clusters, the concept of Public and Private networks went out the window.  We will now send Cluster Communication over any of our networks.  One of the reasons for this was reliability.  With that change, we also gave the heartbeats an additional purpose.

 

  • Determine the fastest and reliable routes between nodes

Since we are now determining the fastest and reliable routes, we could use different networks between nodes for our communication.  Take the below as an example.

 

Heartbeat2.jpg

 

We have three individual networks between our nodes:

 

 

  • Blue – 10 gbps used for backups and administration only
  • Green – 40 gbps used for communicating out on the WAN to clients
  • Red – 40 gbps used for communicating out on the WAN to clients

 

 

As a refresher, here is what the heartbeats are doing:

 

  • Establishes if a Cluster Network is up or down
  • Establishes routes between nodes
  • Ensure the health of a node is good or bad
  • Determine the fastest and reliable routes between nodes

What the heartbeats are going to tell the Cluster is to use one of the faster networks for its communication.  With that as the case, it is going to use either Red or Green network.  If the heartbeats start detecting that neither of these is as reliable (i.e. dropping a packet, network congested, etc), it will automatically switch and use the Blue network.  That's it, nothing for you to configure extra.

 

So to wrap things up, remember these things about Failover Clusters and Heartbeats.

 

  1. There is no such thing as a heartbeat network or a network dedicated to heartbeats
  2. Heartbeat packets are lightweight (134 bytes in size)
  3. Heartbeats are sensitive to latency
  4. Bandwidth is not an important factor, quality of service is.  If your network is all teamed, ensure you have set up Network QOS policies for our UDP 3343 traffic.

For more information regarding configuring networks in a Cluster, please see Microsoft Ignite session:

 

Failover Clustering Networking Essentials

 

Happy Clustering !!!!

 

John Marlin

Senior Program Manager

Microsoft Corporation

Twitter: @Johnmarlin_MSFT

13 Comments
Copper Contributor

Well I understand your theory behind this in practice there are other products besides Windows clustering were a heartbeat network is still a thing. 

 

Now with that said I will use the term Network loosely... As I don't really consider a crossover cable between two physical servers a network per se. 

 

There are systems out there such as drbd and other technologies, as I'm just using this as an example, we're a heartbeat is only sent out over one network and configured as such. Other configurations may exist to approximate where Windows clustering has progressed to, but not everything is there... Yet.

Copper Contributor

Thanks for the write up John!

 

So should we still use

 

(Get-Cluster). PlumbAllCrossSubnetRoutes = 1

 

to use all routes for cluster comms? 

Microsoft

PlumbAllCrossSubnetRoutes is different.  This property is used to detect if there are other routes we can find.  For example, Node1 has 1.1.1.1 and 192.1.1.1 networks.  Node2 has 1.1.1.2 and 192.1.1.2 networks.  Cluster Network 1 would comprise the 1.x network.  Cluster Network 2 would comprise the 192.x network.  What this property setting would do is periodically see if it can make a connection between 1.1.1.1 and 192.1.1.2 as well as 1.1.1.2 and 192.1.1.1.  If it can, NetFt will add it to it's internal route table.  If it cannot, it will check again at a later time.

 

More Info:

https://techcommunity.microsoft.com/t5/Failover-Clustering/What-the-heck-is-PlumbAllCrossSubnetRoute...

 

Copper Contributor

So with (Get-Cluster). PlumbAllCrossSubnetRoutes = 1 set will the cluster still prefer the faster routes or spread cluster comms across all known network in a round robin or load balancing fashion? Or do all comms stick with one network path until it fails or slows down? 

 

Thanks!

Microsoft

The Cluster networking component NetFt is what decides which path it will take for Cluster communications.  PlumbAllCrossSubnetRoutes simply just tries to establish if a new route can be find that NetFt can use if it needs it.  For the communications itself, it is not round robin.  NetFt will establish a route for the communications and will generally tend to stay on it unless a better one is found.

Copper Contributor

Great, thanks for your help John!

Copper Contributor

John,

"Heartbeats are sensitive to latency"

Question: Is this heartbeat information the the keep-alive in reference to a Converged Network topology, Non-Converged Network topology or there is no difference from your point of view?

 

 

Microsoft

there is no difference.  can we see the packet and the sequence number in the packet and can we return.  It is simply a matter of can we connect through the network.

Copper Contributor

thank you John.  

 

From our view point we see the difference as actual bandwidth per network.  Items that can compete for cycles on an interface, like multiple networks, are not a factor in a dedicated topology (non-converged).  The Keepalive (heartbeat) using all networks available, smart move! 

 

JP

Very Valuable info, thank you!

Search engines should index and show these useful and original articles at top when people search for them.

Copper Contributor

What happened if my cluster heartbeat network failed single node and both node.

Microsoft

@rajeshmokashi17 as the title of this article explains, there is 'no such thing as a heartbeat network' - but assuming that you are referring to a Cluster Network which has been enabled for Cluster Communications, if this is the only Cluster Network with Cluster Communications enabled and it fails on one node, the other node would attempt to failover any resources which are dependent on that network and bring them online.  If that cluster network fails on both nodes, the cluster will not be able to function so the cluster service will stop on both nodes and will continuously try to restart until the nodes are able to reach each other and reform the cluster.  Ideally you should have redundancy for all networks and NICs to avoid any single points of failure, in which case a single failed network should not have any significant impact as long as the redundant networks/devices remain connected.

Copper Contributor

Hello,

 

We believe we have our cluster networks set up correctly, but when we take backups of databases, the backup traffic goes across our backup network AND our production network.  We wanted to disable all cluster traffic on the backup network but now fear that this will affect heartbeat traffic - or at the very least we'd only have 1 network where heartbeat traffic could cross.

 

Anyone have any ideas about keeping backup traffic on the correct network?

 

Cheers,
MR.

Version history
Last update:
‎Mar 25 2019 03:48 PM
Updated by: