Mar 17 2023 12:43 AM
Hi,
We've had a big network outage lately and were surprised by how this was handled by our Hyper-V environment. What happened was that a core switch needed a replacement and failover of our network to the 2nd core switch took 1.5min because of broadcast traffic. In that time we saw all Hyper-V hosts stop and start VMs, losing CSV volumes and stopping cluster services.
Our environment is all 2016 Hyper-V hosts. Each host has 4 nics. Nic0 = management, Nic1 = live migration, Nic2 and Nic3 are a team for VM data. These are UCS Blades meaning that all nics are virtual hardware (but seen by Windows as physical) and are connected to the same backplane. All those backplanes come together in a UCS Domain which consists of 2 fabric interconnects, feed A and B. If feed A was to fail, feed B will pickup and vise versa. They are running active/active. We have multiple of these domains and they are all connect to our cores network switches. The hosts of a hyper-v cluster are always in the same datacenter, but spread over multiple domains. We don't stretch clusters over datacenters. When the network black out happened, multiple clusters in site A were hit by that.
From the log analyses I'm trying to draw some conclusions about the inner workings of Hyper-V and CSVs and would like to check with you if I'm right.
- I've read a lot about the cluster time out settings ( SameSubnetDelay and SameSubnetThreshold) and come to believe that this is only applicable for the loss of one host. Or in other words, if one host dies, the remaining hosts will user SameSubnetDelay and SameSubnetThreshold to wait before taking action. This is NOT applicable for the host that gets isolated. So if we have a complete network black-out, This doesn't prevent VMs going down.
- The biggest issue probably that caused VMs to go down is that the Hyper-V hosts lost their CSV volumes. In the logging I see that they lost "almost" all the CSVs they were connected to, but not all. And I'm wondering if the CSV they didn't lose are the CSVs that they were owner of but I can't find any evidence of that because I don't know which CSVs they were owner of before the outage. Since all the CSVs are FC connected, I first was puzzled on how this could have happened, but then came to the conclusion that if a host gets isolated and it therefore can no longer talk to the owner node of a CSV volume over the network, it will drop that CSV volume even though the FC connection is still working.
Would love to hear if my conclusions above are correct and if there is anything I can do to prevent this from happening again with our next network maintenance.
We are planning to create a completely separate network that doesn't run through the core and use that as an extra heartbeat, but it will take some time and money to get it running.
Mar 17 2023 09:27 PM
Based on your description, it seems that the network outage caused the Hyper-V hosts to lose connectivity to the CSV volumes, which resulted in the VMs being stopped and the cluster services being interrupted. This is because the CSV volumes are a shared storage resource that relies on the network for communication between the hosts and the storage.
Your understanding of the SameSubnetDelay and SameSubnetThreshold settings is correct. These settings are used by the cluster to determine how long to wait before taking action when a host becomes unavailable within the same subnet. In the case of a complete network blackout, these settings do not apply, and the remaining hosts will take action immediately.
Regarding the CSV volumes, it's possible that the volumes that were not lost were owned by the hosts that were still up and running. When a host loses connectivity to a CSV volume, it will drop the volume, and the ownership of the volume will be transferred to another host in the cluster. If a host is the owner of a CSV volume and it loses network connectivity, the ownership will not be transferred to another host until the SameSubnetThreshold has been exceeded.
In terms of preventing this from happening again during network maintenance, creating a separate network for heartbeat communication is a good idea. This will provide an additional layer of redundancy and help ensure that the cluster can remain online during a network outage. Additionally, you may want to consider implementing a redundant storage network, such as a separate FC fabric, to provide additional resilience for the CSV volumes.
Mar 19 2023 11:19 PM
@Mark_Albin Thank you for your response and confirming my thoughts. In the last line you mention a redundant FC network/fabric. How would this help without having the extra ethernet network? Since even a second FC fabric would rely on the network communication to be available, correct?
Currently the CSV volumes have 4 paths to each host.
Jan 16 2024 10:58 AM
Gabrie - we experienced the same issue last week when a network ARP event took one of my HyperV clusters down. We had data corruption as the CSV control information was not able to communicate over the Ethernet. The fiber channel was fine, and we do have redundant FC (Alpha/Omega) networks.
Did you ever deploy the out-of-band set of switches in your environment? If so, which networks are you routing across Management (538) , Live Migration(3807), or Cluster Network (3808)? Can't be the Cluster Network - would one of the others even prevent the underlying issue?
Any advice you can provide would be very much appreciated.
Nancy Freeman
Jan 17 2024 04:56 PM
This post nearly a year old. I suggest you create a new discussion in order to increase visibility.
I'm going to make the following assumptions; Your use of Alpha/Omega is what is usually classified as A/B Fabric, the Management (538), Live Migration(3807), and Cluster Network (3808) networks you mention are vlans that you have defined within your environment.
I do not see a Cluster Storage Network in your list. You may be depending upon cluster storage's failover behavior for this. The Cluster Storage network will use the Cluster Network if the Cluster Storage Network is unavailable.
In my humble opinion, Cluster Storage in Hyper-V using Fibre Channel is not ready for prime time. Essentially, it is a non-clustering filesystem that achieves clustering by coordinating writes through an out of band communication channel (the Cluster Storage Network.) This leaves you with problems if EITHER the FC or IP network has a brief outage. If your storage were over IP, then this would make sense. It does not make sense for Fibre Channel. An actual clustering filesystem will have a mechanism for coordination over the same pipe as the storage.
If you are using Hyper-V to support enterprise workloads or systems that are intolerant of storage fragility, do not use CSVs. You still have the option of using traditional storage or SMBv3.
In the environments I support, I place systems that do not require live migration (such as Domain Controllers, clustered app servers, clustered db servers, etc) on traditional storage and the have the rest use SMBv3 from a SAN/NAS.
You can certainly take steps to mitigate these types of issues by using a completely separate network using separate physical nics, but careful architecture is required to make sure you do not simply move the problem from one place to another. A short outage on your isolated network will have same impact as a short outage on your non-isolated network. My recommendation would be to use both.