Mar 17 2023 12:43 AM
Mar 17 2023 12:43 AM
We've had a big network outage lately and were surprised by how this was handled by our Hyper-V environment. What happened was that a core switch needed a replacement and failover of our network to the 2nd core switch took 1.5min because of broadcast traffic. In that time we saw all Hyper-V hosts stop and start VMs, losing CSV volumes and stopping cluster services.
Our environment is all 2016 Hyper-V hosts. Each host has 4 nics. Nic0 = management, Nic1 = live migration, Nic2 and Nic3 are a team for VM data. These are UCS Blades meaning that all nics are virtual hardware (but seen by Windows as physical) and are connected to the same backplane. All those backplanes come together in a UCS Domain which consists of 2 fabric interconnects, feed A and B. If feed A was to fail, feed B will pickup and vise versa. They are running active/active. We have multiple of these domains and they are all connect to our cores network switches. The hosts of a hyper-v cluster are always in the same datacenter, but spread over multiple domains. We don't stretch clusters over datacenters. When the network black out happened, multiple clusters in site A were hit by that.
From the log analyses I'm trying to draw some conclusions about the inner workings of Hyper-V and CSVs and would like to check with you if I'm right.
- I've read a lot about the cluster time out settings ( SameSubnetDelay and SameSubnetThreshold) and come to believe that this is only applicable for the loss of one host. Or in other words, if one host dies, the remaining hosts will user SameSubnetDelay and SameSubnetThreshold to wait before taking action. This is NOT applicable for the host that gets isolated. So if we have a complete network black-out, This doesn't prevent VMs going down.
- The biggest issue probably that caused VMs to go down is that the Hyper-V hosts lost their CSV volumes. In the logging I see that they lost "almost" all the CSVs they were connected to, but not all. And I'm wondering if the CSV they didn't lose are the CSVs that they were owner of but I can't find any evidence of that because I don't know which CSVs they were owner of before the outage. Since all the CSVs are FC connected, I first was puzzled on how this could have happened, but then came to the conclusion that if a host gets isolated and it therefore can no longer talk to the owner node of a CSV volume over the network, it will drop that CSV volume even though the FC connection is still working.
Would love to hear if my conclusions above are correct and if there is anything I can do to prevent this from happening again with our next network maintenance.
We are planning to create a completely separate network that doesn't run through the core and use that as an extra heartbeat, but it will take some time and money to get it running.
Mar 17 2023 09:27 PM
Based on your description, it seems that the network outage caused the Hyper-V hosts to lose connectivity to the CSV volumes, which resulted in the VMs being stopped and the cluster services being interrupted. This is because the CSV volumes are a shared storage resource that relies on the network for communication between the hosts and the storage.
Your understanding of the SameSubnetDelay and SameSubnetThreshold settings is correct. These settings are used by the cluster to determine how long to wait before taking action when a host becomes unavailable within the same subnet. In the case of a complete network blackout, these settings do not apply, and the remaining hosts will take action immediately.
Regarding the CSV volumes, it's possible that the volumes that were not lost were owned by the hosts that were still up and running. When a host loses connectivity to a CSV volume, it will drop the volume, and the ownership of the volume will be transferred to another host in the cluster. If a host is the owner of a CSV volume and it loses network connectivity, the ownership will not be transferred to another host until the SameSubnetThreshold has been exceeded.
In terms of preventing this from happening again during network maintenance, creating a separate network for heartbeat communication is a good idea. This will provide an additional layer of redundancy and help ensure that the cluster can remain online during a network outage. Additionally, you may want to consider implementing a redundant storage network, such as a separate FC fabric, to provide additional resilience for the CSV volumes.
Mar 19 2023 11:19 PM
@Mark_Albin Thank you for your response and confirming my thoughts. In the last line you mention a redundant FC network/fabric. How would this help without having the extra ethernet network? Since even a second FC fabric would rely on the network communication to be available, correct?
Currently the CSV volumes have 4 paths to each host.