Forum Discussion
What happens to Cluster shared volumes during network outage?
Based on your description, it seems that the network outage caused the Hyper-V hosts to lose connectivity to the CSV volumes, which resulted in the VMs being stopped and the cluster services being interrupted. This is because the CSV volumes are a shared storage resource that relies on the network for communication between the hosts and the storage.
Your understanding of the SameSubnetDelay and SameSubnetThreshold settings is correct. These settings are used by the cluster to determine how long to wait before taking action when a host becomes unavailable within the same subnet. In the case of a complete network blackout, these settings do not apply, and the remaining hosts will take action immediately.
Regarding the CSV volumes, it's possible that the volumes that were not lost were owned by the hosts that were still up and running. When a host loses connectivity to a CSV volume, it will drop the volume, and the ownership of the volume will be transferred to another host in the cluster. If a host is the owner of a CSV volume and it loses network connectivity, the ownership will not be transferred to another host until the SameSubnetThreshold has been exceeded.
In terms of preventing this from happening again during network maintenance, creating a separate network for heartbeat communication is a good idea. This will provide an additional layer of redundancy and help ensure that the cluster can remain online during a network outage. Additionally, you may want to consider implementing a redundant storage network, such as a separate FC fabric, to provide additional resilience for the CSV volumes.
Mark_Albin Thank you for your response and confirming my thoughts. In the last line you mention a redundant FC network/fabric. How would this help without having the extra ethernet network? Since even a second FC fabric would rely on the network communication to be available, correct?
Currently the CSV volumes have 4 paths to each host.
- nvfreemaJan 16, 2024Copper Contributor
Gabrie - we experienced the same issue last week when a network ARP event took one of my HyperV clusters down. We had data corruption as the CSV control information was not able to communicate over the Ethernet. The fiber channel was fine, and we do have redundant FC (Alpha/Omega) networks.
Did you ever deploy the out-of-band set of switches in your environment? If so, which networks are you routing across Management (538) , Live Migration(3807), or Cluster Network (3808)? Can't be the Cluster Network - would one of the others even prevent the underlying issue?
Any advice you can provide would be very much appreciated.
Nancy Freeman
- TaysolJan 18, 2024Copper Contributor
This post nearly a year old. I suggest you create a new discussion in order to increase visibility.
I'm going to make the following assumptions; Your use of Alpha/Omega is what is usually classified as A/B Fabric, the Management (538), Live Migration(3807), and Cluster Network (3808) networks you mention are vlans that you have defined within your environment.
I do not see a Cluster Storage Network in your list. You may be depending upon cluster storage's failover behavior for this. The Cluster Storage network will use the Cluster Network if the Cluster Storage Network is unavailable.
In my humble opinion, Cluster Storage in Hyper-V using Fibre Channel is not ready for prime time. Essentially, it is a non-clustering filesystem that achieves clustering by coordinating writes through an out of band communication channel (the Cluster Storage Network.) This leaves you with problems if EITHER the FC or IP network has a brief outage. If your storage were over IP, then this would make sense. It does not make sense for Fibre Channel. An actual clustering filesystem will have a mechanism for coordination over the same pipe as the storage.
If you are using Hyper-V to support enterprise workloads or systems that are intolerant of storage fragility, do not use CSVs. You still have the option of using traditional storage or SMBv3.
In the environments I support, I place systems that do not require live migration (such as Domain Controllers, clustered app servers, clustered db servers, etc) on traditional storage and the have the rest use SMBv3 from a SAN/NAS.
You can certainly take steps to mitigate these types of issues by using a completely separate network using separate physical nics, but careful architecture is required to make sure you do not simply move the problem from one place to another. A short outage on your isolated network will have same impact as a short outage on your non-isolated network. My recommendation would be to use both.