Hi Gabrie van Zanten
Sandy25 I haven't heard back from you on the invite and would like to intensify our exchange.
I am not here to convert this informational thread into a support thread.
If someone is stuck with their SRs be invited to join AzureStackHCI.slack.com. It's a strong and active voluntary community platform for Azure Stack HCI, S2D and Hyper-V in general.
However, I would like to reach out on the individual issues brought up here.
Setting the CPU masking brings the VM back to 1985 level or at least close to that which cause a few modern applications to no longer run because of specific CPU instructions. We'd need more granular CPU masking like EVC in VMware.
This is a limitation that's addressed with Windows Server 2025 and Azure Stack HCI 22H2 or newer.
Many here reporting timeouts for CSV and owner nodes or network performance issues.
There are a lot of potential issues that can cause these. It is crucial to understand if it's Hyper-V with S2D (or Azure Stack HCI) or Hyper-V with SAN if you report these. I have had quite some customers suffered these issues and all got solved.
Common misonfigurations are:
- Storage timeout in registry not applied to a value recommended by the SAN vendor or OEM, including S2D, HCI.
- use of LFBO Teaming instead of SET Teaming in Windows Server 2019 or newer (2016 is a story on it's own would not use it anymore for workloads). LFBO limits SMB multichannel and throughput to certain CPUs cores. If these are under high load plus poor RSS defaults chips in, congestion can happen and so timeouts.
- Hyper-V + Cluster settings not using SMB Multichannel but TCP compression or none of these.
- Use of ReFS CSV Volumes in SAN scenarios. Don't do this!
- Range of networking config issues:
- LiveMigration settings were not consistent across nodes. Hyper-V and Cluster LiveMigraton settings were not consistent.
- LiveMigration was using 1GBe Interfaces, while faster were available. This lead to congestion and then CSV timeouts.
- Incorrect RSS configurations (defaults might not be suitable for all scenarios and depending the NIC vendor)
- Incorrect RDMA configurations on hosts or switches. RDMA enabled but not (properly) configured.
- Using RDMA RoCEv2 with no VLAN ID. QoS won't work without it.
- Enabling SRV-IOV on the SET switch without previously adjusting SRV-IOV settings in the UEFI firmware for the NICs. Including VF processes set to 127. SRV-IOV won't work when these UEFI settings are wrong and SET switch is created with SR-IOV enabled.
Correcting these settings later require to remove the SET switch and require to create it again.
- Inconsistent Jumbo Packet configurations, which lead to costly packet fragmentation.
Network setup complexity and many issues I have outlined, NetworkATC and NetworkHUD in Windows Server 2025, as already Azure Stack HCI 22H2 and newer, will ease things by a fortune. But again, one needs to adopt these merits.
Unfortunately with models of Intel and Broadcom NICs, me and other experienced generally more often issues with drivers and firmware in Windows OS scenarios. Their issues are often in the areas of issues with SET switch, performance impacts due poor offloading defaults or offloading issues, often driver based. So far Mellanox CX5 / CX6 are more stable, (yet).
A remote friend of mine, Alexander Fuchs, is hunting these for years now. This is especially a problem that surfaces when link speed is higher than 1 or 10 GBe. He has provided an optimization script at GitHub. But caveat: The topic is very complex, so I do not endorse using it as a one size fits all solution. Some say it helped a lot, some disagree.
Mind that the cluster validation result in Windows Admin Center itself (not the downloadable htm file) is more verbose than the PowerShell, FCM. Especially more verbose for networking with RDMA.
Depending the load in the Cluster and Cluster size you will experience performance issues when any of these most common things are not properly done.
Generally the differences and improvements of nitty gritty issues that are addressed are better with later OS versions. Like Windows Server 2022.
Windows Server 2025 has to prove this yet, but given the stability of Azure Stack HCI 22H2 in regards of networking I am confident about this.
I find the Azure Stack HCI to be limiting for Enterprise Virtualization. The number of nodes per cluster is limited to 16 as compared to 64 in Hyper-V with Windows Server. Hyper-V supports SAN storage and Azure Stack HCI does not.
sunnykb sorry to say, but large clusters also contribute to more surface of issues
Especially due to different hardware, inconsistent settings and drivers / firmware skews.
Windows Server does not have proper lifecycle management. There is a solution for this in the summary.
You might want to read on https://techcommunity.microsoft.com/t5/windows-server-for-it-pro/quot-only-16-nodes-per-cluster-but-vmware-quot-limitations-and/m-p/4105136
Quickly find which VMs are on a specific CSV volume can only be done through Windows Explorer on the host.
This is something I am missing badly. As well as a Microsoft provided script to balance VMs to CSVs, especially in S2D and HCI.
We had to created a separate isolated network to prevent major failures because of just a short network loss.
Generally a good idea, aswell as a seperate AD for the Cluster operation itself. Both also for security reasons.
I have heard GPOs breaking things for failover clusters and S2D / HCI often.
Azure Stack HCI 23H2 cluster nodes, by default, sit in an OU with GPO inheritance disabled.
What you could do to prevent issues in a short
- For HCI and SAN consider the checklist above. Not aimed to be complete despite it's lenght.
- Isolate AD domain and networking for the cluster networks using VLANs.
- Highly consider link <> port <> NIC redundancy in your topology not only link <> switch port redundancy. Even if learn.microsoft.com does not show this in their examples of topolgies. If a NIC fails, for whatever reason, you are still save and operational. Seen quite some clusters going down for such things.
- For S2D and HCI with Dell consider their OMIMSWAC Tool for Windows Admin Center and one time license (during order or post order. Mind that S2D and Azure Stack HCI is not the same license. It depends the OS and hardware).
This offers an HCP check that allow Cluster wide checks of their best practices, and Cluster wide consistency of drivers and firmware.
Also other vendors offer a Windows Admin Center extension for similar reasons.
Often Firmware get patched but Windows drivers not, especially chipset drivers.
Mind that OEM solution plugins are not available in Windows Admin Center New Gateway preview, nor from Azure Portal.
Also Azure Stack HCI 23H2 do not use these anymore. 22H2 does.