LiveMigration issue on a 4node S2D-Cluster (switchless)

Copper Contributor

Hi Community,
we've build up a Windows 2022 (DC) hyperconverged Hyper-V S2D cluster with switchless topology and 4 nodes.
Cluster-connectivity is made by 6x100GBE on each node, connecting with 2x100GBE to each other node (Intel 810 with RDMA/iWarp and configured DCB) - no SET teaming.
VM-connectivity is made by 4x10GBE on each node (Intel 710, no RDMA) teamed to a SET.
As usual in hyperconverged S2D there are no separated networks für cluster, storage, live-migration, etc..
In this setup, there are 12 cluster-networks which represents the "100GBE-RMDA-full-mesh", each containig two of the Intel 810 NICs.

 

Now to the problem:
The system is performing very well, despite of one problem with Live-Migration:
LM ist set to use the SMB connections and limited to 5 GB/s or eq. 40GBit/s.
Moving a VM from one node to another takes very different time depending on the source- and target-node.
So, for example, moving "Test-VM" from node1 to node2 takes 10 seconds, starting at once.
Moving from node2 to node3 also, but moving from node3 to node4 OR back from node3 to node2 takes 1-2 minutes, with 30-60 seconds delay after the migration was initiated.
In case of the problem, no errors (cluster manager oder system protocoll are logged)

 

After a longer investigation, I figured out the follwing:
Under FailoverClusterManager -> networks -> LiveMigration networks, all 12 networks are listed an checked (management and VM-networks unchecked).
The live-migrations-paths, that are represented by the first 3 or 4 networks are working as expected (fast), the other ones not (very slow - but still working without errors).


So the problem can be affected here: Moving all networks of a single node to the top of the list (e.g. all networks on node3) will result in "normal" migration-speed to all other nodes, when node3 is the source of the migration. All other migrations (even back to node 3) are very slow.
I've already controlled network configs on all nodes an set manual metrics for all connetions and networks (same for all smb-networks).
So this seems to be a problem of switchless design rather then a bad config on a single node, since the behavior can be changed only by editing the order of the connections in cluster manager.


Maybe anyone can help me with this - I'll be gratefull for any tips.

Thanks,
VoNovo

4 Replies

@VoNovo I have the same issue. Did you ever find a solution?

Sorry, never found a solutions - still having this Performance issues…
Same issue here...
3-node switchless-cluster (Dell AX750 with 50 Gbit/s SMB-connection to each node). LM takes a very long time to start...sometimes several minutes...
Drivers & Patches up2date..no errors in Failover-cluster-manager...