Storage Spaces Direct with Cavium FastLinQ® 41000

Published Apr 10 2019 04:31 AM 1,580 Views
First published on TECHNET on Sep 21, 2017
Hello, Claus here again. I am very excited about how the RDMA networking landscape is evolving. We took RDMA mainstream in Windows Server 2012 when we introduced SMB Direct and even more so in Windows Server 2016 where Storage Spaces Direct is leveraging SMB Direct for east-west traffic.

More partners than ever offer RDMA enabled network adapters. Most partners focus on either iWARP or RoCE. In this post, we are taking a closer look at Microsoft SDDC-Premium certified Cavium FastLinQ® 41000 RDMA adapter, which comes in 10G, 25G, 40G or even 50G versions. The FastLinQ® NIC is a unique NIC, in that it supports both iWARP and RoCE, and can do both at the same time. This provides great flexibility for customer as they can deploy the RDMA technology of their choice, or they can connect both Hyper-V hosts with RoCE adapters and Hyper-V hosts with iWARP adapters to the same Storage Spaces Direct cluster equipped with FastLinQ® 41000 NICs.

Figure 1 Cavium FastLinQ® 41000

We use a 4-node cluster, each node configured with the following hardware:

  • DellEMC PowerEdge R730XD

  • 2x Intel® Xeon® E5-2697v4 (18 cores @ 2.3 GHz)

  • 128GiB DDR4 DRAM

  • 4x 800GB Dell Express Flash NVMe SM1715

  • 8x 800GB Toshiba PX04SHB080 SSD

  • Cavium FastLinQ® QL41262H 25GbE Adapter (2-Port)

  • BIOS configuration

    • BIOS performance profile

    • C States disabled

    • HT On

We deployed Windows Server 2016 Storage Spaces Direct and VMFleet with:

  • 4x 3-way mirror CSV volumes

  • Cache configured for read/write

  • 18 VMs per node

First, we configured VMFleet for throughput. Each VM runs DISKSPD , with 512KB IO size at 100% read at various queue depths:

512K Bytes


iWARP and RoCE

Queue Depth

BW (GB/s) Read latency (ms) BW (GB/s) Read latency (ms) BW (GB/s) Read latency (ms)


33.0 1.1 32.2 1.2 33.2



39.7 1.9 39.4 1.9 40.1



41.0 3.7 40.6 3.7 41.0


8 41.4 7.4 41.1 7.4 41.6


Aggregate throughput is very close to what's possible with the cache devices in the system. Also, the aggregate throughput and latency is very consistent whether it is with iWARP, RoCE or using both at the same time. In these tests, DCB is configured to enable PFC for RoCE but iWARP is without any DCB configuration.

Next, we reconfigured VMFleet for IOPS. Each VM runs DISKSPD , with 4KB IO size at 90% read and 10% write at various queue depths:

4K Bytes


iWARP and RoCE

Queue Depth

IOPS Read latency (ms) IOPS Read latency (ms) IOPS Read latency (ms)


272,588 0.253 268,107 0.258 271,004



484,532 0.284 481,493 0.287 482,564



748,090 0.367 729,442 0.375 740,107


8 1,177,243 0.465 1,161,534 0.474 1,164,115


Again, very similar and consistent IOPS rates and latency numbers for iWARP, RoCE or when using both at the same time.

As mentioned in the beginning, more and more partners are offering RDMA network adapters, most focusing on either iWARP or RoCE. The Cavium FastLinQ® 41000 can do both, which means customers can deploy either or both, or even change over time if the need arises. The numbers look very good and consistent regardless if it used with iWARP, RoCE or both at the same time.

What do you think?

Until next time

Version history
Last update:
‎Apr 10 2019 04:31 AM
Updated by: