RDMA (RoCE) Test Failed over different Subnet

Copper Contributor

Hello Everyone,

 

Let's say pSMBNIC1 and pSMBNIC2 are the names of the NICs to be used for RDMA on each node of a 3-node cluster. IP address assignments are as follows:
pSMBNIC1 = 192.168.207.31 (N1), 192.168.207.32 (N2), 192.168.207.33 (N3)
pSMBNIC2 = 192.168.206.51 (N1), 192.168.206.52 (N2), 192.168.206.53 (N3)

RDMA Test is Successful when I run for the following scenarios:
From pSMBNIC1 of N1 to pSMBNIC1 of N2 (192.168.207.32) and pSMBNIC1 of N3 (192.168.207.33)
From pSMBNIC1 of N2 to pSMBNIC1 of N1 (192.168.207.31) and pSMBNIC1 of N3 (192.168.207.33)
From pSMBNIC1 of N3 to pSMBNIC1 of N1 (192.168.207.31) and pSMBNIC1 of N2 (192.168.207.32)
From pSMBNIC2 of N1 to pSMBNIC2 of N2 (192.168.206.52) and pSMBNIC2 of N3 (192.168.206.53)
From pSMBNIC2 of N2 to pSMBNIC2 of N1 (192.168.206.51) and pSMBNIC2 of N3 (192.168.206.53)
From pSMBNIC2 of N3 to pSMBNIC2 of N1 (192.168.206.51) and pSMBNIC2 of N2 (192.168.206.52)


However, RDMA Test fails with "ERROR: RDMA traffic test FAILED: Please check ERROR: a) physical switch port configuration for Priority Flow Control. ERROR: b) job owner has write permission at 192.168.206.51 \C$" for the following scenarios:

From pSMBNIC1 of N1 to pSMBNIC2 of N2 (192.168.206.52) and pSMBNIC2 of N3 (192.168.206.53)
From pSMBNIC1 of N2 to pSMBNIC2 of N1 (192.168.206.51) and pSMBNIC2 of N3 (192.168.206.53)
From pSMBNIC1 of N3 to pSMBNIC2 of N1 (192.168.206.51) and pSMBNIC2 of N2 (192.168.206.52)
From pSMBNIC2 of N1 to pSMBNIC1 of N2 (192.168.207.32) and pSMBNIC1 of N3 (192.168.207.33)
From pSMBNIC2 of N2 to pSMBNIC1 of N1 (192.168.207.31) and pSMBNIC1 of N3 (192.168.207.33)
From pSMBNIC2 of N3 to pSMBNIC1 of N1 (192.168.207.31) and pSMBNIC1 of N2 (192.168.207.32)


This means that RDMA Tests are passing for the same subnets but failing when run across different subnets.

Is it normal? I have already enabled the PFC... But, even if the PFC is not enabled, then how tests are passing for the same subnet?


Please guide...

 

Thank you in anticipation.

1 Reply

Hi @HasanHasib,

It is not normal for RDMA tests to pass for the same subnets but fail when run across different subnets. There are a few possible reasons for this, including incorrect switch configuration, firewall blocking RDMA traffic, and incorrect RDMA settings on the nodes.

To troubleshoot the issue, you can try the following:

  1. Check the switch configuration to make sure that Priority Flow Control (PFC) is enabled on all ports connecting the nodes.
  2. Check the firewall configuration on all nodes to make sure that RDMA ports are open.
  3. Check the RDMA settings on all nodes to make sure that RDMA is enabled and that the same RDMA technology (iWARP or RoCE) is configured on all nodes.
  4. Try running the RDMA test again after making the necessary changes.


Please click Mark as Best Response & Like if my post helped you to solve your issue.
This will help others to find the correct solution easily. It also closes the item.


If the post was useful in other ways, please consider giving it Like.


Kindest regards,


Leon Pavesic
(LinkedIn)