There has been greater interest in the usage of the HPC optimized VM images that we publish due to the:
While those images (CentOS-HPC 7.6, 7.7) are originally targeted for use on the SR-IOV enabled HPC VMs (HB, HC, HB_v2), conceptually, they are useful for the other now SR-IOV enabled GPU VMs (NCr_v3, NDr_v2) too. Note that the GPU VMs would additionally require the Nvidia GPU drivers (VM extension, manually).
Typically we find that users of the HPC VMs running traditional HPC applications tend to utilize CentOS as their preferred OS. While users of AI/ML applications running on the GPU VMs tend to prefer Ubuntu as the OS. The CentOS-HPC VM OS images (>=7.6 for the SR-IOV enabled VMs, and <=7.5 for the non-SR-IOV enabled VMs) provide a ready to use VM image with the appropriate drivers and MPI runtimes. Such a pre-packaged and ready to use experience isn't yet available for Ubuntu on Azure.
This article attempts to consolidate guidance on configuring InfiniBand (IB) for Ubuntu across both SR-IOV and non-SR-IOV enabled HPC and GPU VMs. Specifically it will focus on getting the right drivers setup and in bringing up the appropriate IB interface on the VMs. At the time of writing, the following steps at least apply to Ubuntu 18.04 LTS image by Canonical on the Azure Marketplace.
Non- SR-IOV enabled VMs
The IB interface eth1 should come up with an RDMA IP address.
The IB related kernel modules are not auto-loaded on Ubuntu anymore. This is a departure from earlier practice where the kernel modules were built into the image. Now these are available as loadable modules so that a user can install Mellanox OFED driver.
Support for the NetworkDirect driver stack (vmbus-rdma-driver required in the non-SRIOV VMs) was dropped in the 5.3 kernel in the18.04-LTS 18.04.202004290 image in the Marketplace. This may lead to issues in bringing up the IB interface as reported here. This may be addressed with Canonical starting in kernel 5.4.
As a workaround, an older image with kernel 5.0 (say Canonical UbuntuServer 18.04-LTS 18.04.202004080 with 5.0.0-1036-azure kernel) has the missing module "hv_network_direct" and works fine.
Ubuntu 20.04 also does not show this issue.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.