By Hugo Affaticati, Program Manager
Useful resources
Information on the NC A100 v4-series: Microsoft
Information on Multi-Instance GPU (MIG): NVIDIA
Pre-requisites
Deploy a virtual machine on Microsoft Azure Portal.
Key values
- Size: NC24ads A100 v4 (also available NC48ads A100 v4 and NC96ads A100 v4)
- Image: Ubuntu HPC 18.04 (recommended, also available Ubuntu HPC 20.04)
- Availability: no redundancy required for benchmarking
Step 1: NVIDIA driver and CUDA
Verify the NVIDIA driver version:
cd /mnt
nvidia-smi
If the driver version is less than 510, update both the driver and the CUDA versions
sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo wget https://developer.download.nvidia.com/compute/cuda/11.6.1/local_installers/cuda-repo-ubuntu1804-11-6-local_11.6.1-510.47.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-11-6-local_11.6.1-510.47.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-6-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
Restart the machine
sudo reboot
Step 2: Docker
The next step is to update Docker to the latest version.
cd /mnt
sudo apt update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt update
sudo apt-get install docker-ce
Update the Docker root directory in the docker daemon configuration file
sudo vi /etc/docker/daemon.json
Add the line after the first curly bracket
"data-root": "/mnt/resource_nvme/data",
Verify the previous steps and enable docker
docker --version
sudo systemctl restart docker
sudo systemctl enable docker
Register your user for Docker
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
You should not have any permission issues when running
docker info
Step 3: Enable MIG
Enable MIG mode
sudo nvidia-smi -mig 1
You may have to reboot the machine after this command. Then, verify that the MIG mode is enabled:
nvidia-smi
Create seven GPU instance IDs and the compute instance IDs:
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19
sudo nvidia-smi mig -cci
For two or three MIG instances you can use respectively:
sudo nvidia-smi mig -cgi 9,9
sudo nvidia-smi mig -cci
or
sudo nvidia-smi mig -cgi 14,14,14
sudo nvidia-smi mig -cci
Display the GPU instance profiles:
sudo nvidia-smi mig -lgip
Get the list of the MIG devices
nvidia-smi -L
Step 4: Mount the NVMe disks
cd /mnt
sudo vi nvme.sh
Copy and paste the following mounting script:
#!/bin/bash
NVME_DISKS_NAME=`ls /dev/nvme*n1`
NVME_DISKS=`ls -latr /dev/nvme*n1 | wc -l`
echo "Number of NVMe Disks: $NVME_DISKS"
if [ "$NVME_DISKS" == "0" ]
then
exit 0
else
mkdir -p /mnt/resource_nvme
# Needed incase something did not unmount as expected. This will delete any data that may be left behind
mdadm --stop /dev/md*
mdadm --create /dev/md128 -f --run --level 0 --raid-devices $NVME_DISKS $NVME_DISKS_NAME
mkfs.xfs -f /dev/md128
mount /dev/md128 /mnt/resource_nvme
fi
chmod 1777 /mnt/resource_nvme
Run the script to mount the disk
sudo sh nvme.sh