By Hugo Affaticati, Program Manager
Useful resources
Information on the NC A100 v4-series: Microsoft
Pre-requisites
Deploy a virtual machine on Microsoft Azure Portal.
Key values
- Size: NC24ads A100 v4 (also available NC48ads A100 v4 and NC96ads A100 v4)
- Image: Ubuntu HPC 20.04
- Availability: no redundancy required for benchmarking
Step 1: NVIDIA driver and CUDA
cd /mnt
nvidia-smi
If the driver version is less than 525, update both the driver and the CUDA versions
sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo wget https://developer.download.nvidia.com/compute/cuda/12.0.0/local_installers/cuda-repo-ubuntu2004-12-0-local_12.0.0-525.60.13-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-0-local_12.0.0-525.60.13-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
Restart the machine
sudo reboot
Step 2: Docker
The next step is to update Docker to the latest version.
cd /mnt
sudo apt update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt update
sudo apt-get install docker-ce
Step 3: Mount the NVMe disks
cd /mnt
sudo vi nvme.sh
Copy and paste the following mounting script:
#!/bin/bash
NVME_DISKS_NAME=`ls /dev/nvme*n1`
NVME_DISKS=`ls -latr /dev/nvme*n1 | wc -l`
echo "Number of NVMe Disks: $NVME_DISKS"
if [ "$NVME_DISKS" == "0" ]
then
exit 0
else
mkdir -p /mnt/resource_nvme
# Needed incase something did not unmount as expected. This will delete any data that may be left behind
mdadm --stop /dev/md*
mdadm --create /dev/md128 -f --run --level 0 --raid-devices $NVME_DISKS $NVME_DISKS_NAME
mkfs.xfs -f /dev/md128
mount /dev/md128 /mnt/resource_nvme
fi
chmod 1777 /mnt/resource_nvme
Run the script to mount the disk
sudo sh nvme.sh
Step 4: Finish setting up docker
Update the Docker root directory in the docker daemon configuration file
sudo vi /etc/docker/daemon.json
Paste the following lines:
{
"data-root":"/mnt/resource_nvme/data",
"runtimes":{
"nvidia":{
"path":"nvidia-container-runtime",
"runtimeArgs":[]
}
}
}
Verify the previous steps and enable docker
docker --version
sudo systemctl restart docker
sudo systemctl enable docker
Register your user for Docker
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
You should not have any permission issues when running
docker info
Updated Jan 27, 2023
Version 9.0RachelPruitt
Microsoft
Joined September 10, 2020
Azure High Performance Computing (HPC) Blog
Follow this blog board to get notified when there's new activity