Blog Post

Azure High Performance Computing (HPC) Blog
2 MIN READ

Getting started with the NC A100 v4-series

RachelPruitt's avatar
RachelPruitt
Icon for Microsoft rankMicrosoft
Jul 08, 2022

By Hugo Affaticati, Program Manager

 

Useful resources

Information on the NC A100 v4-series: Microsoft

 

Pre-requisites

Deploy a virtual machine on Microsoft Azure Portal.

 

Key values

  • Size: NC24ads A100 v4 (also available NC48ads A100 v4 and NC96ads A100 v4)
  • Image: Ubuntu HPC 20.04
  • Availability: no redundancy required for benchmarking

 

Step 1: NVIDIA driver and CUDA

cd /mnt
nvidia-smi

 

If the driver version is less than 525, update both the driver and the CUDA versions

sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo wget https://developer.download.nvidia.com/compute/cuda/12.0.0/local_installers/cuda-repo-ubuntu2004-12-0-local_12.0.0-525.60.13-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-0-local_12.0.0-525.60.13-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

Restart the machine

sudo reboot

 

Step 2: Docker

The next step is to update Docker to the latest version.

cd /mnt
sudo apt update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu  $(lsb_release -cs)  stable"
sudo apt update
sudo apt-get install docker-ce

 

Step 3: Mount the NVMe disks

cd /mnt
sudo vi nvme.sh

 

Copy and paste the following mounting script:

#!/bin/bash 

NVME_DISKS_NAME=`ls /dev/nvme*n1`
NVME_DISKS=`ls -latr /dev/nvme*n1 | wc -l`

echo "Number of NVMe Disks: $NVME_DISKS"

if [ "$NVME_DISKS" == "0" ]
then
    exit 0
else
    mkdir -p /mnt/resource_nvme
    # Needed incase something did not unmount as expected. This will delete any data that may be left behind
    mdadm  --stop /dev/md*
    mdadm --create /dev/md128 -f --run --level 0 --raid-devices $NVME_DISKS $NVME_DISKS_NAME
    mkfs.xfs -f /dev/md128
    mount /dev/md128 /mnt/resource_nvme
fi

chmod 1777 /mnt/resource_nvme

 

Run the script to mount the disk

sudo sh nvme.sh

 

Step 4: Finish setting up docker

Update the Docker root directory in the docker daemon configuration file

sudo vi /etc/docker/daemon.json

 

Paste the following lines:

{
        "data-root":"/mnt/resource_nvme/data", 
        "runtimes":{
                "nvidia":{ 
                        "path":"nvidia-container-runtime",
                        "runtimeArgs":[] 
                }
        } 
}

Verify the previous steps and enable docker

docker --version
sudo systemctl restart docker
sudo systemctl enable docker

 

Register your user for Docker

sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker

 

You should not have any permission issues when running

docker info

 

Updated Jan 27, 2023
Version 9.0
No CommentsBe the first to comment