HPC/AI Storage options for NDm_v4 (A100) Azure kubernetes service (AKS) cluster
Published Jun 15 2023 10:56 AM 3,405 Views
Microsoft


aks_storage.jpg

 

 

Introduction

In a previous blog post we showed how to deploy an optimal NDm_v4 AKS cluster, i.e. all 8 InfiniBand and GPU devices on each NDm_v4 are installed correctly. We then verified that the NDm_v4 kubernetes cluster was deployed/configured correctly by running a NCCL allreduce benchmark on 2 NDm_v4 VM’s (16 A100 GPUs). We will  use this NDm_v4 AKS as a starting point and show how to use popular Azure HPC/AI storage options (such as local NVMe SSDs, Azure managed lustre Filesystem (AMLFS) and Azure files+NFSv4) in this NDm_v4 AKS cluster.

 

Create I/O test container

We will use fio and IOR to test the various NDm_v4+AKS storage options.

 

IOR  build script (build_ior.sh)

 

 

 

 

 

#!/bin/bash
APP_NAME=ior
PARALLEL_BUILD=8
IOR_VERSION=3.2.1

IOR_PACKAGE=ior-$IOR_VERSION.tar.gz
wget https://github.com/hpc/ior/releases/download/$IOR_VERSION/$IOR_PACKAGE
tar xvf $IOR_PACKAGE
rm $IOR_PACKAGE

cd ior-$IOR_VERSION
CC=`which mpicc`
./configure --prefix=`pwd`
make -j ${PARALLEL_BUILD}
make install

 

 

 

FIO build script (fio_build.sh)

 

 

 

#!/bin/bash
APP_VERSION=3.22
PARALLEL_BUILD=4

yum install -y zlib-devel git
git clone https://github.com/axboe/fio.git
cd fio
git checkout tags/fio-${APP_VERSION}

./configure --prefix=`pwd`
make -j $PARALLEL_BUILD
make install

 

 

 

Dockerfile to build I/O tester container.

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.03-py3

FROM ${FROM_IMAGE_NAME}

RUN apt update
RUN apt-get -y install build-essential
RUN apt-get -y install infiniband-diags
RUN apt-get -y install openssh-server
RUN apt-get -y install kmod
COPY build_ior.sh .
RUN ./build_ior.sh
COPY build_fio.sh .
RUN ./build_fio.sh

 

 Build container locally

docker build -t <ACR_NAME>.azurecr.io/<CONTAINER_NAME> .

NOTE: Choose a suitable <CONTAINER_NAME>

 

Push your container to your Azure container registry.

docker push ${ACR_NAME}.azurecr.io/$CONTAINER_NAME

Mount Azure Managed Lustre Filesystem (AMLFS)

Prerequisites

AMLFS is already deployed, see AMLFS documentation for details.

 

In this example AMLFS is deployed in an AMLFS_VNET (Default kubenet networking and AKS deployed the VNET) and AKS is deployed in AMLFS_VNET, to mount AMLFS in AKS, the two VNET's need to peered (to be able to communicate with each other).

 

az network vnet peering create -n <PEER1_NAME> -g <AMLFS_RESOURCE_GROUP> --vnet-name <AMLFS_VNET> --allow-forwarded-traffic --allow-vnet-access --remote-vnet /subscriptions/75d1e0d5-9fed-4ae1-aec7-2ecc19de26fa/resourceGroups/<AKS_RESOURCE_GROUP>/providers/Microsoft.Network/virtualNetworks/<AKS_VNET> 

 

az network vnet peering create -n <PEER2_NAME> -g <AKS_RESOURCE_GROUP>  --vnet-name <AKD_VNET> --allow-forwarded-traffic --allow-vnet-access --remote-vnet /subscriptions/75d1e0d5-9fed-4ae1-aec7-2ecc19de26fa/resourceGroups/<AMLFS_RESOURCE_GROUP>/providers/Microsoft.Network/virtualNetworks/<AMLFS_VNET>

 

Get the AMLFS CSI driver repository.

git clone https://github.com/kubernetes-sigs/azurelustre-csi-driver.git

Install the CSI driver on NDm_v4 AKS cluster (kubenet networking)

curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurelustre-csi-driver/main/deploy/install-driver.sh | bash

 

From the AMLFS CSI driver repository, edit docs/examples/storageclass_existing_lustre.yaml, update the internal name of the lustre filesystem name and the MGS IP address (See Azure portal, AMLFS, Client connection to get these values for your deployed AMLFS)

kubectl create -f storageclass_existing_lustre.yaml

 Edit docs/examples/storageclass_exisiting_lustre.yaml, make sure it has the correct Storage size (storage: 16Ti).

kubectl create -f pvc_storageclass.yaml

 

Check the persistent volume claim for AMLFS

kubectl get pvc

NAME         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                   AGE
pvc-lustre   Bound    pvc-26081f53-bb6d-48f7-b9c3-51982c97ed68   16Ti       RWX            sc.azurelustre.csi.azure.com   11s

 NOTE:  Consult the Azure AMLFS documentation to explore other ways to mount AMLFS in AKS.

 

See below how we use/mount this AMLFS persistent volume claim and run an IOR I/O benchmark.

 

Setup NDm_v4 local NVMe SSD

The NDm_v4 VM includes a 7 TB local NVMe SSD, but the 8 NVMe devices need to be configured and mounted (e.g. raid 0, Ext4 filesystem and mounted).

 

git clone https://github.com/ams0/aks-nvme-ssd-provisioner.git

 Modify the local NVMe SSD mount point, edit aks-nvme-ssd-provisioner.sh

sed -i "s/\/pv-disks\/\$UUID/\/pv-disks\/scratch/g" aks-nvme-ssd-provisioner.sh

 Modify the Dockerfile (change execute permissions on script)

sed -i "/^COPY .*$/a RUN chmod +x \/usr\/local\/bin\/aks-nvme-ssd-provisioner.sh" Dockerfile

 Build container locally and push container to your ACR.

docker build -t ${acr_name}.azurecr.io/aks-nvme-ssd-provisioner:v1.0.2 .
docker push ${acr_name}.azurecr.io/aks-nvme-ssd-provisioner:v1.0.2

 Modify storage-local-static-provisioner.yaml, to point to the corect Azure container registry.

sed -i "s/ams0/${acr_name}.azurecr.io/g" ./manifests/storage-local-static-provisioner.yaml

 Modify the node label used to signal setting up and mounting the local NVMe SSD.

sed -i "s/kubernetes.azure.com\/aks-local-ssd/aks-local-ssd/g" ./manifests/storage-local-static-provisioner.yaml

 

Update NDm_v4 node pool with “aks-local-ssd=true” label.

az aks nodepool update –cluster-name <AKS_NAME>—resource-group <RG>  –nodepool-name <NAME> --labels aks-local-ssd=true

NOTE:  You could also set the nodepool aks-local-ssd label when initially deploying the NDm_v4 AKS cluster

kubectl apply -f aks-nvme-ssd-provisioner/manifests/storage-local-static-provisioner.yaml

 Verify you can see the local NVMe persistent volume

kubectl get pv

NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS    REASON   AGE
local-pv-5b19e64a   7095Gi     RWO            Delete           Available           local-storage 12m

Below we show how to use this local NVMe SSD and run an FIO I/O throughput benchmark.

 

Deploy and mount Azure Files via NFSv4

 

Enable the Azure files CSI driver in your AKS cluster

az aks update -n cgakscluster -g  cg_aks_test --enable-file-driver

Create a customized Storage Class for Azure file + NFSv4 (nfs-sc.yaml)

 

 

 

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-csi-nfs
provisioner: file.csi.azure.com
allowVolumeExpansion: true
parameters:
  protocol: nfs
mountOptions:
  - nconnect=4

 

 

 

kubectl apply -f nfs-sc.yaml

 

Verify you have created the Azure files NFSv4 Storage class

kubectl get sc

NAME                           PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
azurefile-csi-nfs              file.csi.azure.com             Delete          Immediate              true 7s

Create an Azure files NFSv4 persistent volume claim (files-nfs-pvc.yaml)

 

 

 

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: files-nfs-pvc
spec:
  storageClassName: azurefile-csi-nfs
  accessModes:
    - "ReadWriteMany"
  resources:
    requests:
      storage: 100Gi
---

 

 

 

kubectl apply -f files-nfs-pvc.yaml

 Verify that the Azure files NFSv4 persistent volume claim is created

kubeclt get pvc

NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                   AGE
files-nfs-pvc   Bound    pvc-4786a03f-b2fc-44a9-b9e2-ac9ccd5c292c   100Gi      RWX            azurefile-csi-nfs 3m50s

 Note: Consult the AKS documentation to see other ways to mount Azure files in AKS

 

Below we give an example of how to run an FIO benchmark using this Azure Files NFSv4 persistent volume claim.

 

Run I/O benchmarks (FIO and IOR)

We will use FIO to test/validate local NVMe SSD storage and Azure files via NFSv4. IOR I/O benchmark will be used to test/validate AMLFS Storage.

 

Example of FIO yaml script to test the NDm_v4 local NVMe SSD

 

 

 

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: fio-job1
spec:
  minAvailable: 1
  schedulerName: volcano
  tasks:
    - replicas: 1
      name: fio
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  df -h
                  /workspace/fio/bin/fio --name=write_4G --directory=/scratch --direct=1 --size=4G --bs=4M --rw=write --group_reporting --numjobs=4 --runtime=300
                  /workspace/fio/bin/fio --name=read_4G --directory=/scratch --direct=1 --size=4G --bs=4M --rw=read --group_reporting --numjobs=4 --runtime=300
              image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
                privileged: true
              name: fio
              workingDir: /workspace
              resources:
                requests:
                  nvidia.com/infiniband: 8
                limits:
                  nvidia.com/infiniband: 8
              volumeMounts:
              - name: scratch
                mountPath: /scratch
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 0
          volumes:
          - name: scratch
            hostPath:
              path: /pv-disks/scratch
              type: Directory
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
---

 

 

 

df -h (on NDm_v4 pod)

 

Filesystem      Size  Used Avail Use% Mounted on
overlay         117G   67G   50G  58% /
tmpfs            64M     0   64M   0% /dev
/dev/md0        7.0T   28K  7.0T   1% /scratch
/dev/root       117G   67G   50G  58% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           861G   12K  861G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           443G   12K  443G   1% /proc/driver/nvidia
tmpfs           178G  5.5M  178G   1% /run/nvidia-fabricmanager/socket
devtmpfs        443G     0  443G   0% /dev/nvidia0

 

FIO I/O write/read performance measured.

WRITE: bw=7710MiB/s (8085MB/s), 7710MiB/s-7710MiB/s (8085MB/s-8085MB/s), io=16.0GiB (17.2GB), run=2125-2125msec
READ: bw=9012MiB/s (9450MB/s), 9012MiB/s-9012MiB/s (9450MB/s-9450MB/s), io=16.0GiB (17.2GB), run=1818-1818msec

 

Example of IOR yaml script to test AMLFS

 

 

 

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ior-job1
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  tasks:
    - replicas: 1
      name: mpimaster
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  df -h
                  MPI_HOST=$(cat /etc/volcano/mpiworker.host | tr "\n" ",")
                  mkdir -p /var/run/sshd; /usr/sbin/sshd
                  echo "HOSTS: $MPI_HOST"
                  mpirun --allow-run-as-root -np 16 -npernode 8 --bind-to numa --map-by ppr:8:node -hostfile /etc/volcano/mpiworker.host /workspace/ior-3.2.1/bin/ior  -a POSIX -v -i 1 -B -m -d 1 -F -w -r -t 32m -b 2G -o /mnt/myamlfs/test | tee /home/re
              image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
                privileged: true
              name: mpimaster
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /workspace
              resources:
                requests:
                  cpu: 1
              volumeMounts:
              - mountPath: "/mnt/myamlfs"
                name: myamlfs
              - mountPath: /dev/shm
                name: shm
          restartPolicy: OnFailure
          volumes:
          - name: myamlfs
            persistentVolumeClaim:
             claimName: pvc-lustre
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
    - replicas: 2
      name: mpiworker
      template:
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
                privileged: true
              name: mpiworker
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /workspace
              resources:
                requests:
                  nvidia.com/infiniband: 8
                limits:
                  nvidia.com/infiniband: 8
              volumeMounts:
              - mountPath: "/mnt/myamlfs"
                name: myamlfs
              - mountPath: /dev/shm
                name: shm
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 0
          volumes:
          - name: myamlfs
            persistentVolumeClaim:
             claimName: pvc-lustre
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
---

 

 

 

 

df -h (on NDm_v4 pod)

Filesystem Size Used Avail Use% Mounted on
overlay 117G 68G 49G 58% /
tmpfs 64M 0 64M 0% /dev
10.2.0.6@tcp:/lustrefs 16T 1.3M 16T 1% /mnt/myamlfs
tmpfs 8.0G 0 8.0G 0% /dev/shm
tmpfs 861G 16K 861G 1% /root/.ssh
/dev/root 117G 68G 49G 58% /etc/hosts
tmpfs 861G 12K 861G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 443G 12K 443G 1% /proc/driver/nvidia
tmpfs 178G 9.1M 178G 1% /run/nvidia-fabricmanager/socket
devtmpfs 443G 0 443G 0% /dev/nvidia0

 IOR read/write I/O benchmark result 

Max Write: 2263.20 MiB/sec (2373.14 MB/sec)
Max Read:  1829.75 MiB/sec (1918.63 MB/sec)

Note: Inline with expected performance 125 MB/s/TiB, 16TiB AMLFS deployed (Maximum throughput ~2000 MB/s)

 

Example FIO benchmark yaml file to test the Azure Files+NFSv4 filesystem.

 

 

 

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: fio-files-job1
spec:
  minAvailable: 1
  schedulerName: volcano
  tasks:
    - replicas: 1
      name: fio-files
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  df -h
                  /workspace/fio/bin/fio --name=write_4G --directory=/mnt/azurefiles --direct=1 --size=4G --bs=4M --rw=write --group_reporting --numjobs=4 --runtime=300
                  /workspace/fio/bin/fio --name=read_4G --directory=/mnt/azurefiles --direct=1 --size=4G --bs=4M --rw=read --group_reporting --numjobs=4 --runtime=300
              image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
                privileged: true
              name: fio-files
              workingDir: /workspace
              resources:
                requests:
                  nvidia.com/infiniband: 8
                limits:
                  nvidia.com/infiniband: 8
              volumeMounts:
              - name: persistent-files
                mountPath: /mnt/azurefiles
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 0
          volumes:
          - name: persistent-files
            persistentVolumeClaim:
              claimName: files-nfs-pvc
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
---

 

 

 

df -h run from pod on NDm_v4.

Filesystem                                                                                                        Size  Used Avail Use% Mounted on
overlay                                                                                                           117G   68G   49G  58% /
tmpfs                                                                                                              64M     0   64M   0% /dev
fc48afa1c2d754f21873a54.file.core.windows.net:/fc48afa1c2d754f21873a54/pvcn-4786a03f-b2fc-44a9-b9e2-ac9ccd5c292c  100G     0  100G   0% /mnt/azurefiles
/dev/root                                                                                                         117G   68G   49G  58% /etc/hosts
shm                                                                                                                64M     0   64M   0% /dev/shm
tmpfs                                                                                                             861G   12K  861G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                                                                             443G   12K  443G   1% /proc/driver/nvidia
tmpfs                                                                                                             178G  6.0M  178G   1% /run/nvidia-fabricmanager/socket
devtmpfs                                                                                                          443G     0  443G   0% /dev/nvidia0
                                                                                                  443G     0  443G   0% /dev/nvidia0

 

FIO, write/read I/O throughput (100GiB Azure files share mounted via NFSv4)

WRITE: bw=130MiB/s (137MB/s), 130MiB/s-130MiB/s (137MB/s-137MB/s), io=16.0GiB (17.2GB), run=125570-125570msec
READ: bw=158MiB/s (166MB/s), 158MiB/s-158MiB/s (166MB/s-166MB/s), io=16.0GiB (17.2GB), run=103711-103711msec

 

 Conclusion

All popular HPC/AI storage options can easily be consumed in NDm_v4 AKS cluster environments. In this blog post we demonstrated how to set-up and consumed local NVMe SSD’s, AMLFS, Azure files+NFSv4 and validated these storage options by running FIO and IOR I/O benchmarks.

Co-Authors
Version history
Last update:
‎Jun 15 2023 01:40 PM
Updated by: