In a previous blog post we showed how to deploy an optimal NDm_v4 AKS cluster, i.e. all 8 InfiniBand and GPU devices on each NDm_v4 are installed correctly. We then verified that the NDm_v4 kubernetes cluster was deployed/configured correctly by running a NCCL allreduce benchmark on 2 NDm_v4 VM’s (16 A100 GPUs). We will use this NDm_v4 AKS as a starting point and show how to use popular Azure HPC/AI storage options (such as local NVMe SSDs, Azure managed lustre Filesystem (AMLFS) and Azure files+NFSv4) in this NDm_v4 AKS cluster.
We will use fio and IOR to test the various NDm_v4+AKS storage options.
IOR build script (build_ior.sh)
#!/bin/bash
APP_NAME=ior
PARALLEL_BUILD=8
IOR_VERSION=3.2.1
IOR_PACKAGE=ior-$IOR_VERSION.tar.gz
wget https://github.com/hpc/ior/releases/download/$IOR_VERSION/$IOR_PACKAGE
tar xvf $IOR_PACKAGE
rm $IOR_PACKAGE
cd ior-$IOR_VERSION
CC=`which mpicc`
./configure --prefix=`pwd`
make -j ${PARALLEL_BUILD}
make install
FIO build script (fio_build.sh)
#!/bin/bash
APP_VERSION=3.22
PARALLEL_BUILD=4
yum install -y zlib-devel git
git clone https://github.com/axboe/fio.git
cd fio
git checkout tags/fio-${APP_VERSION}
./configure --prefix=`pwd`
make -j $PARALLEL_BUILD
make install
Dockerfile to build I/O tester container.
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.03-py3
FROM ${FROM_IMAGE_NAME}
RUN apt update
RUN apt-get -y install build-essential
RUN apt-get -y install infiniband-diags
RUN apt-get -y install openssh-server
RUN apt-get -y install kmod
COPY build_ior.sh .
RUN ./build_ior.sh
COPY build_fio.sh .
RUN ./build_fio.sh
Build container locally
docker build -t <ACR_NAME>.azurecr.io/<CONTAINER_NAME> .
NOTE: Choose a suitable <CONTAINER_NAME>
Push your container to your Azure container registry.
docker push ${ACR_NAME}.azurecr.io/$CONTAINER_NAME
AMLFS is already deployed, see AMLFS documentation for details.
In this example AMLFS is deployed in an AMLFS_VNET (Default kubenet networking and AKS deployed the VNET) and AKS is deployed in AMLFS_VNET, to mount AMLFS in AKS, the two VNET's need to peered (to be able to communicate with each other).
az network vnet peering create -n <PEER1_NAME> -g <AMLFS_RESOURCE_GROUP> --vnet-name <AMLFS_VNET> --allow-forwarded-traffic --allow-vnet-access --remote-vnet /subscriptions/75d1e0d5-9fed-4ae1-aec7-2ecc19de26fa/resourceGroups/<AKS_RESOURCE_GROUP>/providers/Microsoft.Network/virtualNetworks/<AKS_VNET>
az network vnet peering create -n <PEER2_NAME> -g <AKS_RESOURCE_GROUP> --vnet-name <AKD_VNET> --allow-forwarded-traffic --allow-vnet-access --remote-vnet /subscriptions/75d1e0d5-9fed-4ae1-aec7-2ecc19de26fa/resourceGroups/<AMLFS_RESOURCE_GROUP>/providers/Microsoft.Network/virtualNetworks/<AMLFS_VNET>
Get the AMLFS CSI driver repository.
git clone https://github.com/kubernetes-sigs/azurelustre-csi-driver.git
Install the CSI driver on NDm_v4 AKS cluster (kubenet networking)
curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurelustre-csi-driver/main/deploy/install-driver.sh | bash
From the AMLFS CSI driver repository, edit docs/examples/storageclass_existing_lustre.yaml, update the internal name of the lustre filesystem name and the MGS IP address (See Azure portal, AMLFS, Client connection to get these values for your deployed AMLFS)
kubectl create -f storageclass_existing_lustre.yaml
Edit docs/examples/storageclass_exisiting_lustre.yaml, make sure it has the correct Storage size (storage: 16Ti).
kubectl create -f pvc_storageclass.yaml
Check the persistent volume claim for AMLFS
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pvc-lustre Bound pvc-26081f53-bb6d-48f7-b9c3-51982c97ed68 16Ti RWX sc.azurelustre.csi.azure.com 11s
NOTE: Consult the Azure AMLFS documentation to explore other ways to mount AMLFS in AKS.
See below how we use/mount this AMLFS persistent volume claim and run an IOR I/O benchmark.
The NDm_v4 VM includes a 7 TB local NVMe SSD, but the 8 NVMe devices need to be configured and mounted (e.g. raid 0, Ext4 filesystem and mounted).
git clone https://github.com/ams0/aks-nvme-ssd-provisioner.git
Modify the local NVMe SSD mount point, edit aks-nvme-ssd-provisioner.sh
sed -i "s/\/pv-disks\/\$UUID/\/pv-disks\/scratch/g" aks-nvme-ssd-provisioner.sh
Modify the Dockerfile (change execute permissions on script)
sed -i "/^COPY .*$/a RUN chmod +x \/usr\/local\/bin\/aks-nvme-ssd-provisioner.sh" Dockerfile
Build container locally and push container to your ACR.
docker build -t ${acr_name}.azurecr.io/aks-nvme-ssd-provisioner:v1.0.2 .
docker push ${acr_name}.azurecr.io/aks-nvme-ssd-provisioner:v1.0.2
Modify storage-local-static-provisioner.yaml, to point to the corect Azure container registry.
sed -i "s/ams0/${acr_name}.azurecr.io/g" ./manifests/storage-local-static-provisioner.yaml
Modify the node label used to signal setting up and mounting the local NVMe SSD.
sed -i "s/kubernetes.azure.com\/aks-local-ssd/aks-local-ssd/g" ./manifests/storage-local-static-provisioner.yaml
Update NDm_v4 node pool with “aks-local-ssd=true” label.
az aks nodepool update –cluster-name <AKS_NAME>—resource-group <RG> –nodepool-name <NAME> --labels aks-local-ssd=true
NOTE: You could also set the nodepool aks-local-ssd label when initially deploying the NDm_v4 AKS cluster
kubectl apply -f aks-nvme-ssd-provisioner/manifests/storage-local-static-provisioner.yaml
Verify you can see the local NVMe persistent volume
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
local-pv-5b19e64a 7095Gi RWO Delete Available local-storage 12m
Below we show how to use this local NVMe SSD and run an FIO I/O throughput benchmark.
Enable the Azure files CSI driver in your AKS cluster
az aks update -n cgakscluster -g cg_aks_test --enable-file-driver
Create a customized Storage Class for Azure file + NFSv4 (nfs-sc.yaml)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurefile-csi-nfs
provisioner: file.csi.azure.com
allowVolumeExpansion: true
parameters:
protocol: nfs
mountOptions:
- nconnect=4
kubectl apply -f nfs-sc.yaml
Verify you have created the Azure files NFSv4 Storage class
kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
azurefile-csi-nfs file.csi.azure.com Delete Immediate true 7s
Create an Azure files NFSv4 persistent volume claim (files-nfs-pvc.yaml)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: files-nfs-pvc
spec:
storageClassName: azurefile-csi-nfs
accessModes:
- "ReadWriteMany"
resources:
requests:
storage: 100Gi
---
kubectl apply -f files-nfs-pvc.yaml
Verify that the Azure files NFSv4 persistent volume claim is created
kubeclt get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
files-nfs-pvc Bound pvc-4786a03f-b2fc-44a9-b9e2-ac9ccd5c292c 100Gi RWX azurefile-csi-nfs 3m50s
Note: Consult the AKS documentation to see other ways to mount Azure files in AKS
Below we give an example of how to run an FIO benchmark using this Azure Files NFSv4 persistent volume claim.
We will use FIO to test/validate local NVMe SSD storage and Azure files via NFSv4. IOR I/O benchmark will be used to test/validate AMLFS Storage.
Example of FIO yaml script to test the NDm_v4 local NVMe SSD
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: fio-job1
spec:
minAvailable: 1
schedulerName: volcano
tasks:
- replicas: 1
name: fio
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/bash
- -c
- |
df -h
/workspace/fio/bin/fio --name=write_4G --directory=/scratch --direct=1 --size=4G --bs=4M --rw=write --group_reporting --numjobs=4 --runtime=300
/workspace/fio/bin/fio --name=read_4G --directory=/scratch --direct=1 --size=4G --bs=4M --rw=read --group_reporting --numjobs=4 --runtime=300
image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
securityContext:
capabilities:
add: ["IPC_LOCK"]
privileged: true
name: fio
workingDir: /workspace
resources:
requests:
nvidia.com/infiniband: 8
limits:
nvidia.com/infiniband: 8
volumeMounts:
- name: scratch
mountPath: /scratch
restartPolicy: OnFailure
terminationGracePeriodSeconds: 0
volumes:
- name: scratch
hostPath:
path: /pv-disks/scratch
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
---
df -h (on NDm_v4 pod)
Filesystem Size Used Avail Use% Mounted on
overlay 117G 67G 50G 58% /
tmpfs 64M 0 64M 0% /dev
/dev/md0 7.0T 28K 7.0T 1% /scratch
/dev/root 117G 67G 50G 58% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 861G 12K 861G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 443G 12K 443G 1% /proc/driver/nvidia
tmpfs 178G 5.5M 178G 1% /run/nvidia-fabricmanager/socket
devtmpfs 443G 0 443G 0% /dev/nvidia0
FIO I/O write/read performance measured.
WRITE: bw=7710MiB/s (8085MB/s), 7710MiB/s-7710MiB/s (8085MB/s-8085MB/s), io=16.0GiB (17.2GB), run=2125-2125msec
READ: bw=9012MiB/s (9450MB/s), 9012MiB/s-9012MiB/s (9450MB/s-9450MB/s), io=16.0GiB (17.2GB), run=1818-1818msec
Example of IOR yaml script to test AMLFS
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: ior-job1
spec:
minAvailable: 3
schedulerName: volcano
plugins:
ssh: []
svc: []
tasks:
- replicas: 1
name: mpimaster
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/bash
- -c
- |
df -h
MPI_HOST=$(cat /etc/volcano/mpiworker.host | tr "\n" ",")
mkdir -p /var/run/sshd; /usr/sbin/sshd
echo "HOSTS: $MPI_HOST"
mpirun --allow-run-as-root -np 16 -npernode 8 --bind-to numa --map-by ppr:8:node -hostfile /etc/volcano/mpiworker.host /workspace/ior-3.2.1/bin/ior -a POSIX -v -i 1 -B -m -d 1 -F -w -r -t 32m -b 2G -o /mnt/myamlfs/test | tee /home/re
image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
securityContext:
capabilities:
add: ["IPC_LOCK"]
privileged: true
name: mpimaster
ports:
- containerPort: 22
name: mpijob-port
workingDir: /workspace
resources:
requests:
cpu: 1
volumeMounts:
- mountPath: "/mnt/myamlfs"
name: myamlfs
- mountPath: /dev/shm
name: shm
restartPolicy: OnFailure
volumes:
- name: myamlfs
persistentVolumeClaim:
claimName: pvc-lustre
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
- replicas: 2
name: mpiworker
template:
spec:
containers:
- command:
- /bin/bash
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
securityContext:
capabilities:
add: ["IPC_LOCK"]
privileged: true
name: mpiworker
ports:
- containerPort: 22
name: mpijob-port
workingDir: /workspace
resources:
requests:
nvidia.com/infiniband: 8
limits:
nvidia.com/infiniband: 8
volumeMounts:
- mountPath: "/mnt/myamlfs"
name: myamlfs
- mountPath: /dev/shm
name: shm
restartPolicy: OnFailure
terminationGracePeriodSeconds: 0
volumes:
- name: myamlfs
persistentVolumeClaim:
claimName: pvc-lustre
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
---
df -h (on NDm_v4 pod)
Filesystem Size Used Avail Use% Mounted on
overlay 117G 68G 49G 58% /
tmpfs 64M 0 64M 0% /dev
10.2.0.6@tcp:/lustrefs 16T 1.3M 16T 1% /mnt/myamlfs
tmpfs 8.0G 0 8.0G 0% /dev/shm
tmpfs 861G 16K 861G 1% /root/.ssh
/dev/root 117G 68G 49G 58% /etc/hosts
tmpfs 861G 12K 861G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 443G 12K 443G 1% /proc/driver/nvidia
tmpfs 178G 9.1M 178G 1% /run/nvidia-fabricmanager/socket
devtmpfs 443G 0 443G 0% /dev/nvidia0
IOR read/write I/O benchmark result
Max Write: 2263.20 MiB/sec (2373.14 MB/sec)
Max Read: 1829.75 MiB/sec (1918.63 MB/sec)
Note: Inline with expected performance 125 MB/s/TiB, 16TiB AMLFS deployed (Maximum throughput ~2000 MB/s)
Example FIO benchmark yaml file to test the Azure Files+NFSv4 filesystem.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: fio-files-job1
spec:
minAvailable: 1
schedulerName: volcano
tasks:
- replicas: 1
name: fio-files
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/bash
- -c
- |
df -h
/workspace/fio/bin/fio --name=write_4G --directory=/mnt/azurefiles --direct=1 --size=4G --bs=4M --rw=write --group_reporting --numjobs=4 --runtime=300
/workspace/fio/bin/fio --name=read_4G --directory=/mnt/azurefiles --direct=1 --size=4G --bs=4M --rw=read --group_reporting --numjobs=4 --runtime=300
image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
securityContext:
capabilities:
add: ["IPC_LOCK"]
privileged: true
name: fio-files
workingDir: /workspace
resources:
requests:
nvidia.com/infiniband: 8
limits:
nvidia.com/infiniband: 8
volumeMounts:
- name: persistent-files
mountPath: /mnt/azurefiles
restartPolicy: OnFailure
terminationGracePeriodSeconds: 0
volumes:
- name: persistent-files
persistentVolumeClaim:
claimName: files-nfs-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
---
df -h run from pod on NDm_v4.
Filesystem Size Used Avail Use% Mounted on
overlay 117G 68G 49G 58% /
tmpfs 64M 0 64M 0% /dev
fc48afa1c2d754f21873a54.file.core.windows.net:/fc48afa1c2d754f21873a54/pvcn-4786a03f-b2fc-44a9-b9e2-ac9ccd5c292c 100G 0 100G 0% /mnt/azurefiles
/dev/root 117G 68G 49G 58% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 861G 12K 861G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 443G 12K 443G 1% /proc/driver/nvidia
tmpfs 178G 6.0M 178G 1% /run/nvidia-fabricmanager/socket
devtmpfs 443G 0 443G 0% /dev/nvidia0 443G 0 443G 0% /dev/nvidia0
FIO, write/read I/O throughput (100GiB Azure files share mounted via NFSv4)
WRITE: bw=130MiB/s (137MB/s), 130MiB/s-130MiB/s (137MB/s-137MB/s), io=16.0GiB (17.2GB), run=125570-125570msec
READ: bw=158MiB/s (166MB/s), 158MiB/s-158MiB/s (166MB/s-166MB/s), io=16.0GiB (17.2GB), run=103711-103711msec
All popular HPC/AI storage options can easily be consumed in NDm_v4 AKS cluster environments. In this blog post we demonstrated how to set-up and consumed local NVMe SSD’s, AMLFS, Azure files+NFSv4 and validated these storage options by running FIO and IOR I/O benchmarks.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.