In the realm of AI workloads, ensuring the health and stability of compute nodes is critical. Training large AI models often spans months and relies on advanced AI supercomputers equipped with high-end GPUs like NVIDIA A100 or H100, interconnected via InfiniBand for efficient communication. These models' training workloads are complex and interdependent, with frequent updates and communications facilitated by NCCL collective communication. However, the inherent complexity also brings challenges, as any failure in GPUs or InfiniBand links—such as dropped GPUs or InfiniBand link flaps—can lead to job termination, necessitating restarts from the last checkpoint.
In traditional HPC schedulers such as SLURM, job prologs are employed to execute scripts before the main job begins. These scripts are often used by customers to perform health checks before launching their workloads. Similarly, in Kubernetes, init containers serve as an effective mechanism for conducting pre-job checks. Init containers execute before the main application container within a pod, enabling the execution of health checks.
Ensuring healthy nodes has been a challenge on Azure for both traditional HPC and AI workloads. Due to their necessity, we have a standard set of tests for GPU/IB VMs on Azure that is published in the azurehpc-health-checks repository on GitHub. These health checks are now included on our Azure HPC images, and they are integrated and can automatically run on node startup for CycleCloud with SLURM or as a pre-job health check on Azure Machine Learning. The health checks are also distributed as a container, aznhc-nv, available on the Microsoft Artifact Registry.
Despite these advancements, we do not yet have a published solution for running these health checks on Azure Kubernetes Service (AKS). This blog post remedies that gap by providing a step-by-step guide on how to run pre-job health checks on AKS, ensuring your AI/HPC workloads run smoothly and efficiently from the start.
Note: this guide is specifically targeting the H100 GPU VMs on Azure (Standard_ND96isr_H100_v5). The healthcheck config file will need adaptation for other VM types.
First, create the necessary files for your Docker image.
Dockerfile
FROM mcr.microsoft.com/aznhc/aznhc-nv:latest
RUN cd /usr/local/bin \
&& curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
&& chmod +x kubectl
COPY ndv5.conf /azure-nhc/conf/aznhc.conf
COPY run-healthcheck.sh /azure-nhc/run-healthcheck.sh
RUN chmod +x /azure-nhc/run-healthcheck.sh
ENTRYPOINT ["/azure-nhc/run-healthcheck.sh"]
ndv5.conf
#######################################################################
###
### Hardware checks
###
* || check_hw_cpuinfo 2 96 96
* || check_hw_physmem 1915071MB 1915071MB 5%
* || check_hw_swap 0kB 0kB 3%
* || check_hw_ib 400 mlx5_0:1
* || check_hw_ib 400 mlx5_1:1
* || check_hw_ib 400 mlx5_2:1
* || check_hw_ib 400 mlx5_3:1
* || check_hw_ib 400 mlx5_4:1
* || check_hw_ib 400 mlx5_5:1
* || check_hw_ib 400 mlx5_6:1
* || check_hw_ib 400 mlx5_7:1
* || check_hw_eth lo
* || check_hw_eth eth0
* || check_hw_topology
#######################################################################
####
#### GPU checks
####
* || check_gpu_count 8
* || check_nvsmi_healthmon
* || check_gpu_xid
* || check_gpu_bw 52 350
* || check_gpu_ecc 20000000 10000
* || check_gpu_clock_throttling
* || check_nccl_allreduce 460.0 1 /azure-nhc/topofiles/ndv5-topo.xml 16G
* || check_nvlink_status
#######################################################################
####
#### Additional IB checks
####
* || check_ib_bw_gdr 380
* || check_ib_link_flapping 6
run-healthcheck.sh
#!/bin/bash
CONF_FILE=/azure-nhc/conf/aznhc.conf
LOG_FILE=/azure-nhc/aznhc.log
nhc DETACHED_MODE=0 CONFFILE=$CONF_FILE LOGFILE=$LOG_FILE TIMEOUT=300
# Annotate node with test results
kubectl annotate node $NODE_NAME aznhc-results="$(<$LOG_FILE)" --overwrite
if grep -q "ERROR: nhc: Health check failed:" $LOG_FILE; then
kubectl taint nodes "$NODE_NAME" aznhc=failed:NoExecute
exit 1
fi
Build and push your Docker image:
export ACR_NAME=<your-acr-name>
docker build -t $ACR_NAME.azurecr.io/aks-healthcheck:latest .
docker push $ACR_NAME.azurecr.io/aks-healthcheck:latest
Create a serviceaccount.yaml
file to define the necessary Kubernetes service account and role bindings.
serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: aksnhc-sa
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: aksnhc-role
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: aksnhc-rolebinding
subjects:
- kind: ServiceAccount
name: aksnhc-sa
namespace: default
roleRef:
kind: ClusterRole
name: aksnhc-role
apiGroup: rbac.authorization.k8s.io
Apply the configuration:
kubectl apply -f serviceaccount.yaml
Create a healthcheck-job.yaml file to define a Kubernetes Job that executes health checks as an init container. This approach can be applied to both standard and Volcano-scheduled Jobs. If the init container fails its health checks, the node will be tainted with the aznhc=failed:NoExecute
taint. This prevents new workloads from being scheduled on the node and triggers the eviction of the current Job, forcing it to restart on a healthy node.
healthcheck-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: aks-healthcheck-job
spec:
completions: $NUM_NODES
parallelism: $NUM_NODES
completionMode: Indexed
ttlSecondsAfterFinished: 300
template:
spec:
serviceAccountName: aksnhc-sa
initContainers:
- name: healthcheck
image: $ACR_NAME.azurecr.io/aks-healthcheck:latest
imagePullPolicy: Always
securityContext:
capabilities:
add: ["IPC_LOCK"]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- mountPath: /dev/shm
name: shmem
- mountPath: /azure-nhc/syslog
name: syslog-volume
readOnly: true
resources:
requests:
nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
limits:
nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
containers:
- name: main
image: busybox
command: ['sh', '-c', 'echo "run torchrun or workload here..."']
securityContext:
capabilities:
add: ["IPC_LOCK"]
volumeMounts:
- mountPath: /dev/shm
name: shmem
resources:
requests:
nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
limits:
nvidia.com/gpu: 8
nvidia.com/mlnxnics: 8
restartPolicy: Never
volumes:
- name: shmem
emptyDir:
medium: Memory
sizeLimit: 128Gi
- name: syslog-volume
hostPath:
path: /var/log/syslog
type: File
Apply the job configuration:
export ACR_NAME=<your-acr-name>
export NUM_NODES=<number-of-nodes>
envsubst < healthcheck-job.yaml | kubectl apply -f -
To clean up the resources created for the health checks, you can delete the job and the service account resources:
export ACR_NAME=<your-acr-name>
export NUM_NODES=<number-of-nodes>
envsubst < healthcheck-job.yaml | kubectl delete -f -
kubectl delete -f serviceaccount.yaml
By following these steps, you can effectively run health checks as an init container on your AKS nodes. This ensures your nodes meet the required health standards before your application pods are scheduled, improving the reliability and performance of your applications.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.