Pre-Job Health Checks on AKS: A Guide to Stable AI Workloads
Published Jul 19 2024 06:53 AM 1,334 Views
Microsoft

Pre-Job Health Checks on AKS: A Guide to Stable AI Workloads

 

Introduction

 

In the realm of AI workloads, ensuring the health and stability of compute nodes is critical. Training large AI models often spans months and relies on advanced AI supercomputers equipped with high-end GPUs like NVIDIA A100 or H100, interconnected via InfiniBand for efficient communication. These models' training workloads are complex and interdependent, with frequent updates and communications facilitated by NCCL collective communication. However, the inherent complexity also brings challenges, as any failure in GPUs or InfiniBand links—such as dropped GPUs or InfiniBand link flaps—can lead to job termination, necessitating restarts from the last checkpoint.

 

In traditional HPC schedulers such as SLURM, job prologs are employed to execute scripts before the main job begins. These scripts are often used by customers to perform health checks before launching their workloads. Similarly, in Kubernetes, init containers serve as an effective mechanism for conducting pre-job checks. Init containers execute before the main application container within a pod, enabling the execution of health checks.

Ensuring healthy nodes has been a challenge on Azure for both traditional HPC and AI workloads. Due to their necessity, we have a standard set of tests for GPU/IB VMs on Azure that is published in the azurehpc-health-checks repository on GitHub. These health checks are now included on our Azure HPC images, and they are integrated and can automatically run on node startup for CycleCloud with SLURM or as a pre-job health check on Azure Machine Learning. The health checks are also distributed as a container, aznhc-nv, available on the Microsoft Artifact Registry.

 

Despite these advancements, we do not yet have a published solution for running these health checks on Azure Kubernetes Service (AKS). This blog post remedies that gap by providing a step-by-step guide on how to run pre-job health checks on AKS, ensuring your AI/HPC workloads run smoothly and efficiently from the start.

 

Prerequisites

 

  1. AKS Cluster: You should have an AKS cluster set up.
  2. kubectl: Ensure kubectl is installed and configured to interact with your AKS cluster.
  3. Docker: Have Docker installed to build the Docker image.
  4. Azure Container Registry (ACR): Set up an ACR to store the Docker image.

Note: this guide is specifically targeting the H100 GPU VMs on Azure (Standard_ND96isr_H100_v5). The healthcheck config file will need adaptation for other VM types.

 

Step 1: Build the Docker Image

 

First, create the necessary files for your Docker image.

 

Dockerfile

FROM mcr.microsoft.com/aznhc/aznhc-nv:latest

RUN cd /usr/local/bin \
    && curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
    && chmod +x kubectl

COPY ndv5.conf /azure-nhc/conf/aznhc.conf

COPY run-healthcheck.sh /azure-nhc/run-healthcheck.sh
RUN chmod +x /azure-nhc/run-healthcheck.sh
ENTRYPOINT ["/azure-nhc/run-healthcheck.sh"]

 

ndv5.conf

#######################################################################
###
### Hardware checks
###
* || check_hw_cpuinfo 2 96 96
* || check_hw_physmem 1915071MB 1915071MB 5%
* || check_hw_swap 0kB 0kB 3%
* || check_hw_ib 400 mlx5_0:1
* || check_hw_ib 400 mlx5_1:1
* || check_hw_ib 400 mlx5_2:1
* || check_hw_ib 400 mlx5_3:1
* || check_hw_ib 400 mlx5_4:1
* || check_hw_ib 400 mlx5_5:1
* || check_hw_ib 400 mlx5_6:1
* || check_hw_ib 400 mlx5_7:1
* || check_hw_eth lo
* || check_hw_eth eth0
* || check_hw_topology

#######################################################################
####
#### GPU checks
####
* || check_gpu_count 8
* || check_nvsmi_healthmon
* || check_gpu_xid
* || check_gpu_bw 52 350
* || check_gpu_ecc 20000000 10000
* || check_gpu_clock_throttling
* || check_nccl_allreduce 460.0 1 /azure-nhc/topofiles/ndv5-topo.xml 16G
* || check_nvlink_status


#######################################################################
####
#### Additional IB checks
####
* || check_ib_bw_gdr 380
* || check_ib_link_flapping 6

 

run-healthcheck.sh

#!/bin/bash

CONF_FILE=/azure-nhc/conf/aznhc.conf
LOG_FILE=/azure-nhc/aznhc.log

nhc DETACHED_MODE=0 CONFFILE=$CONF_FILE LOGFILE=$LOG_FILE TIMEOUT=300

# Annotate node with test results
kubectl annotate node $NODE_NAME aznhc-results="$(<$LOG_FILE)" --overwrite

if grep -q "ERROR:  nhc:  Health check failed:" $LOG_FILE; then
    kubectl taint nodes "$NODE_NAME" aznhc=failed:NoExecute
    exit 1
fi

 

Build and push your Docker image:

export ACR_NAME=<your-acr-name>
docker build -t $ACR_NAME.azurecr.io/aks-healthcheck:latest .
docker push $ACR_NAME.azurecr.io/aks-healthcheck:latest

 

Step 2: Create Service Account and Role Bindings

 

Create a serviceaccount.yaml file to define the necessary Kubernetes service account and role bindings.

 

serviceaccount.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aksnhc-sa
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aksnhc-role
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aksnhc-rolebinding
subjects:
- kind: ServiceAccount
  name: aksnhc-sa
  namespace: default
roleRef:
  kind: ClusterRole
  name: aksnhc-role
  apiGroup: rbac.authorization.k8s.io

 

Apply the configuration:

kubectl apply -f serviceaccount.yaml

 

Step 3: Running the Job

 

Create a healthcheck-job.yaml file to define a Kubernetes Job that executes health checks as an init container. This approach can be applied to both standard and Volcano-scheduled Jobs. If the init container fails its health checks, the node will be tainted with the aznhc=failed:NoExecute taint. This prevents new workloads from being scheduled on the node and triggers the eviction of the current Job, forcing it to restart on a healthy node.

 

healthcheck-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: aks-healthcheck-job
spec:
  completions: $NUM_NODES
  parallelism: $NUM_NODES
  completionMode: Indexed
  ttlSecondsAfterFinished: 300
  template:
    spec:
      serviceAccountName: aksnhc-sa
      initContainers:
        - name: healthcheck
          image: $ACR_NAME.azurecr.io/aks-healthcheck:latest
          imagePullPolicy: Always
          securityContext:
            capabilities:
              add: ["IPC_LOCK"]
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - mountPath: /dev/shm
              name: shmem
            - mountPath: /azure-nhc/syslog
              name: syslog-volume
              readOnly: true 
          resources:
            requests:
              nvidia.com/gpu: 8
              nvidia.com/mlnxnics: 8
            limits:
              nvidia.com/gpu: 8
              nvidia.com/mlnxnics: 8
      containers:
        - name: main
          image: busybox
          command: ['sh', '-c', 'echo "run torchrun or workload here..."']
          securityContext:
            capabilities:
              add: ["IPC_LOCK"]
          volumeMounts:
            - mountPath: /dev/shm
              name: shmem
          resources:
            requests:
              nvidia.com/gpu: 8
              nvidia.com/mlnxnics: 8
            limits:
              nvidia.com/gpu: 8
              nvidia.com/mlnxnics: 8
      restartPolicy: Never
      volumes:
        - name: shmem
          emptyDir:
            medium: Memory
            sizeLimit: 128Gi  
        - name: syslog-volume
          hostPath:
            path: /var/log/syslog
            type: File

 

Apply the job configuration:

export ACR_NAME=<your-acr-name>
export NUM_NODES=<number-of-nodes>
envsubst < healthcheck-job.yaml | kubectl apply -f -

 

Step 4: Cleaning Up

 

To clean up the resources created for the health checks, you can delete the job and the service account resources:

 

export ACR_NAME=<your-acr-name>
export NUM_NODES=<number-of-nodes>
envsubst < healthcheck-job.yaml | kubectl delete -f -
kubectl delete -f serviceaccount.yaml

 

Conclusion

 

By following these steps, you can effectively run health checks as an init container on your AKS nodes. This ensures your nodes meet the required health standards before your application pods are scheduled, improving the reliability and performance of your applications.

 

Further Reading

 

Co-Authors
Version history
Last update:
‎Jul 19 2024 03:00 AM
Updated by: