Large AI model training can take months to complete on very large AI supercomputers. These AI supercomputers consist of many high-end GPU’s (e.g NVIDIA A100 or H100) all connected with InfiniBand. The Azure NDv5 has 8 H100 GPU’s, each connected directly by NVlink 4 (on a node) and each GPU has a 400 Gbps IB link that enables it to communicate with all the other GPU’s on the AI Supercomputer.
AI model training workloads are tightly coupled, at regular intervals all the gradients need to be updated using NCCL collective communication. If any of the gpus or InfiniBand links fail (e.g. dropped GPU, InfiniBand link flap etc) this can cause the complete job to terminate (and require it to be restarted from the last checkpoint). It is imperative that any unhealthy nodes/IB fabric be identified to prevent them being included in any of the nodes used in the training job.
The Azurehpc node health repository provides a suite of recommended node health checks for all Azure specialized SKU’s (including GPU’s). In this blog post we will show how to integrate a few of the GPU node health checks into AKS (Azure kubernetes service) in such a way that
We will be leveraging Node problem detector (NPD) to run the specific GPU node health checks and draino to cordon/drain any nodes that fail any of the GPU node health checks.
NPD is commonly used in K8s environments to run various k8s cluster health checks and report any issues via k8s events/conditions to k8s api server. The k8s cluster can then take some action depending on how serious the condition is (e.g. for some permanent conditions, the node may be cordoned off and drained). We will leverage the NPD custom plugin
Note: GPU count, GPU NVlink, GPU XID and GPU ECC health checks are included (other GPU node health checks can also be easily included).
Get the NPD github repository
git clone http://github.com/kubernetes/node-proble-detector.git
Edit the NPD Makefile (get modified file here)
LINUX_PLATFORMS=linux_amd64
DOCKER_PLATFORMS=linux/amd64
TAG?=$(VERSION)_<UNIQUE NUMBER>
REGISTRY?=<YOUR ACR>.azurecr.io/k8s-staging-npd
BASEIMAGE:=nvcr.io/nvidia/pytorch:23.03-py3
Edit NPD Dockerfile (get modified file here)
FROM nvcr.io/nvidia/pytorch:23.03-py3 as builder-base
COPY go1.22.4.linux-amd64.tar.gz .
RUN rm -rf /usr/local/go && tar -C /usr/local -xzf go1.22.4.linux-amd64.tar.gz
#RUN clean-install util-linux bash libsystemd-dev
ENTRYPOINT ["/node-problem-detector", "--config.custom-plugin-monitor=/config/custom-plugin-gpu-count.json"]
Note: You can get the golang tarball here, go1.22.4.linux-amd64.tar.gz
Build NPD without SystemLogMonitor and SystemStatsMonitor. AKS has its own NPD which will run complete monitoring, we only want our NPD to just run the GPU node tests.
BUILD_TAGS="disable_system_log_monitor disable_system_stats_monitor" make 2>&1 | tee make.out
Push the container image to ACR
make push 2>&1 make_push.out
You could add all the GPU node health check plugins and scripts to the NPD container, but it’s much more flexible to use a k8s configMap to inject them directly into the container at runtime.
Edit deployment/node-problem-detector-config.yaml add the GPU custom plugin (yaml file) and gpu health check scripts (bash scripts) to the k8s ConfigMap yaml file. (get modified file here)
Note: You can control the frequency in which the tests are run, there are parameters in the custom plugin yaml files.
Edit deployment/node-problem-detector.yaml. (get modified file here)
- --config.custom-plugin-monitor=/config/custom-plugin-gpu-count.json,/config/custom-plugin-gpu-nvlink.json,/config/custom-plugin-gpu-xid.json, ,/config/custom-plugin-gpu-ecc.json
image: <YOUR ACR>.azurecr.io/k8s-staging-npd/node-problem-detector:<YOUR TAG>
cpu: 240m
memory: 2048Mi
defaultMode: 0777
- key: kernel-monitor.json
path: kernel-monitor.json
- key: docker-monitor.json
path: docker-monitor.json
- key: custom-plugin-monitor.json
path: custom-plugin-monitor.json
- key: check_ntp.sh
path: plugin/check_ntp.sh
- key: custom-plugin-gpu-count.json
path: custom-plugin-gpu-count.json
- key: check_gpu_count.sh
path: plugin/check_gpu_count.sh
- key: custom-plugin-gpu-nvlink.json
path: custom-plugin-gpu-nvlink.json
- key: check_gpu_nvlink.sh
path: plugin/check_gpu_nvlink.sh
- key: custom-plugin-gpu-xid.json
path: custom-plugin-gpu-xid.json
- key: check_gpu_xid.sh
path: plugin/check_gpu_xid.sh
Note: I have shown how to integrate 4 GPU node health checks, other GPU health checks can be easily added.
Note: You will probably need to modify the container limits (cpu/memory) depending on how many and what GPU tests you are running.
The draino set-up is easy, we just need to tell draino which GPU node health check events/conditions to act on (e.g. cordon/drain).
Get the draino repository
git clone https://github.com/planetlabs/draino.git
Build and push draino image/container to your ACR
docker build -t <YOUR ACR>.azurecr.io/draino .
docker push <YOUR ACR>.azurecr.io/draino
Edit the drain manifest yaml file (get modified file here)
rules:
- apiGroups: ['']
resources: [events]
verbs: [create, patch, update]
- apiGroups: ['']
resources: [nodes]
verbs: [get, watch, list, update, patch]
- apiGroups: ['']
resources: [nodes/status]
verbs: [patch, watch, list, update, patch]
- apiGroups: ['']
resources: [endpoints]
verbs: [get,watch, list, create, patch, update]
- apiGroups: ['']
resources: [pods]
verbs: [get, watch, list]
- apiGroups: ['']
resources: [pods/eviction]
verbs: [create]
- apiGroups:
- extensions
- apps
resources: [daemonsets]
verbs: [get, watch, list]
command: [/draino, --skip-drain, --node-label=accelerator=nvidia, GpuCount, GpuNvlink, GpuXid, GpuEcc]
image: <YOUR ACR>.azurecr.io/draino:latest
You have a working AKS cluster. In this test we will be using a NDmv4 nodepool (See here on how to deploy an NDmv4 AKS nodepool).
Deploy NPD+GPU health checks
kubectl apply -f rbac.yaml
kubectl apply -f node-problem-detector-config.yaml
kubectl apply -f node-problem-detector.yaml
Note: You should see the node-problem-detector daemonset running on NDmv4 nodes.
Deploy special draino deployment with support for GPU node health checks
kubectl apply -f manifest.yml
Note: You should see the draino deployment.
Verify that the GPU node health checks are running (Check the NDmv4 node description and look at the node events/conditions.
You can see the GpuNvlink, GpuXid and CpuCount conditions reporting normal status.
Now, to simulate a GPU node health check failure, we will drop one of the NDmv4 GPU’s.
nvidia-smi -i 00000001:00:00.0 -pm 0
nvidia-smi drain -p 0001:00:00.0 -m 1
Note: nvidia-ami will verify that there are 7 GPU’s (instead of the expected 8).
Check the NDmv4 node events/conditions (via node description). If shows that the GPU count test has failed, and the node has been automatically cordoned by draino (i.e. no pods can be scheduled to this node).
NPD is set to run periodically and can overlap with a customer’s job. The timing and type of GPU node health checks you run may affect how well the customer job performs. One possible strategy is to perform thorough node health checks on an empty cluster from time to time and to run some essential GPU node health checks that do not affect performance on regular intervals.
Fully automated GPU specific health checks integrated into AKS, that
helps to improve the reliability of large AI supercomputers running training jobs. In this blog post we showed how to integrate GPU specific health checks into NPD and then have draino look for specific GPU failure conditions and take some action (e.g cordon/drain node).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.