GPU node health checks integrated into Azure Kubernetes service via node problem detector
Published Jun 27 2024 01:23 PM 2,475 Views
Microsoft

CormacGarvey_0-1719504295943.png

 

 

Introduction

Large AI model training can take months to complete on very large AI supercomputers. These AI supercomputers consist of many high-end GPU’s (e.g NVIDIA A100 or H100) all connected with InfiniBand. The Azure NDv5 has 8 H100 GPU’s, each connected directly by NVlink 4 (on a node) and each GPU has a 400 Gbps IB link that enables it to communicate with all the other GPU’s on the AI Supercomputer.

AI model training workloads are tightly coupled, at regular intervals all the gradients need to be updated using NCCL collective communication. If any of the gpus or InfiniBand links fail (e.g. dropped GPU, InfiniBand link flap etc) this can cause the complete job to terminate (and require it to be restarted from the last checkpoint). It is imperative that any unhealthy nodes/IB fabric be identified to prevent them being included in any of the nodes used in the training job.

The Azurehpc node health repository provides a suite of recommended node health checks for all Azure specialized SKU’s (including GPU’s). In this blog post we will show how to integrate a few of the GPU node health checks into AKS (Azure kubernetes service) in such a way that

  • GPU node health checks are run at regular intervals.
  • Nodes which fail any of the GPU tests will be automatically cordoned off (to prevent any jobs being scheduled on them) and optionally drained (all pods removed from node)

We will be leveraging Node problem detector (NPD) to run the specific GPU node health checks and draino to cordon/drain any nodes that fail any of the GPU node health checks.

 

Screenshot 2024-06-27 105536.png

 

GPU node health check integration into NPD

NPD is commonly used in K8s environments to run various k8s cluster health checks and report any issues via k8s events/conditions to k8s api server. The k8s cluster can then take some action depending on how serious the condition is (e.g. for some permanent conditions, the node may be cordoned off and drained). We will leverage the NPD custom plugin

 

Note: GPU count, GPU NVlink, GPU XID and GPU ECC health checks are included (other GPU node health checks can also be easily included).

 

Get the NPD github repository

git clone http://github.com/kubernetes/node-proble-detector.git

 

Edit the NPD Makefile (get modified file here)

  • Build for linux_amd64 only (not ARM)

         LINUX_PLATFORMS=linux_amd64

         DOCKER_PLATFORMS=linux/amd64

  • Provide a unique tag

                TAG?=$(VERSION)_<UNIQUE NUMBER>

  • Change registry to Azure ACR

                  REGISTRY?=<YOUR ACR>.azurecr.io/k8s-staging-npd

  • Change the BASEIMAGE

                 BASEIMAGE:=nvcr.io/nvidia/pytorch:23.03-py3

 

Edit NPD Dockerfile (get modified file here)

  • Change base container

                FROM nvcr.io/nvidia/pytorch:23.03-py3 as builder-base

  • Install golang in container

                COPY go1.22.4.linux-amd64.tar.gz .

                RUN rm -rf /usr/local/go && tar -C /usr/local -xzf go1.22.4.linux-amd64.tar.gz

  • Remove unnecessary ARM packaged

                #RUN clean-install util-linux bash libsystemd-dev

  • Edit entrypoint

                 ENTRYPOINT ["/node-problem-detector", "--config.custom-plugin-monitor=/config/custom-plugin-gpu-count.json"]

 

Note: You can get the golang tarball here, go1.22.4.linux-amd64.tar.gz

 

Build NPD without SystemLogMonitor and SystemStatsMonitor. AKS has its own NPD which will run complete monitoring, we only want our NPD to just run the GPU node tests.

BUILD_TAGS="disable_system_log_monitor disable_system_stats_monitor" make 2>&1 | tee make.out

 

Push the container image to ACR

make push 2>&1 make_push.out

 

You could add all the GPU node health check plugins and scripts to the NPD container, but it’s much more flexible to use a k8s configMap to inject them directly into the container at runtime.

 

Edit deployment/node-problem-detector-config.yaml add the GPU custom plugin (yaml file) and gpu health check scripts (bash scripts) to the k8s ConfigMap yaml file. (get modified file here)

 

Note: You can control the frequency in which the tests are run, there are parameters in the custom plugin yaml files.

 

Edit deployment/node-problem-detector.yaml. (get modified file here)

 

  • NPD command line

- --config.custom-plugin-monitor=/config/custom-plugin-gpu-count.json,/config/custom-plugin-gpu-nvlink.json,/config/custom-plugin-gpu-xid.json, ,/config/custom-plugin-gpu-ecc.json

  • Which image/container to use

                 image: <YOUR ACR>.azurecr.io/k8s-staging-npd/node-problem-detector:<YOUR TAG>

  • Container limits

                 cpu: 240m

                  memory: 2048Mi

  • Bash script permissions

                defaultMode: 0777

  • Which files to inject into the container.

                 - key: kernel-monitor.json

                    path: kernel-monitor.json

                 - key: docker-monitor.json

                    path: docker-monitor.json

                 - key: custom-plugin-monitor.json

                    path: custom-plugin-monitor.json

                 - key: check_ntp.sh

                    path: plugin/check_ntp.sh

                 - key: custom-plugin-gpu-count.json

                   path: custom-plugin-gpu-count.json

                 - key: check_gpu_count.sh

                    path: plugin/check_gpu_count.sh

                - key: custom-plugin-gpu-nvlink.json

                   path: custom-plugin-gpu-nvlink.json

                - key: check_gpu_nvlink.sh

                   path: plugin/check_gpu_nvlink.sh

                - key: custom-plugin-gpu-xid.json

                   path: custom-plugin-gpu-xid.json

                - key: check_gpu_xid.sh

                   path: plugin/check_gpu_xid.sh

 

Note: I have shown how to integrate 4 GPU node health checks, other GPU health checks can be easily added.

Note: You will probably need to modify the container limits (cpu/memory) depending on how many and what GPU tests you are running.

 

Draino set-up

The draino set-up is easy, we just need to tell draino which GPU node health check events/conditions to act on (e.g. cordon/drain).

 

Get the draino repository

git clone https://github.com/planetlabs/draino.git

 

Build and push draino image/container to your ACR

docker build -t <YOUR ACR>.azurecr.io/draino .
docker push <YOUR ACR>.azurecr.io/draino

 

Edit the drain manifest yaml file (get modified file here)

  • Add correct service account permission/rules so draino can access the k8s service

                 rules:

                - apiGroups: ['']

                  resources: [events]

                  verbs: [create, patch, update]

               - apiGroups: ['']

                  resources: [nodes]

                  verbs: [get, watch, list, update, patch]

               - apiGroups: ['']

                  resources: [nodes/status]

                  verbs: [patch, watch, list, update, patch]

               - apiGroups: ['']

                 resources: [endpoints]

                 verbs: [get,watch, list, create, patch, update]

               - apiGroups: ['']

                 resources: [pods]

                 verbs: [get, watch, list]

               - apiGroups: ['']

                  resources: [pods/eviction]

                  verbs: [create]

               - apiGroups:

                   - extensions

                   - apps

                 resources: [daemonsets]

                 verbs: [get, watch, list]

  • Draino command line (Only cordon GPU nodes with these GPU conditions)

                 command: [/draino, --skip-drain, --node-label=accelerator=nvidia, GpuCount, GpuNvlink, GpuXid, GpuEcc]

  • Select the correct image/container

                image: <YOUR ACR>.azurecr.io/draino:latest

 

Testing NPD+Draino GPU health checks

Prerequisites

You have a working AKS cluster. In this test we will be using a NDmv4 nodepool (See here on how to deploy an NDmv4 AKS nodepool).

 

Deploy NPD+GPU health checks

kubectl apply -f rbac.yaml
kubectl apply -f node-problem-detector-config.yaml
kubectl apply -f node-problem-detector.yaml

Note: You should see the node-problem-detector daemonset running on NDmv4 nodes.

 

Deploy special draino deployment with support for GPU node health checks

kubectl apply -f manifest.yml

 

Note: You should see the draino deployment.

 

Screenshot 2024-06-25 181534.png

 

Verify that the GPU node health checks are running (Check the NDmv4 node description and look at the node events/conditions.

Screenshot 2024-06-25 182500.png

You can see the GpuNvlink, GpuXid and CpuCount conditions reporting normal status.

 

Now, to simulate a GPU node health check failure, we will drop one of the NDmv4 GPU’s.

nvidia-smi -i 00000001:00:00.0 -pm 0
nvidia-smi drain -p 0001:00:00.0 -m 1

 

Note: nvidia-ami will verify that there are 7 GPU’s (instead of the expected 8).

 

Check the NDmv4 node events/conditions (via node description). If shows that the GPU count test has failed, and the node has been automatically cordoned by draino (i.e. no pods can be scheduled to this node).

 

Screenshot 2024-06-25 183153.png

 

Some additional considerations

NPD is set to run periodically and can overlap with a customer’s job. The timing and type of GPU node health checks you run may affect how well the customer job performs. One possible strategy is to perform thorough node health checks on an empty cluster from time to time and to run some essential GPU node health checks that do not affect performance on regular intervals.

 

Conclusion

Fully automated GPU specific health checks integrated into AKS, that

  • identify unhealthy GPU nodes
  • cordon nodes

helps to improve the reliability of large AI supercomputers running training jobs. In this blog post we showed how to integrate GPU specific health checks into NPD and then have draino look for specific GPU failure conditions and take some action (e.g cordon/drain node).

Co-Authors
Version history
Last update:
‎Jun 28 2024 12:45 PM
Updated by: