Running high-performance computing (HPC) and AI workloads in the cloud requires a flexible and scalable orchestration platform. Microsoft Azure CycleCloud, when combined with Slurm, provides an efficient solution for managing containerized applications across HPC clusters.
In this blog, we will explore how to run multi-node, multi-GPU workloads in a CycleCloud-Slurm environment using containerized workflows. We’ll cover key configurations, job submission strategies, and best practices to maximize GPU utilization across multiple nodes.
To simplify this process, we developed cyclecloud-slurm-container, a custom project that automates the setup of Pyxis and Enroot for running containerized workloads in Slurm. This tool streamlines the installation and configuration of required software, making it easier to deploy and manage containerized HPC applications.
In this approach, we integrate these scripts to enable container support within a CycleCloud implementation. However, we also offer a product called CycleCloud Workspace for Slurm.
Azure CycleCloud Workspace for Slurm is an Azure Marketplace solution template that simplifies the creation, configuration, and deployment of pre-defined Slurm clusters on Azure using CycleCloud. It eliminates the need for prior knowledge of Azure or Slurm. These Slurm clusters come pre-configured with PMIx v4, Pyxis, and Enroot, enabling seamless execution of containerized AI and HPC workloads.
As an example, we will demonstrate how to use the Azure Node Health Check (aznhc) Docker container to run NCCL benchmarks across multiple nodes and GPUs, showcasing the benefits of a well-optimized containerized HPC environment.
aznhc is not the preferred method for running NCCL AllReduce because it does not include the latest NCCL libraries. For optimal performance, please use the most recent NCCL libraries.
Containers bring significant benefits to HPC and AI workloads by enhancing flexibility and efficiency. When integrated with CycleCloud and Slurm, they provide:
- Portability – Package applications with all dependencies, ensuring consistent execution across different environments.
- Isolation – Run applications in separate environments to prevent conflicts and maintain system integrity.
- Reproducibility – Guarantee identical execution of workloads across multiple job submissions for reliable experimentation.
- Scalability – Dynamically scale containerized workloads across multiple nodes and GPUs, optimizing resource utilization.
By leveraging containers within CycleCloud-Slurm, users can streamline workload management, simplify software deployment, and maximize the efficiency of their HPC clusters.
Testing Environment for Running Multi-Node, Multi-GPU NCCL Workloads in CycleCloud-Slurm
Before executing NCCL benchmarks across multiple nodes and GPUs using containers in a CycleCloud-Slurm setup, ensure the following prerequisites are met:
- CycleCloud 8.x: A properly configured and running CycleCloud deployment.
- Virtual Machines: Use Standard_ND96asr_v4 VMs, which include NVIDIA GPUs and an InfiniBand network optimized for deep learning training and high-performance AI workloads.
- Slurm Configuration: Slurm version 24.05.4-2 (cyclecloud-slurm 3.0.11).
- Operating System: Ubuntu 22.04 (microsoft-dsvm:ubuntu-hpc:2204:latest), pre-configured with essential GPU, InfiniBand drivers, and HPC tools.
- Container Runtime Setup: Deploy the cyclecloud-slurm-container project, which automates the configuration of Enroot and Pyxis for efficient container execution.
- Azure Node Health Check (AZNHC) Container: Utilize this container to validate node health and execute NCCL benchmarks for performance testing.
aznhc is not the preferred method for running NCCL AllReduce because it does not include the latest NCCL libraries. For optimal performance, please use the most recent NCCL libraries.
This setup ensures a reliable and scalable environment for evaluating NCCL performance in a containerized CycleCloud-Slurm cluster.
Here’s a polished and structured version of your content:
Configuring the CycleCloud-Slurm Container Project
Follow these steps to set up the cyclecloud-slurm-container project, which automates the configuration of Pyxis and Enroot for running containerized workloads in Slurm.
Step 1: Open a Terminal Session to cyclecloud server
Ensure you have access to the CycleCloud server with the CycleCloud CLI enabled.
Step 2: Clone the Repository
Clone the cyclecloud-slurm-container repository from GitHub:
git clone https://github.com/vinil-v/cyclecloud-slurm-container.git
Example Output:
[azureuser@cc87 ~]$ git clone https://github.com/vinil-v/cyclecloud-slurm-container.git
Cloning into 'cyclecloud-slurm-container'...
remote: Enumerating objects: 27, done.
remote: Counting objects: 100% (27/27), done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 27 (delta 2), reused 27 (delta 2), pack-reused 0
Receiving objects: 100% (27/27), done.
Resolving deltas: 100% (2/2), done.
Step 3: Upload the Project to CycleCloud Locker
Navigate to the project directory and upload it to the CycleCloud locker:
cd cyclecloud-slurm-container/
cyclecloud project upload <locker name>
Example Output:
[azureuser@cc87 cyclecloud-slurm-container]$ cyclecloud project upload "Team Shared-storage"
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support
Job 43b89b46-ca66-3244-6341-d0cb746a87ad has started
Log file is located at: /home/azureuser/.azcopy/43b89b46-ca66-3244-6341-d0cb746a87ad.log
100.0 %, 14 Done, 0 Failed, 0 Pending, 14 Total, 2-sec Throughput (Mb/s): 0.028
Job 43b89b46-ca66-3244-6341-d0cb746a87ad Summary
Files Scanned at Source: 14
Files Scanned at Destination: 14
Elapsed Time (Minutes): 0.0334
Number of Copy Transfers for Files: 14
Number of Copy Transfers for Folder Properties: 0
Total Number of Copy Transfers: 14
Number of Copy Transfers Completed: 14
Number of Copy Transfers Failed: 0
Number of Deletions at Destination: 0
Total Number of Bytes Transferred: 7016
Total Number of Bytes Enumerated: 7016
Final Job Status: Completed
Upload complete!
Step 4: Explore Project Structure
The project is designed to configure Pyxis and Enroot in both the scheduler and compute nodes within a Slurm cluster. It includes the following directories:
[azureuser@cc87 cyclecloud-slurm-container]$ ll specs
total 0
drwxrwxr-x. 3 azureuser azureuser 26 Apr 2 01:44 default
drwxrwxr-x. 3 azureuser azureuser 26 Apr 2 01:44 execute
drwxrwxr-x. 3 azureuser azureuser 26 Apr 2 01:44 scheduler
Compute Node Scripts (execute Directory)
These scripts configure NVMe storage, set up Enroot, and enable Pyxis for running container workloads.
[azureuser@cc87 cyclecloud-slurm-container]$ ll specs/execute/cluster-init/scripts/
total 16
-rw-rw-r--. 1 azureuser azureuser 974 Apr 2 01:44 000_nvme-setup.sh
-rw-rw-r--. 1 azureuser azureuser 1733 Apr 2 01:44 001_enroot-setup.sh
-rw-rw-r--. 1 azureuser azureuser 522 Apr 2 01:44 002_pyxis-setup-execute.sh
-rw-rw-r--. 1 azureuser azureuser 350 Apr 2 01:44 README.txt
Scheduler Node Scripts (scheduler Directory)
These scripts set up Pyxis on the scheduler node to enable container execution with Slurm.
[azureuser@cc87 cyclecloud-slurm-container]$ ll specs/scheduler/cluster-init/scripts/
total 8
-rw-rw-r--. 1 azureuser azureuser 1266 Apr 2 01:44 000_pyxis-setup-scheduler.sh
-rw-rw-r--. 1 azureuser azureuser 350 Apr 2 01:44 README.txt
Configuring cyclecloud-slurm-container in CycleCloud Portal:
- Login to CycleCloud Web portal
- Create a Slurm cluster. In the Required Settings select HPC VM Type as Standard_ND96asr_v4 (in this example).
- In the Advance Settings you can select the OS for Ubuntu 22.04 LTS ( This image microsoft-dsvm:ubuntu-hpc:2204:latest has the GPU drivers, InfiniBand driver, hpc utilities like MPI etc.)
- in your CycleCloud Slurm cluster configuration, add “cyclecloud-slurm-container” project as a cluster-init in the scheduler and execute configuration.
Click on browse and navigate to cyclecloud-slurm-container directory and select the “scheduler” directory for scheduler and “execute” directory for execute.
Scheduler-cluster-init section:
Execute cluster-init section:
After configuring all the settings, save the changes and start the cluster.
Testing the setup:
Once the cluster is running, login to scheduler node and create a job script (nccl_benchmark_job.sh) like below.
Job script:
#!/bin/bash
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=12
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH -o nccl_allreduce_%j.log
export OMPI_MCA_coll_hcoll_enable=0 \
NCCL_IB_PCI_RELAXED_ORDERING=1 \
CUDA_DEVICE_ORDER=PCI_BUS_ID \
NCCL_SOCKET_IFNAME=eth0 \
NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \
NCCL_DEBUG=WARN \
NCCL_MIN_NCHANNELS=32
CONT="mcr.microsoft.com#aznhc/aznhc-nv:latest"
PIN_MASK='ffffff000000,ffffff000000,ffffff,ffffff,ffffff000000000000000000,ffffff000000000000000000,ffffff000000000000,ffffff000000000000'
MOUNT="/opt/microsoft:/opt/microsoft"
srun --mpi=pmix \
--cpu-bind=mask_cpu:$PIN_MASK \
--container-image "${CONT}" \
--container-mounts "${MOUNT}" \
--ntasks-per-node=8 \
--cpus-per-task=12 \
--gpus-per-node=8 \
--mem=0 \
bash -c 'export LD_LIBRARY_PATH="/opt/openmpi/lib:$LD_LIBRARY_PATH"; /opt/nccl-tests/build/all_reduce_perf -b 1K -e 16G -f 2 -g 1 -c 0'
Submit a NCCL job using the following command.
-N for how many node you want to use for running the NCCL benchmark. In this example I am running the benchmark on 4 nodes. You could change the -N to the desired Nodes of your choice.
sbatch -N 4 --gres=gpu:8 -p hpc ./nccl_benchmark_job.sh
Output:
azureuser@gpu-scheduler:~$ sbatch -N 4 --gres=gpu:8 -p hpc ./nccl_benchmark_job.sh
Submitted batch job 61
azureuser@gpu-scheduler:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
61 hpc nccl_ben azureuse CF 0:04 4 gpu-hpc-[1-4]
azureuser@gpu-scheduler:~$
Verifying the Results
After the job completes, you will find a nccl_allreduce_<jobid>.log file containing the benchmark details for review.
azureuser@gpu-scheduler:~$ cat nccl_allreduce_61.log
pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest
pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest
pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest
pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest
# nThread 1 nGpus 1 minBytes 1024 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 16036 on gpu-hpc-1 device 0 [0x00] NVIDIA A100-SXM4-40GB
# Rank 1 Group 0 Pid 16037 on gpu-hpc-1 device 1 [0x00] NVIDIA A100-SXM4-40GB
# Rank 2 Group 0 Pid 16038 on gpu-hpc-1 device 2 [0x00] NVIDIA A100-SXM4-40GB
# Rank 3 Group 0 Pid 16039 on gpu-hpc-1 device 3 [0x00] NVIDIA A100-SXM4-40GB
# Rank 4 Group 0 Pid 16040 on gpu-hpc-1 device 4 [0x00] NVIDIA A100-SXM4-40GB
# Rank 5 Group 0 Pid 16041 on gpu-hpc-1 device 5 [0x00] NVIDIA A100-SXM4-40GB
# Rank 6 Group 0 Pid 16042 on gpu-hpc-1 device 6 [0x00] NVIDIA A100-SXM4-40GB
# Rank 7 Group 0 Pid 16043 on gpu-hpc-1 device 7 [0x00] NVIDIA A100-SXM4-40GB
# Rank 8 Group 0 Pid 17098 on gpu-hpc-2 device 0 [0x00] NVIDIA A100-SXM4-40GB
# Rank 9 Group 0 Pid 17099 on gpu-hpc-2 device 1 [0x00] NVIDIA A100-SXM4-40GB
# Rank 10 Group 0 Pid 17100 on gpu-hpc-2 device 2 [0x00] NVIDIA A100-SXM4-40GB
# Rank 11 Group 0 Pid 17101 on gpu-hpc-2 device 3 [0x00] NVIDIA A100-SXM4-40GB
# Rank 12 Group 0 Pid 17102 on gpu-hpc-2 device 4 [0x00] NVIDIA A100-SXM4-40GB
# Rank 13 Group 0 Pid 17103 on gpu-hpc-2 device 5 [0x00] NVIDIA A100-SXM4-40GB
# Rank 14 Group 0 Pid 17104 on gpu-hpc-2 device 6 [0x00] NVIDIA A100-SXM4-40GB
# Rank 15 Group 0 Pid 17105 on gpu-hpc-2 device 7 [0x00] NVIDIA A100-SXM4-40GB
# Rank 16 Group 0 Pid 17127 on gpu-hpc-3 device 0 [0x00] NVIDIA A100-SXM4-40GB
# Rank 17 Group 0 Pid 17128 on gpu-hpc-3 device 1 [0x00] NVIDIA A100-SXM4-40GB
# Rank 18 Group 0 Pid 17129 on gpu-hpc-3 device 2 [0x00] NVIDIA A100-SXM4-40GB
# Rank 19 Group 0 Pid 17130 on gpu-hpc-3 device 3 [0x00] NVIDIA A100-SXM4-40GB
# Rank 20 Group 0 Pid 17131 on gpu-hpc-3 device 4 [0x00] NVIDIA A100-SXM4-40GB
# Rank 21 Group 0 Pid 17132 on gpu-hpc-3 device 5 [0x00] NVIDIA A100-SXM4-40GB
# Rank 22 Group 0 Pid 17133 on gpu-hpc-3 device 6 [0x00] NVIDIA A100-SXM4-40GB
# Rank 23 Group 0 Pid 17134 on gpu-hpc-3 device 7 [0x00] NVIDIA A100-SXM4-40GB
# Rank 24 Group 0 Pid 17127 on gpu-hpc-4 device 0 [0x00] NVIDIA A100-SXM4-40GB
# Rank 25 Group 0 Pid 17128 on gpu-hpc-4 device 1 [0x00] NVIDIA A100-SXM4-40GB
# Rank 26 Group 0 Pid 17129 on gpu-hpc-4 device 2 [0x00] NVIDIA A100-SXM4-40GB
# Rank 27 Group 0 Pid 17130 on gpu-hpc-4 device 3 [0x00] NVIDIA A100-SXM4-40GB
# Rank 28 Group 0 Pid 17131 on gpu-hpc-4 device 4 [0x00] NVIDIA A100-SXM4-40GB
# Rank 29 Group 0 Pid 17132 on gpu-hpc-4 device 5 [0x00] NVIDIA A100-SXM4-40GB
# Rank 30 Group 0 Pid 17133 on gpu-hpc-4 device 6 [0x00] NVIDIA A100-SXM4-40GB
# Rank 31 Group 0 Pid 17134 on gpu-hpc-4 device 7 [0x00] NVIDIA A100-SXM4-40GB
NCCL version 2.19.3+cuda12.2
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 256 float sum -1 53.54 0.02 0.04 N/A 55.41 0.02 0.04 N/A
2048 512 float sum -1 60.53 0.03 0.07 N/A 60.49 0.03 0.07 N/A
4096 1024 float sum -1 61.70 0.07 0.13 N/A 58.78 0.07 0.14 N/A
8192 2048 float sum -1 64.86 0.13 0.24 N/A 59.49 0.14 0.27 N/A
16384 4096 float sum -1 134.2 0.12 0.24 N/A 59.91 0.27 0.53 N/A
32768 8192 float sum -1 66.55 0.49 0.95 N/A 61.85 0.53 1.03 N/A
65536 16384 float sum -1 69.26 0.95 1.83 N/A 64.42 1.02 1.97 N/A
131072 32768 float sum -1 73.87 1.77 3.44 N/A 221.6 0.59 1.15 N/A
262144 65536 float sum -1 360.4 0.73 1.41 N/A 91.51 2.86 5.55 N/A
524288 131072 float sum -1 103.5 5.06 9.81 N/A 101.1 5.18 10.04 N/A
1048576 262144 float sum -1 115.6 9.07 17.57 N/A 118.0 8.89 17.22 N/A
2097152 524288 float sum -1 142.8 14.68 28.45 N/A 141.5 14.82 28.72 N/A
4194304 1048576 float sum -1 184.6 22.72 44.02 N/A 183.8 22.82 44.21 N/A
8388608 2097152 float sum -1 277.2 30.26 58.63 N/A 271.9 30.86 59.78 N/A
16777216 4194304 float sum -1 370.4 45.30 87.77 N/A 377.5 44.45 86.12 N/A
33554432 8388608 float sum -1 632.7 53.03 102.75 N/A 638.8 52.52 101.76 N/A
67108864 16777216 float sum -1 1016.1 66.04 127.96 N/A 1018.5 65.89 127.66 N/A
134217728 33554432 float sum -1 1885.0 71.20 137.96 N/A 1853.3 72.42 140.32 N/A
268435456 67108864 float sum -1 3353.1 80.06 155.11 N/A 3369.3 79.67 154.36 N/A
536870912 134217728 float sum -1 5920.8 90.68 175.68 N/A 5901.4 90.97 176.26 N/A
1073741824 268435456 float sum -1 11510 93.29 180.74 N/A 11733 91.52 177.31 N/A
2147483648 536870912 float sum -1 22712 94.55 183.20 N/A 22742 94.43 182.95 N/A
4294967296 1073741824 float sum -1 45040 95.36 184.76 N/A 44924 95.60 185.23 N/A
8589934592 2147483648 float sum -1 89377 96.11 186.21 N/A 89365 96.12 186.24 N/A
17179869184 4294967296 float sum -1 178432 96.28 186.55 N/A 178378 96.31 186.60 N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth : 75.0205
#
Conclusion
Integrating containers with CycleCloud-Slurm for multi-node, multi-GPU workloads enables seamless scalability and portability in HPC and AI applications. By leveraging Enroot and Pyxis, we can efficiently execute containerized workloads while ensuring optimal GPU utilization.
The cyclecloud-slurm-container project simplifies the deployment process, making it easier for teams to configure, manage, and scale their workloads on Azure HPC clusters. Running NCCL benchmarks inside containers provides valuable insights into communication efficiency across GPUs and nodes, helping optimize AI and deep learning training workflows.
By following this guide, you can confidently set up and run containerized multi-node NCCL benchmarks in CycleCloud-Slurm, ensuring peak performance for your AI workloads in the cloud.
References
- ND A100 v4 Series – GPU-Accelerated Virtual Machines
- Microsoft Azure CycleCloud – Overview
- Slurm and Containers – Official Documentation
- NVIDIA Pyxis – Slurm Container Runtime
- NVIDIA Enroot – Lightweight Container Runtime
- Azure HPC VM Images – Preconfigured Images for HPC Workloads
- CycleCloud-Slurm-Container – Project Repository
- CycleCloud Workspace for Slurm