Azure High Performance Computing (HPC) Blog

11 MIN READ

Running Container Workloads in CycleCloud-Slurm – Multi-Node, Multi-GPU Jobs (NCCL Benchmark)

Microsoft

Apr 02, 2025

Running high-performance computing (HPC) and AI workloads in the cloud requires a flexible and scalable orchestration platform. Microsoft Azure CycleCloud, when combined with Slurm, provides an efficient solution for managing containerized applications across HPC clusters.

In this blog, we will explore how to run multi-node, multi-GPU workloads in a CycleCloud-Slurm environment using containerized workflows. We’ll cover key configurations, job submission strategies, and best practices to maximize GPU utilization across multiple nodes.

To simplify this process, we developed cyclecloud-slurm-container, a custom project that automates the setup of Pyxis and Enroot for running containerized workloads in Slurm. This tool streamlines the installation and configuration of required software, making it easier to deploy and manage containerized HPC applications.

In this approach, we integrate these scripts to enable container support within a CycleCloud implementation. However, we also offer a product called CycleCloud Workspace for Slurm.

Azure CycleCloud Workspace for Slurm is an Azure Marketplace solution template that simplifies the creation, configuration, and deployment of pre-defined Slurm clusters on Azure using CycleCloud. It eliminates the need for prior knowledge of Azure or Slurm. These Slurm clusters come pre-configured with PMIx v4, Pyxis, and Enroot, enabling seamless execution of containerized AI and HPC workloads.

As an example, we will demonstrate how to use the Azure Node Health Check (aznhc) Docker container to run NCCL benchmarks across multiple nodes and GPUs, showcasing the benefits of a well-optimized containerized HPC environment.

aznhc is not the preferred method for running NCCL AllReduce because it does not include the latest NCCL libraries. For optimal performance, please use the most recent NCCL libraries.

Containers bring significant benefits to HPC and AI workloads by enhancing flexibility and efficiency. When integrated with CycleCloud and Slurm, they provide:

Portability – Package applications with all dependencies, ensuring consistent execution across different environments.
Isolation – Run applications in separate environments to prevent conflicts and maintain system integrity.
Reproducibility – Guarantee identical execution of workloads across multiple job submissions for reliable experimentation.
Scalability – Dynamically scale containerized workloads across multiple nodes and GPUs, optimizing resource utilization.

By leveraging containers within CycleCloud-Slurm, users can streamline workload management, simplify software deployment, and maximize the efficiency of their HPC clusters.

Testing Environment for Running Multi-Node, Multi-GPU NCCL Workloads in CycleCloud-Slurm

Before executing NCCL benchmarks across multiple nodes and GPUs using containers in a CycleCloud-Slurm setup, ensure the following prerequisites are met:

CycleCloud 8.x: A properly configured and running CycleCloud deployment.
Virtual Machines: Use Standard_ND96asr_v4 VMs, which include NVIDIA GPUs and an InfiniBand network optimized for deep learning training and high-performance AI workloads.
Slurm Configuration: Slurm version 24.05.4-2 (cyclecloud-slurm 3.0.11).
Operating System: Ubuntu 22.04 (microsoft-dsvm:ubuntu-hpc:2204:latest), pre-configured with essential GPU, InfiniBand drivers, and HPC tools.
Container Runtime Setup: Deploy the cyclecloud-slurm-container project, which automates the configuration of Enroot and Pyxis for efficient container execution.
Azure Node Health Check (AZNHC) Container: Utilize this container to validate node health and execute NCCL benchmarks for performance testing.
aznhc is not the preferred method for running NCCL AllReduce because it does not include the latest NCCL libraries. For optimal performance, please use the most recent NCCL libraries.

This setup ensures a reliable and scalable environment for evaluating NCCL performance in a containerized CycleCloud-Slurm cluster.

Here’s a polished and structured version of your content:

Configuring the CycleCloud-Slurm Container Project

Follow these steps to set up the cyclecloud-slurm-container project, which automates the configuration of Pyxis and Enroot for running containerized workloads in Slurm.

Step 1: Open a Terminal Session to cyclecloud server

Ensure you have access to the CycleCloud server with the CycleCloud CLI enabled.

Step 2: Clone the Repository

Clone the cyclecloud-slurm-container repository from GitHub:

git clone https://github.com/vinil-v/cyclecloud-slurm-container.git

Example Output:

[azureuser@cc87 ~]$ git clone https://github.com/vinil-v/cyclecloud-slurm-container.git
Cloning into 'cyclecloud-slurm-container'...
remote: Enumerating objects: 27, done.
remote: Counting objects: 100% (27/27), done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 27 (delta 2), reused 27 (delta 2), pack-reused 0
Receiving objects: 100% (27/27), done.
Resolving deltas: 100% (2/2), done.

Step 3: Upload the Project to CycleCloud Locker

Navigate to the project directory and upload it to the CycleCloud locker:

cd cyclecloud-slurm-container/
cyclecloud project upload <locker name>

Example Output:

[azureuser@cc87 cyclecloud-slurm-container]$ cyclecloud project upload "Team Shared-storage"
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job 43b89b46-ca66-3244-6341-d0cb746a87ad has started
Log file is located at: /home/azureuser/.azcopy/43b89b46-ca66-3244-6341-d0cb746a87ad.log

100.0 %, 14 Done, 0 Failed, 0 Pending, 14 Total, 2-sec Throughput (Mb/s): 0.028

Job 43b89b46-ca66-3244-6341-d0cb746a87ad Summary
Files Scanned at Source: 14
Files Scanned at Destination: 14
Elapsed Time (Minutes): 0.0334
Number of Copy Transfers for Files: 14
Number of Copy Transfers for Folder Properties: 0
Total Number of Copy Transfers: 14
Number of Copy Transfers Completed: 14
Number of Copy Transfers Failed: 0
Number of Deletions at Destination: 0
Total Number of Bytes Transferred: 7016
Total Number of Bytes Enumerated: 7016
Final Job Status: Completed

Upload complete!

Step 4: Explore Project Structure

The project is designed to configure Pyxis and Enroot in both the scheduler and compute nodes within a Slurm cluster. It includes the following directories:

[azureuser@cc87 cyclecloud-slurm-container]$ ll specs
total 0
drwxrwxr-x. 3 azureuser azureuser 26 Apr  2 01:44 default
drwxrwxr-x. 3 azureuser azureuser 26 Apr  2 01:44 execute
drwxrwxr-x. 3 azureuser azureuser 26 Apr  2 01:44 scheduler

Compute Node Scripts (execute Directory)

These scripts configure NVMe storage, set up Enroot, and enable Pyxis for running container workloads.

[azureuser@cc87 cyclecloud-slurm-container]$ ll specs/execute/cluster-init/scripts/
total 16
-rw-rw-r--. 1 azureuser azureuser  974 Apr  2 01:44 000_nvme-setup.sh
-rw-rw-r--. 1 azureuser azureuser 1733 Apr  2 01:44 001_enroot-setup.sh
-rw-rw-r--. 1 azureuser azureuser  522 Apr  2 01:44 002_pyxis-setup-execute.sh
-rw-rw-r--. 1 azureuser azureuser  350 Apr  2 01:44 README.txt

Scheduler Node Scripts (scheduler Directory)

These scripts set up Pyxis on the scheduler node to enable container execution with Slurm.

[azureuser@cc87 cyclecloud-slurm-container]$ ll specs/scheduler/cluster-init/scripts/
total 8
-rw-rw-r--. 1 azureuser azureuser 1266 Apr  2 01:44 000_pyxis-setup-scheduler.sh
-rw-rw-r--. 1 azureuser azureuser  350 Apr  2 01:44 README.txt

Configuring cyclecloud-slurm-container in CycleCloud Portal:

Login to CycleCloud Web portal
Create a Slurm cluster. In the Required Settings select HPC VM Type as Standard_ND96asr_v4 (in this example).

In the Advance Settings you can select the OS for Ubuntu 22.04 LTS ( This image microsoft-dsvm:ubuntu-hpc:2204:latest has the GPU drivers, InfiniBand driver, hpc utilities like MPI etc.)
in your CycleCloud Slurm cluster configuration, add “cyclecloud-slurm-container” project as a cluster-init in the scheduler and execute configuration.

Click on browse and navigate to cyclecloud-slurm-container directory and select the “scheduler” directory for scheduler and “execute” directory for execute.

Scheduler-cluster-init section:

Execute cluster-init section:

After configuring all the settings, save the changes and start the cluster.

Testing the setup:

Once the cluster is running, login to scheduler node and create a job script (nccl_benchmark_job.sh) like below.

Job script:

#!/bin/bash
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=12
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH -o nccl_allreduce_%j.log

export OMPI_MCA_coll_hcoll_enable=0 \
       NCCL_IB_PCI_RELAXED_ORDERING=1 \
       CUDA_DEVICE_ORDER=PCI_BUS_ID \
       NCCL_SOCKET_IFNAME=eth0 \
       NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \
       NCCL_DEBUG=WARN \
       NCCL_MIN_NCHANNELS=32

CONT="mcr.microsoft.com#aznhc/aznhc-nv:latest"
PIN_MASK='ffffff000000,ffffff000000,ffffff,ffffff,ffffff000000000000000000,ffffff000000000000000000,ffffff000000000000,ffffff000000000000'
MOUNT="/opt/microsoft:/opt/microsoft"

srun --mpi=pmix \
     --cpu-bind=mask_cpu:$PIN_MASK \
     --container-image "${CONT}" \
     --container-mounts "${MOUNT}" \
     --ntasks-per-node=8 \
     --cpus-per-task=12 \
     --gpus-per-node=8 \
     --mem=0 \
     bash -c 'export LD_LIBRARY_PATH="/opt/openmpi/lib:$LD_LIBRARY_PATH"; /opt/nccl-tests/build/all_reduce_perf -b 1K -e 16G -f 2 -g 1 -c 0'

Submit a NCCL job using the following command.

-N for how many node you want to use for running the NCCL benchmark. In this example I am running the benchmark on 4 nodes. You could change the -N to the desired Nodes of your choice.

sbatch -N 4 --gres=gpu:8 -p hpc ./nccl_benchmark_job.sh

Output:

azureuser@gpu-scheduler:~$ sbatch -N 4 --gres=gpu:8 -p hpc ./nccl_benchmark_job.sh
Submitted batch job 61
azureuser@gpu-scheduler:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                61       hpc nccl_ben azureuse CF       0:04      4 gpu-hpc-[1-4]
azureuser@gpu-scheduler:~$

Verifying the Results

After the job completes, you will find a nccl_allreduce_<jobid>.log file containing the benchmark details for review.

azureuser@gpu-scheduler:~$ cat nccl_allreduce_61.log
pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest
pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest
pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest
pyxis: imported docker image: mcr.microsoft.com#aznhc/aznhc-nv:latest
# nThread 1 nGpus 1 minBytes 1024 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  16036 on  gpu-hpc-1 device  0 [0x00] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  16037 on  gpu-hpc-1 device  1 [0x00] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  16038 on  gpu-hpc-1 device  2 [0x00] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  16039 on  gpu-hpc-1 device  3 [0x00] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  16040 on  gpu-hpc-1 device  4 [0x00] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  16041 on  gpu-hpc-1 device  5 [0x00] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  16042 on  gpu-hpc-1 device  6 [0x00] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  16043 on  gpu-hpc-1 device  7 [0x00] NVIDIA A100-SXM4-40GB
#  Rank  8 Group  0 Pid  17098 on  gpu-hpc-2 device  0 [0x00] NVIDIA A100-SXM4-40GB
#  Rank  9 Group  0 Pid  17099 on  gpu-hpc-2 device  1 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 10 Group  0 Pid  17100 on  gpu-hpc-2 device  2 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 11 Group  0 Pid  17101 on  gpu-hpc-2 device  3 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 12 Group  0 Pid  17102 on  gpu-hpc-2 device  4 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 13 Group  0 Pid  17103 on  gpu-hpc-2 device  5 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 14 Group  0 Pid  17104 on  gpu-hpc-2 device  6 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 15 Group  0 Pid  17105 on  gpu-hpc-2 device  7 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 16 Group  0 Pid  17127 on  gpu-hpc-3 device  0 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 17 Group  0 Pid  17128 on  gpu-hpc-3 device  1 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 18 Group  0 Pid  17129 on  gpu-hpc-3 device  2 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 19 Group  0 Pid  17130 on  gpu-hpc-3 device  3 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 20 Group  0 Pid  17131 on  gpu-hpc-3 device  4 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 21 Group  0 Pid  17132 on  gpu-hpc-3 device  5 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 22 Group  0 Pid  17133 on  gpu-hpc-3 device  6 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 23 Group  0 Pid  17134 on  gpu-hpc-3 device  7 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 24 Group  0 Pid  17127 on  gpu-hpc-4 device  0 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 25 Group  0 Pid  17128 on  gpu-hpc-4 device  1 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 26 Group  0 Pid  17129 on  gpu-hpc-4 device  2 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 27 Group  0 Pid  17130 on  gpu-hpc-4 device  3 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 28 Group  0 Pid  17131 on  gpu-hpc-4 device  4 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 29 Group  0 Pid  17132 on  gpu-hpc-4 device  5 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 30 Group  0 Pid  17133 on  gpu-hpc-4 device  6 [0x00] NVIDIA A100-SXM4-40GB
#  Rank 31 Group  0 Pid  17134 on  gpu-hpc-4 device  7 [0x00] NVIDIA A100-SXM4-40GB
NCCL version 2.19.3+cuda12.2
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1024           256     float     sum      -1    53.54    0.02    0.04    N/A    55.41    0.02    0.04    N/A
        2048           512     float     sum      -1    60.53    0.03    0.07    N/A    60.49    0.03    0.07    N/A
        4096          1024     float     sum      -1    61.70    0.07    0.13    N/A    58.78    0.07    0.14    N/A
        8192          2048     float     sum      -1    64.86    0.13    0.24    N/A    59.49    0.14    0.27    N/A
       16384          4096     float     sum      -1    134.2    0.12    0.24    N/A    59.91    0.27    0.53    N/A
       32768          8192     float     sum      -1    66.55    0.49    0.95    N/A    61.85    0.53    1.03    N/A
       65536         16384     float     sum      -1    69.26    0.95    1.83    N/A    64.42    1.02    1.97    N/A
      131072         32768     float     sum      -1    73.87    1.77    3.44    N/A    221.6    0.59    1.15    N/A
      262144         65536     float     sum      -1    360.4    0.73    1.41    N/A    91.51    2.86    5.55    N/A
      524288        131072     float     sum      -1    103.5    5.06    9.81    N/A    101.1    5.18   10.04    N/A
     1048576        262144     float     sum      -1    115.6    9.07   17.57    N/A    118.0    8.89   17.22    N/A
     2097152        524288     float     sum      -1    142.8   14.68   28.45    N/A    141.5   14.82   28.72    N/A
     4194304       1048576     float     sum      -1    184.6   22.72   44.02    N/A    183.8   22.82   44.21    N/A
     8388608       2097152     float     sum      -1    277.2   30.26   58.63    N/A    271.9   30.86   59.78    N/A
    16777216       4194304     float     sum      -1    370.4   45.30   87.77    N/A    377.5   44.45   86.12    N/A
    33554432       8388608     float     sum      -1    632.7   53.03  102.75    N/A    638.8   52.52  101.76    N/A
    67108864      16777216     float     sum      -1   1016.1   66.04  127.96    N/A   1018.5   65.89  127.66    N/A
   134217728      33554432     float     sum      -1   1885.0   71.20  137.96    N/A   1853.3   72.42  140.32    N/A
   268435456      67108864     float     sum      -1   3353.1   80.06  155.11    N/A   3369.3   79.67  154.36    N/A
   536870912     134217728     float     sum      -1   5920.8   90.68  175.68    N/A   5901.4   90.97  176.26    N/A
  1073741824     268435456     float     sum      -1    11510   93.29  180.74    N/A    11733   91.52  177.31    N/A
  2147483648     536870912     float     sum      -1    22712   94.55  183.20    N/A    22742   94.43  182.95    N/A
  4294967296    1073741824     float     sum      -1    45040   95.36  184.76    N/A    44924   95.60  185.23    N/A
  8589934592    2147483648     float     sum      -1    89377   96.11  186.21    N/A    89365   96.12  186.24    N/A
 17179869184    4294967296     float     sum      -1   178432   96.28  186.55    N/A   178378   96.31  186.60    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 75.0205
#

Conclusion

Integrating containers with CycleCloud-Slurm for multi-node, multi-GPU workloads enables seamless scalability and portability in HPC and AI applications. By leveraging Enroot and Pyxis, we can efficiently execute containerized workloads while ensuring optimal GPU utilization.

The cyclecloud-slurm-container project simplifies the deployment process, making it easier for teams to configure, manage, and scale their workloads on Azure HPC clusters. Running NCCL benchmarks inside containers provides valuable insights into communication efficiency across GPUs and nodes, helping optimize AI and deep learning training workflows.

By following this guide, you can confidently set up and run containerized multi-node NCCL benchmarks in CycleCloud-Slurm, ensuring peak performance for your AI workloads in the cloud.