Setting Up Slurm Cloud Bursting Using CycleCloud on Azure
Published May 15 2024 09:04 AM 2,304 Views
Microsoft

Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High-Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.

 

Slurm is a widely used open-source HPC scheduler that can manage workloads across clusters of compute nodes. Slurm can also be configured to interact with cloud resources, such as Azure CycleCloud, to dynamically add or remove nodes based on the demand of the jobs. This allows users to optimize their resource utilization and cost efficiency, as well as to access the scalability and flexibility of the cloud.

 

In this blog post, we are discussing how to integrate an external Slurm Scheduler to send jobs to CycleCloud for cloud bursting (Enabling on-premises workloads to be sent to the cloud for processing, known as “cloud bursting”) or hybrid HPC scenarios. For demonstration purposes, we are creating a Slurm Scheduler node in Azure as an external Slurm Scheduler in a different VNET and the execute nodes are in CycleCloud in a separate VNET. We are not discussing the complexities of networking involved in Hybrid scenarios.

vinilv_0-1715756206958.png

Prerequisites

Before we start, we need to have the following items ready:

  • An Azure subscription
  • CycleCloud Version: 8.6.0-3223
  • OS version in Scheduler and execute nodes: Alma Linux release 8.7 (almalinux:almalinux-hpc:8_7-hpc-gen2:latest)
  • Slurm Version: 23.02.7-1
  • cyclecloud-slurm Project: 3.0.6
  • An external Slurm Scheduler node in Azure or on-premises. in this example we are using Azure VM running with Alma Linux 8.7.
  • A network connection between the external Slurm Scheduler node and the CycleCloud cluster. You can use Azure Virtual Network peering, VPN gateway, ExpressRoute, or other methods to establish the connection. In this example, we are using a very basic network setup.
  • A shared file system between the external Slurm Scheduler node and the CycleCloud cluster. You can use Azure NetApp Files, Azure Files, NFS, or other methods to mount the same file system on both sides. In this example, we are using a Scheduler VM as a NFS server.

Steps 

After we have the prerequisites ready, we can follow these steps to integrate the external Slurm Scheduler node with the CycleCloud cluster:

 

1. On CycleCloud VM:

  • Ensure CycleCloud 8.6 VM is running and accessible via cyclecloud CLI.
  • Clone this repository and import a cluster using the provided CycleCloud template (slurm-headless.txt).
  • We are importing a cluster named hpc1using theslurm-headless.txt template.
git clone https://github.com/vinil-v/slurm-cloud-bursting-using-cyclecloud.git
cyclecloud import_cluster hpc1 -c Slurm-HL -f slurm-cloud-bursting-using-cyclecloud/templates/slurm-headless.txt

Output:

[vinil@cc86 ~]$ cyclecloud import_cluster hpc1 -c Slurm-HL -f slurm-cloud-bursting-using-cyclecloud/cyclecloud-template/slurm-headless.txt
Importing cluster Slurm-HL and creating cluster hpc1....
----------
hpc1 : off
----------
Resource group:
Cluster nodes:
Total nodes: 0

 

2. Preparing Scheduler VM:

  • Deploy a VM using the specified AlmaLinux image (If you have an existing Slurm Scheduler, you can skip this).
  • Run the Slurm scheduler installation script (slurm-scheduler-builder.sh) and provide the cluster name (hpc1) when prompted.
  • This script will install and configure Slurm Scheduler.

 

git clone https://github.com/vinil-v/slurm-cloud-bursting-using-cyclecloud.git
cd slurm-cloud-bursting-using-cyclecloud/scripts
sh slurm-scheduler-builder.sh

 

Output:

------------------------------------------------------------------------------------------------------------------------------
Building Slurm scheduler for cloud bursting with Azure CycleCloud
------------------------------------------------------------------------------------------------------------------------------

Enter Cluster Name: hpc1
------------------------------------------------------------------------------------------------------------------------------

Summary of entered details:
Cluster Name: hpc1
Scheduler Hostname: masternode2
NFSServer IP Address: 10.222.1.26

 

3. CycleCloud UI:

  • Access the CycleCloud UI, edit the hpc1 cluster settings, and configure VM SKUs and networking settings.
  • Enter the NFS server IP address for /sched and /shared mounts in the Network Attached Storage section.
  • Save & Start hpc1 cluster

vinilv_0-1715787791469.png

 

4. On Slurm Scheduler Node:

  • Integrate External Slurm  Scheduler with CycleCloud using the cyclecloud-integrator.sh script.
  • Provide CycleCloud details (username, password, and URL) when prompted. (Try entering the details manually instead of copy and paste. The copy & paste might contain some whitespaces and it might create issues in building the connection.)
cd slurm-cloud-bursting-using-cyclecloud/scripts
sh cyclecloud-integrator.sh

Output:

[root@masternode2 scripts]# sh cyclecloud-integrator.sh
Please enter the CycleCloud details to integrate with the Slurm scheduler

Enter Cluster Name: hpc1
Enter CycleCloud Username: vinil
Enter CycleCloud Password:
Enter CycleCloud URL (e.g., https://10.222.1.19): https://10.222.1.19
------------------------------------------------------------------------------------------------------------------------------

Summary of entered details:
Cluster Name: hpc1
CycleCloud Username: vinil
CycleCloud URL: https://10.222.1.19

------------------------------------------------------------------------------------------------------------------------------

 

5. User and Group Setup:

  • Ensure consistent user and group IDs across all nodes.
  • Better to use a centralized User Management system like LDAP to ensure the UID and GID are consistent across all the nodes.
  • In this example we are using the users.sh script to create a test user vinil and group for job submission. (User vinil exists in CycleCloud)

 

cd slurm-cloud-bursting-using-cyclecloud/scripts
sh users.sh

 

 

6. Testing & Job Submission:

  • Log in as a test user (vinil in this example) on the Scheduler node.
  • Submit a test job to verify the setup.

 

su - vinil
srun hostname &

 

Output:

 

[root@masternode2 scripts]# su - vinil
Last login: Tue May 14 04:54:51 UTC 2024 on pts/0
[vinil@masternode2 ~]$ srun hostname &
[1] 43448
[vinil@masternode2 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1       hpc hostname    vinil CF       0:04      1 hpc1-hpc-1
[vinil@masternode2 ~]$ hpc1-hpc-1

 

You will see a new node getting created in hpc1 cluster.

 

vinilv_0-1715762360401.png

 

Congratulations! You have successfully set up Slurm bursting with CycleCloud on Azure.

Conclusion

In this blog post, we have shown how to integrate an external Slurm Scheduler node with Azure CycleCloud for cloud bursting or hybrid HPC scenarios. This enables users to leverage the power and flexibility of the cloud for their HPC workloads, while maintaining their existing Slurm workflows and tools. We hope this guide helps you to get started with your HPC journey on Azure.

 

Reference:

GitHub repo - slurm-cloud-bursting-using-cyclecloud 

Azure CycleCloud Documentation 

Slurm documentation 

 

6 Comments
Co-Authors
Version history
Last update:
‎May 15 2024 09:04 AM
Updated by: