OVERVIEW
Azure CycleCloud (CC) is a High Performance Computing (HPC) orchestration tool for creating and autoscaling HPC clusters in Azure using traditional schedulers (ie. Slurm, GridEngine, PBS, etc). The default behavior of CC is to download and install the scheduler packages for each node at boot, which can increase the boot time of compute nodes in particular. Creating a custom image with the scheduler packages installed can reduce the boot time by up to half. This blog will demonstrate how to install Slurm packages in a custom image to be deployed by CC.
PREREQUISITES
- working CC install (mine is currently 8.2.2-1902)
- CC Slurm cluster-init version 2.5+ (mine is running 2.6.2)
- Azure CLI installed (or use Cloud Shell)
- CycleCloud CLI installed
- (optional) Azure Image Builder configured
- (optional) Azure Compute Gallery
SOLUTION
Following are the steps needed to create an Azure VM, install Slurm packages and capture it for deployment via CycleCloud (CC):
1. The CC Product Group provides specific versions of Slurm with a Job Submit Plugin used by Slurm to communicate with CC. The latest Slurm version provided by CC is 20.11.9-1 (Slurm project v2.7.0) and is available on the Github repo. The Slurm packages and Job Submit Plugin are based on the Linux OS and version used.
NOTE: The CC Slurm repo includes scripts needed to build Slurm RPMs and the CC Job Submit Plugin for other versions of Slurm (ie. 20.11.9).
The scripts can be found here with sample instructions:
## Slurm 20.11.9:
sudo -i
cd $HOME
git clone https://github.com/Azure/cyclecloud-slurm.git
mkdir /source
cp -a cyclecloud-slurm/specs/default/cluster-init/files/JobSubmitPlugin /source
sed -i 's/20.11.7/20.11.9/g' /source/JobSubmitPlugin/job_submit_cyclecloud_test.py
sed -i 's/20.11.7/20.11.9/g' ~/cyclecloud-slurm/specs/default/cluster-init/files/00-build-slurm.sh
bash ~/cyclecloud-slurm/specs/default/cluster-init/files/00-build-slurm.sh
## RPMs and Job Submit Plugin located in ~/rpmbuild/RPMS/x86_64/
2. Create a VM in the Azure portal or CLI using the same OS and version in your Slurm cluster (ie. CentOS 7.9, AlmaLinux 8.6, etc). Here is an example Azure CLI command to create the VM without a Public IP (NOTE: to configure a Public IP remove the --public-ip-address ""
parameter:
USERNAME=$(whoami)
SSH_KEY=$(cat ${HOME}/.ssh/id_rsa.pub)
RG= # Add your Resource Group name here
VNET= # Add your Resource Group name here
SUBNET= # Add your Resource Group name here
az vm create -n slurm-image-vm -g ${RG} --image OpenLogic:CentOS-HPC:7_9:latest \
--size Standard_D8ds_V4 --admin-username ${USERNAME} --ssh-key-values "${SSH_KEY}" \
--public-ip-address "" --os-disk-size-gb 128 --storage-sku StandardSSD_LRS \
--vnet-name ${VNET} --subnet ${SUBNET}
3. SSH into the newly created VM using the credentials provisioned while creating the VM.
4. Copy the slurm-install.sh
script to the VM and run it (NOTE: the script defaults to Slurm 20.11.7 on AlmaLinux 8):
sudo -i
wget -P /tmp https://raw.githubusercontent.com/themorey/cyclecloud-scripts/main/slurm-install.sh
bash /tmp/slurm-install.sh
5. Verify the installation:
[root@slurm-image-vm ~]# which sinfo
/bin/sinfo
[root@slurm-image-vm ~]# id slurm
uid=11100(slurm) gid=11100(slurm) groups=11100(slurm)
[root@slurm-image-vm ~]# ll /usr/lib64/slurm/job_submit_cyclecloud.so
-rw-r--r--. 1 root root 221608 Dec 6 2021 /usr/lib64/slurm/job_submit_cyclecloud.so
6. Deprovision the VM & exit the SSH session:
[root@slurm-image-vm ~]# waagent --deprovision+user --force
WARNING! The waagent service will be stopped.
WARNING! All SSH host key pairs will be deleted.
WARNING! Cached DHCP leases will be deleted.
WARNING! root password will be disabled. You will not be able to login as root.
WARNING! /etc/resolv.conf will be deleted.
WARNING! jmorey account and entire home directory will be deleted.
2022-06-09T18:45:31.818641Z INFO MainThread Examine /proc/net/route for primary interface
2022-06-09T18:45:31.819005Z INFO MainThread Primary interface is [eth0]
[root@slurm-image-vm ~]# exit
logout
[jmorey@slurm-image-vm ~]$ exit
logout
Connection to 10.0.22.7 closed.
7. Deallocate, generalize & capture the VM as a managed image (or capture it to an Azure Compute Gallery):
az vm deallocate -g ${RG} -n slurm-image-vm
az vm generalize -g ${RG} -n slurm-image-vm
az image create -g ${RG} -n Slurm-Image-20_11_9 --source slurm-image-vm
8. Find and copy the ResourceID of the captured image using Azure CLI or Azure Portal (CLI shown) for use with CC:
$ az image show -g ${RG} -n Slurm-Image-20_11_9 -o yaml | awk '/^id: /{print $NF}'
/subscriptions/12345678-abcd-1234-1234-abcded123456/resourceGroups/<RG_NAME>/providers/Microsoft.Compute/images/Slurm-Image-20_11_9
9. Update the cluster settings in the CC Portal to use the newly captured image:
- Click the
Edit
link to modify your cluster settings - In the popup window, select
Advanced Settings
on the left vertical menu - Check the
Custom image
checkbox for all the options here and paste the output from Step 8 (the image ResourceID) - Save the settings
10. Add slurm.install = false
to your CC cluster template file and re-import so CC will not download/install the Slurm pkgs:
sed -i '/slurm_version$/a \\tslurm.install = false' slurm.txt
cyclecloud export_parameters <cluster_name> > cluster-params.json
cyclecloud import_cluster <cluster_name> -c Slurm -f slurm.txt -p cluster-params.json --force
11. The compute nodes can be updated to use the new image without terminating/restarting the cluster. SSH into the scheduler node and run the following commands:
sudo /opt/cycle/slurm/cyclecloud_slurm.sh remove_nodes
sudo /opt/cycle/slurm/cyclecloud_slurm.sh scale
12. Start a compute node (ie. srun --pty bash
) and verify Slurm is working correctly.
CONCLUSION
The boot time of Slurm compute nodes can be decreased by up to half when Slurm is installed in a custom image and deployed by CC. The custom image can be deployed to an Azure Compute Gallery for replicating to other regions and for improving performance when concurrently scaling many compute nodes. This process can also be combined with Azure Image Builder and Azure Devops to make it a repeatable process.
LEARN MORE
Learn more about Azure Cyclecloud
Read more about Azure HPC + AI
Take the Azure HPC learning path