Azure High Performance Computing (HPC) Blog

5 MIN READ

Automate BeeOND Filesystem on Azure CycleCloud Slurm Cluster

Microsoft

Sep 29, 2022

UPDATE 1/16/2024:
- added support for running BeeOND with AlmaLinux 8.7 HPC image on compute nodes
- created cloud-init script for scheduler nodes (previously was a manual process).

OVERVIEW

Azure CycleCloud (CC) is an enterprise-friendly tool for orchestrating and managing High-Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, deploy/mount filesystems and automatically scale the infrastructure to run jobs efficiently at any scale.

BeeOND ("BeeGFS On Demand", pronounced like the word "beyond") is a per-job BeeGFS parallel filesystem that aggregates the local NVMe/SSDs of Azure VMs into a single filesystem for the duration of the job (NOTE: this is not persistent storage and only exists as long as the job is running...data needs to be staged into and out of BeeOND). This provides the performance benefit of a fast, shared job scratch without additional cost as the VM local NVMe/SSD is included in the cost of the VM. The BeeOND filesystem will utilize the Infiniband fabric of our H-series VMs to provide the highest bandwidth (up to 400Gbps NDR) and lowest latency compared to any other storage option.

This blog will describe how to automate a BeeOND filesystem with an Azure CycleCloud orchestrated Slurm cluster. It will demonstrate how to install and configure BeeOND on compute nodes using a Cloud-Init script via CycleCloud. The process of starting and stopping the BeeOND filesystem for each job is implemented via provided Slurm Prolog and Epilog scripts. By the end you will be able to add a BeeOND filesystem to your Slurm cluster (NOTE: creating the Slurm cluster is outside of this scope) with minimal interactions from the users running jobs.

REQUIREMENTS/VERSIONS:

- CycleCloud server (My CC version is 8.5.0-3196)

- Slurm cluster (My Slurm version is 22.05.10-1 and my CC Slurm release version is 3.0.5)

- Azure H-series Compute VMs (My Compute VMs are HB120rs_v3, each with 2x 900GiB ephemeral NVMe drives)
- My Compute OS is openlogic:centos-hpc:7_9-gen2:⁠7.9.2022040101 and almalinux:almalinux-hpc:8_7-hpc-gen2:⁠8.7.2023111401

TL/DR:

- Use the provided Cloud-Init script in CC to install/configure the compute partition for BeeOnD (NOTE: script is tailored to HBv3/HBv4 with 2 NVMe drives)

- Use the provided Cloud-Init script in CC to configure the scheduler for BeeOND (includes download/configuration of prolog/epilog)
- Provision (ie. start) the BeeOND filesystem using this Slurm Prolog script

- De-provision (ie. stop) the BeeOND filesystem using this Slurm Epilog script

SOLUTION:

Ensure you have a working CC environment and Slurm cluster
Copy the compute cloud-init script from my Git repo and add to the CC Slurm cluster:

2.1: Select the appropriate cluster from your CC cluster list

2.2: Click "Edit" to update the cluster settings
2.3: In the popup window click "Cloud-init" from the menu

2.4: Select your partition to add the BeeOND filesystem (NOTE: the default partition name is hpc but may differ in your cluster)

2.5: Copy the Cloud-init script from git and paste here
2.6: Save the settings

3. Copy the scheduler cloud-init script from my Git repo and add to the CC Slurm cluster following step 2 instructions (for step 2.4, add this script to the scheduler instead of thehpc partition). Be careful of this step if you already use Prolog/Epilog scripts as this will replace your existing setup. More custom integration work will be needed in that case.

4. SSH to your Slurm scheduler node and re-scale the compute nodes to update the settings in Slurm (NOTE: this step is not required when using Slurm project release version 3.0+)

sudo /opt/cycle/slurm/cyclecloud_slurm.sh remove_nodes
sudo /opt/cycle/slurm/cyclecloud_slurm.sh scale

VALIDATE SETUP:

1. Download the test job from Github (NOTE: this test script is downloaded to the cluster admin home dir in the scheduler cloud-init script. Use the command below to download the script for another user):

wget https://raw.githubusercontent.com/themorey/cyclecloud-scripts/main/slurm-beeond/beeond-test.sbatch

#!/bin/bash
#SBATCH --job-name=beeond
#SBATCH -N 2
#SBATCH -n 100
#SBATCH -p hpc

logdir="/sched/$(sudo -i jetpack config cyclecloud.cluster.name)/log"
logfile=$logdir/slurm_beeond.log

echo "#####################################################################################"
echo "df -h:   "
df -h
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=storage:   "
beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=storage
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=metadata:   "
beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=metadata
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-ctl --mount=/mnt/beeond --getentryinfo:   "
beegfs-ctl --mount=/mnt/beeond --getentryinfo /mnt/beeond
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-net:   "
beegfs-net

2. Submit the job: sbatch beeond-test.sbatch

3. When the job completes you will have an output file named slurm-2.out in your home directory (assuming the job # is 2, else substitute your job # in the filename). A sample job output will look like this:

#####################################################################################

df -h:

Filesystem Size Used Avail Use% Mounted on

devtmpfs 221G 0 221G 0% /dev

tmpfs 221G 0 221G 0% /dev/shm

tmpfs 221G 18M 221G 1% /run

tmpfs 221G 0 221G 0% /sys/fs/cgroup

/dev/sda2 30G 20G 9.6G 67% /

/dev/sda1 494M 119M 375M 25% /boot

/dev/sda15 495M 12M 484M 3% /boot/efi

/dev/sdb1 473G 73M 449G 1% /mnt/resource

/dev/md10 1.8T 69M 1.8T 1% /mnt/nvme

10.40.0.5:/sched 30G 33M 30G 1% /sched

10.40.0.5:/shared 100G 34M 100G 1% /shared

tmpfs 45G 0 45G 0% /run/user/20002

beegfs_ondemand 3.5T 103M 3.5T 1% /mnt/beeond

#####################################################################################

beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=storage:

jm-hpc-pg0-1 [ID: 1]

jm-hpc-pg0-3 [ID: 2]

#####################################################################################

beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=metadata:

jm-hpc-pg0-1 [ID: 1]

#####################################################################################

beegfs-ctl --mount=/mnt/beeond --getentryinfo:

Entry type: directory

EntryID: root

Metadata node: jm-hpc-pg0-1 [ID: 1]

Stripe pattern details:

+ Type: RAID0

+ Chunksize: 512K

+ Number of storage targets: desired: 4

+ Storage Pool: 1 (Default)

#####################################################################################

beegfs-net:

mgmt_nodes

=============

jm-hpc-pg0-1 [ID: 1]

Connections: TCP: 1 (172.17.0.1:9008);

meta_nodes

=============

jm-hpc-pg0-1 [ID: 1]

Connections: RDMA: 1 (172.16.1.66:9005);

storage_nodes

=============

jm-hpc-pg0-1 [ID: 1]

Connections: RDMA: 1 (172.16.1.66:9003);

jm-hpc-pg0-3 [ID: 2]

Connections: RDMA: 1 (172.16.1.76:9003);

CONCLUSION

Creating a fast parallel filesystem on Azure does not have to be difficult nor expensive. This blog has shown how the installation and configuration of a BeeOND filesystem can be automated for a Slurm cluster (will also work with other cluster types with adaptation of the prolog/epilog configs). As this is non-persistent shared job scratch the data should reside on a persistent storage (ie. NFS, Blob) and staged to and from the BeeOND mount (ie. /mnt/beeond per these setup scripts) as part of the job script.

LEARN MORE

Learn more about Azure Cyclecloud

Blog Post

Automate BeeOND Filesystem on Azure CycleCloud Slurm Cluster