Automate BeeOND Filesystem on Azure CycleCloud Slurm Cluster
Published Sep 29 2022 07:12 AM 8,016 Views
Microsoft

Jerrance_0-1663079177731.png

 

UPDATE 1/16/2024:
- added support for running BeeOND with AlmaLinux 8.7 HPC image on compute nodes
- created cloud-init script for scheduler nodes (previously was a manual process).

 

OVERVIEW

Azure CycleCloud (CC) is an enterprise-friendly tool for orchestrating and managing High-Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, deploy/mount filesystems and automatically scale the infrastructure to run jobs efficiently at any scale.  

BeeOND ("BeeGFS On Demand", pronounced like the word "beyond") is a per-job BeeGFS parallel filesystem that aggregates the local NVMe/SSDs of Azure VMs into a single filesystem for the duration of the job (NOTE: this is not persistent storage and only exists as long as the job is running...data needs to be staged into and out of BeeOND).  This provides the performance benefit of a fast, shared job scratch without additional cost as the VM local NVMe/SSD is included in the cost of the VM.  The BeeOND filesystem will utilize the Infiniband fabric of our H-series VMs to provide the highest bandwidth (up to 400Gbps NDR) and lowest latency compared to any other storage option.

 

This blog will describe how to automate a BeeOND filesystem with an Azure CycleCloud orchestrated Slurm cluster.  It will demonstrate how to install and configure BeeOND on compute nodes using a Cloud-Init script via CycleCloud.  The process of starting and stopping the BeeOND filesystem for each job is implemented via provided Slurm Prolog and Epilog scripts.  By the end you will be able to add a BeeOND filesystem to your Slurm cluster (NOTE:  creating the Slurm cluster is outside of this scope) with minimal interactions from the users running jobs.

 

 

REQUIREMENTS/VERSIONS:

- CycleCloud server (My CC version is 8.5.0-3196)

- Slurm cluster (My Slurm version is 22.05.10-1 and my CC Slurm release version is 3.0.5)

- Azure H-series Compute VMs (My Compute VMs are HB120rs_v3, each with 2x 900GiB ephemeral NVMe drives)
- My Compute OS is openlogic:centos-hpc:7_9-gen2:⁠7.9.2022040101 and almalinux:almalinux-hpc:8_7-hpc-gen2:⁠8.7.2023111401

 

 

TL/DR:

- Use the provided Cloud-Init script in CC to install/configure the compute partition for BeeOnD (NOTE: script is tailored to HBv3/HBv4 with 2 NVMe drives)

- Use the provided Cloud-Init script in CC to configure the scheduler for BeeOND (includes download/configuration of prolog/epilog)
- Provision (ie. start) the BeeOND filesystem using this Slurm Prolog script 

- De-provision (ie. stop) the BeeOND filesystem using this Slurm Epilog script 

 

 

SOLUTION:

  1.  Ensure you have a working CC environment and Slurm cluster
  2.  Copy the compute cloud-init script from my Git repo and add to the CC Slurm cluster:Jerrance_0-1705420499958.png

2.1:  Select the appropriate cluster from your CC cluster list

2.2:  Click "Edit" to update the cluster settings
2.3:  In the popup window click "Cloud-init" from the menu

2.4:  Select your partition to add the BeeOND filesystem (NOTE: the default partition name is hpc but may differ in your cluster)

2.5:  Copy the Cloud-init script from git and paste here
2.6:  Save the settings

 

3.  Copy the scheduler cloud-init script from my Git repo and add to the CC Slurm cluster following step 2 instructions (for step 2.4, add this script to the scheduler instead of thehpc partition).  Be careful of this step if you already use Prolog/Epilog scripts as this will replace your existing setup.  More custom integration work will be needed in that case.

 

4.  SSH to your Slurm scheduler node and re-scale the compute nodes to update the settings in Slurm (NOTE:  this step is not required when using Slurm project release version 3.0+)

 

sudo /opt/cycle/slurm/cyclecloud_slurm.sh remove_nodes
sudo /opt/cycle/slurm/cyclecloud_slurm.sh scale

 

 

VALIDATE SETUP:

1.  Download the test job from Github (NOTE:  this test script is downloaded to the cluster admin home dir in the scheduler cloud-init script.  Use the command below to download the script for another user):

 

wget https://raw.githubusercontent.com/themorey/cyclecloud-scripts/main/slurm-beeond/beeond-test.sbatch

 

 

#!/bin/bash
#SBATCH --job-name=beeond
#SBATCH -N 2
#SBATCH -n 100
#SBATCH -p hpc

logdir="/sched/$(sudo -i jetpack config cyclecloud.cluster.name)/log"
logfile=$logdir/slurm_beeond.log

echo "#####################################################################################"
echo "df -h:   "
df -h
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=storage:   "
beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=storage
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=metadata:   "
beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=metadata
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-ctl --mount=/mnt/beeond --getentryinfo:   "
beegfs-ctl --mount=/mnt/beeond --getentryinfo /mnt/beeond
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-net:   "
beegfs-net

 

2.  Submit the job:  sbatch beeond-test.sbatch

 

3.  When the job completes you will have an output file named slurm-2.out in your home directory (assuming the job # is 2, else          substitute your job # in the filename).  A sample job output will look like this:

#####################################################################################

df -h:

Filesystem         Size  Used Avail Use% Mounted on

devtmpfs           221G     0  221G   0% /dev

tmpfs              221G     0  221G   0% /dev/shm

tmpfs              221G   18M  221G   1% /run

tmpfs              221G     0  221G   0% /sys/fs/cgroup

/dev/sda2           30G   20G  9.6G  67% /

/dev/sda1          494M  119M  375M  25% /boot

/dev/sda15         495M   12M  484M   3% /boot/efi

/dev/sdb1          473G   73M  449G   1% /mnt/resource

/dev/md10          1.8T   69M  1.8T   1% /mnt/nvme

10.40.0.5:/sched    30G   33M   30G   1% /sched

10.40.0.5:/shared  100G   34M  100G   1% /shared

tmpfs               45G     0   45G   0% /run/user/20002

beegfs_ondemand    3.5T  103M  3.5T   1% /mnt/beeond

#####################################################################################

#####################################################################################

 

beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=storage:

jm-hpc-pg0-1 [ID: 1]

jm-hpc-pg0-3 [ID: 2]

#####################################################################################

#####################################################################################

 

beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=metadata:

jm-hpc-pg0-1 [ID: 1]

#####################################################################################

#####################################################################################

 

beegfs-ctl --mount=/mnt/beeond --getentryinfo:

Entry type: directory

EntryID: root

Metadata node: jm-hpc-pg0-1 [ID: 1]

Stripe pattern details:

+ Type: RAID0

+ Chunksize: 512K

+ Number of storage targets: desired: 4

+ Storage Pool: 1 (Default)

#####################################################################################

#####################################################################################

 

beegfs-net:

 

mgmt_nodes

=============

jm-hpc-pg0-1 [ID: 1]

   Connections: TCP: 1 (172.17.0.1:9008);

 

meta_nodes

=============

jm-hpc-pg0-1 [ID: 1]

   Connections: RDMA: 1 (172.16.1.66:9005);

 

storage_nodes

=============

jm-hpc-pg0-1 [ID: 1]

   Connections: RDMA: 1 (172.16.1.66:9003);

jm-hpc-pg0-3 [ID: 2]

   Connections: RDMA: 1 (172.16.1.76:9003);

 

 

CONCLUSION

Creating a fast parallel filesystem on Azure does not have to be difficult nor expensive.  This blog has shown how the installation and configuration of a BeeOND filesystem can be automated for a Slurm cluster (will also work with other cluster types with adaptation of the prolog/epilog configs).  As this is non-persistent shared job scratch the data should reside on a persistent storage (ie. NFS, Blob) and staged to and from the BeeOND mount (ie. /mnt/beeond per these setup scripts) as part of the job script.

 

 

LEARN MORE

Learn more about Azure Cyclecloud

Read more about Azure HPC + AI

Take the Azure HPC learning path

Co-Authors
Version history
Last update:
‎Jan 17 2024 06:15 AM
Updated by: