Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High-Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.
In this blog post, we are discussing about how to integrate an external PBS master node to send jobs to CycleCloud for cloud bursting (Enabling on-premises workloads to be sent to the cloud for processing, known as “cloud bursting”) or hybrid HPC scenarios. For demonstration purposes, I am creating a PBS master node in Azure as an external PBS in a different VNET and the execute nodes are in CycleCloud in a separate VNET. we are not discussing the complexities of networking involved in Hybrid scenarios.
Architecture:
Environment:
Preparing master node:
In this example, I am using OpenLogic.CentOS-HPC-7.7 image on Standard D8s v4 as master (head) node for PBS scheduler.
Install the prerequisites for configuring the NFS server and installing the PBSPro scheduler.
yum install python3 nfs-utils -y
Create shared directory for centralized home directory.
mkdir /shared
echo "/shared *(rw,sync,no_root_squash)" >> /etc/exports
systemctl start nfs-server
systemctl enable nfs-server
Checking the NFS server status:
[root@hnpbs ~]# showmount -e
Export list for hnpbs:
/shared *
Download the following packages from GitHub (I am using 2.0.9 version here. if you have a different project version, use the required version).
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/hwloc-libs-1.11.9-3.el8.x86_64.rpm
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/pbspro-debuginfo-18.1.4-0.x86_64.rpm
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/pbspro-server-18.1.4-0.x86_64.rpm
Install the PBSPro package on the master node.
yum localinstall *.rpm
Check the pbs configuration file.
[root@hnpbs ~]# cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_SERVER=hnpbs
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
Start the PBS scheduler service
systemctl start pbs
systemctl enable pbs
Preparing CycleCloud environment:
Create the headless template using the stock template.
git clone https://github.com/Azure/cyclecloud-pbspro.git
cd /home/vinil/cyclecloud-pbspro/templates
Removed the following sections to make a headless template.
[[node server]]
[[nodearray login]]
[[[parameter serverMachineType]]]
[[[parameter SchedulerImageName]]]
[[[parameter NumberLoginNodes]]]
Update the following variables for this requirement. As I am using CentOS 7 and PBSPro 18 version.
[[[configuration cyclecloud.mounts.nfs_sched]]]
type = nfs
mountpoint = /sched
disabled = true
#IMPORTANT: update the master node hostname
<--............-->
[[nodearray execute]]
MachineType = $ExecuteMachineType
MaxCoreCount = $MaxExecuteCoreCount
Interruptible = $UseLowPrio
AdditionalClusterInitSpecs = $ExecuteClusterInitSpecs
[[[configuration]]]
autoscale.enabled = $Autoscale
pbspro.scheduler = hnpbs
Reference template: https://github.com/vinilvadakkepurakkal/cyclecloud-pbsproheadless/blob/main/openpbs.txt
Import the custom template into CycleCloud.
cyclecloud import_template -f openpbs.txt
You can now see a new template named “OpenPBS-headless” in the CycleCloud portal.
Create a cluster using the following parameters.
a. Select the N/W – different subnet than the master node
b. Add the NFS server IP address (Master node IP)
c. Disable the return proxy and CentOS 7 and PBSPro v18 as software selection
d. Add the following cloud-init script for master node name resolution (change the IP and hostname based on your setup). Save and start the cluster.
e. Once the cluster is started, add a node to cluster. This is required to create a hostname resolution from the master node to CycleCloud compute node. CycleCloud automatically creates a /etc/hosts file with all the execution nodes and its hostname during the startup. it's easy to use that file instead of creating one by ourselves.
f. You will see the node in error state, that’s normal. connect to the node and copy the /etc/hosts file.
ssh cyclecloud@10.0.0.7
$ sudo -i
# cp /etc/hosts /shared/
g. terminates the newly added node
Integrating External Master node to CycleCloud
Update the cyclecloud execution nodes hostnames in master node’s /etc/hosts (/shared/hosts is the file we created while adding a node).
grep ip- /shared/hosts >> /etc/hosts
Download the following package and extract the file into the external master node.
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/cyclecloud-pbspro-pkg-2.0.9.tar.gz
tar -zxf cyclecloud-pbspro-pkg-2.0.9.tar.gz
Run the installer to setup the CycleCloud environment.
cd cyclecloud-pbspro/
./initialize_pbs.sh
./initialize_default_queues.sh
./install.sh --venv /opt/cycle/pbspro/venv
Create autoscale.json file for autoscaler requirement. Here is the command to prepare the autoscale.json.
./generate_autoscale_json.sh --username username --password password --url https://fqdn:port --cluster-name cluster_name
Here is the output:
./generate_autoscale_json.sh --username vinil --password <password> --url https://<ipaddress_of_cc_server> --cluster-name hbpbs
testing that we can connect to CycleCloud...
success!
Run azpbs validate command to validate the configuration. It will give you suggestions to correct the configuration in the external master.
[root@hnpbs cyclecloud-pbspro]# azpbs validate
ungrouped is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_priv. Please add this and restart PBS
group_id is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_priv. Please add this and restart PBS
Edit /var/spool/pbs/sched_priv/sched_config and add ungrouped, group_id to resources.
#grep ^resources /var/spool/pbs/sched_priv/sched_config
resources: "ncpus, mem, arch, host, vnode, aoe, eoe, ungrouped, group_id"
Run azpbs validate to verify the changes.
[root@hnpbs cyclecloud-pbspro]# azpbs validate
[root@hnpbs cyclecloud-pbspro]#
Run azpbs autoscale and it should come out without any error.
[root@hnpbs cyclecloud-pbspro]# azpbs autoscale
NAME HOSTNAME PBS_STATE JOB_IDS STATE VM_SIZE DISK MEM NCPUS NGPUS EOE GROUP_ID NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR
Testing the jobs
Create a normal user for submitting the jobs. I am creating a normal user used in cyclecloud with the same uid and gid. Home directory in the /shared location.
groupadd -g 20001 vinil
useradd -g 20001 -u 20001 -d /shared/home/vinil -s /bin/bash vinil
Submit an interactive job using qsub -I for testing the functionality.
[root@hnpbs server_priv]# su - vinil
[vinil@hnpbs ~]$ qsub -I
qsub: waiting for job 7.hnpbs to start
You can now see a new node getting created in the CycleCloud portal.
Also, you can see a node is getting created when you run azpbs autoscale command.
[root@hnpbs pbspro]# azpbs autoscale
NAME HOSTNAME PBS_STATE JOB_IDS STATE VM_SIZE DISK MEM NCPUS NGPUS EOE GROUP_ID NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR
execute-1 2yab3000005 7 Preparing Standard_F2s_v2 20.00g/20.00g 4.00g/4.00g 0/1 0/0 s_v2_pg0 execute execute false 26af9032ae0 3228.5 -1
You can now see a new node got provisioned from CycleCloud.
[root@hnpbs cyclecloud-pbspro]# su - vinil
Last login: Thu Feb 10 10:38:35 UTC 2022 on pts/0
[vinil@hnpbs ~]$ qsub -I
qsub: waiting for job 7.hnpbs to start
qsub: job 7.hnpbs ready
[vinil@ip-0A00001C ~]$
qstat output and azpbs autoscale output as follows:
[root@hnpbs pbspro]# qstat -an
hnpbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
7.hnpbs vinil workq STDIN 11378 1 1 -- -- R 00:04
ip-0A00001C/0
[root@hnpbs pbspro]#
[root@hnpbs pbspro]# azpbs autoscale
NAME HOSTNAME PBS_STATE JOB_IDS STATE VM_SIZE DISK MEM NCPUS NGPUS EOE GROUP_ID NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR
execute-1 ip-0A00001C job-busy 7 Ready Standard_F2s_v2 20.00gb/20.00gb 4.00gb/4.00gb 0/1 0/0 s_v2_pg0 execute execute false 26af9032ae0 -1 -1
We successfully integrated an external PBS master node into CycleCloud
NOTE: when you are working with an on-prem master node, make sure that the required network ports for the PBS scheduler, compute nodes, file shares, license server etc. are opened for successful communication.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.