Azure High Performance Computing (HPC) Blog

6 MIN READ

Integrating external PBS Master to CycleCloud (Cloud Bursting scenario)

Microsoft

Oct 27, 2022

Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High-Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.

In this blog post, we are discussing about how to integrate an external PBS master node to send jobs to CycleCloud for cloud bursting (Enabling on-premises workloads to be sent to the cloud for processing, known as “cloud bursting”) or hybrid HPC scenarios. For demonstration purposes, I am creating a PBS master node in Azure as an external PBS in a different VNET and the execute nodes are in CycleCloud in a separate VNET. we are not discussing the complexities of networking involved in Hybrid scenarios.

Architecture:

Environment:

External Master node (Standard D8s v4)
Compute nodes on CycleCloud 8.2
PBS Pro Scheduler
CentOS 7 Operating system (Openlogic CentOS HPC 7.7).
cyclecloud-pbspro project version 2.0.9 (Latest version can be used)
2 subnets created for deployment ( hpc + default). I select the default subnet for cyclecloud and hpc for the master node.

Preparing master node:

In this example, I am using OpenLogic.CentOS-HPC-7.7 image on Standard D8s v4 as master (head) node for PBS scheduler.

Install the prerequisites for configuring the NFS server and installing the PBSPro scheduler.

yum install python3 nfs-utils -y

Create shared directory for centralized home directory.

mkdir /shared
echo "/shared *(rw,sync,no_root_squash)" >> /etc/exports
systemctl start nfs-server
systemctl enable nfs-server

Checking the NFS server status:

[root@hnpbs ~]# showmount -e
Export list for hnpbs:
/shared *

Download the following packages from GitHub (I am using 2.0.9 version here. if you have a different project version, use the required version).

wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/hwloc-libs-1.11.9-3.el8.x86_64.rpm
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/pbspro-debuginfo-18.1.4-0.x86_64.rpm
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/pbspro-server-18.1.4-0.x86_64.rpm

Install the PBSPro package on the master node.

yum localinstall *.rpm

Check the pbs configuration file.

[root@hnpbs ~]# cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_SERVER=hnpbs
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

Start the PBS scheduler service

systemctl start pbs
systemctl enable pbs

Preparing CycleCloud environment:

Create the headless template using the stock template.

git clone https://github.com/Azure/cyclecloud-pbspro.git
cd /home/vinil/cyclecloud-pbspro/templates

Removed the following sections to make a headless template.

[[node server]]

[[nodearray login]]

[[[parameter serverMachineType]]]

[[[parameter SchedulerImageName]]]

[[[parameter NumberLoginNodes]]]

Update the following variables for this requirement. As I am using CentOS 7 and PBSPro 18 version.


[[[configuration cyclecloud.mounts.nfs_sched]]]
        type = nfs
        mountpoint = /sched
        disabled = true
#IMPORTANT: update the master node hostname
<--............-->
  [[nodearray execute]]
    MachineType = $ExecuteMachineType
    MaxCoreCount = $MaxExecuteCoreCount
    Interruptible = $UseLowPrio
    AdditionalClusterInitSpecs = $ExecuteClusterInitSpecs

        [[[configuration]]]
        autoscale.enabled = $Autoscale
        pbspro.scheduler = hnpbs

Reference template: https://github.com/vinilvadakkepurakkal/cyclecloud-pbsproheadless/blob/main/openpbs.txt

Import the custom template into CycleCloud.

cyclecloud import_template -f openpbs.txt

You can now see a new template named “OpenPBS-headless” in the CycleCloud portal.

Create a cluster using the following parameters.

a. Select the N/W – different subnet than the master node

b. Add the NFS server IP address (Master node IP)

c. Disable the return proxy and CentOS 7 and PBSPro v18 as software selection

d. Add the following cloud-init script for master node name resolution (change the IP and hostname based on your setup). Save and start the cluster.

e. Once the cluster is started, add a node to cluster. This is required to create a hostname resolution from the master node to CycleCloud compute node. CycleCloud automatically creates a /etc/hosts file with all the execution nodes and its hostname during the startup. it's easy to use that file instead of creating one by ourselves.

f. You will see the node in error state, that’s normal. connect to the node and copy the /etc/hosts file.

ssh cyclecloud@10.0.0.7
$ sudo -i
# cp /etc/hosts /shared/

g. terminates the newly added node

Integrating External Master node to CycleCloud

Update the cyclecloud execution nodes hostnames in master node’s /etc/hosts (/shared/hosts is the file we created while adding a node).

grep ip- /shared/hosts  >> /etc/hosts

Download the following package and extract the file into the external master node.

wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/cyclecloud-pbspro-pkg-2.0.9.tar.gz
tar -zxf cyclecloud-pbspro-pkg-2.0.9.tar.gz

Run the installer to setup the CycleCloud environment.

cd cyclecloud-pbspro/
./initialize_pbs.sh
./initialize_default_queues.sh
./install.sh  --venv /opt/cycle/pbspro/venv

Create autoscale.json file for autoscaler requirement. Here is the command to prepare the autoscale.json.

./generate_autoscale_json.sh --username username --password password --url https://fqdn:port --cluster-name cluster_name

Here is the output:

./generate_autoscale_json.sh --username vinil --password <password> --url https://<ipaddress_of_cc_server> --cluster-name hbpbs
testing that we can connect to CycleCloud...
success!

Run azpbs validate command to validate the configuration. It will give you suggestions to correct the configuration in the external master.

[root@hnpbs cyclecloud-pbspro]# azpbs validate
ungrouped is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_priv. Please add this and restart PBS
group_id is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_priv. Please add this and restart PBS

Edit /var/spool/pbs/sched_priv/sched_config and add ungrouped, group_id to resources.

#grep ^resources /var/spool/pbs/sched_priv/sched_config
resources: "ncpus, mem, arch, host, vnode, aoe, eoe, ungrouped, group_id"

Run azpbs validate to verify the changes.

[root@hnpbs cyclecloud-pbspro]# azpbs validate
[root@hnpbs cyclecloud-pbspro]#

Run azpbs autoscale and it should come out without any error.

[root@hnpbs cyclecloud-pbspro]# azpbs autoscale
NAME HOSTNAME PBS_STATE JOB_IDS STATE VM_SIZE DISK MEM NCPUS NGPUS EOE GROUP_ID NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR

Testing the jobs

Create a normal user for submitting the jobs. I am creating a normal user used in cyclecloud with the same uid and gid. Home directory in the /shared location.

groupadd -g 20001 vinil
useradd -g 20001 -u 20001 -d /shared/home/vinil -s /bin/bash vinil

Submit an interactive job using qsub -I for testing the functionality.

[root@hnpbs server_priv]# su - vinil
[vinil@hnpbs ~]$ qsub -I
qsub: waiting for job 7.hnpbs to start

You can now see a new node getting created in the CycleCloud portal.

Also, you can see a node is getting created when you run azpbs autoscale command.

[root@hnpbs pbspro]# azpbs autoscale
NAME      HOSTNAME    PBS_STATE JOB_IDS STATE     VM_SIZE         DISK          MEM         NCPUS NGPUS EOE GROUP_ID NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR    ITR
execute-1 2yab3000005           7       Preparing Standard_F2s_v2 20.00g/20.00g 4.00g/4.00g 0/1   0/0       s_v2_pg0 execute   execute   false     26af9032ae0 3228.5 -1

You can now see a new node got provisioned from CycleCloud.

[root@hnpbs cyclecloud-pbspro]# su - vinil
Last login: Thu Feb 10 10:38:35 UTC 2022 on pts/0
[vinil@hnpbs ~]$ qsub -I
qsub: waiting for job 7.hnpbs to start
qsub: job 7.hnpbs ready
[vinil@ip-0A00001C ~]$

qstat output and azpbs autoscale output as follows:

[root@hnpbs pbspro]# qstat -an
hnpbs:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
7.hnpbs         vinil    workq    STDIN       11378   1   1    --    --  R 00:04
   ip-0A00001C/0

[root@hnpbs pbspro]#
[root@hnpbs pbspro]# azpbs autoscale
NAME      HOSTNAME    PBS_STATE JOB_IDS STATE VM_SIZE         DISK            MEM           NCPUS NGPUS EOE GROUP_ID NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR
execute-1 ip-0A00001C job-busy  7       Ready Standard_F2s_v2 20.00gb/20.00gb 4.00gb/4.00gb 0/1   0/0       s_v2_pg0 execute   execute   false     26af9032ae0 -1  -1

We successfully integrated an external PBS master node into CycleCloud

NOTE: when you are working with an on-prem master node, make sure that the required network ports for the PBS scheduler, compute nodes, file shares, license server etc. are opened for successful communication.

Updated Jul 13, 2023

Version 4.0

hpc

vinilv

Microsoft

Joined November 12, 2020

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity

Blog Post

Integrating external PBS Master to CycleCloud (Cloud Bursting scenario)