Authenticating to an Azure CycleCloud Slurm cluster with Azure Active Directory
Published Aug 25 2022 07:29 AM 5,518 Views
Microsoft

 

Overview:

Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.

 

As enterprises increasingly move to using Azure Active Directory for their authentication needs this blog explores how Azure AD and OpenSSH certificate-based authentication may be used to provide authentication to a Slurm cluster. We also utilise the recent Azure Bastion native client support feature to provide remote access to the Login Node over the public internet.

 

Summarising, we will use the native Azure AD Linux authentication to access the Login Node through the Azure Bastion host, using a temporal, provisioned ssh key. Once logged into the Login Node, the CycleCloud provisioned user account and ssh keys will guarantee authentications to the scheduler and compute nodes. AAD authentication can improve the security to access our environment by enabling the possibility to use conditional access enabling for example multi-factor authentication before being able to use SSH.

 

aad-auth.png

 

Components:

This solution uses an existing Azure AD Tenant and very standard deployments of CycleCloud 8.2, Azure Files NFS (to provide a persistent /shared folder), a Login Node (more details later) and Azure Bastion (Standard SKU). I prefer to deploy these using Bicep. The OS used for all VMs is the AlmaLinux 8.5 HPC image.

 

Solution:

 

1) Azure Bastion, this is a typical deployment of Azure Bastion with the only additional considerations being to ensure it is the Standard SKU and that enableTunneling is set to true.

 

 

 

 

 

resource azureBastion 'Microsoft.Network/bastionHosts@2022-01-01' = {
  name: bastionName
  location: location
  properties: {
    enableTunneling: true
    ipConfigurations: [
      {
        name: 'IpConf'
        properties: {
          subnet: {
            id: '${vnetId}/subnets/AzureBastionSubnet'
          }
          publicIPAddress: {
            id: pip.id
          }
        }
      }
    ]
  }
  sku: {
    name: 'Standard'
  }
}

 

 

 

 

 

 

2) Login Node, again a standard virtual machine deployment. To interact with the Slurm cluster it should have Slurm and Munge installed with configurations matching your Slurm cluster. The /shared folder is also mounted to provide access to the shared home folders. Typically to enable AAD auth for Linux we would ensure the VM has a System Assigned Managed Identity and add the AADSSHLoginForLinux extension. As the extension does not currently support AlmaLinux it has been installed using Cloud Init referencing the RHEL 8 RPMs. Additionally note how the default home directory for new users has been changed to /shared/home and use of NFS for home directories enabled. 

 

 

 

 

 

#cloud-config

#https://github.com/Azure/WALinuxAgent/issues/1938
bootcmd:
  - mkdir -p /etc/systemd/system/walinuxagent.service.d
  - echo "[Unit]\nAfter=cloud-final.service" > /etc/systemd/system/walinuxagent.service.d/override.conf
  - sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service
  - systemctl daemon-reload

yum_repos:
  packages-microsoft-com-prod:
    baseurl: https://packages.microsoft.com/rhel/8/prod/
    enabled: true
    gpgcheck: true
    gpgkey: https://packages.microsoft.com/keys/microsoft.asc
    name: packages-microsoft-com-prod

packages:
- munge
- nfs-utils
- aadsshlogin-selinux.x86_64
- aadsshlogin.x86_64

mounts:
- ["nfsshares2960c680b0a1578.file.core.windows.net:/nfsshares2960c680b0a1578/shared", /shared, nfs, "vers=4,minorversion=1,sec=sys"]

runcmd:
- mkdir -p /shared
- mount -t nfs nfsshares2960c680b0a1578.file.core.windows.net:/nfsshares2960c680b0a1578/shared /shared -o vers=4,minorversion=1,sec=sys
- cp /shared/apps/slurm/munge.key /etc/munge
- chown -R munge.munge /etc/munge/ /var/log/munge/
- chmod 0700 /etc/munge/ /var/log/munge/
- systemctl enable munge
- systemctl stop munge
- systemctl start munge
- wget https://github.com/Azure/cyclecloud-slurm/releases/download/2.4.1/slurm-20.11.0-0rc2.el8.x86_64.rpm
- wget https://github.com/Azure/cyclecloud-slurm/releases/download/2.4.1/slurm-perlapi-20.11.0-0rc2.el8.x86_64.rpm
- dnf localinstall ./slurm-20.11.0-0rc2.el8.x86_64.rpm -y
- dnf localinstall ./slurm-perlapi-20.11.0-0rc2.el8.x86_64 -y
- groupadd slurm --gid 11100
- useradd -m -d /home/slurm --gid 11100 --uid 11100 slurm
- mkdir /etc/slurm
- cp /shared/apps/slurm/slurm.conf /etc/slurm
- cp /shared/apps/slurm/cyclecloud.conf /etc/slurm
- chown -R slurm.slurm /etc/slurm
- sed -i --follow-symlinks "s/HOME=.*/HOME=\/shared\/home/g" /etc/default/useradd
- setsebool -P use_nfs_home_dirs on

 

 

 

 

 

To be able to access the VM, the last thing to do is to assign an RBAC role to allow the user to login. This could be the Vitual Machine User Login role, for normal users, or the Virtual Machine Administrator Login, for system administrators. Here an example to assign the role to a standard user:

 

 

 

 

 

username=$(az account show --query user.name --output tsv)
rg=$(az group show --resource-group myResourceGroup --query id -o tsv)
vm=$(az vm show -g $rg --name loginnode --query id -o tsv)

az role assignment create \
    --role "Virtual Machine User Login" \
    --assignee $username \
    --scope $vm

 

 

 

 

 

More details about the necessary steps to enable the SSH login on the VM can be found here.

 

With this in place we now configure Azure role assignments authorizing our user to log in to the VM and use the Azure CLI to connect to the Login Node via Azure Bastion.

 

trevcc_2-1659946105464.png

Note the user’s home directory name, uid/gid and create an ssh pubkey pair for use with the cluster. This will be used to create a matching ‘local’ user in the CycleCloud user management system.

 

3) CycleCloud, with the user access to the Login Node now established we must grant the user access to the Compute Nodes. For this the built-in CycleCloud user management system is used and we will create a matching ‘local’ user to our AAD principal.

 

trevcc_3-1659946177729.png

The default CycleCloud Slurm template is used to create the cluster with the default NFS share mounted from the Azure Files NFS share.

 

trevcc_4-1659946219889.png

 

Conclusion:

With the cluster started and ‘local’ user assigned we can update the Login Node to ensure it has the correct munge key and the slum.conf is pointing to the scheduler. From the Login Node our AAD Authenticated user can now submit and run jobs on the cluster.

 

trevcc_5-1659946284247.png

 

Reference:

Azure AD and OpenSSH 

Azure CycleCloud

Azure HPC 

Azure Bastion - Native Client  

 

5 Comments
Version history
Last update:
‎Aug 24 2022 03:27 PM
Updated by: