Azure HPC main orchestration solutions, Azure Batch and Azure CycleCloud, can be easily adapted to the needs of a specific workload scenario.
One of the features enabling this flexibility is the possibility for end users to create custom OS images for their clusters. These images can be configured to contain specific libraries, applications, drivers or other dependencies required for the workload.
However, managing these customizations can quickly become time consuming and error prone without the implementation of appropriate automation strategies.
This article defines an automated methodology to create custom HPC images through Azure Image Builder and Azure Compute Gallery. It will present also a recipe to deploy the required Azure resources through an Infrastructre as Code (IaC) approach using Bicep templates.
The official scripts for Azure HPC images preparations contained in Azure/azhpc-images repository are leveraged in the procedure described in this article.
Azure HPC images repository
Azure/azhpc-images repository contains the recipes for the preparation of CentOS-HPC / Ubuntu-HPC / Alma-HPC for H-series and N-series machines, as described in the Azure HPC images documentation.
The scripts from this repository perform installation of relevant tools, libraries and drivers for the HPC world. For example, all the layers added in the process for a CentOS 7.9 image are reported inside a readme file of the official repository, where the list contains:
This repository is a valuable guide for customizing an HPC image starting from a standard Azure Marketplace image since it allows to install the standard packages for an HPC scenario and a specific OS.
In this article, it will be leveraged for the customization of an Ubuntu 20.04 LTS Azure Marketplace image
Target scenario
The methodology presented in this article is based on an Azure Image Builder instance deploying an image to an Azure Compute Gallery. The reference architecture is aligned to what is described in Azure Documentation for Azure Image Builder using a virtual network
Azure Image Builder instance has a User Assigned Identity with two custom roles assigned: one for reading/joining virtual networks and one for contributing to Managed Images or Compute Gallery images (role assignment is scoped to the specific Resource Group).
The architecture uses Azure Image Builder without public IPs. This makes this procedure applicable even for organizations having security policies not allowing public IP deployments.
When Azure Image Builder is deployed inside a Virtual Network it leverages Private Link Service. Private Link Service communicates to a Proxy VM using an Azure Load Balancer. Azure Image Builder service interacts through the Proxy VM with the Build VM.
The images are organized in Image Definitions inside an Azure Compute Gallery. An image definition is a logical grouping of multiple versions of a specific image, meaning each Image Definition can contain multiple Image Versions
Azure Image Builder will distribute the custom OS image to a specific Image Definition inside an Azure Compute Gallery, creating a new incremental Image Version.
Bicep automation
The architecture can be deployed using an Infrastructure as Code (IaC) approach.
A Bicep template has been created for the purpose starting from the ARM templates present in Azure Image Builder documentation
Bicep template contains deployment instructions for all the resources described in the previous paragraph.
Image Builder definition
Azure Image Builder Bicep template is the core element of the deployment. The template in the repository is focused on the creation of an Ubuntu-HPC image starting from an Azure Marketplace image. The Bicep reference for Azure Image Builder describes all the possible configuration options for the Image Builder.
The source image for the build process is defined with a source object in the properties of the Azure Image Builder Bicep template:
source: {
type: 'PlatformImage'
publisher: 'Canonical'
offer: '0001-com-ubuntu-server-focal'
sku: '20_04-lts-gen2'
version: 'latest'
}
Starting from this image, the Image Builder will apply customizations on the Build VM after the image is loaded. The customize property allows to execute different operations in the image creation process:
customize: [
{
type: 'Shell'
name: 'InstallUpgrades'
inline: [
'wget https://codeload.github.com/Azure/azhpc-images/zip/refs/heads/master -O azhpc-images-master.zip'
'sudo apt-get install unzip'
'unzip azhpc-images-master.zip'
'sed -i "s%./install_nvidiagpudriver.sh%#./install_nvidiagpudriver.sh%g" azhpc-images-master/ubuntu/ubuntu-20.x/ubuntu-20.04-hpc/install.sh'
'sed -i \'s%$UBUNTU_COMMON_DIR/install_nccl.sh%#$UBUNTU_COMMON_DIR/install_nccl.sh%g\' azhpc-images-master/ubuntu/ubuntu-20.x/ubuntu-20.04-hpc/install.sh'
'sed -i \'s%rm /etc/%rm -f /etc/%g\' azhpc-images-master/ubuntu/common/install_monitoring_tools.sh'
'cd azhpc-images-master/ubuntu/ubuntu-20.x/ubuntu-20.04-hpc/'
'sudo ./install.sh'
'cd -'
'sudo rm -rf azhpc-images-master'
]
}
]
NVIDIA drivers and NVIDIA NCCL are skipped in this example since the image is assumed to be used only for compute nodes.
If NVIDIA drivers are needed and the Build VM is a SKU without a NVIDIA card, the kernel module load will fail at the end of the NVIDIA driver installation.
This can be overcome in two ways:
The target of the image deployment is then defined through the distribute property and allows to select the target among the three available scenarios: a managed image, a compute gallery or a VHD in a storage account
Image build process involves a real Azure VM. VmSize property allows to specify the VM SKU to be used and this will have direct impact on the price per build of the image. Other properties like OS Disk size and the subnet ID can be defined through this parameter.
Bicep template in the repository is using a Standard_D8ds_v5, while the default would be (for Gen 2 images) Standard_D4ds_v4.
Please consider that this machine selection will determine the biggest part of the cost of each image builder execution.
Deployment and first execution
The deployment of Bicep template has the following prerequisites:
The following commands in the Azure CLI download the repository an deploy the resources:
git clone https://github.com/wolfgang-desalvador/az-hpc-image-builder.git
cd az-hpc-image-builder
az deployment group create --resource-group <RESOURCE_GROUP_NAME> --template-file main.bicep
The following mandatory parameters need to be specified interactively at the beginning of the deployment (or should be provided through a Bicep parameters file) :
The following optional parameters can be used to avoid deploying the VNET and NSG:
// Virtual Network parameters
@description('Boolean to specify is the virtual network and the network security group needs to be deployed.')
param deployVirtualNetwork bool = true
@description('Image Builder Virtual Network name')
param virtualNetworkName string = '${imageBuilderName}-vnet'
@description('Image Builder subnet Network Security Group name')
param nsgName string = '${imageBuilderName}-nsg'
In the main.bicep there are parameters set with a default value and they can be customized for specific needs.
Please pay attention to the fact that the deployment of this Bicep file will update any resource already present in the Resource Group with the same name. For example, in case a Compute Gallery with the same name is present in the subscription, the Description will be updated with the new input parameters.
The target Resource group will contain the following resources after the completion of deployment:
The image build process can be triggered from Azure CLI or from Azure Portal:
az resource invoke-action \
--resource-group <RESOURCE_GROUP_NAME> \
--resource-type Microsoft.VirtualMachineImages/imageTemplates \
-n <IMAGE_BUILDER_NAME> \
--action Run
Another Resource Group (starting with "IT_") will be created in the Subscriptions during image build process.
This Resource Group contains the resources required by Azure Image Builder and they will result attached to the subnet defined for Azure Image Builder.
The Resource Group will contain a Load Balancer, a Private Link Service, a Proxy VM and a Build VM. No Public IP will be created.
At the end of each image build process, a new Image Version for the Image Definition will be added inside the Compute Gallery:
Using the custom image in an Azure CycleCloud cluster
The image ID is required to use the newly created image inside an Azure CycleCloud cluster. The ID can be obtained through the Azure Portal looking inside the Properties of the specific Image Version.
Alternatively, it can be retrieved from Azure CLI printing the JSON definition of the Image Version:
az sig image-version list --resource-group <RESOURCE_GROUP_NAME> --gallery-name <GALLERY_NAME> --gallery-image-definition <GALLERY_IMAGE_DEFINITION>
In the case of the current deployment, it will become:
az sig image-version list --resource-group <RESOURCE_GROUP_NAME> --gallery-name hpcgallery --gallery-image-definition ubuntu-hpc
The form of the ImageID is the following:
/subscriptions/<subscription_id>/resourceGroups/<resource_group_id>/providers/Microsoft.Compute/galleries/<gallery_name>/images/<image_definition_name>/versions/<version_number>
In Azure CycleCloud images are specified using the ImageID mentioned above.
The same operation can be performed in the UI from the "Advanced Settings" of the cluster.
For example, for a Slurm cluster the ImageID can be specified checking the "Custom Image" box for the node OS specification.
Particular attention should be reserved to the impact of an Image ID change in an Azure CycleCloud cluster. A change from the UI of the Image ID to a new version in an active cluster may lead to heterogeneous OS on the execution node array in case of autoscaling.
If it is critical for a workload to run on homogeneous OS, the Image Version should be changed only adopting proper strategies to minimize the impact.
Several options can be leveraged to manage an image version change in this case:
Automation with GitHub Actions
The Bicep template can be used to create a Continuous Delivery pipeline in GitHub using GitHub Workflow and GitHub Actions.
In this way every change to the image definition in GitHub version control system on main branch will trigger a deployment inside a target Resource Group of the update image template file and the execution of the image build process.
Application registration and permissions setup
The configuration of a Bicep deployment through GitHub Actions involves as a first step the creation of an Application registration in Azure AD.
The following command should be executed in an Azure CLI, with login performed for the user in the target tenant (az login) and with the correct subscription set active (az account set). An easy solution is again to use Azure Cloud Shell.
az ad sp create-for-rbac --name <APPLICATION_NAME_OF_CHOICE> --sdk-auth
This command will output a JSON file that should be copied and used in the subsequent steps.
The deployment agent will need to authenticate through the Service Principal created above to Azure.
GitHub Encrypted Secrets need to be defined in GitHub to allow the deployment agent to authenticate against Azure. Following the guide for encrypted secrets creation in a repository, three secrets should be defined:
The final result should look like the following:
The Application registration in AD should then be granted permissions to perform the required deployment operations.
For the purpose of Image Builder deployment, the following Custom Role can be created and assigned to the Application Service Principal scoped to the target Resource Group for the deployment:
{
"properties": {
"roleName": "Azure Image Template Contributor",
"description": "Allows to contribute to Azure Image Builder resources",
"assignableScopes": [
"/subscriptions/4b026ed5-a12a-4349-b2d1-870c7144e09d/resourceGroups/hpc-azure-image-builder"
],
"permissions": [
{
"actions": [
"Microsoft.VirtualMachineImages/imageTemplates/*",
"Microsoft.Resources/deployments/*",
"Microsoft.Compute/galleries/read",
"Microsoft.Compute/galleries/images/read",
"Microsoft.Compute/galleries/images/versions/read",
"Microsoft.Network/virtualNetworks/read",
"Microsoft.Compute/images/read",
"Microsoft.ManagedIdentity/userAssignedIdentities/assign/action",
"Microsoft.ManagedIdentity/userAssignedIdentities/read",
"Microsoft.Resources/subscriptions/resourceGroups/read"
],
"notActions": [],
"dataActions": [],
"notDataActions": []
}
]
}
}
This role grants to the GitHub Application Service Principal exclusively the minimal permissions (least privilege) required for Image Template deployment through Bicep.
GitHub workflow for deployment
Inside the repository there is the definition of a GitHub Workflow which realizes the deployment of the image-builder.bicep at every push on the main branch. In GitHub workflows are defined in the form of YML files. Every time the version control systems gets a new commit, the workflow is automatically executed
This workflow is triggered at every change on the main branch:
on:
push:
branches:
- main
And it performs three jobs in sequence:
delete-image-builder:
uses: ./.github/workflows/delete-image-builder.yml
with:
image-builder-name: 'imageBuilder'
secrets: inherit
deploy-bicep:
needs: delete-image-builder
runs-on: ubuntu-latest
steps:
# Checkout code
- uses: actions/checkout@main
# Log into Azure
- uses: azure/login@v1
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
# Deploy Bicep file
- name: Deploy Bicep resources
uses: azure/arm-deploy@v1
with:
subscriptionId: ${{ secrets.AZURE_SUBSCRIPTION }}
resourceGroupName: ${{ secrets.AZURE_RG }}
template: ./image-builder.bicep
parameters: 'imageBuilderName=imageBuilder destinationGalleryName=hpcgallery destinationImageName=ubuntuhpc'
failOnStdErr: false
run-builder:
needs: [deploy-bicep, delete-image-builder]
uses: ./.github/workflows/run-image-builder.yml
with:
image-builder-name: 'imageBuilder'
secrets: inherit
The delete-image-builder and run-image-builder jobs are defined calling in the repository two reusable workflows.
Thanks to this framework, every time a change is performed on the code base, the image template builder is recreated, and a new version of the image is built and saved in the Azure Compute Gallery.
Technically a user can perform changes and updates to their HPC Images directly from an IDE pushing to the GitHub repository, while GitHub Actions grant delivery of the images to the Azure Compute Gallery.
This approach can be further extended to support multiple image templates/environments:
That’s all folks
If you've made it this far, congratulations, you already know the basics of automating your Azure HPC Images customization using Azure Image Builder and Bicep
#AzureHPCAI
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.