Azure High Performance Computing (HPC) Blog

8 MIN READ

Accelerate ANSYS Mechanical on Azure HPC with Remote Solve Manager and Azure CycleCloud

Microsoft

Mar 01, 2022

Introduction

Ansys Mechanical enables to solve complex structural engineering problems and make better, faster design decisions. Should the complexity of the model increase a user can choose to leverage the state-of-the art infrastructure for High Performance Compute (HPC) Azure in terms of compute, networking, storage, etc. than running on a single machine.

This blog explains the steps to deploy a HPC environment on Azure for ANSYS Mechanical in the IaaS (Infrastructure as a Service) mode. Benefits of this include:

(a) Ability to choose the latest infrastructure for HPC (Virtual Machines, Storage) as it becomes available on Azure.

(b) Convenience to deploy the environment in your subscription and share the cluster with other team members.

(d) End to end deployment: do all the pre-processing, processing and post-processing tasks in the same environment without having to move/duplicate the data around.

(e) Control over costs: Automatic scaling of the compute nodes, as they shut down once the ANSYS job is finished processing.

Section I: ANSYS Solutions for HPC to use in IaaS mode

Ansys offers few solutions to use in HPC mode, first, through the ARC configuration which includes the job scheduler, and second, using the Remote Solver Manager (RSM) which enables us to bring our own scheduler. Azure has an enterprise-friendly tool for orchestrating and managing High Performance Computing (HPC) environments called the CycleCloud. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale. This blog will explain setting up HPC environment for ANSYS to use with Azure CycleCloud and the RSM method.

Section II: ANSYS Architecture on Azure with CycleCloud and RSM

We see lot of users comfortable with Windows environment and prefer to interact with the ANSYS Mechanical GUI for pre and post processing tasks. Once the pre-processing tasks (e.g., meshing) are complete, the huge models take days to run on a single machine and these become perfect candidates to leverage state-of-the-art infrastructure for HPC on Azure. With an aim to deliver high performance with cost optimisations, this architecture is designed to offer a Windows front end node with a GPU for those pre/post processing tasks, and then send ANSYS computations to a Linux cluster with HPC compute, all having the same file share. The architecture is as below.

Section III. Some Recommendations

Scheduler:

Given that the submission of the jobs to Azure is through the ANSYS GUI, it is recommended to have a scheduler other than PBS for this. ANSYS GUI does not take all the auto scale settings into consideration when using PBS scheduler and the back-end scripts for PBS may need to be changed when it comes to optimal utilisation of the compute infrastructure on Azure.
This solution was deployed through the SLURM scheduler and has been briefly tested on UNIVA Grid Engine. Both the templates are available in Azure CycleCloud environments and the following link would be a good starting point should you like to use those scheduler templates or customise the template for your environments

https://docs.microsoft.com/en-us/azure/cyclecloud/download-cluster-templates?view=cyclecloud-8

For SLURM, one might run into issues due to the temporary memory reduction on HBv2 VM’s. The job seeks to utilise the full memory and given the reduction the job may sit in the Queue state until noticed. To avoid this, follow Section IV - SLURM Cluster with CycleCloud below.

Storage:

This project was set up using Azure Net App File storage, using the NFSv3. Azure CycleCloud contains templates to provision a Parallel File System such as the BeeGFS should one want to use that for a large deployment.

Before mapping the NFS share to a drive on Windows it must be mounted on Linux machine (the head node of the cluster) first and the permissions of the share must be changed such that is accessible on Windows after the mount.
When mounting NFS on Windows, note that Windows does not have a Client for NFS when working with NFSv4. You may want to use a third-party client should you go with ANF NFSv4/ Azure File Share Premium NFSv4 or simply use NFSv3 and enable the Client for NFS feature on Windows.
The Windows Client for NFS usually has its default permissions which makes the share non-writable. This means Windows would have issues creating any files on the share. If you are mounting NFSv3 on Windows, you will need to change the permissions of the NFS client. This is because the ANSYS creates a bunch of files/folders from Windows which need to be accessed by the machines in the Linux cluster and this requires write permissions to the NFS share. To change permissions of the NFS client, on the Windows node open Powershell and implement something similar to the following

Set-NfsClientConfiguration -DefaultAccessMode 777

ANSYS on Windows is known to communicate with the NFS share using the UNC path. This does not work when we have mapped the ANF NFSv3/v4 or Azure Fileshare Premium NFSv4 to a drive on Windows. If using ANF, one may want to enable the dual protocol (SMB and NFS) to get it working. Also, to have SMB enabled on ANF, one needs to have Active Directory connections to use. In case you do not wish to have an AD you may just want to use NFS on Linux and Windows. To address this problem, there are few changes to the dll's which come with the ANSYS installation and those are available with the author.

Section IV. Implementing the architecture on Azure

This section assumes CycleCloud is already installed in your subscription. If it is not, you may want to follow the link below on how to deploy CycleCloud.

https://docs.microsoft.com/en-us/azure/cyclecloud/qs-install-marketplace?view=cyclecloud-8

SLURM Cluster with CycleCloud

Start with the existing SLURM template within Cyclecloud and customise it to configure a cluster of HBv2 or HBv3 compute nodes with CentOS 7.9-HPC image.

To avoid the temporary memory reduction issue mentioned in Section III of Scheduler above, include

slurm.dampen_memory=10

in the Slurm template in the [configuration] part of the node array to look like this.

Include the Linux installation of ANSYS Structures in the blob folder of the CycleCloud project. This is downloadable from support.ansys.com
Automate the following step via a backend script to run on all nodes at boot up

Mount the ANF NFS share on all nodes of the cluster as say, /ansys_mech. Change the share permissions such that is accessible on the Windows node. This back end script needs to run on all linux nodes.

Automate the following step via a backend script to run on the master node at boot up

The backend script on master node should download the ANSYS installer from blob, unpack and install on the share as per below command. You will need to configure the license server first to include the IP Address in the installation below.

./INSTALL -silent -licserverinfo 2325:1055:<ipaddress_of_licenseserver> -install_dir Linux_install_dir

2. In the same script include the command to start the RSM service in the cluster

Linux_install_dir/v212/RSM/Config/tools/linux/rsmlauncher start

Section V. Preparing Windows Node, Nvv4.

Create a VM with Windows 2019 DataCenter Gen2. This VM MUST be in the same CycleCloud VNET of the cluster.
Using the cmd create a ssh key for the windows node using

ssh-keygen -t rsa

Add this key into the list of CycleCloud users and make sure you can ssh into the Linux master node from Windows machine.

Enable Client for NFS Feature on Windows Node.
Set the right permissions for NFS Client. That is from Powershell run the following

Set-NfsClientConfiguration -DefaultAccessMode 777

Map the NFS share to Windows drive.
Download the Windows version of ANSYS Structures, 2021R2 from support.ansys.com and install on the Windows Node. Windows version of ANSYS NEEDS to be installed on the node and not on the share.

Due to the UNC path issue, ANSYS has recommended the following things:
1. Update the Ans.Rsm.JobManagement dll located in the bin directory of the RSM folder within the ANSYS installation on the Windows Node ONLY. The new dll is available with the author of this document

2. Update Ans.Rsm.AppSettings.config file within the RSM/Config folder of the ANSYS installation

Configure RSM on the Windows Node
1. On the Windows machine open RSM Configuration 2021R2
2. On the RSM configuration window, Click + to add a new HPC configuration. Give it a name, e.g., Azure HPC
3. The HPC type is SLURM
4. The submit host is <username>@<hostname>
5. Make sure the Use SSH protocol is unchecked
6. Hit apply

7. On the File Management tab, select No file transfer is needed. List the name of the network share as /ansys_mech and hit Apply.

8. On the Queues tab, hit the refresh button.

It will pop out a window asking to enter a password. You will need to create a user and its password on the Linux master node. ssh into head node and create the password using:

sudo passwd <username>

The same password needs to be entered here.

9. In the Queue list click on the queue which says HPC in it and hit apply

ANSYS is set to leverage Azure HPC for job processing.

Submit a test job
1. In the RSM configuration window, in the left margin, right click on the cluster you have just created and select advanced test.
2. Enter the Client directory, i.e., the share which you have just mapped to Windows and hit submit.
3. CycleCloud should be able to spin one execute node and start a test job.

Running ANSYS jobs

Log on to the Windows node, Start ANSYS Workbench and load the model.
Open the model in ANSYS mechanical and follow the setting to send the simulation on Azure. Under File click on Solve Process Settings.

Click Add Queue, Enter the Name of the HPC environment and hit Ok, and again Ok.

Select the right license and hit OK.

Go back to Mechanical and select the right environment to send the jobs to Azure for compute.

On the cyclecloud portal, you shall see new nodes being created. Give it 10 minutes, once you see a green bar, ANSYS has started using the Azure compute for processing. Confirm that ANSYS workbench job monitor lists the jobs as Running and not Queued. This model has 3 jobs to run in parallel so each job will utilise 120 cores. Therefore, although the setting is configured to 120 cores in ANSYS GUI, you see CycleCloud has spun 3 nodes for the 3 jobs.

CycleCloud has an idle timeout which will terminate the nodes once the job processing is complete. One can also place a limit on the number of nodes to spin. In this case the limit is 6. Should one have more jobs to process than the number of nodes, the scheduler will place the jobs in a queue. Alternatively, one can raise the limit on the number of nodes to spin.

Conclusion

Leveraging Azure HPC infrastructure to run CFD workloads like ANSYS Mechanical, StarCCM or others can provide multiple benefits:
a. Accelerated processing using the 200 Gbps Infiniband available on HBv2 and HBv3 nodes powered by the latest AMD processors.

b. Cluster orchestration and monitoring powered by Azure CycleCloud which allows easy deployment of an HPC environment, provisioning auto-scalable compute and HPC storage (including Parallel File Systems) with a click of a button. Importantly, Azure CycleCloud can be customised for your application and requirements.

c. Ready-to-use OS images specially for HPC workloads available in the Azure marketplace which come packed with the latest MPI libraries and compilers needed for these deployments.

Updated Oct 26, 2022

Version 2.0

hpc

Mandar_AU

Microsoft

Joined February 02, 2022

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity

Blog Post

Accelerate ANSYS Mechanical on Azure HPC with Remote Solve Manager and Azure CycleCloud