Build your RoseTTAFold protein AI prediction cluster on Azure CycleCloud
Published Aug 12 2021 06:00 PM 3,186 Views

Recently Science and Nature published newest AI-based protein fold algorithm of RoseTTAFold and AlphaFold2 at the same day, which will bring the revolutionary breakthrough on human protein prediction. Corresponding code repositories were also released on Github.

How to adopt this new protein folding technology to fasten your research with huge power of HPC cluster? It's more wise and convenient to use Azure HPC solutions which will get ready in several hours instead of preparing on-premises  server and build a static cluster in several months.

Azure have different offerings at HPC platform layer for your prompt building under purpose built scenarios. And Azure also have rich VM types suitable for HPC scenarios including newest Milan CPU based HBv3 series, HC series with high Infiniband components and Nvidia A100/V100/T4 GPU accelerating NC series.

Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.




CycleCloud installation

First step is to prepare CycleCloud Server through ARM template. Open Cloud Shell in Azure console and run below command to create a service principal. Remember the returned "appId", "password", "tenant" info in your notepad.








az ad sp create-for-rbac --name RoseTTAFoldOnAzure --years 1








Click the CycleCloud Server template link jump to custom deployment page in Azure console. Set region as Southeast Asia and resource group as rgCycleCloud. Provide the service principal info just created and setup a CycleCloud admin username & password for further login. Set storage account name as sacyclecloud and let other parameter as is. Click "Review+create" and then click "Create".


When resource is ready, go to the "cyclecloud" VM  overview page to find its DNS name. Open it at another web browser page, then login using admin username & password set previously. 

At the upper right "cycleadmin" drop menu, click "My profile -> Edit profile" and provide your SSH public key string to save. It's a must-do step because this public key is used to scale VMs. Then use SSH login to this CycleCloud Server, and execute initialize command and press 'Enter' at each hint step. Then create a id_rsa file and provide your SSH private key string. Keep this SSH window open.








cyclecloud initialize
vi ~/.ssh/id_rsa   #provide private key string
chmod 400 ~/.ssh/id_rsa









Prepare RoseTTAFold VM Image

In Azure console,  enter the VM creating page by Home->Create Resource->Virtual Machine. Set the basic configuration as:

  • Resource Group: rgCycleCloud
  • Virtual Machine name: vmRoseTTAFoldHPCImg
  • Region: Southeast Asia
  • Availability options: No infrastructure redundancy required
  • Image: CentOS-based 7.9 HPC Gen1 with GPU driver, CUDA and HPC tools pre-installed.(Click "See all images" and search this image in Marketplace)
  • Size: Standard NC16as_T4_v3 
  • Username: cycleadmin
  • SSH public key source: Use existing public key (if use SSH Keys in Azure)
  • SSH public key: <your SSH public key>
  • Virtual network: azurecyclecloud (or other existed VNet)

Click 'Review+Create' to check and then Create VM.

After this VM booted as Running status, we need one more step to enlarge the system disk size. Stop VM first with click option of reserve VM's public IP address. After status is as stopped, click VM Disk menu -> click system disk link -> 'Size + performance' to set the system disk size as 64G and performance tier P6 or higher. Wait till upper right pop-up info shows update accomplished then go back to Start the VM. VM status will change to Running several minutes later.

Using your SSH terminal to login to this VM and execute the next commands to install RoseTTAFold application, which include these steps:

  • Install Anaconda3. In process set the destination directory as /opt/anaconda3 and select yes when ask whether to init conda.
  • Download RoseTTAFold Github repo. It refers to a branch of RoseTTAFold repo which modified for adapting to HPC building.
  • Config two conda environments.
  • Install the PyRosetta4 component in folding conda environment. As a optional status check of PyRosetta4, enter Python command in folding env and then execute "import pyrosetta" and "pyrosetta.init()" with expectation of no error in output.








## Install anaconda 
chmod +x
sudo bash ./
# read license with blank to next page
# set the destination dir as /opt/anaconda3
# select 'yes' when ask if need conda init
cat <<EOF | sudo tee -a /etc/profile
export PATH=\$PATH:/opt/anaconda3/bin
source /etc/profile
## Get repo and setup conda env 
cd /opt
sudo su
conda deactivate    #back to VM shell
git clone    #branch from RosettaCommons/RoseTTAFold modified for HPC env
cd RoseTTAFold
conda env create -f RoseTTAFold-linux.yml
conda env create -f folding-linux.yml
conda env list
## Install pyrosetta in folding env
conda init bash
source ~/.bashrc
conda activate folding
# original download link: 
# Register first. Below is a copy, while download means obey the license requirements at
tar -vjxf PyRosetta4.Release.python37.linux.release-289.tar.bz2
cd PyRosetta4.Release.python37.linux.release-289/setup
python install
# [Optional] verify the pyrosetta lib
# python   #then input two lines:  <<<import pyrosetta;   <<<pyrosetta.init()
conda deactivate    #back to conda (base)
conda deactivate    #back to VM shell








Strongly suggest to make a snapshot of this VM's OS disk before we go on. Then run this prepare command in SSH console and press 'y' to go.








sudo waagent -deprovision+user








When it's completed, go to Cloud Shell to run these commands:








az vm deallocate -n vmRoseTTAFoldHPCImg -g rgCycleCloud
az vm generalize -n vmRoseTTAFoldHPCImg -g rgCycleCloud
az image create -n imgRoseTTAhpc --source vmRoseTTAFoldHPCImg -g rgCycleCloud








After custom image created, go to Azure console page through Images -> imgRoseTTAhpc -> Overview. Find the 'RESOURCE ID' as form of '/subscriptions/xxx-xx-…xxxx/resourceGroups/rgCycleCloud/providers/Microsoft.Compute/images/imgRoseTTAhpc' and save it for further use.


Create a HPC cluster in CycleCloud

In the CycleCloud UI, click add new cluster with Slurm scheduler type selected. Give a cluster name first, eg. clusRosetta1. Then config "required settings" page as below. Choose NC16as_T4_v3 as HPC VM type and set quantity in auto-scaling configuration. Network select 'azurecyclecloud-compute' subnet. Click "Next".


Change CycleCloud default NFS disk size as 5000GB (training dataset will occupy 3T), which will be mounted at cluster startup. In "advanced settings" page, config the HPC OS type as "Custom image" and modify the image id as 'RESOURCE ID' at previous step. Left other option as is and click bottom right "Save" button.


Click the "Start" to boot cluster. CycleCloud will then create VMs according configuration. After several minutes, a scheduler VM will be ready in list. Click this item and click "connect" button in below detail list to get the string like "cyclecloud connect scheduler -c clusRosetta1". Use this command in CycleCloud Server's SSH console to login to scheduler VM.


RoseTTAFold Dataset preparation 

Next is to prepare the datasets including weights and reference protein pdb database. In scheduler VM SSH console, run below command to load datasets into NFS volume mounted in cluster. We provide these dataset copy link at Azure Blob storage here to fasten the download speed. Your can also switch to original links as commented. Unzip operation will cost some time in hours. Suggest to unzip in multiple SSH windows with no interruption to assure the data integrity. Suggest to check the data size through 'du -sh <directory_name>' command after unzip operations.








cd /shared/home/cycleadmin
git clone    #branch from RosettaCommons/RoseTTAFold modified for HPC env
cd RoseTTAFold
## wget
tar -zxvf weights.tar.gz
## uniref30 [46G, unzip: 181G]
## wget
mkdir -p UniRef30_2020_06
tar -zxvf UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06
## BFD [272G, unzip: 1.8T]
## wget
mkdir -p bfd
tar -zxvf bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd
## structure templates (including *_a3m.ffdata, *_a3m.ffindex) [115G, unzip: 667GB]
## wget
tar -zxvf pdb100_2021Mar03.tar.gz









Run a RoseTTAFold sample

There is a job submission script in git repo named Then we can submit a RoseTTAFold analysis job by SLURM sbatch command in Scheduler SSH as below.
















This sample job will cost some time est. at 30+ mins including steps of  MSA parameters generation, HHsearch, prediction and modeling. Job's output can be checked in job<id>.out and logging files are at ~/log_<id>/ where you can find more progress info. AI training logging info can be found at ./log_<id>/folding.stdout.

As a HPC cluster, you can submit multiple jobs. Slurm scheduler will allocate jobs to compute nodes in cluster. Multiple jobs allocation and status can be listed by 'squeue' command as below.

[cycleadmin@ip-0A00041F ~]$ squeue


                 7       hpc       RosettaO  cycleadm  R       1:13      1 hpc-pg0-1

                 8       hpc       RosettaO  cycleadm  R       0:08      1 hpc-pg0-2

If node is not sufficient, CycleCloud will boot new nodes for more accommodation. Meanwhile, CycleCloud will terminate nodes which no job running on it after a time window for cost saving. CycleCloud UI provide more detailed status info of cluster and nodes. GPU utilization reached near 100% in prediction steps and has idle time during running.

Successful running prompts as below. It will output 5 preferred protein pdb results at path of ~/model_<id>/ which named as model_x.pdb.

[cycleadmin@ip-0A00041F ~]$ cat job9.out

Running HHblits of JobId rjob204

Running PSIPRED of JobId rjob204

Running hhsearch of JobId rjob204

Predicting distance and orientations of JobId rjob204

Running parallel

Running DeepAccNet-msa of JobId rjob204

Picking final models of JobId rjob204

Final models saved in: /shared/home/cycleadmin/model_204



Below is the image of two pdb protein structure of pyrosetta and end2end results in PyMOL tools UI.


You can change parameters in submission script to fully utilize CPU and memory according to your VM type configuration of cluster. What to do in the next is to upload your fasta input files to NFS volume and submit your RoseTTAFold jobs.


Tear down

If will not keep this environment, delete the resource group of rgCycleCloud to tear down all the related resource directly.


Appendix Links:

Science Rosetta article: Accurate prediction of protein structures and interactions using a three-track neural network | Scie...

RoseTTAFold repo: RosettaCommons/RoseTTAFold: This package contains deep learning models and related scripts for RoseT...

RoseTTAFold branch repo for HPC: RoseTTAFold for HPC


Version history
Last update:
‎Sep 16 2021 07:32 AM
Updated by: