Recently Science and Nature published newest AI-based protein fold algorithm of RoseTTAFold and AlphaFold2 at the same day, which will bring the revolutionary breakthrough on human protein prediction. Corresponding code repositories were also released on Github.
How to adopt this new protein folding technology to fasten your research with huge power of HPC cluster? It's more wise and convenient to use Azure HPC solutions which will get ready in several hours instead of preparing on-premises server and build a static cluster in several months.
Azure have different offerings at HPC platform layer for your prompt building under purpose built scenarios. And Azure also have rich VM types suitable for HPC scenarios including newest Milan CPU based HBv3 series, HC series with high Infiniband components and Nvidia A100/V100/T4 GPU accelerating NC series.
Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.
First step is to prepare CycleCloud Server through ARM template. Open Cloud Shell in Azure console and run below command to create a service principal. Remember the returned "appId", "password", "tenant" info in your notepad.
az ad sp create-for-rbac --name RoseTTAFoldOnAzure --years 1
Click the CycleCloud Server template link jump to custom deployment page in Azure console. Set region as Southeast Asia and resource group as rgCycleCloud. Provide the service principal info just created and setup a CycleCloud admin username & password for further login. Set storage account name as sacyclecloud and let other parameter as is. Click "Review+create" and then click "Create".
When resource is ready, go to the "cyclecloud" VM overview page to find its DNS name. Open it at another web browser page, then login using admin username & password set previously.
At the upper right "cycleadmin" drop menu, click "My profile -> Edit profile" and provide your SSH public key string to save. It's a must-do step because this public key is used to scale VMs. Then use SSH login to this CycleCloud Server, and execute initialize command and press 'Enter' at each hint step. Then create a id_rsa file and provide your SSH private key string. Keep this SSH window open.
cyclecloud initialize vi ~/.ssh/id_rsa #provide private key string chmod 400 ~/.ssh/id_rsa
Prepare RoseTTAFold VM Image
In Azure console, enter the VM creating page by Home->Create Resource->Virtual Machine. Set the basic configuration as:
Click 'Review+Create' to check and then Create VM.
After this VM booted as Running status, we need one more step to enlarge the system disk size. Stop VM first with click option of reserve VM's public IP address. After status is as stopped, click VM Disk menu -> click system disk link -> 'Size + performance' to set the system disk size as 64G and performance tier P6 or higher. Wait till upper right pop-up info shows update accomplished then go back to Start the VM. VM status will change to Running several minutes later.
Using your SSH terminal to login to this VM and execute the next commands to install RoseTTAFold application, which include these steps:
## Install anaconda wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh chmod +x Anaconda3-2021.05-Linux-x86_64.sh sudo bash ./Anaconda3-2021.05-Linux-x86_64.sh # read license with blank to next page # set the destination dir as /opt/anaconda3 # select 'yes' when ask if need conda init cat <<EOF | sudo tee -a /etc/profile export PATH=\$PATH:/opt/anaconda3/bin EOF source /etc/profile ## Get repo and setup conda env cd /opt sudo su conda deactivate #back to VM shell git clone https://github.com/Iwillsky/RoseTTAFold.git #branch from RosettaCommons/RoseTTAFold modified for HPC env cd RoseTTAFold conda env create -f RoseTTAFold-linux.yml conda env create -f folding-linux.yml conda env list ./install_dependencies.sh ## Install pyrosetta in folding env conda init bash source ~/.bashrc conda activate folding # original download link: https://www.pyrosetta.org/downloads # Register first. Below is a copy, while download means obey the license requirements at https://els2.comotion.uw.edu/product/pyrosetta wget https://asiahpcgbb.blob.core.windows.net/rosettaonazure/PyRosetta4.Release.python37.linux.release-289.tar.bz2 tar -vjxf PyRosetta4.Release.python37.linux.release-289.tar.bz2 cd PyRosetta4.Release.python37.linux.release-289/setup python setup.py install # [Optional] verify the pyrosetta lib # python #then input two lines: <<<import pyrosetta; <<<pyrosetta.init() conda deactivate #back to conda (base) conda deactivate #back to VM shell
Strongly suggest to make a snapshot of this VM's OS disk before we go on. Then run this prepare command in SSH console and press 'y' to go.
sudo waagent -deprovision+user
When it's completed, go to Cloud Shell to run these commands:
az vm deallocate -n vmRoseTTAFoldHPCImg -g rgCycleCloud az vm generalize -n vmRoseTTAFoldHPCImg -g rgCycleCloud az image create -n imgRoseTTAhpc --source vmRoseTTAFoldHPCImg -g rgCycleCloud
After custom image created, go to Azure console page through Images -> imgRoseTTAhpc -> Overview. Find the 'RESOURCE ID' as form of '/subscriptions/xxx-xx-…xxxx/resourceGroups/rgCycleCloud/providers/Microsoft.Compute/images/imgRoseTTAhpc' and save it for further use.
Create a HPC cluster in CycleCloud
In the CycleCloud UI, click add new cluster with Slurm scheduler type selected. Give a cluster name first, eg. clusRosetta1. Then config "required settings" page as below. Choose NC16as_T4_v3 as HPC VM type and set quantity in auto-scaling configuration. Network select 'azurecyclecloud-compute' subnet. Click "Next".
Change CycleCloud default NFS disk size as 5000GB (training dataset will occupy 3T), which will be mounted at cluster startup. In "advanced settings" page, config the HPC OS type as "Custom image" and modify the image id as 'RESOURCE ID' at previous step. Left other option as is and click bottom right "Save" button.
Click the "Start" to boot cluster. CycleCloud will then create VMs according configuration. After several minutes, a scheduler VM will be ready in list. Click this item and click "connect" button in below detail list to get the string like "cyclecloud connect scheduler -c clusRosetta1". Use this command in CycleCloud Server's SSH console to login to scheduler VM.
RoseTTAFold Dataset preparation
Next is to prepare the datasets including weights and reference protein pdb database. In scheduler VM SSH console, run below command to load datasets into NFS volume mounted in cluster. We provide these dataset copy link at Azure Blob storage here to fasten the download speed. Your can also switch to original links as commented. Unzip operation will cost some time in hours. Suggest to unzip in multiple SSH windows with no interruption to assure the data integrity. Suggest to check the data size through 'du -sh <directory_name>' command after unzip operations.
cd /shared/home/cycleadmin git clone https://github.com/Iwillsky/RoseTTAFold.git #branch from RosettaCommons/RoseTTAFold modified for HPC env cd RoseTTAFold ## wget https://files.ipd.uw.edu/pub/RoseTTAFold/weights.tar.gz wget https://asiahpcgbb.blob.core.windows.net/rosettaonazure/weights.tar.gz tar -zxvf weights.tar.gz ./install_dependencies.sh ## uniref30 [46G, unzip: 181G] ## wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz wget https://asiahpcgbb.blob.core.windows.net/rosettaonazure/UniRef30_2020_06_hhsuite.tar.gz mkdir -p UniRef30_2020_06 tar -zxvf UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06 ## BFD [272G, unzip: 1.8T] ## wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz wget https://asiahpcgbb.blob.core.windows.net/rosettaonazure/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz mkdir -p bfd tar -zxvf bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd ## structure templates (including *_a3m.ffdata, *_a3m.ffindex) [115G, unzip: 667GB] ## wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz wget https://asiahpcgbb.blob.core.windows.net/rosettaonazure/pdb100_2021Mar03.tar.gz tar -zxvf pdb100_2021Mar03.tar.gz
Run a RoseTTAFold sample
There is a job submission script in git repo named runjob.sh. Then we can submit a RoseTTAFold analysis job by SLURM sbatch command in Scheduler SSH as below.
This sample job will cost some time est. at 30+ mins including steps of MSA parameters generation, HHsearch, prediction and modeling. Job's output can be checked in job<id>.out and logging files are at ~/log_<id>/ where you can find more progress info. AI training logging info can be found at ./log_<id>/folding.stdout.
As a HPC cluster, you can submit multiple jobs. Slurm scheduler will allocate jobs to compute nodes in cluster. Multiple jobs allocation and status can be listed by 'squeue' command as below.
[cycleadmin@ip-0A00041F ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7 hpc RosettaO cycleadm R 1:13 1 hpc-pg0-1
8 hpc RosettaO cycleadm R 0:08 1 hpc-pg0-2
If node is not sufficient, CycleCloud will boot new nodes for more accommodation. Meanwhile, CycleCloud will terminate nodes which no job running on it after a time window for cost saving. CycleCloud UI provide more detailed status info of cluster and nodes. GPU utilization reached near 100% in prediction steps and has idle time during running.
Successful running prompts as below. It will output 5 preferred protein pdb results at path of ~/model_<id>/ which named as model_x.pdb.
[cycleadmin@ip-0A00041F ~]$ cat job9.out
Running HHblits of JobId rjob204
Running PSIPRED of JobId rjob204
Running hhsearch of JobId rjob204
Predicting distance and orientations of JobId rjob204
Running parallel RosettaTR.py
Running DeepAccNet-msa of JobId rjob204
Picking final models of JobId rjob204
Final models saved in: /shared/home/cycleadmin/model_204
Below is the image of two pdb protein structure of pyrosetta and end2end results in PyMOL tools UI.
You can change parameters in submission script to fully utilize CPU and memory according to your VM type configuration of cluster. What to do in the next is to upload your fasta input files to NFS volume and submit your RoseTTAFold jobs.
If will not keep this environment, delete the resource group of rgCycleCloud to tear down all the related resource directly.
RoseTTAFold branch repo for HPC: RoseTTAFold for HPC
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.