“Only 10 total drugs in 46 years have been intentionally developed for childhood cancer and have reached FDA approval (reference). Childhood cancer will affect 1,900,000 adults in North America this year, but only 16,000 children. The most common cancers in children often only account for 400-500 of those 16,000 pediatric cases, which is the reason childhood cancers do not pencil out as markets for pharmaceutical companies. The paradox is that childhood cancers are simpler, and often only have a single mutant protein (reference), and computer modeling of chemicals that bind these proteins can lead quickly to new drugs for children with cancer. “
-- Charles Keller - Scientific Director, Children's Cancer Therapy Development Institute
Charles indicates that one of the major challenges in drug design is the scarcity of computing power. Docking is the method to estimate how a molecule attaches to a protein, and it helps to discover possible drug candidates, but it also requires a lot of trial and error and computational resources.
This article describes proof of concept of docking simulation on Azure for "rhabdomyosarcoma", the most common type of soft tissue sarcoma in children. The target is a PAX3-FOXO1 fusion protein that is an essential initiator of rhabdomyosarcoma. Autodock Vina will be used to find compounds poses interacting with the fusion protein with good binding score.
Dataset
Tyuji Hoshino (星野 忠次)’s research team, Laboratory of Molecular Design of Chiba University, have been working with Charles to perform the docking screening to find candidate compounds. The dataset they are targeting, “Namiki_2019”, contains ~4.8 million compounds. Since the number of compounds is large, researchers separate those compounds into batches for processing. The top thousands of compounds in the binding scores by Autodock Vina will then be re-evaluated for the next step of drug design. Usually, the process of docking simulation would take months or even years to complete.
We choose 1,020 compounds out of the total compounds (#3,988,000 ~ #3,989,019) for the proof of concept. This size is suitable enough for initial findings of the nature of the simulation, particularly the CPU, memory, and I/O utilization. As well as finding out the most cost-effective SKU of Virtual Machines to be used, and estimating the total elapsed time when simulating the whole set of compounds.
Architecture
The diagram in Figure 2 illustrates the high-level classic architecture of an Azure High Performance Computing (HPC) environment. The compounds to be simulated are securely stored in a 4TB volume on Azure NetApp Files (ANF) with Standard tier. The simulation job is initiated by a PBS scheduler housed on a CycleCloud server. CycleCloud will dynamically scale up and down, provisioning parallel VMs with the required number of VMs to run the Vina simulation. These VMs are part of a VM scale set (VMSS) with accelerated networking, ensuring optimal performance. Upon completion of each simulation, output files are written back to the ANF volume for secure storage and easy access. This HPC architecture leverages the power of cloud computing to provide a scalable and efficient solution for running complex simulations like Autodock Vina.
Nature of the docking simulations
Most docking tools like Autodock Vina supports multithreading to speed up the simulation by taking advantage of multiple CPUs. We first did some pre-work to find out the most cost-effective # of threads (CPUs) per compound. We also compared two SKUs of Virtual Machines, one is Standard_HB120rs_v3, which feature 120 AMD EPYC™ 7V73X (Milan-X) CPU cores, the other is Standard_D64d_v5, which features the 3rd Generation Intel® Xeon® Platinum 8370C (Ice Lake) processor. Both with Ubuntu 20.04. We tested several rounds of different # of CPUs on different compounds to find out the nature of behavior, and below shows one of the results (compound #3,985,126).
We found that CPU=2 is the most cost-effective setting for all the tested compounds, and HBv3 series is more performant than Dv5 series. Autodock Vina docking simulation is very compute-intensive with very less I/O utilization. It will utilize around 10GB memory per compound, which implies L3 cache size and memory bandwidth could be critical to overall performance. HBv3 also has lower price per CPU compared to Dv5 series. Autodock Vina docking simulation is embarrassingly parallel and can be divided into completely independent Vina jobs. Considering all the above we choose CPU=2 and HBv3 as our compute nodes.
Simulations configuration
Below the PBS command we used to submit the Vina jobs:
qsub -N vina_job -l select=1:slot_type=hb120v3:ncpus=2,place=free -j oe
This command will submit Vina jobs to the PBS queue, each requesting 1 node of type “hb120v3” with 2 CPUs, and allowing the system to freely allocate the resources. Below is the screenshot of CycleCloud portal after submitting the jobs. As we configured to run on 2 CPUs per compound, up to 60 compounds can be run in one HB120v3 VM. Therefore, there were17 HB120v3 VMs provisioned in total, which contains 2,040 cores.
Results and observations
Professor Tyuji helped confirm the predicted docking poses of the simulation are all correct.
Below shows the CPU utilization of one of the HB120v3 VMs. We found CPU usage stays high during the simulation, and around 90% of compounds are done in 10 mins. All 1,020 compounds are completed in ~30 mins.
Figure 7. shows more detailed metrics. Besides the high CPU utilization and ~60GB memory usage in peak, we see very little I/O or networking usage. That means we might be able to replace ANF to Azure premium disks, which would fulfill the sizing requirements with lower cost and not affecting the overall performance.
The total size of the output files is ~40MB after 1,020 compounds were screened.
High scalability is one of the key features Azure HPC can provide, allowing users to effortlessly scale up to very large numbers of VMs when needed. Figure 8 shows the estimation (in red) when running ~4.8 million compounds on the larger scale environment. Which shows the total simulation computation time can be completed in 10 days or less, instead of months or years. Please note the estimation is not counting other efforts like infrastructure preparation and data movement.
Summary
This proof of concept found out the nature of the behavior of running Autodock Vina on the selected dataset, which is very compute-intensive with very less I/O utilization, and ~10GB memory per compound. We found CPU=2 the most cost-effective and HBv3 the suggested SKU. Most importantly we verified the classic Azure HPC environment can run Autodock Vina simulation seamlessly with great scalability. We also suggested a more cost-effective storage solution. And finally, we forecasted an estimation of completion time when running ~4.8 million compounds can be reduced from months or years to less than 10 days.
References
High-performance computing (HPC) on Azure: High-performance computing (HPC) on Azure - Azure Architecture Center | Microsoft Learn
What is Azure CycleCloud? Overview - Azure CycleCloud | Microsoft Learn
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.