Covid-19 pandemic, along with recent years’ breakthrough in molecular biology, have led to the exemptional use of the sequence databases. Researchers use BLAST (basic local alignment search tool), by NCBI (National Center for Biotechnology Information), to compare nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.
Elapsed time of a query is critical in this task. The less elapsed time of a query, the more iterations users can run during a limited time of period. That is, users would be able to achieve time-to-market goal and to save overall IT infrastructure cost.
In this article, we will show you how to run BLAST on Azure. And analyze the execution pattern of BLAST query runs, and further to find out the performance bottleneck. We will then come up performance tuning suggestions and verify their improvement. Besides that, we will examine some related topics, including multi-threading scalability, degradation of performance with multiple queries, and cost-effective analysis.
We will examine 2 kinds of different BLAST scenarios in order. Both using the most current version of BLAST (2.11.0) and “blastn” query to compare nucleotide-nucleotide sequence.
Running standalone BLAST with large BLAST DB and a simple query
Please note that we are using “-num_threads 30” to utilize 30 CPUs/cores.
BLAST is installed standalone on one Virtual Machine. Same query will be run multiple times to compare the query elapsed time. BLAST DB and input/output files are stored in different storage locations, including Local/NVMe disks, Azure NetApp Files, and Managed (Premium) Disks. All the storage solutions provide similar IOPS and throughput capabilities.
NOTE: you can follow the step-by-step process on azurehpc/apps/blast at master · Azure/azurehpc (github.com) to create Azure environment, install BLAST, and run the test on your own.
Below the screenshots of Azure Insights running BLAST on E64dsv4 VM, with both BLAST DB and input/output files stored in Local Temp SSD Disks. You can see there are 2 patterns of phases during the execution.
1st phase: It takes ~50% of the total execution time (left-hand part of the below screenshot). Very high (up to ~12K) READ IOPS with ~1.3GB/s throughput in disk I/O., and CPU utilization is only 5%~30%.
Seems for this phase it’s IO-bound. To further verify, we run “iostate” to check real-time Disk IO:
$ iostate -kx 2
The IO utilization (Local Temp Disk in “/sdc” in this case) is 100% most of the time with massive READ operations. Long disk queue length causes CPU not fully utilized. So, it’s clear it’s IO-bound in this phase.
We now need to dig out whether the IO-bound situation is due to the IOPS/throughput limitation or latency issue. We will verify it by storing BLAST DB and input/output files in different locations, which we will examine later.
Another question is, how to avoid or mitigate the IO bottleneck? Could it because the size of BLAST DB is too large to be loaded into VM’s memory? We will also verify it by running on a VM with larger than 1.2 TB memory later.
2nd phase: It takes another ~50% of the total run time (right-hand part of the below screenshot). Only 1 CPU is used with ~100% utilization in this phase. Very little networking, memory, and read/write operations. Which indicates this phase is CPU-bound and implies VM SKUs would be the performance differentiator.
Analyze Performance Bottleneck
As we discussed, the 1st phase of the run is IO-bound. We then performed tests on storing BLAST DB and input/output files in different storage locations, including Azure NetApp Files and Premium Disks. All provide similar IOPS and throughput.
Recall that we also sensed that the IO-bound situation might be due to VM’s memory not big enough to load the 1.2TB BLAST DB at once. Therefor we run the same query on M128dsv2 VM, which has 4TB memory, as comparison.
Below are the full testing results. The greener, the better of the query performance.
First, we do find M128dsv2 VM (w/ 4TB memory) perform much better than others. When running on this VM, we see very little disks IO in the 1st phase of run, and CPU utilization is nearly 100% all the time. Which validate the assumption that the IO-bound is due to VM memory size’s not large enough to load all BLAST DB. In fact, you will see the same pattern on a small BLAST size scenario later.
Second, for VMs like E64dsv4, HB120rsv3 or FX-48 (currently in private preview), having both BLAST DB and input/output files in Local Temp or NVMe Disk can provide the best performance, which indicates that the IO is very latency sensitive.
We also noticed that applying proper VM tuning profile can further optimize the performance. The “enterprise-storage” or “throughput-performance” provide slightly better than others. Below the commands you can practice on your own.
# check active tuned-adm profile
$ sudo tuned-adm active
# apply tuning profile
$ sudo tuned-adm profile throughput-performance
$ sudo tuned-adm profile enterprise-storage
Finally, VM SKUs matters the most when considering performance:
HB120v3 = M128dsv2 > FX48 > E64dsv4 > HC44rs > L64dsv2
Cost Effective Analysis
Now considering the cost. As HB120v3 is much cheaper than others, it’s also the most cost effective.
Degradation of performance with multiple queries, and CentOS vs. Ubuntu
To find out if there’s any outstanding performance degradation when running multiple queries. We first run 10-threads-job and record the elapsed time as baseline. And then we run multiple (6 or 12, based on # of CPU of the VM) queries at the same time and also record the end-to-end elapsed time as shown in below table. The total elapsed time increase ~130%, as shown in red rectangle.
We also see that CentOS runs faster than Ubuntu.
Now let’s move on to examine next BLAST dataset.
Running Docker BLAST with a small size of BLAST DB and much complicated query strings
NCBI perform a BLAST analysis similar to the approach described in this publication to compare de novo aligned contigs from bacterial 16S-23S sequencing against the nucleotide collection (nt) database, using the latest version of the BLAST+ Docker image.
Please note that we are using “-num_threads 16” to utilize 16 CPUs/cores.
We tested 3 different queries, from the simplest (Analysis 1) to the complicated (Analysis 3):
NOTE: you can follow process on ncbi/blast_plus_docs (github.com) to run the same test on your own.
Below the Azure Insights screenshots when running the BLAST queries on HB120v3 VM, which has 120 cores of CPU. Again, there are 2 patterns of phases during the execution like what we found in previous test.
16 CPUs ~100% usage in 1st phase ("htop" screenshot):
Very little I/O & networking usage during the whole run (Azure Insights screnthos):
Based on the observations, we found that when VM’s memory size is larger than BLAST DB (~122GB). The program loads the database into memory and doing query algorithm for all subsequent sequences. Along with the fact that CPU are fully utilized, selection of VM SKUs would play bigger role as performance differentiator.
We also tested several performance tuning practices including:
Below the testing results on different VM SKUs. The greener, the better of the query performance.
As you see that:
HB120v3 > FX48 > E64dsv4 > HC44rs
One of the promises of public cloud is scalability. So, we examine how is the level of complexity of query strings impact scalability. Below the results of elapsed time (mins) when we run 3 different kinds of query strings (Analysis 1, Analysis 2, & Analysis 3) on the same VM, with different “num_threads(=1, 2, 4, 8, & 16)” specified.
Now we have walked you through the process to analyze the pattern of 2 kinds of BLAST query scenarios, and how to determine performance bottleneck. We also suggest several performance tuning practices and verify their effectiveness.
We also covered many other related topics like multi-threading scalability, degradation of performance with multiple queries, and cost-effective analysis.
Below the key suggestions when running BLAST queries on Azure.
HB120v3 >> FX48(cost TBD) > E64dsv4 > L64ds2 > HC44rs
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.