Next-generation sequencing (NGS) is a massively parallel sequencing technology that offers ultra-high throughput, scalability, and speed. The technology is used to determine the order of nucleotides in entire genomes or targeted regions of DNA or RNA. NGS has revolutionized the biological sciences, allowing labs to perform a wide variety of applications and study biological systems at a level never before possible.
NGS makes large-scale whole-genome sequencing (WGS) accessible and practical for the average researcher. It enables scientists to analyze the entire human genome in a single sequencing experiment, and scale to tens of thousands of genomes per year.
As the throughput of sequencing instruments increases and the cost per sample decreases, data volumes are increasing exponentially. As a result, data storage, management, and analysis are becoming a major bottleneck in the overall workflow and increasing compute costs. As state-of-the-art methods allow users to extract more information from their data and the analytical pipelines become more computationally intensive, this bottleneck is growing worse.
NVIDIA Clara Parabricks addresses these computational challenges. Clara Parabricks is the only GPU accelerated and optimized secondary analysis software that includes industry standard tools, plus deep learning-based tools such as DeepVariant for variant calling. As the throughput of genomics sequencers increases, driving the cost of sequencing down, the bottleneck now lies in the computational analysis of the sequence samples.
NVIDIA Clara Parabricks
NVIDIA introduced the Clara Parabricks software suite for performing analysis of NGS DNA and RNA data. It delivers results at blazing fast speeds and low cost. Clara Parabricks can analyze 30x WGS data in under 25 minutes on a single 8-GPU server, instead of 30 hours for traditional CPU-based methods. Its output matches commonly used software, making it simple to verify the accuracy of the results.
Clara Parabricks software provides at least an order of magnitude acceleration in compute time while generating identical outputs and reducing analysis costs. Clara Parabricks is available free onNVIDIA GPU Cloud (NGC) and can be easily deployed on Azure GPU based virtual machines (VM).
Clara Parabricks provides optimal performance for multiple Azure instance types and can be used out of the box for essential bioinformatics needs. Currently, the Clara Parabricks accelerated analysis tools start from FASTQ files and perform alignment through variant calling and expression analysis, including QC tools for both types of outputs. The suite of tools can be used to support end-to-end workflows for germline, somatic and RNA-Seq pipelines, providing the flexibility to meet the individual needs of most projects. The tools can also be used individually, as drop-in replacements for steps in existing workflows.
Figure-1 below shows most of the accelerated tools within the Clara Parabricks software package. Due to the acceleration of the pipelines, users can implement multiple variant callers to extract the most information from their data, and still generate the results in less time and at lower cost than using standard baseline software solutions. A standard 30x WGS sample can be processed in 25 min using ND96asr v4 Azure VM.
Figure 1: The NVIDIA Clara Parabricks 4.0 Toolset-Ref.
Running Parabricks 4.0 on Microsoft Azure
The prerequisites for running Parabricks 4.0 on Microsoft Azure are:
Microsoft Azure subscription with Compute-VM (cores-vCPUs) quota allowing to create GPU based VMs (preferably NCas_T4_v3 and ND96asr_A100_v4)
An NVIDIA driver greater than version 465.32.*
Any Linux Operating System that supports nvidia-docker2 Docker version 20.10 (or higher)
Note: Clara Parabricks requires at least two GPUs per sample to run efficiently.
The fastest way to run the application is to use a predefinedUbuntu Data Science Virtual Machine image instead of standard Ubuntu. In this case you do not need to install the required NVIDIA driver. Otherwise, you will need to install the relevant driver. We are recommending using SSH public key authentication as a fast, simple, and secure way to connect to your VM.
Once the VM is created you need to connect to it using ssh:
$ ssh -I private-key.pem user-id@vm-public-DNS
If the NVIDIA driver is already installed, check your NVIDIA hardware and driver version using the nvidia-smi command:
To make sure you have nvidia-docker2 installed, run this command:
$ docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Let’s discuss the details of the benchmarking on Microsoft Azure:
Step 1: Download data
The original source of the data can be found fromthis link. For the tests below, Microsoft Azure blob storage was used. In order to run benchmarking against WGS, you need to extend the VM hard drive to have at least 1TB of local space withthe following steps. Subsequently, start your VM and download the 30x WGS dataset: