Cost-effective genomics analysis with Sentieon on Azure

Venkat_Malladi · ‎Jan 09 2024

This Blog was Co-Authored by Don Freed - Sr. Bioinformatics Scientist, Brendan Gallagher - Head of Business Development at Sentieon, Inc.

Introduction

In our previous blog, we introduced you to Sentieon, who specializes in developing software tools for analyzing genomic data. Sentieon pipelines allow researchers and clinicians to process and analyze genomic data quickly, accurately, and efficiently with a low total cost of ownership.

To understand the performance of the Sentieon software, we benchmarked Sentieon’s DNAseq and DNAscope pipelines version 202112.05, using publicly available datasets from Illumina, PacBio HiFi, Element Biosciences, and Ultima Genomics platforms on Azure instances. We break down the runtime and cost of the pipelines on a wide range of currently available instances. These benchmarks use publicly available datasets, and the pipeline is available on Github.

Pre-requisites

Sentieon License server on Azure
An Azure VM instance within the same Network or public access as the Sentieon license server.
Download datasets and localize into Azure storage account

Running Sentieon on Azure

The pipelines and scripts needed for setup used in this benchmarking are provided on GitHub.

Instance Setup

The script at misc/instance_setup.sh performs initial setup of the instance and download/installation of software packages used in the benchmark.

Input datasets

In these benchmarks, we use the GIAB HG002 sample sequenced on multiple sequencing platforms. Input datasets for the benchmark are recorded in the config/config.yaml. With the exception of the Element dataset, that you will have to download on your own.

We recommend downloading all the files and placing them in an azure blob storage. You can use AzCopy to transfer the required files to your own Storage account using a shared access signature (SAS) with "Write" access. Then we recommend updating the configs to use a shared access signature to each file. The pipeline will automatically download input files.

Input FASTQ files were obtained from the following URLs. For Azure URLs, you will have to use the provided SAS token added using the following pattern (<URL>?<SAS_Token>) :

Illumina NovaSeq
- https://s3.amazonaws.com/genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/40x/HG0...
- https://s3.amazonaws.com/genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/40x/HG0...
Illumina HiSeq
- https://s3.amazonaws.com/genomics-benchmark-datasets/google-brain/fastq/hiseqx/wgs_pcr_free/40x/HG00...
- https://s3.amazonaws.com/genomics-benchmark-datasets/google-brain/fastq/hiseqx/wgs_pcr_free/40x/HG00...
Element Biosciences Aviti
- Downloaded from https://go.elementbiosciences.com/access-seq-datasets-060622
Ultima Genomics UG100
- https://s3.amazonaws.com/ultima-selected-1k-genomes/crams/005401-UGAv3-1-CACATCCTGCATGTGAT.cram
PacBio HiFi (SAS Token: sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=7qp%2BxGLGc%2BO2MIVzzDZY7GSqEwthyGnhXJ566KoH7As%3D)

The script at misc/run_benchmarks.sh was used to run the benchmarks. This orchestrates the localization of the input datasets, references, model files and execution of Snakemake workflows on the machine. The workflow will down-sample the input data to a consistent coverage, process the down-sampled data through the Sentieon pipeline and will calculate variant calling accuracy against the Genome in a Bottle (GIAB) v4.2.1 truth set.

Running benchmarks on Azure

The input files vary in their coverage, so the datasets with FASTQ input were down-sampled to approximately 93 billion bases (~30x coverage) prior to processing with the Sentieon secondary analysis pipelines. The Ultima CRAM file was not down-sampled and is at 40x coverage as recommended by Ultima Genomics.

The data were processed using the hg38 reference genome. The reference genome at https://giab.s3.amazonaws.com/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.... was used for files with input in the FASTQ format. The reference genome at https://broad-references.s3.amazonaws.com/hg38/v0/Homo_sapiens_assembly38.fasta was used with the Ultima data in CRAM format, as this dataset was already aligned to this reference genome.

Benchmarking Sentieon’s variant calling pipelines with HBv3

The Sentieon software can be run a variety of instances on Azure. We tested the scalability of the software by running on the HBv3 series of machines. These machines are optimized for applications that are driven by memory bandwidth, such as fluid dynamics, finite element analysis, and reservoir simulation and would be a good fit for Sentieon’s analysis pipelines. Figure 1 presents the runtime and Spot compute cost of running Sentieon’s analysis pipelines for germline variant calling across multiple sequencing technologies on Standard_HB120rs_v3 instance in US East at the time of publication.

Figure 1: Runtime and Spot compute cost of Sentieon DNAseq and DNAscope pipelines on Standard_HB120rs_v3.

Using the Standard_HB120rs_v3, we analyzed 30x Illumina NovaSeq and HiSeqX samples from FASTQ to VCF using the DNAseq and DNAscope pipelines. The DNAseq pipeline took around 30 minutes with a cost of $0.18. Sentieon’s DNAscope pipeline takes only 2 minutes longer – around 32 minutes with a cost of $0.19, about 1 cent more, see Table 1.

The Ultima UG100 dataset is already aligned to the reference genome and pipeline performed variant calling without alignment. The DNAscope pipeline finished in 18 minutes for Spot cost of $0.11.

Sentieon’s DNAscope LongRead pipeline for PacBio HiFi data is more computationally intensive as it includes multiple passes of variant calling along with a read-backed phasing. The DNAscope LongRead pipeline finished in 72 minutes with a Spot cost of $0.42.

The Element Biosciences AVITI system is supported by a customized Sentieon DNAscope pipeline. Sentieon’s DNAscope pipeline for Element Biosciences finished in 31 minutes with a Spot cost of $0.18.

All run times and costs can be found in Table 1.

Sample	Pipeline	Alignment (min)	Preprocessing (min)	Variant Calling (min)	Total Runtime (min)	On Demand($)	Spot ($)
Element Aviti	DNAscope	21.90	1.35	7.38	30.63	1.84¹	0.18¹
Illumina HiSeq X	DNAseq	21.90	2.93	4.45	29.28	1.76¹	0.18¹
Illumina HiSeq X	DNAscope	21.90	1.21	7.56	30.67	1.84¹	0.18¹
Illumina NovaSeq	DNAseq	23.10	2.63	4.72	30.45	1.83¹	0.18¹
Illumina NovaSeq	DNAscope	23.10	1.18	7.68	31.96	1.92¹	0.19¹
PacBio HiFi	DNAscope	27.18	1.96	42.47	71.61	4.30¹	0.43¹
Ultima UG100	DNAscope	N/A	N/A	18.23	18.23	1.09¹	0.11¹

Table 1: Runtime and On Demand and Spot compute cost of Sentieon DNAseq and DNAscope pipelines on Standard_HB120rs_v3. Alignment includes alignment with Sentieon BWA-MEM for short-read data and alignment with Sentieon minimap2 for PacBio HiFi data. Preprocessing includes duplicate marking, base-quality score recalibration, and merging of multiple aligned files into a single file. Variant calling includes variant calling or variant candidate identification along with variant genotyping and filtering. Variant calling for PacBio HiFi data is implemented as a multi-stage pipeline. All runs were in the east us region¹ Pricing is accurate at the time of publication.

Sentieon benchmark across multiple instance families and architectures

The Sentieon pipelines and software is able to scale to smaller or larger instances depending on data as well as instance availability. To provide an accurate representation of performance across various architectures, we benchmarked the Sentieon DNAseq and DNAscope pipeline with Illumina NovaSeq dataset on ARM and x86 architecture. The runtime, On Demand and Spot compute cost is shown in Figures 2 and 3 respectively. On Demand VMs are pay for compute capacity by the second, with no commitments or upfront payments. While Spot VMs request unused compute capacity at a discount.

Figure 2: Runtime and On Demand and Spot compute cost of the Sentieon DNAseq pipeline across various Azure machine types using Illumina NovaSeq dataset sorted by overall runtime. Larger instances provide lower runtime, while cost is generally consistent within a family but does differ between architectures.

Figure 3: Runtime and Dedicated and Spot compute cost of Sentieon DNAscope pipeline across various Azure machine types using Illumina NovaSeq dataset sorted by overall runtime. Larger instances provide lower runtime, while cost is generally consistent within a family but does differ between architectures.

For the fastest turnaround, the Sentieon DNAseq pipeline is able to process the Illumina 30x NovaSeq dataset in 30minutes on a Standard_HB120rs_v3, with a Dedicated cost of $1.82 or a Spot cost of $0.18, see Figure 2. As another cost-effective option, DNAseq can be used on the Standard_D96ads_v5 instance with an On-Demand cost of $3.34, a spot cost of $0.33 and a turnaround time of under 40 minutes, see Figure 2. The DNAscope pipeline for Standard_D96ads_v5 instance with an On-Demand cost of $3.88, a spot cost of $0.39 and a turnaround time of under 50 minutes, see Figure 3. Note, for the Standard_F48s_v2, an additional external disk was used to accommodate all the test data for the analysis but wasn’t included in the overall cost.

We were able to also run comparison against ARM CPUs. For direct comparison we were able to use the equivalent 32 vCPU machines, but the highest available is 64 vCPU when compared to 96 vCPU in X86 (Figure 2 and 3). In Table 2, we can see that ARM runtime was within 10-20 minutes of X86 equivalent for Intel and AMD. Additionally, Dedicated cost was comparable for DNAscope and DNAseq pipeline comparable across the board. However, the biggest differences were AMD, which had a Spot price of $0.33 for DNAscope and $0.30 for DNAseq pipeline which is about 1/3^rd the cost for the other options

VM Size	Architecture	*Pipeline*	*Total Runtime (min)*	On Demand ($)	Spot ($)
D32ds_v5	x86 (Intel)	DNAscope	129.50	3.90¹	1.25¹
D32ads_v5	x86 (AMD)	DNAscope	121.96	3.35¹	0.33¹
D32pds_v5	ARM	DNAscope	142.19	3.43¹	1.10¹
D32ds_v5	x86 (Intel)	DNAseq	120.88	3.64¹	1.17¹
D32ads_v5	x86 (AMD)	DNAseq	110.51	3.04¹	0.30¹
D32pds_v5	ARM	DNAseq	129.42	3.12¹	1.00¹

Table 2: Runtime, Dedicated and Spot compute cost of Sentieon DNAseq and DNAscope pipelines on across 32cpu architectures. All runs were in the eastus region.
¹ Pricing is accurate at the time of publication.

These results highlight the ability of the Sentieon software to scale up large instances for faster turnaround and down to smaller instances as needed. We only included a subset of potential compute, based on optimized compute-to-price ratios. However, the Sentieon tools can also be used with other machine families, based on availability in a given region.

Conclusion

Sentieon’s DNAseq and DNAscope pipelines are highly scalable and can be used on a variety of machine types. The software can scale up to the 120 vCPU Standard_HB120rs_v3, instances for turnaround times of 30 minutes or down to Standard_D32ads_v5 instances for better pricing on Spot instance of $0.30

If you can get Standard_HB120rs_v3 in your preferred region, it is the cheapest per run. However, if not available, all other Spot pricing options are great with the following two being your best cost advantage, Standard_D32ads_v5 and Standard_D96ads_v5. Standard_D96ads_v5 instances offer highly competitive On Demand and Spot pricing with a reasonable turnaround time. Sentieon’s FASTQ to VCF pipelines can process Illumina 30x whole genomes for about $4.94 on On Demand machines or $0.33 on Spot machines and in under 42 minutes. Standard_D32ads_v5 process the DNAseq pipeline in for $2.90 on On Demand machines or $0.30 on Spot machines and in about 111 minutes. While on Spot machines Sentieon DNAseq is capable of processing 30x genomes from FASTQ to VCF with a Spot machine cost of less than $1.50 on a variety of machine types that we tested.

Readers should note that all costs represent hardware costs and don’t represent software licensing costs.

To get started with the Sentieon software on Azure, please reach out to info@sentieon.com or visit the Sentieon website at www.sentieon.com.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs