Healthcare and Life Sciences Blog

9 MIN READ

Genomic Data Storage in Azure: Basic Compression for Mapped Sequencing Data

Microsoft

Jan 23, 2023

In this article, we show how compressing mapped (aligned) genomic sequencing data (BAM files) using the CRAM standard can reduce storage size (and cost) by around 63% at a compression cost of under a penny (< $0.01 USD) per sample. We demonstrate our results on a set of 62 whole genome sequencing (WGS) samples from the 1000 Genomes Project. We also give detailed instructions on how to use this method on your own data.

Introduction

Organizations that work with genomic data often need to store large amounts of sequencing information for large numbers of samples for long periods of time. This can result in substantial storage costs, especially when the data needs to be preserved for many years. Compressing these files can save both money and time, since smaller files can be transferred more quickly.

Overview of Sequencing File Types

Digital genomic data originates from sequencing instruments, which typically generate FASTQ files. These files contain the data from individual “reads,” the familiar “AGCT…” sequences from the short segments of DNA processed by the instrument. Those reads are then mapped to a location in a reference genome, resulting in a SAM or BAM file (BAM is a binary representation of SAM with optional compression). Because mapping is an expensive operation, many organizations store BAM files; storage and compression of this format will be the focus of this article. Note that some organizations will also keep the FASTQ files to preserve the original data; we will discuss how to efficiently compress and store FASTQ files in a future article.

Further Compressing BAM Files

While BAM files are often already compressed, there is a lot of room left for further compression. In this article, we show you how to take advantage of the CRAM 3.0 compression standard from GA4GH using freely available, open-source tools to gain substantial savings in file size without losing data fidelity (lossless compression). CRAM has a variety of options that allow you to trade off compression size with the time it takes to do the compression; we discuss some of these below.

In the sections below, we first show the storage savings and costs associated with CRAM compression, then offer a detailed walkthrough of how to apply CRAM compression to your own data.

How Much Can I Save with CRAM?

In order to demonstrate the potential savings, we first assembled a dataset of 62 public, representative human whole genome sequencing (WGS) samples with an average coverage of 45x (134 gigabases per sample on average) from the 1000 Genomes Project. You can download these samples for your own experiments from the 1000 Genomes Azure Open Dataset. The specific sample IDs we used are listed in the notes at the end of this article; also note that the files provided in the 1000 Genomes Project are CRAM, so we converted these to BAM with samtools using the default compression level. All data presented in this article represent the average across these 62 WGS samples.

CRAM allows for a wide range of settings to trade off size vs. compression speed; to demonstrate the range of possibilities we are showing results for three discrete settings: default, best, and archive. Default refers to running the compression tool with default options, best refers to the highest possible level of compression, and archive is an in-between setting that provides a good compromise between compression time/cost and size. Later in this article, we show detailed results on the time/size tradeoff as well as how to apply these specific settings in your own compression workflows.

Note that in addition to the savings from compression, there can be substantial additional cost savings from choosing an appropriate Azure blob storage tier. For instance, data that is primarily kept for archival purposes can be stored in the archive tier, resulting in a 95% cost savings over the most available tier (hot). In the table below, we show per sample pricing per year, i.e., how much it costs to store one human genome’s 45x sequencing data for one year. The numbers below are using pay-as-you-go pricing for flat namespace LRS (locally redundant storage) in the East US region; you can find pricing for your region as well as additional discounts for reserved pricing on this page.

	Sample Size (% of BAM Size)	Annual Storage Cost Per Sample (USD)
Method	Sample Size (% of BAM Size)	Hot Tier	Cold Tier	Archive Tier
BAM	46.7 GiB (100%)	$11.65	$8.52	$0.55
CRAM 3.0 (default)	17.3 GiB (37%)	$4.32	$3.16	$0.21
CRAM 3.0 (archive)	15.3 GiB (33%)	$3.82	$2.79	$0.18
CRAM 3.0 (best)	14.8 GiB (32%)	$3.69	$2.69	$0.18

To convert these pricing numbers (price per sample per year, or Psy) to price per gigabase per month (Pgm), where we have 134 gigabases per sample and 12 months per year, use the following formula:

For example, using CRAM 3.0 (archive) and the hot storage tier (immediate access), the annual cost per sample of $3.82 is equivalent to $0.00238 per gigabase per month.

How Much Does Compression Cost?

Compression does require some computation, so it is important to consider the cost of compression when assessing overall costs. In the table below, we use a spot instance of a Standard_D16a_v4 virtual machine running Ubuntu Linux in the East US region, which in January 2023 was priced at $0.0768/hour for pay-as-you-go pricing (use this page to look up the current price for your region including discounts for reserved pricing).

As shown below, the cost of compressing a sample can be less than a penny ($0.008), which is far less than the cost of storing the BAM for a year. Extraction (decompression) is even less expensive ($0.007), and is fairly independent of compression level, so there is only a negligible impact on the cost for future data usage.

	CRAM Compression Per Sample		CRAM Extraction Per Sample
Method	Cost (USD)	Time (minutes)	Cost (USD)	Time (minutes)
CRAM 3.0 (default)	$0.008	6.5	$0.007	5.3
CRAM 3.0 (archive)	$0.020	15	$0.008	6.1
CRAM 3.0 (best)	$0.090	68	$0.007	5.8

Compression Speed vs. Cost Tradeoff

While the table above shows the compression cost when running on a Standard_D16a_v4 VM, choosing a different VM may be desirable to trade off compression time vs. cost. The table below shows the results for compression using CRAM 3.0 (archive) on 6 different Azure VMs in the Dav4 VM series. The CRAM file generated on each VM is the same, but the choice of VM impacts the runtime and cost. Note that we can reduce the compression time from nearly an hour to under six minutes, while the cost increases 50% from $0.019 to $0.029 per sample. We do not show decompression times here because the runtime does not change substantially when a VM with more cores is used. This is due to the CRAM decompression implementation in samtools not taking significant advantage of parallelization.

	CRAM 3.0 (archive) Compression Per Sample
VM Size	Cost (USD)	Time (minutes)
Standard_D4a_v4	$0.019	58
Standard_D8a_v4	$0.018	29
Standard_D16a_v4	$0.020	15
Standard_D32a_v4	$0.021	8.3
Standard_D48a_v4	$0.024	6.4
Standard_D64a_v4	$0.029	5.8

How Do I Use CRAM on My Data?

The tools for CRAM compression are part of the open source samtools suite, which is freely available and does not require any license fees to use (see license here). We’ll start by showing the commands for compression and decompression to demonstrate how simple they are, then lead you through the setup procedure and some additional tips to improve performance. The instructions below are for an Azure VM running Ubuntu Linux (20.04 LTS).

Compress a BAM to a CRAM

Once your environment is set up as described below, including downloading the reference file "reference.fa" (see “Download Reference Files” below), compressing a BAM file using all cores takes only a single line:

samtools view -@ "$(nproc)" -O "$OUTPUT_OPTIONS" --reference "reference.fa" -o "output.cram" "input.bam"

The OUTPUT_OPTIONS variable allows you to trade off size and speed as we discussed earlier. Different settings can be used to customize the compression, as documented in the samtools manual. The table below shows the different settings we used for the entries in the tables above.

Setting	OUTPUT_OPTIONS
default	cram
archive	cram,archive
best	cram,archive,use_lzma,level=9

Extract a BAM from a CRAM

Extraction (decompression) using all cores also takes only a single command:

samtools view -@ "$(nproc)" --reference "reference.fa" -o "output.bam" "input.cram"

Setting Up Your Environment

In this walkthrough, we’ll first show you how to install samtools, then how to find and download the reference file for your BAMs, and finally how to build a reference cache to further improve your performance.

Before starting your setup, it’s best to identify the virtual machine’s temporary disk (also called resource disk). For the standard Ubuntu Linux VM images in Azure, the temporary disk will be mounted to /mnt. All data files (BAM, CRAM, reference FASTA, etc.) should be stored on the temporary disk, since it is directly mounted to the VM host. As a result, it has extremely low latency and high throughput compared to the OS disk or mounted data disks (any part of the filesystem other than /mnt). This can have a substantial impact on performance.

Install samtools

The easiest way to install samtools is with the Anaconda Python distribution. If you already have a conda environment set up, you can install samtools as follows:

conda install -c bioconda samtools

If you’d prefer not to use Anaconda, you can build and install samtools using these instructions.

Download Reference Files

Because BAM files map sequencer reads to a reference genome and CRAM makes use of this mapping to further compress the data, you’ll need to download the reference file (.fasta or .fa) that was used to create the BAM file. You can find the reference name in the BAM header with the command below (replace "input.bam" with the name of your BAM file).

samtools head "input.bam" | grep -Pom1 'UR:\K.*$'

You’ll then need to download that reference file. Standard reference files can generally be found with a quick web search. The BAM files in our 1000 Genomes dataset were created using the reference file GRCh38_full_analysis_set_plus_decoy_hla.fa, which you can download using the Data Access instructions for the 1000 Genomes Azure Open Dataset.

Building a Reference Cache

In the compression/decompression commands shown above, we specified the reference file on the command line with --reference. When doing this, samtools spends 1-2 minutes preparing the index for the CRAM compression/decompression operation. Since it is common to convert files as a batch, this reference preparation can be performed only once by building a local reference cache. Doing this will speed up the compression/decompression of each file; our time/price numbers in the tables above use this approach to optimize the results. The code below uses seq_cache_populate.pl (which is installed with samtools) and builds the cache using the reference “reference.fa”. The cache is then saved in a directory called “ref_cache”:

refcache_path="./ref_cache"
seq_cache_populate.pl -root "$refcache_path" "reference.fa"

We’ll now need to set some environment variables to force samtools to use the local reference cache we just built:

export REF_CACHE="$refcache_path/%2s/%2s/%s"
export REF_PATH="$refcache_path/%2s/%2s/%s"

We can now remove the “reference.fa” file, as only the cache directory is used by subsequent samtools commands. The compression/decompression commands should be run without --reference to make use of this. For example, our decompression command is now simplified to be the following:

samtools view -@ "$(nproc)" -o "output.bam" "input.cram"

What’s Next?

We’ve given you an overview of the size, speed, and cost for compressing mapped sequencing data (BAM files) using the CRAM standard, as well as instructions on how to use CRAM on your own data. In future articles in this series, we’ll discuss how to compress unmapped data (FASTQ) as well as approaches to compress data to even smaller sizes. If you have any feedback on this post or questions for us, please leave us a comment below.

Notes

1000 Genomes CRAM Samples used in this article:

HG00117, HG00150, HG00236, HG00243, HG00251, HG00320, HG00584, HG00650, HG01049, HG01083, HG01122, HG01254, HG01565, HG01795, HG02012, HG02074, HG02554, HG02624, HG02728, HG03006, HG03054, HG03069, HG03108, HG03372, HG03433, HG03452, HG03625, HG03708, HG03773, HG03778, HG03875, HG03908, HG03963, HG04001, HG04106, HG04164, HG04212, HG04214, HG04225, NA12342, NA12399, NA12400, NA12873, NA18518, NA18547, NA18915, NA18972, NA19037, NA19042, NA19063, NA19076, NA19190, NA19327, NA20520, NA20763, NA20785, NA20882, NA20906, NA21101, NA21115, NA21127, NA21143

Updated Jan 25, 2023

Version 2.0

genomic storage and compression

Sumit_Basu

Microsoft

Joined January 10, 2023

View Profile

Healthcare and Life Sciences Blog

Follow this blog board to get notified when there's new activity