Genomic data in Parquet format on Azure

Microsoft

Feb 10, 2022

One of the most common formats for genomic data is Variant Call Format (VCF). VCF is a text file format, most likely stored in a compressed manner. It contains meta-information, a header, and data lines containing tab-separated text fields with genotype information. VCF is accepted and processed by many genomic tools, but not natively supported by interactive and serverless technologies like Azure Synapse, AWS Athena, Google BigQuery.

Parquet is an open-source columnar storage format designed to bring efficiency compared to row-based files. It stores metadata and supports complex nested data structures. The layout of Parquet data files is optimized for queries that process large volumes of data.  It is a recognized file format used by many systems, and it works great with serverless analytics solutions.

Due to the wide support of Parquet, having data in this format is beneficial for multiple scenarios. If you want to explore a dataset, you can run ad-hoc queries on Parquet files and get results several times faster and at less cost compared to row-based formats. If you want to incorporate genomic data into your machine learning models, you can read Parquet files from your Notebooks. You can even create your own genomic database – Parquet files ingestion is well supported by many big data analytics platforms.

We converted two Azure Genomics Data Lake datasets from VCF to Parquet: 1000 Genomes variant call information (release 20130502) and gnomAD v2.1.1, to make this data more accessible, especially for people without extensive genomics background. Datasets in Parquet format are publicly available for queries and exploration.

Conversion from VCF to Parquet format

To perform format conversion, we used Glow, an open-source toolkit for large-scale genomic analysis built on Apache Spark. To check notebooks used for dataset conversion and learn more about running them on Azure Databricks and Azure Synapse Analytics, visit our GitHub repository.

Query genomic data in Parquet format

You can query Parquet files in Azure Synapse Analytics. Serverless SQL pool enables you to query data in your data lake. Billing for a serverless SQL pool is based on the amount of data processed to run the query and not the number of nodes used to run the query.

Prerequisites: You need to have a Synapse workspace to perform analytics. Follow the instructions and create a workspace if needed (you can skip the last step and do not upload any sample data).

First, we create a database with a data source that references genomic data storage account. We provide setup scripts and sample queries for 1000 Genomes and gnomAD datasets. Setup script will create the database, data sources, database scoped credentials, and external file formats.

Open Synapse Studio, go to the Develop tab, open SQL script, and execute it.

Go to the Data tab to check that the database and table were created. Now you are ready to run sample queries.

To check the amount of data processed to run a query go to the Monitor tab.

Data schema

Default Glow schema mirrors a single row of a VCF file. Information that applies to an entire variant, such as the contig name, start and end positions, and INFO attributes, is contained in columns. The genotypes, which correspond to the GT FORMAT fields in a VCF, are contained in an array with one entry per sample. Each entry is a structure with fields that are described in the VCF header.

gnomAD dataset in Parquet format has Glow schema with removed genotypes and added hashId. The genotypes are removed since they do not contain any information in this dataset. Column hashId could be used as a unique variant id and is constructed as the hash code of all columns except genotypes using the 64-bit variant of the xxHash algorithm. Detailed gnomAD schema description is available at https://datasetgnomadparquet.blob.core.windows.net/dataset/v2.1.1/dataSchema_exomes.txt and https://datasetgnomadparquet.blob.core.windows.net/dataset/v2.1.1/dataSchema_genomes.txt

For 1000 Genomes dataset we created 2 copies of data with different schema:

Glow schema with added hashId
Flattened schema (Information that applies to an entire variant, hashId included, stays the same. The genotypes get replaced by a set of columns, one per each struct field - to achieve that we copy variant information for each entry of genotypes array)

To compare 1000 Genomes data schemas check SQL scripts with sample queries for flattened and nested data.