Blog Post

Healthcare and Life Sciences Blog
3 MIN READ

Scalable Genomics Annotation Analysis with OpenCRAVAT in Microsoft Azure

Erdal_Cosgun's avatar
Erdal_Cosgun
Icon for Microsoft rankMicrosoft
Oct 14, 2021

Co-authored by Prof.Dr.Rachel Karchin, Kym Pagel,Ph.D. and RyangGuk Kim,Ph.D.

 

In this blog, we will share the integration of OpenCRAVAT on Azure Data Science Virtual Machines. "OpenCRAVAT is a python package that performs genomic variant interpretation including variant impact, annotation, and scoring. There is a web-based version of OpenCRAVAT (https://run.opencravat.org) but it can also be installed locally and is easy to integrate into bioinformatics pipelines. OpenCRAVAT has a modular architecture with a wide variety of analysis modules that can be selected and installed/run based on the needs of a given study. The modules are made available via the CRAVAT Store and are developed both by the CRAVAT team and the broader variant analysis community. OpenCRAVAT is a product of the Karchin Lab at Johns Hopkins University in collaboration with In Silico Solutions and Oak Bioinformatics LLC with funding provided by the National Cancer Institute’s ITCR program. "

 

Professor Rachel Karchin,  Institute for Computational Medicine, Johns Hopkins University : "Microsoft Azure makes large-scale integrative analysis with OpenCRAVAT easy in interactive environments like Jupyter Notebooks and RStudio"

 

Overview

 

"OpenCRAVAT is a modular python package that is available in the pip PyPI repository. It takes a file of genomic variants as input. The most common input format is a VCF file but other formats are supported including dbSNP identifiers, 23&Me and Ancestry.com file formats. The analysis performed by OpenCRAVAT depends upon user-selected annotation and visualization options, available for download from the free OpenCRAVAT Store. In addition to the interactive user interface, OpenCRAVAT provides several output formats including text reports, Excel spreadsheets, and a SQLite database of results used by cravat_view.

 

There are more than 150 different modules in the app store. These modules can be assigned one or more tags, that include allele frequency, cancer, cardiovascular, clinical relevance, converters, evolution, functional studies, genes, interactions, literature, non coding, reporters, variant effect prediction, variants, and visualization. 

  • Converters (input formats): TSV, VCF, Ancestry.com, 23andMe, FamilyTreeDNA
  • Reporters (output formats): Text format, Excel, TSV, CSV, Annotated VCF"

Assistant Research Scientist, Institute for Computational Medicine, Johns Hopkins University - Kym Pagel: ‘ Compared to similar services, the Azure interface for developers is much more intuitive which is incredibly valuable as data get larger and more complex'

 

OpenCRAVAT in Microsoft Azure

 

It is fairly simple to get OpenCRAVAT up-and-running on Microsoft Azure. We recommend selecting the F2s v2 virtual machine (VM) for small jobs, and F16s zV2 VM for heavier loads that include multiple samples with whole genome sequencing. After the VM is started, ssh into the VM and then run a few commands to install all necessary components:

  • To install OpenCRAVAT, run pip3 install open-cravat

We recommend that users pull the store modules from Genomic Data Lake when running a VM on Azure, this dataset is a mirror of the store at https://store.opencravat.org and https://run.opencravat.org. To facilitate this, we provide a small script for pulling and downloading the relevant modules.

  • Download azcopy
  • Determine the annotation and analysis modules that you’d like to download. View all available options with oc module ls -a
  • Download the import_modules.py script, and place it in the same directory as azcopy
  • To run the script, type python3 import_modules.py module1 module2

For more information, consult the genomicsnotesbooks guide to downloading specific databases and deploying a Data Science VM on Azure for OpenCRAVAT at https://github.com/microsoft/genomicsnotebook/blob/main/sample-notebooks/genomics-opencravat.ipynb

 

CEO, Oak Bioinformatics LLC – RyangGuk Kim : ‘Microsoft Azure has been fantastic for delivering OpenCRAVAT to our clients with ease and convenience.’

 

Azure Deployment Steps

 

Step 1. Visit: https://github.com/microsoft/genomicsnotebook/blob/main/sample-notebooks/genomics-opencravat.ipynb

 

 

 

Step 2. Click ‘Deploy To Azure’

Step 3. Select the relevant parameters: Subscription, Resource Group, etc.. for VM deployment

 

 

Step 4. Once deployment is ready, use RDP for log-in the VM. Below is the Desktop of Azure Data Science VM for OpenCRAVAT. Users can find the installation instructions, documentation of OpenCRAVAT and sample datasets at the mounted folder.

 

 

Step 5. OpenCRAVAT landing page. Users need to select ‘Reference Genome’ and path of the file

 

 

 

 

Step 6. OpenCRAVAT store for on-demand annotation modules

 

 

OpenCRAVAT Datasets on Azure Genomics Data Lake

 

Users can explore and use the OpenCRAVAT datasets from Azure Genomics Data Lake. For further information on Azure Storage Explorer use for this data set, please visit the instructions.

 

 

 

References

 

  1. https://opencravat.org/about.html
  2. genomicsnotebook/genomics-opencravat.ipynb at main · microsoft/genomicsnotebook (github.com)
  3. OpenCravat - Azure Open Datasets | Microsoft Docs
  4. Microsoft Genomics
  5. https://github.com/microsoft/genomicsnotebook
  6. https://github.com/microsoft/genomicsnotebook/blob/main/docs/Genomics_Data_Lake_Azure_Storage_Explorer.pdf
Updated Oct 14, 2021
Version 3.0
No CommentsBe the first to comment