Co-authored by Prof.Dr.Rachel Karchin, Kym Pagel,Ph.D. and RyangGuk Kim,Ph.D.
In this blog, we will share the integration of OpenCRAVAT on Azure Data Science Virtual Machines. "OpenCRAVAT is a python package that performs genomic variant interpretation including variant impact, annotation, and scoring. There is a web-based version of OpenCRAVAT (https://run.opencravat.org) but it can also be installed locally and is easy to integrate into bioinformatics pipelines. OpenCRAVAT has a modular architecture with a wide variety of analysis modules that can be selected and installed/run based on the needs of a given study. The modules are made available via the CRAVAT Store and are developed both by the CRAVAT team and the broader variant analysis community. OpenCRAVAT is a product of the Karchin Lab at Johns Hopkins University in collaboration with In Silico Solutions and Oak Bioinformatics LLC with funding provided by the National Cancer Institute’s ITCR program. "
Professor Rachel Karchin, Institute for Computational Medicine, Johns Hopkins University : "Microsoft Azure makes large-scale integrative analysis with OpenCRAVAT easy in interactive environments like Jupyter Notebooks and RStudio"
"OpenCRAVAT is a modular python package that is available in the pip PyPI repository. It takes a file of genomic variants as input. The most common input format is a VCF file but other formats are supported including dbSNP identifiers, 23&Me and Ancestry.com file formats. The analysis performed by OpenCRAVAT depends upon user-selected annotation and visualization options, available for download from the free OpenCRAVAT Store. In addition to the interactive user interface, OpenCRAVAT provides several output formats including text reports, Excel spreadsheets, and a SQLite database of results used by cravat_view.
There are more than 150 different modules in the app store. These modules can be assigned one or more tags, that include allele frequency, cancer, cardiovascular, clinical relevance, converters, evolution, functional studies, genes, interactions, literature, non coding, reporters, variant effect prediction, variants, and visualization.
Reporters (output formats): Text format, Excel, TSV, CSV, Annotated VCF"
Assistant Research Scientist, Institute for Computational Medicine, Johns Hopkins University - Kym Pagel: ‘ Compared to similar services, the Azure interface for developers is much more intuitive which is incredibly valuable as data get larger and more complex'
It is fairly simple to get OpenCRAVAT up-and-running on Microsoft Azure. We recommend selecting the F2s v2 virtual machine (VM) for small jobs, and F16s zV2 VM for heavier loads that include multiple samples with whole genome sequencing. After the VM is started, ssh into the VM and then run a few commands to install all necessary components:
To install OpenCRAVAT, run pip3installopen-cravat
We recommend that users pull the store modules from Genomic Data Lake when running a VM on Azure, this dataset is a mirror of the store at https://store.opencravat.org and https://run.opencravat.org. To facilitate this, we provide a small script for pulling and downloading the relevant modules.
Step 3. Select the relevant parameters: Subscription, Resource Group, etc.. for VM deployment
Step 4. Once deployment is ready, use RDP for log-in the VM. Below is the Desktop of Azure Data Science VM for OpenCRAVAT. Users can find the installation instructions, documentation of OpenCRAVAT and sample datasets at the mounted folder.
Step 5. OpenCRAVAT landing page. Users need to select ‘Reference Genome’ and path of the file
Step 6. OpenCRAVAT store for on-demand annotation modules
OpenCRAVAT Datasets on Azure Genomics Data Lake
Users can explore and use the OpenCRAVAT datasets from Azure Genomics Data Lake. For further information on Azure Storage Explorer use for this data set, please visit the instructions.