Blog Post

Healthcare and Life Sciences Blog
8 MIN READ

Genomic workflow managers on Microsoft Azure

manoj1116's avatar
manoj1116
Icon for Microsoft rankMicrosoft
Feb 21, 2023

Genomic secondary analysis is the process of processing and analyzing genomic data obtained from next-generation sequencing (NGS) technologies. NGS technologies enable the generation of large amounts of DNA or RNA sequence data from various biological samples, such as blood, tissue, saliva, etc. Genomic secondary analysis aims to extract meaningful information from these raw data and identify genomic variants that are associated with diseases, traits, or functions.

 

The main steps of genomic secondary analysis are:

  • Read filtering and trimming: This step involves removing low-quality reads, adapter sequences, duplicates, and other contaminants from the raw data. This improves the accuracy and efficiency of downstream analysis.
  • Alignment or assembly: This step involves mapping the reads to a reference genome (alignment) or constructing a new genome from scratch (assembly). Alignment is used for known genomes or genomes with minor variations, while assembly is used for novel genomes or genomes with large structural variations. The main output of this step is a BAM file containing aligned reads.
  • Variant calling: This step involves identifying and annotating genomic variants such as single nucleotide polymorphisms (SNPs), insertions and deletions (indels), copy number variations (CNVs), structural variations (SVs), etc. The main output of this step is a VCF file containing variant information.

Genomic secondary analysis can be performed using various tools and pipelines that are available online or on cloud platforms such as Microsoft Azure. Some examples of tools and pipelines are:

  • GATK: A toolkit developed by the Broad Institute for variant discovery and genotyping in human genomes.
  • BWA-MEM: A fast alignment tool that uses Burrows-Wheeler transform and maximal exact matches.
  • SPAdes: A de novo assembly tool that handles various types of NGS data.
  • ANNOVAR: A tool for functional annotation of genomic variants.

Genomic secondary analysis is an essential step for understanding the genetic basis of diseases, traits, and functions. It can provide insights into disease diagnosis, prognosis, treatment, prevention, evolution, phylogeny, etc.

 

Genomic workflow managers are software tools designed to automate and optimize the various steps involved in genomic data analysis. These tools are becoming increasingly important in the field of genomics, as the amount of data generated by sequencing technologies continues to grow at an unprecedented rate. By streamlining and standardizing the analysis process, genomic workflow managers can help researchers to save time, reduce errors, and improve the reproducibility of their results.

 

One of the key benefits of using a genomic workflow manager is that it allows researchers to automate the entire analysis pipeline, from start to finish. This means that they can spend less time on routine data processing tasks and more time on the more creative and intellectually challenging aspects of their research. In addition, by using a workflow manager, researchers can ensure that their analyses are carried out in a standardized and reproducible way, which is essential for ensuring that their results are trustworthy and can be compared to those of other studies.

 

There are many different genomic workflow managers available, each with their own strengths and weaknesses. Some examples include Snakemake, Nextflow, Cromwell, and Galaxy, among others. These tools vary in terms of their ease of use, scalability, compatibility with different computing environments, and the range of workflows they support. In general, however, they all offer a powerful set of features for managing and automating genomic data analysis.

 

Microsoft supports various genomics workflow managers on different Azure cloud technologies. Following is a summary view of all the important workflow managers and their support on various job schedulers:

  Azure batch Azure Cycle cloud Azure Kubernetes Service (AKS)
Cromwell  
Nextflow  
Snakemake  
Galaxy    

 

Cromwell on AzureCromwell on Azure is a system that enables users to run scientific workflows on the Azure cloud platform. It is based on Cromwell, a workflow management system developed by the Broad Institute for genomics analysis. Cromwell on Azure is an open-source project that adapts Cromwell to run natively on Azure using the GA4GH Task Execution Service (TES) backend. Users can download and run an executable file from GitHub that will automatically deploy an instance of Cromwell on Azure in their Azure subscription. Users can then upload their workflow files and input data to a storage account and submit them to Cromwell on Azure for execution. Users can also customize various settings such as VM size, disk type, region, etc. Cromwell on Azure is a powerful and flexible solution for running scientific workflows on the cloud. It enables users to take advantage of the scalability, reliability, and security of Azure while using their preferred workflow languages and tools.

 

Nextflow on AzureNextflow on Azure is a system that enables users to run data-intensive workflows on the Azure cloud platform. It is based on Nextflow, a workflow management system developed by Seqera Labs for bioinformatics and computational biology. Nextflow on Azure is an open-source project that integrates Nextflow with Azure services such as Azure Batch and Azure Kubernetes Service (AKS). Users can install Nextflow using Conda or Docker and configure it to use Azure Blob Storage as the shared file system. Users can then write their workflow scripts using DSL2 or other languages and submit them to Nextflow for execution. Users can also customize various settings such as VM size, disk type, region, etc.

 

Snakemake on AzureSnakemake on Azure is a system that enables users to run bioinformatics workflows on the Azure cloud platform. It is based on Snakemake, a workflow management system developed by Johannes Köster and colleagues for reproducible and scalable data analysis. Snakemake on Azure integrates Snakemake with Azure services such as Azure Batch and Azure CycleCloud. It simplifies the deployment and configuration of workflows and their dependencies using containers and cloud storage. It leverages Azure Batch or Azure CycleCloud to dynamically provision and scale compute resources according to the workload demand and user preferences. It supports a wide range of workflow languages and tools, such as Python, R, SAMtools, BCFtools, etc. It allows users to monitor and manage their workflows through a web-based dashboard or a command-line interface.

 

Users can install Snakemake using Conda or Docker and configure it to use Azure Blob Storage as the remote provider. Users can then write their workflow scripts using Python-based language and submit them to Snakemake for execution. Users can also customize various settings such as VM size, disk type, region, etc. Snakemake on Azure is a powerful and flexible solution for running bioinformatics workflows on the cloud. It enables users to take advantage of the scalability, reliability, and security of Azure while using their preferred workflow languages and tools.

 

Galaxy on Azure - Galaxy on Azure is a solution that combines Galaxy and Azure CycleCloud to provide an easy and efficient way of conducting genomic analysis on the cloud. Users can deploy Galaxy on Azure using Azure CycleCloud templates that automate the installation and configuration of Galaxy servers, databases, storage, and compute nodes. Galaxy on Azure has several benefits for genomic research. It allows users to access a large collection of tools for different types of analysis, such as DNA sequencing, RNA sequencing, metagenomics, proteomics, epigenetics, etc. Moreover, it leverages Azure’s features such as security, compliance, reliability, availability zones, etc. 

 

Azure Batch - Azure Batch is a cloud-based service provided by Microsoft that allows users to run large-scale parallel and high-performance computing (HPC) workloads in the cloud. The service is designed to help organizations scale their workloads easily, without having to worry about the infrastructure management, and take advantage of the computing power and scalability of the cloud. With Azure Batch, users can deploy and manage large-scale batch processing jobs across a pool of virtual machines (VMs) or dedicated nodes in the cloud. The service enables users to run a variety of workloads, including compute-intensive tasks such as data analysis, rendering, and scientific simulations. One of the key benefits of Azure Batch is its ability to automatically scale resources up or down as needed based on workload demand. Users can configure their job to run on a specific number of VMs, and Azure Batch will automatically provision and manage the required resources to meet the workload demands. This makes it easy for organizations to scale their workloads to meet changing demands without having to invest in expensive infrastructure.

 

Azure CycleCloud Azure CycleCloud is a cloud-based solution from Microsoft that simplifies the management of high-performance computing (HPC) and other large-scale computing environments. It is designed to help businesses and organizations create, manage, and optimize clusters of virtual machines in the cloud, providing users with easy access to powerful computing resources. One of the key benefits of Azure CycleCloud is that it helps businesses save time and money by automating many of the tasks associated with creating and managing HPC clusters. The solution provides a web-based interface that allows users to easily create and configure virtual machines, set up data storage, and manage their cluster's resources. Another benefit of Azure CycleCloud is that it provides users with the ability to scale their computing resources up or down depending on their needs. This means that businesses can quickly increase their computing power during times of high demand and then scale back down when demand decreases, helping them save money on computing resources.

 

Azure Kubernetes ServiceAzure Kubernetes Service or AKS, is a managed container orchestration service provided by Microsoft Azure. It allows users to deploy and manage containerized applications using the popular open-source container orchestration system, Kubernetes. AKS offers several benefits for developers and IT professionals who want to run their applications in containers. For one, it simplifies the process of deploying and managing Kubernetes clusters, as Microsoft takes care of the underlying infrastructure and management tasks. This means users can focus on developing and deploying their applications instead of managing the infrastructure.

 

Another benefit of AKS is its scalability. The service automatically scales applications up or down based on demand, ensuring that resources are used efficiently and cost-effectively. Additionally, AKS integrates with other Azure services, making it easy to incorporate other Azure tools like Azure DevOps and Azure Active Directory. AKS also provides a high level of security for containerized applications. It includes features like network security groups, Azure AD authentication, and Kubernetes RBAC (Role-Based Access Control) to ensure that only authorized users have access to the applications and data.

 

References:

Updated Apr 29, 2024
Version 3.0
No CommentsBe the first to comment