Accelerating pathogen identification by using Snakemake on Azure

Former Employee

Nov 10, 2024

Introduction

The advancement of Cloud enabled high-throughput and genomic sequencing technologies has allowed pathogen genomics and metagenomics to become a potent data-centric approaches in disease surveillance & identification. These epidemiological cloud instruments is starting to play a crucial role in enabling elastic scale in responding to the emergence and evolution of various pathogens, ranging from food-borne illnesses to disease epidemics.

Detecting these pathogens requires time and significant computational power. It is anticipated that the need for processing samples will grow due to rising demands from emerging industries and enable increased frequency of sampling. We are seeing customers leveraging the scale and elasticity of cloud in research to accelerate the research output, enabling a step change in the quantity and scale of analysis across the aspects of storage, analysis, classification and dissemination of information.

One example of this is where we worked with Fonterra, a dairy manufacturing company. We work specifically with the Fonterra Research and Development Centre (FRDC) to pilot lifting their genomics analysis from on premises into the cloud to access what the speed and cost advantages could be.

Solution

To improve analysis time and meet Fonterra’s requirements for running workloads, we teamed up to deploy Snakemake running on their subscription.

Snakemake is a workflow management system used to create reproducible and scalable data analysis. The workflows are described in a human readable Python based language. Each workflow is composed of many different interdependent tasks, describing the required software and the processes that manipulate the data.

By deploying Snakemake on Azure, Fonterra was able to reduce analysis time by 93%, demonstrating significant efficiency improvements over on-premise solutions and reducing the overall pathogen management process.

The following resources are deployed to run Snakemake in Azure:

Azure resources

Batch account - The Azure Batch account is used to spin up the virtual machines that run a workflow.

Storage account - The Azure Storage account.

Azure Container Registry - build, store, and manage container images and artifacts in a private registry for all types of container deployments.

Non-Azure resources

Docker compatible machine: A system capable of constructing a Docker image

Snakemake Docker container: A Docker container featuring Snakemake along with supplementary bioinformatics tools required for the workflow. Note these will be unique to your workflow.

Snakemake files: A collection of rules in a workflow in a snakemake workflow file..

Architecture - Snakemake

Steps:

Install docker on either an on Prem or virtual machine
Create a DockerFile to include all the genomics tools required (using Snakemake docker image as the base image)
Create a docker image from the DockerFile
Publish this image to a container registry
Ensure data is available in Azure Blob Storage account
Edit/Develop a Snakemake workflow for the analysis needed
On a machine in the network, install Azure CLI and all dependencies (including Snakemake)
conda create -c bioconda -c conda-forge -n snakemake snakemake \
msrest azure-batch azure-storage-blob azure-mgmt-batch azure-identity
conda activate snakemake
Export Azure credentials as environment variables
export AZ_BLOB_PREFIX=<Azure_Blob_name>
export AZ_BATCH_ACCOUNT_URL="<AZ_BATCH_ACCOUNT_URL>"
export AZ_BATCH_ACCOUNT_KEY="<AZ_BATCH_ACCOUNT_KEY>"
export AZ_BLOB_ACCOUNT_URL="<AZ_BLOB_ACCOUNT_URL_with_SAS>"
Execute Snakemake on the workflow using the Azure credentials

Conclusion

When comparing the same pipeline on premises to an Azure deployment, the team was able to reduce the total time for analysis by 93%. Based on these results the team believe that it is possible to take approx. 24 hours out of the pathogen management process.

Acknowledgments:

We would like to acknowledge the leadership from Darren Kinsman & the support contributions by Alex Novikov and Ji Zhang from FDRC who helped test and pilot this solution in their Azure tenant.

Updated Nov 11, 2024

Version 2.0

Former Employee

Joined January 26, 2022

View Profile