Blog Post

Healthcare and Life Sciences Blog
3 MIN READ

Accelerating pathogen identification by using Snakemake on Azure

Venkat_Malladi's avatar
Venkat_Malladi
Former Employee
Nov 10, 2024

Introduction 

The advancement of Cloud enabled high-throughput and genomic sequencing technologies has allowed pathogen genomics and metagenomics to become a potent data-centric approaches in disease surveillance & identification. These epidemiological cloud instruments is starting to play a crucial role in enabling elastic scale in responding to the emergence and evolution of various pathogens, ranging from food-borne illnesses to disease epidemics. 

Detecting these pathogens requires time and significant computational power. It is anticipated that the need for processing samples will grow due to rising demands from emerging industries and enable increased frequency of sampling. We are seeing customers leveraging the scale and elasticity of cloud in research to accelerate the research output, enabling a step change in the quantity and scale of analysis across the aspects of storage, analysis, classification and dissemination of information. 

One example of this is where we worked with Fonterra, a dairy manufacturing company. We work specifically with the Fonterra Research and Development Centre (FRDC) to pilot lifting their genomics analysis from on premises into the cloud to access what the speed and cost advantages could be. 

Solution 

To improve analysis time and meet Fonterra’s requirements for running workloads, we teamed up to deploy Snakemake running on their subscription.  

Snakemake is a workflow management system used to create reproducible and scalable data analysis. The workflows are described in a human readable Python based language. Each workflow is composed of many different interdependent tasks, describing the required software and the processes that manipulate the data. 

By deploying Snakemake on Azure, Fonterra was able to reduce analysis time by 93%, demonstrating significant efficiency improvements over on-premise solutions and reducing the overall pathogen management process.  

The following resources are deployed to run Snakemake in Azure:   

Azure resources 

  • Batch account - The Azure Batch account is used to spin up the virtual machines that run a workflow.  
  • Azure Container Registry - build, store, and manage container images and artifacts in a private registry for all types of container deployments.  

Non-Azure resources 

  • Docker compatible machine: A system capable of constructing a Docker image 
  • Snakemake Docker container: A Docker container featuring Snakemake along with supplementary bioinformatics tools required for the workflow. Note these will be unique to your workflow. 
  • Snakemake files: A collection of rules in a workflow in a snakemake workflow file..

 

Architecture - Snakemake 

 

Steps: 

  1. Install docker on either an on Prem or virtual machine  
  2. Create a DockerFile to include all the genomics tools required (using Snakemake docker image as the base image)  
  3. Create a docker image from the DockerFile 
  4. Publish this image to a container registry  
  5. Ensure data is available in Azure Blob Storage account 
  6. Edit/Develop a Snakemake workflow for the analysis needed 
  7. On a machine in the network, install Azure CLI and all dependencies (including Snakemake) 
  8. conda create -c bioconda -c conda-forge -n snakemake snakemake \  
    msrest azure-batch azure-storage-blob azure-mgmt-batch azure-identity  
    conda activate snakemake 
  9. Export Azure credentials as environment variables 
  10. export AZ_BLOB_PREFIX=<Azure_Blob_name>  
    export AZ_BATCH_ACCOUNT_URL="<AZ_BATCH_ACCOUNT_URL>"  
    export AZ_BATCH_ACCOUNT_KEY="<AZ_BATCH_ACCOUNT_KEY>"  
    export AZ_BLOB_ACCOUNT_URL="<AZ_BLOB_ACCOUNT_URL_with_SAS>" 
  11. Execute Snakemake on the workflow using the Azure credentials 

Conclusion 

When comparing the same pipeline on premises to an Azure deployment, the team was able to reduce the total time for analysis by 93%. Based on these results the team believe that it is possible to take approx. 24 hours out of the pathogen management process. 

 

Acknowledgments:  

We would like to acknowledge the leadership from Darren Kinsman & the support contributions by Alex Novikov and Ji Zhang from FDRC who helped test and pilot this solution in their Azure tenant. 

 

Updated Nov 11, 2024
Version 2.0
No CommentsBe the first to comment