Blog Post

Healthcare and Life Sciences Blog
4 MIN READ

Secured Nextflow Deployment on Azure

Venkat_Malladi's avatar
Apr 28, 2025

Introduction 

The advancement of cloud-enabled high-throughput and genomic sequencing technologies has revolutionized the field of genomics. Azure provides a robust platform for genomic analysis, offering scalability, flexibility, and efficiency. This blog post explores how you can use Azure and Nextflow to accelerate genomic research and analysis.  

Secured Nextflow Deployment Architecture 

 

 

 

The architecture for deploying Nextflow on Azure involves several key components to ensure security and efficiency: 

Private Network Integration 

All Azure resources are deployed inside a virtual network. 

  • Private endpoints are used to access the services, including: 
  • Storage Accounts (both I/O and Batch storage) 
  • Azure Batch Account (both data plane and node management) 
  • Key Vault 
  • Azure Container Registry 
  • Public network access is explicitly disabled for all services 

Identity and Access Management 

  • User-assigned managed identity for Batch operations 
  • RBAC assignments 
  • ACR Pull role for container image access 
  • Storage Blob Data Contributor for storage operations 
  • Key Vault Secrets User for accessing secrets 

Secrets Management 

Azure Key Vault integration storing: 

  • Batch account keys 
  • Storage account keys 
  • ACR admin password 

Container Security 

  • Premium SKU with enhanced security features 
  • Admin account enabled to be used by Nextflow and credentials stored in Key Vault 
  • Network rules set to deny public access 
  • Retention policy enabled (7 days) 

Storage Account Configuration 

  • Public network access disabled 
  • Segregated Storage Accounts and Hierarchical namespace configuration: two storage accounts are created: 
  • Pipeline assets: HNS disabled to enable versioning 
  • I/O files: ADLS Gen2 with file-level controls 

Azure Batch Security 

  • Network isolated compute nodes 
  • Communication through virtual network only (no public IP for nodes) 
  • Container images pulled via private ACR endpoint 
  • Segregated compute pools for different workload types: 
  • Dedicated pool for Nextflow runners: smaller VMs (1 vCPU and 2 GB RAM) 
  • Separate pool for pipeline execution: larger VMs for pipeline tasks 

Security Best Practices Implemented 

  • Zero trust network design with no public endpoints 
  • Principle of least privilege through RBAC 
  • Secrets management through Key Vault 
  • Network isolation through private endpoints 
  • Container image security through Premium ACR 

 

Assumptions on Existing Resources 

The following resources are assumed to exist as part of an Enterprise Landing Zone and are not included in the Bicep templates: 

  • Virtual Network and peering with the hub or other means of connection to the on-premises network. 
  • Private Endpoints Subnet. 
  • DNS Zones and Records for private endpoints. 

 

For a detailed deployment steps and Bicep code

Run Nextflow Pipeline 

A sample Nextflow pipeline is included under the Nextflow folder. To run this pipeline: 

  1. Upload samples and references folders to the I/O Storage Account under the Nextflow blob container.
  2. Set the following environment variables in your terminal: 

 

Variable 

Description 

RESOURCE_GROUP 

The name of the Azure resource group where resources are deployed 

BATCH_ACCOUNT_URL 

The URL endpoint of the Azure Batch account 

BATCH_ACCOUNT_NAME 

The name of the Azure Batch account 

POOL_ID 

The ID of the Batch Nextflow Headnodes pool 

STORAGE_ACCOUNT_NAME 

The name of the storage account used for pipeline I/O 

AUTOSTORAGE_ACCOUNT_NAME 

The name of the storage account used for pipeline assets 

KEYVAULT_NAME 

The name of the Key Vault storing secrets and certificates 

CONTAINER_NAME 

The name of the container registry where the Nextflow image and other images used in pipeline tasks are stored 

LOCAL_FOLDER_PATH 

The local path where Nextflow Pipeline files are located 

MANAGED_IDENTITY_RESOURCE_ID 

The resource ID of the User Managed Identity used by Batch Account and Pool Nodes to access Storage and Key Vault. Must be in the format: /subscriptions/resourcegroups/providers/Microsoft.ManagedIdentity/userAssignedIdentities/ 

OUTPUT_PATH 

The path in the I/O Storage Account where pipeline outputs of the run will be stored 

 

Set the values for storage account name, batch account name, and registry server in the nextflow.config file. Use the submit_job.py script to upload Nextflow pipeline codes and define and submit a batch job to run the pipeline. Blob Data Contributor/Owner role plus Batch Account Contributor/Owner role are required to run this script. Check the I/O storage account for the results once the batch job is completed. 

Case Study: Using the above architecture for Genomic Analysis Pipeline for Clinical Laboratories 

In recent years, the laboratory diagnosis of viral infections has evolved significantly, transitioning from viral cultures to molecular diagnostics. Despite the advancements, antiviral drug resistance (AVDR) testing remains primarily limited to reference laboratories. With increasing clinical demand for timely AVDR, there is a need to expand access to this testing to inform treatment decisions by clinicians. The St. Paul’s Hospital virology laboratory provides AVDR testing for CMV and HBV for the province of British Columbia, Canada. They developed a bioinformatics pipeline that could be used in the clinical laboratory to automate interpretation of AVDR results and serve as a stable platform for data analysis of future NGS-based clinical assays. The development described herein creates a genomics infrastructure capable of processing complex next-generation sequencing (NGS) data efficiently and securely leveraging Nextflow as the orchestrator and Azure Batch as the compute platform. 

Conclusion 

Azure and Nextflow offer a powerful platform for accelerating genomic research. By leveraging Azure's scalability, flexibility, and efficiency, scientists and researchers can streamline their workflows and achieve faster, more accurate results. The success of projects deployed by St. Paul, highlights an architecture that can be used as a template for other healthcare entities interested in developing in-house genomics analysis with minimal overhead and maintenance costs. 

Acknowledgments:   

We would like to acknowledge the leadership from Dr. Daniel T. Holmes and Dr. Marc Romney & the contributions by Dr. Christopher F. Lowe, Dr. Nancy Matic, Dr. Gordon Ritchie from Department of Pathology and Laboratory Medicine, St Paul’s Hospital, Vancouver, Canada, and Mahdi Mobini from Daric Clouding Solutions Inc. who developed and deployed this solution in HealthBC Landing Zone on Azure.  

Updated Apr 29, 2025
Version 4.0
No CommentsBe the first to comment