Healthcare and Life Sciences Blog

4 MIN READ

Secured Nextflow Deployment on Azure

Former Employee

Apr 28, 2025

Introduction

The advancement of cloud-enabled high-throughput and genomic sequencing technologies has revolutionized the field of genomics. Azure provides a robust platform for genomic analysis, offering scalability, flexibility, and efficiency. This blog post explores how you can use Azure and Nextflow to accelerate genomic research and analysis.

Secured Nextflow Deployment Architecture

The architecture for deploying Nextflow on Azure involves several key components to ensure security and efficiency:

Private Network Integration

All Azure resources are deployed inside a virtual network.

Private endpoints are used to access the services, including:

Storage Accounts (both I/O and Batch storage)

Azure Batch Account (both data plane and node management)

Key Vault

Azure Container Registry

Public network access is explicitly disabled for all services

Identity and Access Management

User-assigned managed identity for Batch operations

RBAC assignments

ACR Pull role for container image access

Storage Blob Data Contributor for storage operations

Key Vault Secrets User for accessing secrets

Secrets Management

Azure Key Vault integration storing:

Batch account keys

Storage account keys

ACR admin password

Container Security

Premium SKU with enhanced security features

Admin account enabled to be used by Nextflow and credentials stored in Key Vault

Network rules set to deny public access

Retention policy enabled (7 days)

Storage Account Configuration

Public network access disabled

Segregated Storage Accounts and Hierarchical namespace configuration: two storage accounts are created:

Pipeline assets: HNS disabled to enable versioning

I/O files: ADLS Gen2 with file-level controls

Azure Batch Security

Network isolated compute nodes

Communication through virtual network only (no public IP for nodes)

Container images pulled via private ACR endpoint

Segregated compute pools for different workload types:

Dedicated pool for Nextflow runners: smaller VMs (1 vCPU and 2 GB RAM)

Separate pool for pipeline execution: larger VMs for pipeline tasks

Security Best Practices Implemented

Zero trust network design with no public endpoints

Principle of least privilege through RBAC

Secrets management through Key Vault

Network isolation through private endpoints

Container image security through Premium ACR

Assumptions on Existing Resources

The following resources are assumed to exist as part of an Enterprise Landing Zone and are not included in the Bicep templates:

Virtual Network and peering with the hub or other means of connection to the on-premises network.

Private Endpoints Subnet.

DNS Zones and Records for private endpoints.

For a detailed deployment steps and Bicep code

Run Nextflow Pipeline

A sample Nextflow pipeline is included under the Nextflow folder. To run this pipeline:

Upload samples and references folders to the I/O Storage Account under the Nextflow blob container.
Set the following environment variables in your terminal:

Variable	Description
RESOURCE_GROUP	The name of the Azure resource group where resources are deployed
BATCH_ACCOUNT_URL	The URL endpoint of the Azure Batch account
BATCH_ACCOUNT_NAME	The name of the Azure Batch account
POOL_ID	The ID of the Batch Nextflow Headnodes pool
STORAGE_ACCOUNT_NAME	The name of the storage account used for pipeline I/O
AUTOSTORAGE_ACCOUNT_NAME	The name of the storage account used for pipeline assets
KEYVAULT_NAME	The name of the Key Vault storing secrets and certificates
CONTAINER_NAME	The name of the container registry where the Nextflow image and other images used in pipeline tasks are stored
LOCAL_FOLDER_PATH	The local path where Nextflow Pipeline files are located
MANAGED_IDENTITY_RESOURCE_ID	The resource ID of the User Managed Identity used by Batch Account and Pool Nodes to access Storage and Key Vault. Must be in the format: /subscriptions/resourcegroups/providers/Microsoft.ManagedIdentity/userAssignedIdentities/
OUTPUT_PATH	The path in the I/O Storage Account where pipeline outputs of the run will be stored

Set the values for storage account name, batch account name, and registry server in the nextflow.config file. Use the submit_job.py script to upload Nextflow pipeline codes and define and submit a batch job to run the pipeline. Blob Data Contributor/Owner role plus Batch Account Contributor/Owner role are required to run this script. Check the I/O storage account for the results once the batch job is completed.

Case Study: Using the above architecture for Genomic Analysis Pipeline for Clinical Laboratories

In recent years, the laboratory diagnosis of viral infections has evolved significantly, transitioning from viral cultures to molecular diagnostics. Despite the advancements, antiviral drug resistance (AVDR) testing remains primarily limited to reference laboratories. With increasing clinical demand for timely AVDR, there is a need to expand access to this testing to inform treatment decisions by clinicians. The St. Paul’s Hospital virology laboratory provides AVDR testing for CMV and HBV for the province of British Columbia, Canada. They developed a bioinformatics pipeline that could be used in the clinical laboratory to automate interpretation of AVDR results and serve as a stable platform for data analysis of future NGS-based clinical assays. The development described herein creates a genomics infrastructure capable of processing complex next-generation sequencing (NGS) data efficiently and securely leveraging Nextflow as the orchestrator and Azure Batch as the compute platform.

Conclusion

Azure and Nextflow offer a powerful platform for accelerating genomic research. By leveraging Azure's scalability, flexibility, and efficiency, scientists and researchers can streamline their workflows and achieve faster, more accurate results. The success of projects deployed by St. Paul, highlights an architecture that can be used as a template for other healthcare entities interested in developing in-house genomics analysis with minimal overhead and maintenance costs.

Acknowledgments:

We would like to acknowledge the leadership from Dr. Daniel T. Holmes and Dr. Marc Romney & the contributions by Dr. Christopher F. Lowe, Dr. Nancy Matic, Dr. Gordon Ritchie from Department of Pathology and Laboratory Medicine, St Paul’s Hospital, Vancouver, Canada, and Mahdi Mobini from Daric Clouding Solutions Inc. who developed and deployed this solution in HealthBC Landing Zone on Azure.

Updated Apr 29, 2025

Version 4.0

Former Employee

Joined January 26, 2022

View Profile

Healthcare and Life Sciences Blog

Follow this blog board to get notified when there's new activity