Setup a secure lockdown environment using Cromwell on Azure
Published Jul 24 2023 08:43 AM 1,797 Views
Microsoft

Introduction:  

 

In recent years, the field of genomics has seen a rapid increase in the amount of data generated by DNA sequencing technologies. As a result, many researchers are turning to cloud-based infrastructure to store, analyze, and share their genomic data. However, the sensitive nature of genomic data, which includes personal health information and propriety data, creates significant security concerns. Therefore, a highly secure and locked down genomics infrastructure in the cloud is necessary to ensure the protection of sensitive data and to maintain compliance with data privacy and security regulations (HIPPA, FEDRAMP). This type of infrastructure can provide a safe and scalable platform for researchers to perform their analyses while protecting the privacy of individuals whose data is being used. 

 

We worked with FMC Corporation, an agricultural sciences company, to enable such an environment.  They have a tightly controlled secure Azure environment and every deployment must pass through rigorous security and infosec reviews. As part of their partnership with Microsoft, they are in the process of migrating all their genomic workflows and solutions to Microsoft Azure.  

 

Both parties worked on the deployment methodology for VM-enabled Cromwell on Azure. Cromwell is a workflow management system for scientific workflows, orchestrating the computing tasks needed for genomics analysis. Originally developed by the Broad Institute, Cromwell is widely used to scale genome analysis pipelines. Cromwell supports running scripts at various scales, including your local machine, a local computing cluster, and on the cloud. 

 

Cromwell on Azure configures all of the Azure resources needed to run workflows through Cromwell on the Azure cloud and uses the GA4GH TES (Task execution service) backend for orchestrating the tasks on Azure Batch. The installation sets up a VM host to run the Cromwell server and uses Azure Batch to spin up virtual machines that run each task in a workflow. Cromwell workflows can be written using either WDL scripting language. 

 

Venkat_Malladi_1-1690212377312.png

 

Architecture - CoA 

 

Improvements:  

 

To improve security and meet FMC’s requirements for running workloads in secure environments, we teamed up to deploy Cromwell on Azure running on in their subscription. 

 

Access for the user has been available since release 3.2 of Cromwell on Azure.  Since the release of 4.0 of Cromwell on Azure is operational , the following major changes have been made: 

 

  • Cosmos DB (used by TES) and MySQL (used by Cromwell) were both replaced with PostgreSQL to unify deployments on a single database platform 
  • By default the deployer will provision an AKS account rather than providing a VM to run the containers, to use a VM please use a prior version of CoA.  

Below we describe how to setup a resource group with a storage account, virtual network, and Azure Container Registry to run Cromwell on Azure with private networking.  This architecture ensures that no components have public IP addresses. All traffic is routed through the virtual network or private endpoints. Overall, the following  steps are needed to deploy a fully privatized Cromwell on Azure environment: 

 

  1. Provision a virtual network and subnets for CoA resources. 
  2. Provision a VM to run the deployer. Since the deployment will not have direct internet access, we need to create a temporary jumpbox on the virtual network with a public IP address so the deployer can have ssh access to CoA. 
  3. Create a Storage Account (CosmosDB for release prior to 4.0) to be used by CoA without public access, and establish private endpoints with the virtual network created. 
  4. Provision Azure Container Registry 
  5. Deploy CoA, deploys rest of the resources with appropriate  

For full instructions, please visit Setting up private networking for Cromwell on Azure 

 

These instructions allow users to deploy Cromwell on Azure with AKS  configurations and provide a guide to deploying a private TES instance which was previously part of CoA. The following are the Azure resources that are deployed using the above methods: 

 

  • Batch account - The Azure Batch account is used by TES to spin up the virtual machines that run each task in a workflow. 
  • Storage account - The Azure Storage account is mounted on the containers using blobfuse, which enables Azure Block Blobs to be mounted as a local file system available to the containers. By default, it includes the following Blob containers - configuration, Cromwell-executions, Cromwell-workflow-logs, inputs, outputs, and workflows. 
  • Azure Container Registry - build, store, and manage container images and artifacts in a private registry for all types of container deployments. 
  • Azure Kubernetes Service – Runs Cromwell, TES, TriggerService container AKS pods. Blobfuse is used to mount the default storage account as a local file system available to the containers.  

 

Venkat_Malladi_2-1690212377314.png

 

Private Architecture - CoA 

 

For VM and prior to 4.0 instructions please see the following: 

 

This architecture acts as a key pattern for the FMC’s genomics workflows running on Azure cloud platform. 

  

Acknowledgments: 

 

We would like to acknowledge the contributions by Sang-Ic Kim fromFMC Corporation that help to test and deploy this solution on their Azure tenant.

Co-Authors
Version history
Last update:
‎Jul 24 2023 08:43 AM
Updated by: