Introduction:
In recent years, the field of genomics has seen a rapid increase in the amount of data generated by DNA sequencing technologies. As a result, many researchers are turning to cloud-based infrastructure to store, analyze, and share their genomic data. However, the sensitive nature of genomic data, which includes personal health information and propriety data, creates significant security concerns. Therefore, a highly secure and locked down genomics infrastructure in the cloud is necessary to ensure the protection of sensitive data and to maintain compliance with data privacy and security regulations (HIPPA, FEDRAMP). This type of infrastructure can provide a safe and scalable platform for researchers to perform their analyses while protecting the privacy of individuals whose data is being used.
We worked with FMC Corporation, an agricultural sciences company, to enable such an environment. They have a tightly controlled secure Azure environment and every deployment must pass through rigorous security and infosec reviews. As part of their partnership with Microsoft, they are in the process of migrating all their genomic workflows and solutions to Microsoft Azure.
Both parties worked on the deployment methodology for VM-enabled Cromwell on Azure. Cromwell is a workflow management system for scientific workflows, orchestrating the computing tasks needed for genomics analysis. Originally developed by the Broad Institute, Cromwell is widely used to scale genome analysis pipelines. Cromwell supports running scripts at various scales, including your local machine, a local computing cluster, and on the cloud.
Cromwell on Azure configures all of the Azure resources needed to run workflows through Cromwell on the Azure cloud and uses the GA4GH TES (Task execution service) backend for orchestrating the tasks on Azure Batch. The installation sets up a VM host to run the Cromwell server and uses Azure Batch to spin up virtual machines that run each task in a workflow. Cromwell workflows can be written using either WDL scripting language.
Architecture - CoA
Improvements:
To improve security and meet FMC’s requirements for running workloads in secure environments, we teamed up to deploy Cromwell on Azure running on in their subscription.
Access for the user has been available since release 3.2 of Cromwell on Azure. Since the release of 4.0 of Cromwell on Azure is operational , the following major changes have been made:
- Cosmos DB (used by TES) and MySQL (used by Cromwell) were both replaced with PostgreSQL to unify deployments on a single database platform
- By default the deployer will provision an AKS account rather than providing a VM to run the containers, to use a VM please use a prior version of CoA.
Below we describe how to setup a resource group with a storage account, virtual network, and Azure Container Registry to run Cromwell on Azure with private networking. This architecture ensures that no components have public IP addresses. All traffic is routed through the virtual network or private endpoints. Overall, the following steps are needed to deploy a fully privatized Cromwell on Azure environment:
- Provision a virtual network and subnets for CoA resources.
- Provision a VM to run the deployer. Since the deployment will not have direct internet access, we need to create a temporary jumpbox on the virtual network with a public IP address so the deployer can have ssh access to CoA.
- Create a Storage Account (CosmosDB for release prior to 4.0) to be used by CoA without public access, and establish private endpoints with the virtual network created.
- Provision Azure Container Registry
- Deploy CoA, deploys rest of the resources with appropriate
For full instructions, please visit Setting up private networking for Cromwell on Azure.
These instructions allow users to deploy Cromwell on Azure with AKS configurations and provide a guide to deploying a private TES instance which was previously part of CoA. The following are the Azure resources that are deployed using the above methods:
- Batch account - The Azure Batch account is used by TES to spin up the virtual machines that run each task in a workflow.
- Storage account - The Azure Storage account is mounted on the containers using blobfuse, which enables Azure Block Blobs to be mounted as a local file system available to the containers. By default, it includes the following Blob containers - configuration, Cromwell-executions, Cromwell-workflow-logs, inputs, outputs, and workflows.
- Application Insights - This contains logs from TES and the Trigger Service to enable debugging.
- Azure Database for PostgreSQL - Flexible Server - one database is created for Cromwell and another database is created for TES, which includes information and metadata about each TES task that is run as part of a workflow.
- Azure Container Registry - build, store, and manage container images and artifacts in a private registry for all types of container deployments.
- Azure Kubernetes Service – Runs Cromwell, TES, TriggerService container AKS pods. Blobfuse is used to mount the default storage account as a local file system available to the containers.
Private Architecture - CoA
For VM and prior to 4.0 instructions please see the following:
- VM: Host VM - runs Ubuntu 18.04 LTS and Docker Compose with three containers (Cromwell, TES, TriggerService). Blobfuse is used to mount the default storage account as a local file system available to the four containers.
This architecture acts as a key pattern for the FMC’s genomics workflows running on Azure cloud platform.
Acknowledgments:
We would like to acknowledge the contributions by Sang-Ic Kim fromFMC Corporation that help to test and deploy this solution on their Azure tenant.