Introduction:
In recent years, the field of genomics has seen a rapid increase in the amount of data generated by DNA sequencing technologies. As a result, many researchers are turning to cloud-based infrastructure to store, analyze, and share their genomic data. However, the sensitive nature of genomic data, which includes personal health information and propriety data, creates significant security concerns. Therefore, a highly secure and locked down genomics infrastructure in the cloud is necessary to ensure the protection of sensitive data and to maintain compliance with data privacy and security regulations (HIPPA, FEDRAMP). This type of infrastructure can provide a safe and scalable platform for researchers to perform their analyses while protecting the privacy of individuals whose data is being used.
We worked with FMC Corporation, an agricultural sciences company, to enable such an environment. They have a tightly controlled secure Azure environment and every deployment must pass through rigorous security and infosec reviews. As part of their partnership with Microsoft, they are in the process of migrating all their genomic workflows and solutions to Microsoft Azure.
Both parties worked on the deployment methodology for VM-enabled Cromwell on Azure. Cromwell is a workflow management system for scientific workflows, orchestrating the computing tasks needed for genomics analysis. Originally developed by the Broad Institute, Cromwell is widely used to scale genome analysis pipelines. Cromwell supports running scripts at various scales, including your local machine, a local computing cluster, and on the cloud.
Cromwell on Azure configures all of the Azure resources needed to run workflows through Cromwell on the Azure cloud and uses the GA4GH TES (Task execution service) backend for orchestrating the tasks on Azure Batch. The installation sets up a VM host to run the Cromwell server and uses Azure Batch to spin up virtual machines that run each task in a workflow. Cromwell workflows can be written using either WDL scripting language.
Architecture - CoA
Improvements:
To improve security and meet FMC’s requirements for running workloads in secure environments, we teamed up to deploy Cromwell on Azure running on in their subscription.
Access for the user has been available since release 3.2 of Cromwell on Azure. Since the release of 4.0 of Cromwell on Azure is operational , the following major changes have been made:
Below we describe how to setup a resource group with a storage account, virtual network, and Azure Container Registry to run Cromwell on Azure with private networking. This architecture ensures that no components have public IP addresses. All traffic is routed through the virtual network or private endpoints. Overall, the following steps are needed to deploy a fully privatized Cromwell on Azure environment:
For full instructions, please visit Setting up private networking for Cromwell on Azure.
These instructions allow users to deploy Cromwell on Azure with AKS configurations and provide a guide to deploying a private TES instance which was previously part of CoA. The following are the Azure resources that are deployed using the above methods:
Private Architecture - CoA
For VM and prior to 4.0 instructions please see the following:
This architecture acts as a key pattern for the FMC’s genomics workflows running on Azure cloud platform.
Acknowledgments:
We would like to acknowledge the contributions by Sang-Ic Kim fromFMC Corporation that help to test and deploy this solution on their Azure tenant.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.