COVID Variant Analysis on Azure using Nextflow

jkarasha

Microsoft

Jan 27, 2022

This is a three-part series example of running Nextflow on Azure.

Part 1: Basics (This blog)

Part 2: Getting started with Nextflow & HAVoC (coming soon)

Part 3: Data analysis & visualization (coming soon)

Imagine you were a public health official for your state. As the COVID virus spreads through the community, you not only need to know the number of cases but also the COVID variants that are affecting your community. You are working with a lab that has the capability to quickly turnaround case positivity results and they are adding the capacity to sequence a portion of the samples. We are going to show an example using Nextflow on Azure to run a pipeline called HAVoC, that the lab could use to detect the variants of COVID spreading in your community. This is not the only solution, but an example of how this problem could be solved.

Nextflow is a workflow framework that can be used for Genomics analysis. Genomics pipelines are composed of many different interdependent tasks. The output of a task is used as the input of a subsequent downstream task. Each task usually involves running a process that manipulates the data.

HAVoC is a bioinformatics pipeline built by a team at the Helsinki University that can be used to analyze variations in the COVID virus genome and assign variant lineage. You could analyze a sample and identify which of the known variants is spreading in your community.

Bioinformatic pipelines are compute intensive. The HAVoC pipeline, for example, is made up of more than ten different interdependent tasks. Each task requires an input that might come from a previous step, runs a dedicated process, and generates an output that might be passed to a downstream task or stored for further analysis. Additionally, these pipelines may need to run on multiple samples concurrently. Using the power of Azure, we can dynamically assign the right type and amount of compute for the task, scaling up and down as needed. We can also free up the resources as soon as they are no longer needed. This flexibility and scalability make these pipelines ideal for running on the cloud.

To get started, we will install Nextflow on a local compute, this can be your desktop or a VM that you have access to. We’ll call this local compute the client on this blog. The client doesn’t have to run on Azure to get started. You can run Nextflow on any POSIX compatible systems. For Windows users you can either use the WSL (Windows Subsystem for Linux) or use a Linux based VM. The only other dependency is Java, versions should be 8 or greater. Follow instructions on the “Getting Started” section.

Although you can run Nextflow on your client locally, you are limited to the resources that you have available locally. This is a good way to develop and test your pipelines but for production use you want to leverage the power of Azure. Fortunately, Nextflow supports this easily. Just by changing a few attributes on the configuration file, you can use the power and scalability of Azure you run your pipelines.

On the next blog, we’ll look at how to set up a Nextflow environment. We’ll show how to execute jobs locally and then using the HAVoC script, we’ll show you how to run jobs on Azure.

Updated Jan 28, 2022

Version 3.0

HLS_Hack

life sciences

jkarasha

Microsoft

Joined September 24, 2018

View Profile

Healthcare and Life Sciences Blog

Follow this blog board to get notified when there's new activity