Part 3: Data analysis & visualization (coming soon)
This is part two of the Nextflow on Azure blog series. In the last post we introduced the COVID lineage detection example that we’ll be using. We also introduced Nextflow and gave a high-level overview of how Nextflow works. In this blog, we’ll walk-through how-to setup on Nextflow and how to run Nextflow jobs on your local computer as well as how to run jobs on Azure.
Nextflow can be installed on any POSIX compliant operating system such as Linux, and MacOS. You can also use the WSL for Linux to run Nextflow on Windows. We will show an example of how to install this on WSL. Follow instructions on the “Getting Started” section to get started.
To validate the version of Java installed run the “java -version” command. Your output should be like one below, adjusting for version of Java installed.
Once you have Java installed successfully, you can run the command to install Nextflow. If it works, you’ll see output that resembles the screenshots below. I tried to capture various stages so your output might be slightly different.
You can confirm that Nextflow is successfully installed by running the “./nextflow -v” command. Your output should be like the one below.
After validating the installation is complete, run the “hello world” equivalent to test your installation. Sample output is included below.
If all the steps completed successfully, congratulations, you have Nextflow installed and running on your local compute environment. Let’s shift our focus to configure Nextflow to run jobs on Azure. In the previous blog, we introduced HAVoC, a tool that can be used for COVID variant analysis. We have created a Nextflow project that runs the HAVoC pipeline on Azure. You can use that repo to follow along.
Running Nextflow on Azure.
At their most basic form, Nextflow jobs are defined in a single file with a “.nf” extension. For non-trivial examples, the code could be laid out in numerous folders. Below is the example of the HAVoC project we’re going to use in Nextflow.
The “bin” folder and the “main.nf” will determine what Nextflow runs and the “conf” folder and the “Nextflow.config” file determines how the jobs will be run.
Let’s start by looking at the “Nextflow.config” file.
Plugins: ‘nf-azure’ means you’ll be running Nextflow on Azure
Process: Provides the name of the container that Azure will load to run your Nextflow job. This provides a nice clean way to customize the runtime environment for your job.
Profiles: Allows you to run multiple environments. This will allow the job to be run locally or on Azure as needed. We can do most of our dev/test locally and then push the final run to Azure. Notice that the profiles point to additional configuration files. To run on Azure the minimum configuration needs to specify the following parameters:
Process: Tells Nextflow to use the Azure Batch executor. It’s possible to use Kubernetes as the executor by using Azure Kubernetes Service.
Azure.Storage: Tells Nextflow which Azure Storage account to use for its working directory.
Azure.Batch: Tells Nextflow which Azure Batch account to use for its executor.
With that configuration, Nextflow is now able to connect and use your Azure resources to execute your job.
Let’s now briefly look at what will be run when you run “main.nf”.
Although this script is quite intimidating to look at, it can be boiled down to three sections:
Channel: This will take the input fastq file pair and create a tuple for each read pair. That tuple will be one of the inputs to the process task.
Parameter Check: The next two blocks of code, check for the presence of a reference file and an adapter file. Both these files are inputs into the HAVoC script.
Process: This will be the main action block for Nextflow. In this example we have a single process defined. You’ll notice the process called “runHavoc” can be broken down into three distinct sections. It has an “input” section that defines all the input parameters. It has an “output” section that defines all the output parameters. It also has an execution block. The execution block for this process will use the bash shell to execute the Havoc.sh shell script and pass in all the required parameters.
Nextflow run main.nf: Instructs Nextflow to run the main.nf file that we looked at above.
-w: Instructs Nextflow to use the “Nextflow” container in the configured storage account as it’s working directory
-nextera: Passes in the nextera adapter file that the HAVoC script expects as an input.
-ref: Passes in the reference file that the HAVoC script expects as in input.
When the job is done running, it will generate and store all the output files in the configured storage account container. The script will output progress/status messages on the console. You can use Batch Explorer or the Azure portal to view the progress/status of the Azure Batch execution.
On the next blog, we’ll look at the output of the process, specifically the “pangolearn_assignments.csv” file. We’ll show how the data can be used to generate some reports visualizations to show trends in variant spread.