RNA sequencing analysis on Azure using Nextflow: low-priority vs. dedicated machines comparison.

Olesya_Melnichenko · ‎Mar 22 2023

This is part two of the blog series on RNA sequencing analysis using Nextflow. In the first part we introduced configuration files required to run nf-core/rnaseq pipeline on dedicated virtual machines on Azure. We also presented benchmarking results for two machines series: Ddv4 and Edv4; and discussed possible cost savings due to switching from dedicated to low-cost machines.

In this post we describe configuration file change needed to run pipeline on low-cost virtual machines. We used updated configuration to benchmark the pipeline on Edv4-series low-priority machines. The dedicated vs. low-priority comparison shows that the cost to run a pipeline can be reduced by up to 82% without significant changes in pipeline run duration when using low-priority virtual machines.

Prerequisites

We highly recommend reading the first part of the blog series since it has detailed description of the benchmarking experiment and data used for it.

Updating configuration file

Azure Batch can automatically scale pools based on parameters that you define. The default scale formula that Nextflow uses is shown below, and it only defines the target number of dedicated nodes in the pool

// Get pool lifetime since creation.
lifespan = time() - time("{{poolCreationTime}}");
interval = TimeInterval_Minute * {{scaleInterval}}; 

// Compute the target nodes based on pending tasks.
// $PendingTasks == The sum of $ActiveTasks and $RunningTasks
$samples = $PendingTasks.GetSamplePercent(interval);
$tasks = $samples < 70 ? max(0, $PendingTasks.GetSample(1)) : max($PendingTasks.GetSample(1), avg($PendingTasks.GetSample(interval)));
$targetVMs = $tasks > 0 ? $tasks : max(0, $TargetDedicatedNodes/2);
targetPoolSize = max(0, min($targetVMs, {{maxVmCount}}));

// For first interval deploy 1 node, for other intervals scale up/down as per tasks.
$TargetDedicatedNodes = lifespan < interval ? {{vmCount}} : targetPoolSize;
$NodeDeallocationOption = taskcompletion;

We can change the type of pool nodes from dedicated to low-priority by modifying default formula and using scaleFormula option for pool configuration. The updated configuration file is shown below

// Scale formula to use low-priority nodes only.
lowPriorityScaleFormula = '''
    lifespan = time() - time("{{poolCreationTime}}");
    interval = TimeInterval_Minute * {{scaleInterval}};
    $samples = $PendingTasks.GetSamplePercent(interval);
    $tasks = $samples < 70 ? max(0, $PendingTasks.GetSample(1)) : max($PendingTasks.GetSample(1), avg($PendingTasks.GetSample(interval)));
    $targetVMs = $tasks > 0 ? $tasks : max(0, $TargetLowPriorityNodes/2);
    targetPoolSize = max(0, min($targetVMs, {{maxVmCount}}));
    $TargetLowPriorityNodes = lifespan < interval ? {{vmCount}} : targetPoolSize;
    $TargetDedicatedNodes = 0;
    $NodeDeallocationOption = taskcompletion;
'''

process {
    executor = 'azurebatch'
    queue = 'Standard_E2d_v4'
    withLabel:process_low {queue = 'Standard_E2d_v4'}
    withLabel:process_medium {queue = 'Standard_E8d_v4'}
    withLabel:process_high {queue = 'Standard_E16d_v4'}
    withLabel:process_high_memory {queue = 'Standard_E32d_v4'}
}

azure {
    storage {
        accountName = "<Your storage account name>"
        sasToken = "<Your storage account SAS Token>"
    }
    batch {
        location = "<Your location>"
        accountName = "<Your batch account name>"
        accountKey = "<Your batch account key>"
        autoPoolMode = false
        allowPoolCreation = true
        deletePoolsOnCompletion = true

        pools {
            Standard_E2d_v4 {
                autoScale = true
                vmType = 'Standard_E2d_v4'
                vmCount = 2
                maxVmCount = 20
                scaleFormula = lowPriorityScaleFormula
            }
            Standard_E8d_v4 {
                autoScale = true
                vmType = 'Standard_E8d_v4'
                vmCount = 2
                maxVmCount = 20
                scaleFormula = lowPriorityScaleFormula
            }
            Standard_E16d_v4 {
                autoScale = true
                vmType = 'Standard_E16d_v4'
                vmCount = 2
                maxVmCount = 20
                scaleFormula = lowPriorityScaleFormula
            }
            Standard_E32d_v4 {
                autoScale = true
                vmType = 'Standard_E32d_v4'
                vmCount = 2
                maxVmCount = 10
                scaleFormula = lowPriorityScaleFormula
            }
        }
    }
}

Note that scaleFormula option allows you to provide any custom formula, so you can also create pools with a mix of dedicated and low-cost machines if needed. To learn more about creating autoscale formulas check Azure Batch documentation.

Benchmarking Edv4-series: dedicated vs. low-priority machines

We used the updated configuration file to repeat benchmarking experiment on Edv4-series low-priority machines. The detailed experiment setup is described in the previous blog post. Sample and reference data, tools and pipeline versions, and pricing data were kept the same.

As we mentioned previously, the tradeoff for using low-cost machines in that those machines may be preempted at any time. To collect benchmarking data, we performed 30 pipeline runs, 10 runs per alignment option. The number of tasks in the pipeline depends on the alignment option: 36 tasks for STAR-RSEM and HISAT2, and 42 tasks for STAR-Salmon. All 1140 tasks were completed on the first attempt, no preemption events were observed.

Average cost estimates and durations for each alignment option are presented on Fig.1-3; average changes are in Table 1. The results show that the cost to run a pipeline can be reduced by up to 82% without significant changes in pipeline completion time by using low-priority virtual machines.

Fig. 1: Average cost estimate and pipeline run duration for nf-core/rnaseq pipeline with STAR-Salmon alignment option; Edv4-series virtual machines.

Fig. 2: Average cost estimate and pipeline run duration for nf-core/rnaseq pipeline with STAR-RSEM alignment option; Edv4-series virtual machines.

Fig. 3: Average cost estimate and pipeline run duration for nf-core/rnaseq pipeline with HISAT2 alignment option; Edv4-series virtual machines.

Table 1: Average change in cost and duration for each alignment option when switching from dedicated to low-priority VMs, Edv4-series.
Alignment option	Average change, %
Alignment option	Cost	Duration
STAR-Salmon	-81	-2.8
STAR-RSEM	-82	1.6
HISAT2	-81	0.8

Currently Azure Batch offers two types of low-cost virtual machines: Spot VMs and Low-priority VMs. The type of node you get depends on your Batch account's pool allocation mode, which is settable during account creation. Accounts that use the Batch managed pool allocation mode always get low-priority VMs; accounts that use the user subscription pool allocation mode always get Spot VMs. Please keep in mind that low-priority nodes will be retired in September 2025.

For the benchmarking experiments we used an account with Batch managed pool allocation mode (default option) due to easier setup process. To replicate the experiments on Spot VMs you need to create Batch account with user subscription pool allocation mode; configuration file might be reused with no changes required.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

RNA sequencing analysis on Azure using Nextflow: low-priority vs. dedicated machines comparison.

Prerequisites

Updating configuration file

Benchmarking Edv4-series: dedicated vs. low-priority machines