Blog Post

Healthcare and Life Sciences Blog
10 MIN READ

Simplifying Genomic Task Execution with GA4GH TES: A Guide for Bioinformatics Workflows

Venkat_Malladi's avatar
Oct 23, 2024

Introduction

 

The rise of high-throughput sequencing and bioinformatics workflows has led to increasing complexity in managing computational tasks across cloud environments. For many genomics professionals, translating biological questions into scalable, cloud-native processes can be a challenging experience. That's where the Global Alliance for Genomics and Health GA4GH Task Execution Service (TES) comes into play.

 

At Microsoft, we’ve been working to simplify how bioinformaticians, developers, and data scientists submit and manage tasks on the cloud through the TES API. In this blog post, we’ll walk you through some practical examples of how you can submit tasks to TES using different tools and languages, such as curl, Nextflow, Python, and C#.

 

These examples will help you understand the flexibility TES offers to developers and scientists, making it easier to build, submit, and monitor your genomics workflows in the cloud.

 

Submitting Tasks to TES: A Brief Overview

 

TES follows a standardized API that abstracts the complexity of managing distributed tasks in the cloud. Whether you're working with small datasets or scaling up to process terabytes of sequencing data, TES provides a unified interface for running tasks on cloud or on-premises environments.

 

Here, we will focus on four ways to submit tasks to TES:

 

  1. Using Curl - A command-line tool for transferring data.
  2. Python Client - For flexible, programmatic access to TES.
  3. C# SDK - For developers working in the .NET ecosystem.
  4. Nextflow Integration - A popular workflow management system for bioinformatics pipelines.

Example 1: Submitting Tasks with Curl

 

The simplest way to interact with TES is via the command line using curl. Belowis the curl example provided in the official repository:

Prerequisites

Make sure you install jq if not already present. jq is a lightweight and flexible command-line JSON processor.

 

  1. Create the TES Instance File

    You’ll need to define the TES instances in a file named .tes_instances. This file should be in CSV format with two fields/columns: a description of the TES instance and the URL pointing to it.

    You can create the file using the following command:

     

    cat << "EOF" > .tes_instances
    Azure/TES @ YourNode,https://tes.your-node.org/
    EOF

    Important: Make sure to replace the example content with your actual TES instance description and URL. Avoid using commas in the description field.

  2. Create the Secrets File

    Next, create a secrets file (.env) that will store your environment variables such as TES service credentials and Azure storage account information. You can either set these variables in your shell or directly insert the values into the command below:

     

    cat << EOF > .env
    TES_SERVER_USER=$TES_SERVER_USER
    TES_SERVER_PASSWORD=$TES_SERVER_PASSWORD
    TES_OUTPUT_STORAGE_ACCT=$TES_OUTPUT_STORAGE_ACCT
    EOF
    • TES_SERVER_USER: Your TES service username.
    • TES_SERVER_PASSWORD: Your TES service password.
    • TES_OUTPUT_STORAGE_ACCT: Your Azure storage account where the outputs will be saved.

Running the Demo

After setting up the necessary configuration, you’re ready to submit a task using the BWA example. First download the run-bwa.sh script.

Run the following command to submit the task:

 

 

./run-bwa.sh

 

 

Here's a summary of what it does:

 

  1. Load Environment Variables:

    • It checks for and loads variables from a .env file, such as credentials and storage account information.
  2. TES Task Submission:

    • The script defines a function submit_task() that submits a task payload (described later) to a TES instance via a POST request using curl. It uses basic authentication based on the environment variables loaded earlier.
  3. TES Task State Monitoring:

    • Another function, get_task_state(), fetches the current state of a task by making a GET request to the TES instance using the task ID.
  4. TES Instance URL:

    • The script reads the TES instance URL from a file called .tes_instances, which contains TES instance information. If no instance is found, the script aborts.
  5. Task Payload:

    • The script constructs a JSON payload to describe the task. This payload includes:
      • Inputs: FASTQ files and a reference genome (HG38) to be used for BWA.
      • Outputs: The aligned output BAM file.
      • Executors: It uses the quay.io/biocontainers/bwa container to execute the BWA commands.
      • Resources: The task requests 16 CPU cores and 32 GB of RAM.
  6. Submit the Task:

    • The task payload is submitted to the TES instance, and the script logs the full response for debugging. If the task is submitted successfully, it extracts the task ID.
  7. Monitor Task Status:

    • The script monitors the task's state every 5 seconds until it reaches one of the final states: COMPLETE, EXECUTOR_ERROR, SYSTEM_ERROR, or CANCELLED.

After the pipeline completes, all results will be saved in the Azure Blob Storage container outputs/curl.

 

This example demonstrates how to create a simple TES task in JSON format and submit it to a TES server running locally. The task definition includes input files, outputs, resource requirements, and a Docker container image to execute the task.

 

Example 2: Submitting Tasks with Python (py-tes)

 

Python is a widely used language in bioinformatics, and the py-tes package simplifies programmatic submission of tasks to TES. Below  describes the py-tes example.

Prerequisites

To get started with py-tes, you need to install the required dependencies and set up the necessary configuration files. You can use Conda or the faster alternative, Mamba (recommended). to install these dependencies.

 

  1. Install Dependencies

    If you are using Conda or Mamba, you can create the environment with the following command:

     

    conda env create -f environment.yml

    This command will install all the dependencies listed in the environment.yml file.

  2. Create the TES Instance File

    You’ll need to define the TES instances in a file named .tes_instances. This file should be in CSV format with two fields/columns: a description of the TES instance and the URL pointing to it.

    You can create the file using the following command:

     

    cat << "EOF" > .tes_instances
    Azure/TES @ YourNode,https://tes.your-node.org/
    EOF

    Important: Make sure to replace the example content with your actual TES instance description and URL. Avoid using commas in the description field.

  3. Create the Secrets File

    Next, create a secrets file (.env) that will store your environment variables such as TES service credentials and Azure storage account information. You can either set these variables in your shell or directly insert the values into the command below:

     
    cat << EOF > .env
    TES_SERVER_USER=$TES_SERVER_USER
    TES_SERVER_PASSWORD=$TES_SERVER_PASSWORD
    TES_OUTPUT_STORAGE_ACCT=$TES_OUTPUT_STORAGE_ACCT
    EOF
    • TES_SERVER_USER: Your TES service username.
    • TES_SERVER_PASSWORD: Your TES service password.
    • TES_OUTPUT_STORAGE_ACCT: Your Azure storage account where the outputs will be saved.

Running the Demo

After setting up the necessary configuration, you’re ready to submit a task using the BWA example. First download the run-bwa.py script.

Run the following command to submit the task:

 

 

./run-bwa.py

 

 

This script will read the .tes_instances file to identify the TES instances and submit the task using the credentials and storage account information provided in the .env file.

 

Compared to the curl example this script does the following:

 

  • Task Submission:

    • The script loops through the available TES instances and submits the task to each instance using the py-tes client.
    • If the task is submitted successfully, the task ID is logged. Otherwise, any error is caught and logged.
  • Helper Functions:

    • csv_to_dict: Reads the .tes_instances file and converts the contents into a dictionary for easy lookup of TES instance URLs.
    • submit_task: Submits a task to the specified TES instance using the py-tes client, with basic authentication (if required).
    • get_task_state: Fetches the current state of a submitted task using its task ID.

After the pipeline completes, all results will be saved in the Azure Blob Storage container outputs/py-tes.

 

This Python example illustrates how to interact with the TES API programmatically, making it easy to define and submit tasks from your scripts or applications.

 

Example 3: Submitting Tasks with C#

 

For developers working in the .NET ecosystem, the C# SDK provides a convenient way to interact with TES. Below describes  C# example:

Prerequisites

Before running the TES SDK examples, ensure you meet the following requirements:

  1. .NET 8.0 SDK or higher: You need the .NET SDK to build and execute the C# examples. You can download it from the official .NET site.

  2. Azure CLI: This tool is essential for authenticating and accessing Azure resources such as Blob Storage. You can log in to your Azure account by running the following command, which uses device authentication:

     

    az login
  3. User Secrets: Use User Secrets in .NET to securely store the credentials for the TES service and Azure Blob Storage. You’ll need to configure the following secrets:

    • TesCredentialsPath: Path to TesCredentials.json (created during TES deployment).
    • StorageAccountName: Name of your Azure Blob Storage account (used to store output files).

    Example commands to configure User Secrets:

     

    dotnet user-secrets init dotnet user-secrets set "TesCredentialsPath" "path/to/TesCredentials.json" dotnet user-secrets set "StorageAccountName" "your_storage_account_name"

Opening and Building the Project

  1. Open the TES Solution: Launch Visual Studio and open the Microsoft.GA4GH.TES.sln file from the TES project. This file is located in the root directory of the TES repository.

  2. Using the TES.SDK.Examples Project: The TES.SDK.Examples project in the solution contains various sample console applications to demonstrate how to interact with the TES API.

Building a Single-File Executable

For ease of deployment, you can package the demo application as a single-file executable. Below are the instructions for both Linux and Windows.

For Linux

Run the following command to publish the demo application as a single-file executable for Linux:

 

 

dotnet publish --configuration Release --output ./publish --self-contained --runtime linux-x64 /p:PublishSingleFile=true

 

 

For Windows

To publish the demo application for Windows:

 

 

dotnet publish --configuration Release --output ./publish --self-contained --runtime win-x64 /p:PublishSingleFile=true

 

 

Note: Replace linux-x64 or win-x64 with your desired platform runtime (e.g., osx-x64 for macOS).

This will create a single-file executable in the ./publish directory.

Running the TES SDK Examples

After building the project, you can run the TES SDK examples by executing the single-file executable you just created. Here are two key examples you can run.

1. Prime Sieve Example

This example submits a TES task that calculates prime numbers in a specified range. Run the following command:

 

 

./Tes.SDK.Examples primesieve [taskCount]

 

 

  • taskCount: (Optional) Number of tasks to run. Each task processes a range of 1,000,000 numbers. If omitted, it defaults to 1.

Example:

 

 

./Tes.SDK.Examples primesieve 10

 

 

This command will submit 10 tasks, each calculating prime numbers in a distinct range.

2. BWA Mem Example

This example submits a TES task to run the BWA Mem algorithm, which is widely used for aligning sequence reads to a reference genome.

 

 

./Tes.SDK.Examples bwa

 

 

The output from this task (such as the aligned BAM file) will be stored in your configured Azure Blob Storage account.

 

This example shows how to use the C# SDK to submit a task to a TES instance, making it easy for .NET developers to integrate TES into their cloud-native applications.

 

Example 4: Nextflow and TES Integration

 

For bioinformaticians managing complex workflows, Nextflow provides a powerful and flexible way to orchestrate tasks. Nextflow can directly submit tasks to TES, making it easier to scale your workflows across cloud environments.

 

In this Nextflow example, we configure Nextflow to use TES as the execution backend:

Prerequisites

Before you start, ensure you have the following installed and configured:

      • For Linux or macOS, run the following commands in your terminal to install Nextflow:

        curl -s https://get.nextflow.io | bash mv nextflow /usr/local/bin/
      • Nextflow: Nextflow requires Java (version 8 or 11) and can be installed on Linux, macOS, or Windows.

    • For Windows, you’ll need to install Windows Subsystem for Linux (WSL) first. After setting up WSL, follow the Linux installation instructions.

    For detailed steps, refer to the official Nextflow installation guide.

  1. Java (version 8 or 11): Nextflow relies on Java, so ensure you have a compatible version installed on your machine.

  2. Azure CLI: You’ll need this tool to authenticate and manage Azure resources such as Blob Storage.

    Log in to your Azure account using:

    az login

Configuring Nextflow for TES

To connect Nextflow with your TES instance and Azure storage, you need to create a configuration file (tes.config) containing your TES and Azure credentials. Here's an example of how the file should look:

 

 

process {
  executor = 'tes'
}

azure {
  storage {
    accountName = "<Your storage account name>"
    accountKey  = "<Your storage account key>"
  }
}

tes.endpoint       = "<Your TES endpoint>"
tes.basicUsername  = "<Your TES username>"
tes.basicPassword  = "<Your TES password>"

 

 
  • accountName: The name of your Azure Blob Storage account.
  • accountKey: Your Azure Blob Storage account key.
  • tes.endpoint: The URL of your TES endpoint.
  • tes.basicUsername and tes.basicPassword: Your TES service credentials.

Running the Nextflow Pipeline

To help you get started quickly, we’re using the nf-hello-gatk project, a sample Nextflow pipeline designed to showcase Nextflow’s capabilities with TES and Azure integration.

Use the following command to run the pipeline:

 

 

./nextflow run seqeralabs/nf-hello-gatk -c tes.config -w 'az://work' --outdir 'az://outputs/nextflow' -r main

 

 

Here's what each part of the command does:

  • -c tes.config: Specifies the configuration file with your TES and Azure credentials.
  • -w 'az://work': Defines the Azure Blob Storage container for intermediate workflow files.
  • --outdir 'az://outputs/nextflow': Specifies the output directory in Azure Blob Storage where the results will be saved.
  • -r main: Specifies the branch of the repository to run (in this case, the main branch).

Viewing the Results

Once the pipeline completes, all results will be saved in the Azure Blob Storage container specified by the --outdir flag.

 

By using Nextflow with TES, bioinformaticians can easily parallelize tasks across different compute environments while maintaining full control of task execution through Nextflow’s powerful DSL.

Conclusion

 

The flexibility of the GA4GH TES API allows users to submit and manage tasks in a wide variety of environments—whether you're working from the command line with `curl`, managing workflows with Nextflow, writing scripts in Python, or building applications in C#.

 

Microsoft is committed to providing tools that help simplify the genomic workflow experience across cloud platforms. With TES, you can easily scale your bioinformatics tasks in a consistent and cloud-agnostic manner. Try out these examples in your own workflows and see how TES can help streamline your bioinformatics operations.

 

If you’d like to dive deeper, check out the full documentation and examples on the GA4GH-TES GitHub repository, also stay tuned for new ways to interact with the API. 

 

Updated Oct 23, 2024
Version 3.0
No CommentsBe the first to comment