Submit Apache Spark and Apache Flink jobs with Azure Logic Apps on HDInsight on AKS
Published Nov 22 2023 07:55 AM 2,257 Views
Microsoft

Author(s): Arun Sethia is a Program manager in Azure HDInsight Customer Success Engineering (CSE) team.

Co-Author: Sairam is a Product manager for Azure HDInsight on AKS.

 

Introduction

Azure Logic Apps allows you to create and run automated workflows with little to no code. These workflows can be stateful or stateless. Each workflow starts with a single trigger, after which you must add one or more actions. An Action specifies a task to perform. Trigger specifies the condition for running any further steps in that workflow, for example when a blob is added or updated, when  http request is received, checks for new data in an SQL database table, etc. These workflows can be stateful or stateless, based on your Azure Logic App plan (Standard and Consumption).

 

Using workflows, you can orchestrate complex workflow with multiple processing steps, triggers, and interdependencies. These steps can involve certain Apache Spark and Apache Flink jobs, and integration with Azure services.

 

The blog is focused on how you can add an action to trigger Apache Spark or Apache Flink job on HDInsight on AKS from a workflow.

 

Azure Logic App -  Orchestrate Apache Spark Job on HDInsight on AKS

 

In our previous blog, we discussed about different options to submit Apache Spark jobs to HDInsight on AKS cluster.  The Azure Logic Apps workflow will make use of Livy Batch Job API to submit Apache Spark job.

The following diagram shows interaction between Azure Logic Apps, Apache Spark cluster on HDInsight on AKS, Azure Active Directory and Azure Key Vault. You can always use the other cluster shapes like Apache Flink or Trino for the same, with the Azure management endpoints.

spark_logic_app.png

HDInsight on AKS allows you to access Apache Spark Livy REST APIs using OAuth token. It would require a Microsoft Entra service principal  and  Grant access to the cluster for the same service principal to the HDInsight on AKS cluster (RBAC support is coming soon). The client id (appId) and secret (password) of this principal can be stored in Azure Key Vault (you can use  various design pattern’s to rotate secrets).

 

Based on your business scenario, you can start (trigger) your workflow; in this example we are using “Http request is received.”  The workflow connects to Key Vault using System managed (or you can use User Managed identities) to retrieve secrets and client id for a service principal created to access HDInsight on AKS cluster. The workflow retrieves OAuth token using client credential (secret, client id, and scope as https://hilo.azurehdinsight.net/.default).

 

The invocation to the Apache Spark Livy REST APIs on HDInsight is done with Bearer token and Livy Batch (POST /batches) payload

The final workflow is as follows, the source code and sample payload are available on this GitHub

spark_workflow.png

Azure Logic App -  Orchestrate Apache Flink Job on HDInsight on AKS

 

HDInsight on AKS provides user friendly ARM Rest APIs to submit and manage Flink jobs. Users can submit Apache Flink jobs from any Azure service using these Rest APIs. Using ARM REST API, you can orchestrate the data pipeline with Azure Data Factory Managed Airflow. Similarly, you can use Azure Logic Apps workflow to manage complex business workflow.

 

The following diagram shows interaction between Azure Logic Apps, Apache Flink cluster on HDInsight on AKS, Azure Active Directory and Azure Key Vault.

Flink_LogicApp.png

To invoke ARM REST APIs, we would require a Microsoft Entra service principal  and configure its access to specific Apache Flink cluster on HDInsight on AKS with Contributor role. (resource id can be retrieved from the portal, go to cluster page, click on JSON view, value for “id” is resource id).

 

 

az ad sp create-for-rbac -n <sp name> --role Contributor --scopes <Flink Cluster Resource ID>

 

 

The client id (appId) and secret (password) of this principal can be stored in Azure Key Vault (you can use  various design pattern’s to rotate secrets).

 

The workflow connects to Key Vault using System managed (or you can use User Managed identities) to retrieve secrets and client id for a service principal created to access HDInsight on AKS cluster. The workflow retrieves OAuth token using client credential (secret, client id, and scope as https://management.azure.com/.default).

 

The final workflow is as follows, the source code and sample payload is available on GitHub

flink_workflow.png

Summary

HDInsight on AKS REST APIs lets you automate, orchestrate, schedule  and allows you to monitor workflows with your choice of framework. Such automation reduces complexity, reduces development cycles and completes tasks with fewer errors.

 

You can choose what works best for your organization, let us know your feedback or any other integration from Azure services to automate and orchestrate your workload on HDInsight on AKS.

References

We are super excited to get you started:

1 Comment
Version history
Last update:
‎Nov 23 2023 04:08 PM
Updated by: