Build medallion architecture using Apache Flink, Trino with Microsoft Fabric and HDInsight on AKS
Published Oct 12 2023 09:46 AM 2,758 Views
Microsoft

Author: Snehal Sonwane, Service Engineer, Azure Data & AI, Abhishek Jain & Sairam Yeturi, Product Manager's, HDInsight

 

In this era of AI, we have many options to build data platforms, and the technology stack is incubating fast.

 

As the data landscape continues to scale, enterprises become more invested to build the platforms to scale and sustain to meet 3 big V’s - Volume, Velocity and Variety, paving way to modern architecture of data platforms to meet enterprise scale.

 

Introduction to HDInsight on AKS

 

ssonwane_1-1696965091476.png

 

 

Microsoft's Azure HDInsight ia managed, full-spectrum, open-source analytics service in the cloud for enterprises. With HDInsight, you can use open-source frameworks such as, Apache Spark, Apache Hive, LLAP, Apache Kafka, Hadoop and more, in your Azure environment.

 

With the recent release of HDInsight on AKS, Microsoft has further enhanced the service offering to run Azure Kubernetes Service.

 

Well, I took the versions for a spin, and running on a performant infrastructure adds to benefits of lower maintenance and management, and allows me to focus on business logic, and with intuitive interface the experience of creating a cluster is now reduced from several minutes to 6 to 10 minutes!

 

With added features like Auto scale, Configuration management, and Cluster pool setup - this PaaS offering is designed to help pro-developers to ace on building their applications and less worry on the infrastructure or platform issues.

 

The new version introduces, two new workloads in addition to the wide range of analytics spectrum on the previous version.

 

HDInsight on AKS, introduced Apache Flink® and Trino - the most coveted analytics workloads on the customers mind, which help compliment the entire stack from ingestion, query to streaming.

 

Let's talk a bit more on these new technologies, and what they mean to us?

 

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations and stateful streaming applications at in-memory speed and at any scale. Learn more here.

 

Trino is an open-source distributed SQL query engine for federated and interactive analytics against heterogeneous data sources. It can query data at scale (gigabytes to petabytes) from multiple sources to enable enterprise-wide analytics. Learn more here.

 

In addition to these two analytics components, the most loved component in Apache Spark is also added to HDInsight on AKS.

 

Security becomes an important checkbox on your checklist, then you should hear this - The platform implements secure by default with modern OAuth/Authz around authorization and authentication using our Managed Identities and OAuth. It also supports integrations with VNets, and AAD to make your data platforms secure using Azure native solutions.

 

Integrations compliment and make life easy for developers, to not end up spending time on integrating two components forever - They've got that covered across the Azure analytics stack - You will find most of your favorite Azure services like ADF, Purview, Azure Monitoring and last but certainly not the least with Microsoft Fabric.

 

We will now walk you through how you can modernize your architecture!

 

Implementing medallion architecture using HDInsight on AKS with Apache Flink, Trino, and Apache Spark running on Microsoft Fabric OneLake.

 

ssonwane_1-1697015246391.png

 

 

Let's dive into how you can use these workloads together and build an end-to-end enterprise architecture to suit your needs.

 

Let's get started - 

 

Here is our end-to-end scenario:

 

  1. Use Flink to load data in OneLake in Microsoft Fabric
  2. Read and transform data in Spark on Lakehouse
  3. Storing results in OneLake
  4. Using Trino CLI to access transformed data.
  5. Visualize it in PowerBI.

Here are some pre-requisites for the demo:

 

  1. ADLS gen2 storage account
  2. Microsoft Fabric workspace
  3. Power BI desktop
  4. IntelliJ for development 
  5. HDInsight on AKS Cluster pool (subscription and resources)
    • You can create cluster pool and clusters through portal here. If you are a fan of ARM templates, you can also use one-click deployment templates to spin up your clusters.

  6. MSI for your clusters and other resources to communicate securely.

 

1. Ingesting data using Flink into OneLake Microsoft Fabric 

 

Let's create a flink cluster inside a cluster pool. For the ingestion process we will use a Flink-delta connector.

 

For purpose of this demonstration, we are using a data generator function to ingest the data, and Flink can easily read data from variety of sources like Apache Kafka, EventHubs, etc.

 

Writing to Delta sink

 

The delta sink is used for writing the data to a delta table in ADLS gen2. The data stream is consumed by the delta sink. We will build the jar with required libraries and dependencies and call the delta sink class while submitting the job via Flink CLI. We can specify the path of ADLS gen2 storage account while specifying the delta sink properties.

 

Before submitting the job, you must create the destination folder in ADLS gen2. As you can see it is empty at the moment (in the below figure).

 

ssonwane_1-1697010313002.png

 

 

Login to ssh node of your flink cluster and run the job. Once the job is submitted, you can check the status and metrics on Flink UI.

 

ssonwane_1-1696963920518.png

 

ssonwane_2-1697010359942.png

 

ssonwane_3-1697010487382.png

 

As you can see the data been loaded to the ADLS gen2

 

ssonwane_4-1697010516223.png

 

Now we will create a shortcut of this ADLS gen2 in OneLake in Fabric workspace.

 

Before that, create a lakehouse in your Fabric workspace.

Create a lakehouse - Microsoft Fabric | Microsoft Learn

 

ssonwane_5-1697010548041.png

 

ssonwane_6-1697010584394.png

ssonwane_3-1697016177784.png

 

Note: Use dfs endpoint of your ADLS gen2

Give a name to your shortcut and point to the location where delta files are getting generated.

 

ssonwane_6-1697016401607.png

 

 

ssonwane_0-1697125595170.png

 

The data now resides on the raw layer of the medallion architecture.

 

Let's cleanse in silver layer and transform this data for our gold layer to be consumed by end users.

 

Note: I have used same OneLake to store raw, cleansed, and transformed data for this demo. But it's a good practice to use separate OneLake for each layer.

 

2. Reading and transforming data using Spark cluster

 

To access OneLake in Spark notebook provide appropriate permission to the HDInsight managed identity on Microsoft Fabric workspace as documented here

 

Note: You can also leverage the Spark within Fabric; However, I have used Spark cluster on HDInsight on AKS in this demo. 

 

  • Create a Spark cluster on HDI on AKS as here
  • Open a jupyter notebook from the spark cluster, read data from OneLake, transform the data and load it to a new table in OneLake.

 

ssonwane_1-1697125629876.png

We observed a few null values in our data set, we cleansed this making it ready for transformation at consumption layer.

 

ssonwane_2-1697125665104.png

Let's transform our data now, we have an input data which is stored in ADLS gen2 and will be used to perform aggregations on the data in cleansed layer of OneLake.

 

ssonwane_3-1697125720455.png

Performing join condition on the two data sets and aggregating the results on average value of velocity.

 

ssonwane_4-1697125755215.png

 

3. Storing the aggregated and transformed data into OneLake

 

Let's save our results to the table in OneLake

 

ssonwane_10-1697017022778.png

 

Data is loaded as a new table in OneLake which is gold layer for consumption.

 

ssonwane_5-1697125813721.png

 

Note: I have used here same OneLake to store my raw data, cleansed data and transformed data for my demo. However, as a best practice you must create a separate OneLake for each layer. 

 

4. Using Trino CLI to access the transformed data

 

Now, we will access this new transformed table through Trino CLI

 

 

Once Trino cluster is redeployed with delta catalog enabled, now open a Trino cluster CLI and create a new table in Trino schema using delta table in OneLake

 

 

 

 

CREATE SCHEMA delta.example;

USE delta.example;

CALL delta.system.register_table(schema_name => 'example', table_name => 'FlightsData', table_location => 'abfss://<workspacename>@msit-onelake.dfs.fabric.microsoft.com/<lakehousename>.Lakehouse/Tables/FlightsData');

SELECT * FROM FlightsData;

 

 

 

 

ssonwane_12-1697017273172.png

 

ssonwane_6-1697125883010.png

5. Visualize it in PowerBI

 

Let's query this data in power BI using Trino on AKS connector.

 

Open Power BI desktop and click on get data. Type trino in the search bar and click connect.

 

ssonwane_15-1697011102263.png

 

Fill in your trino cluster details (Trino cluster url is available in Overview blade of your Trino cluster in the Azure portal)

 

ssonwane_16-1697017552609.png

 

Select the table created in the previous step and load the data.

 

ssonwane_7-1697125946405.png

 

Well, that was fabulous! 

 

Onelake could truly be used as a single unified storage layer for all these big data workloads and can extend the capabilities to a large set of tools and services.

 

This really makes it a great architecture to bet on all the best technologies on Azure, Let's get you started:

 

 

Version history
Last update:
‎Oct 12 2023 09:42 AM
Updated by: