Author: Debananda Ghosh, Global Black Belt - Sr. Specialist, Data & AI, Microsoft.
On October 2023 Microsoft Events - Enterprise scale open-source analytics on containers event, Microsoft announced public preview of HDInsight on AKS. This is the latest evolution of Microsoft Azure HDInsight PaaS foundation stack. Azure HDInsight is now completely rearchitected on Azure Kubernetes Service infrastructure and currently offering Trino, Flink and Spark workloads. HDInsight on AKS provides end to end integration with Azure ecosystem. Using Spark, Flink and Trino offering HDInsight AKS, we can deploy big data analytics computation capability in container. At the same team Enterprises and Digital native organizations do not need to manage container separately. We can learn more about HDInsight on AKS here What is HDInsight on AKS? - Azure HDInsight Preview Documentation | Microsoft Learn.
In this blog we aim to focus on Trino capabilities of HDInsight on AKS cluster. We will also discuss why we need to modernize our on-premises Trino deployments and adopt such PaaS capabilities.
Fig 1.1 HDInsight on AKS Trino offering illustration.
We will cover the following topics in this blog.
Overview of Trino
Trino is a tool to query humongous volume of data using distributed query. Note that Trino is not an OLTP (Online transaction processing) database like My SQL, Postgre SQL, HBase. It is neither an open-source data lake, lakehouse alternate like Hadoop file system or Delta format. Trino is a tool to execute ad hoc query in petabyte scale data management system like Hive and Delta Lake. However, Trino is not restricted to connecting only Datalake or Lakehouse. Trino extends its query capability to multiple sources. As per Trino community definition it is,
A query engine that runs at ludicrous speed, fast distributed SQL query engine for big data analytics that helps you explore your data universe.
Trino is known primarily for,
To support data federation and descent interactive analytics speed, current open-source Trino leverages following architecture components and query execution model.
The following architecture diagram from Trino community blog depicts Trino components and external data connections.
Fig 1.2 OSS Trino Conceptual Architecture.
Trino architecture patterns and challenges.
Trino /presto deployments are mostly used in organization for data federation purpose. There could be different other purpose like ad hoc interactive querying and visualization layer on top of multiple data sources. One example of Trino architecture pattern for data federation is shown in the following diagram. In this architecture pattern the organization is using change data capture mechanism, MQTT Broker, custom Apps and other middleware mechanism to feed the data in cloud storage account. Streaming data is further transformed by Delta, Spark, or other processing capability and subsequently stored in cloud storage account like (Azure data lake ADLS Gen2/AWS S3). For near real time streaming analytics purpose data is also loaded in cloud big data real time database for fast data exploration activity. Open-source Trino/Presto is used as a data federation layer between Cold storage (Cloud storage accounts) and Cloud OSS real time database (Hot storage). Business intelligence tools can connect directly to Trino/Presto for visualization purpose. Trino environment also acts as an ad hoc querying tool for interactive users who want to explore multiple data sources inside the same tool.
Fig 1.3 Current Trino Deployment architecture
In previous architecture framework, let us zoom into data federation block/Trino execution engine part. Note that such Trino OSS deployment is usually on premises based and comes with multiple challenges. Some of those challenges are depicted below.
Architecture Modernization benefits.
When we migrate to Trino with Azure HDInsight AKS service on the on-premises, we leverage the following benefits.
Fast service deployment- Like any Azure 1st party Service, Azure HDInsight AKS Trino cluster is created over few clicks. To deploy we need to follow the simple two-step process described below.
Step1- Go to Azure portal Home - Microsoft Azure ,search for ‘Azure HDInsight on AKS cluster pools (preview)’ and then Click ‘+Create’ to create and deploy ‘Azure HDInsight on AKS Cluster pools’ .
Step2- We need to click ‘+New cluster’ to create a Trino Cluster.
Learn more about portal-based deployment here Create a Trino cluster - Azure portal - Azure HDInsight Preview Documentation | Microsoft Learn
Enterprise scale security, Support of Vnet.
HDInsight on AKS Trino provides multi-layer security pillars out of box as part of its inbuilt security offering for enteprise customers.
Learn more about Enteprise security here- Security in HDInsight on AKS - Azure HDInsight Preview Documentation | Microsoft Learn
Data Infra monitoring with Prometheus, Grafana, Workbook, Azure Monitor - Microsoft provides Grafana (visualization tool) and Prometheus (cloud native monitoring tool) capabilities as azure managed service offerings. We can configure Microsoft managed Grafana instance, Microsoft Managed Prometheus and then monitor HDI AKS via simple checkbox as shown in following screenshot.
Fig 1.4 One click Microsoft Managed Prometheus, Grafana integration.
Azure workbook provides prebuilt and custom-built canvas capability for data analysis. HDInsight AKS Trino cluster supports native Workbook integration as shown in the following screenshot.
Fig 1.5 HDInsight AKS Trino Workbook gallery
Native integration capability of Azure HDInsight on AKS with Azure Monitoring service gives the ability to explore, interact with log and monitor the workloads in seamless manner. Setting up alert based on query threshold becomes much easier when we set up alert.
Federated SQL monitoring with Trino UI– Trino provides native UI to monitor query logs triggered in Trino cluster. To access Trino Logs, we need to perform the following two steps.
Step 1. Go inside HDInsight Trino cluster and click on ‘Trino UI’ as shown in following screenshot.
Fig 1.6 Trino UI Dashboards.
Step2. Click on ‘Trino UI, a Trino UI monitoring dashboard with query detail will like below as shown in following screenshot. Query details appear in this canvas which can be used for monitoring and deepdive of query.
Fig 1.7 Trino UI Dashboards
Elasticity- Manual Scale, Auto scale & delegated container management-
HDInsight on AKS supports both manual and auto scale capability. We need go to portal and manually drag the ‘Number of Worker nodes’ then turn on ‘Auto scale’ toggle via HDI AKS user interface to leverage the cloud elasticity. Underlying cluster pod management as shown in the next screenshot takes care of the HDI AKS elasticity in a high performant manner.
Fig 1.8 HDInsight on AKS implementation architecture
Ease of Service Configuration Management- There are multiple Trino configurations which we need to leverage for application development and tuning. Some are as below.
HDInsight cluster provides intuitive user interface to configure Trino service-related configurations. Please refer sample trino configuration here- hdionaksresources.blob.core.windows.net/trino/samples/arm/arm-trino-config-sample.json.
Fig 1.9 Trino configuration Management
Example we can enable query caching (for better query performance in certain scenarios) by adding new parameters as shown in following screenshot.
Fig 1.10 Trino Query catching enablement.
Ease of Migration- Since underlying open-source tooling (in this case Trino) is compatible with on-premises deployment, modernization of Trino in Azure HDInsight will not be complete disruption for organization. Even though this migration is not complete lift and shift migration, it does not need complete rewriting of application as well.
TCO benefits- PaaS benefits like various SKU Choices, subscription-based pay per use model, scalability, ease of deployment always brings together cost advantage for this modernization.
After migration of Trino in HDInsight AKS, our conceptual TO-BE architecture is depicted in the following diagram.
Fig 1.11 Trino Modernization with Azure HDInsight AKS.
How to connect HDI AKS Trino Cluster
Once we deploy HDInsight AKS Trino cluster, we need to connect Trino cluster. Subsequently we need to connect with multiple data sources to build a data federation framework for further development purpose. This comprises of three step process broadly.
Step1- To begin development we need to first fetch the Trino Cluster endpoint from portal as shown in the following screenshot.
Fig 1.12 Trino Cluster Endpoint.
Step2- As the next step we need to connect with Trino cluster and access the environment. HDInsight Trino AKS can be accessed via following mechanism.
To access Trino cluster via CLI please follow the prerequisite documentation here- Trino CLI - Azure HDInsight Preview Documentation | Microsoft Learn
Once prerequisites are installed, we can access Trino cli using ‘command prompt’ and run sample command like ‘show catalogs’ as shown below.
trino-cli --server <cluster_endpoint>
Trino-cli --server debhditrino.xxxxxxxxx.eastus.hdinsightaks.net
Fig 1.13 Trino CLI query
Likewise we can leverage one of the most popular tool Deaver and use Trino JDBC connector for ad hoc querying .As pre requisite we need to download Dbeaver from here Download | DBeaver Community .Then follow the configuration pre requisites step here -Trino with DBeaver
After we do Azure Trino JDBC driver setup, we can test connectivity by running the following query (checks metadata).
select * from tpcds. information_schema.columns
The following screenshot shows the outcome of previous query related to system catalog exploration.
Fig 1.14 Trino query using Dbeaver tool.
Step3- As a next step we need to leverage Trino connectors to connect with external data sources. Following connectors are supported today- Trino connectors - Azure HDInsight Preview Documentation | Microsoft Learn
Example to connect with AWS S3 we have detailed documentation here via ARM Template - Query data from AWS S3 and with Glue - Azure HDInsight on AKS | Microsoft Learn
Like wise to configure delta lake we can leverage this document- Configure Delta Lake catalog - Azure HDInsight Preview Documentation | Microsoft Learn
Further reading
To learn more about other HDInsight services here Azure HDInsight on AKS - Azure HDInsight Preview Documentation | Microsoft Learn .Conceptual anatomy of HDInsight AKS latest offerings is shown in following diagram.
Fig 1.15 HDInsight on AKS current offerings
We can refer to the following resources:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.