Delta Lake on HDInsight

Microsoft

Nov 17, 2022

Introduction

Azure HDInsight is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. HDInsight Apache Spark cluster is parallel processing framework that supports in-memory processing, it is based on Open-Source Apache Spark.

Apache Spark is evolving; it’s efficiency and ease of use makes it a preferred big data tool among big data engineers and data scientists. There are few essential features missing from the Spark, one of them is A(Atomicity)C(Consistency)I(Isolation)D(Durability) transaction. Majority of databases supports ACID feature out of the box, when it comes to Storage layer (ADLS Gen2) it is hard to support similar level of ACID feature provided by databases.

Delta Lake is a storage layer that brings ACID transactions to Apache Spark and big data workloads - for both streaming and batch operations. Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

This blog is not about Delta Lake; we will talk more about how you can leverage delta with HDInsight Spark Cluster, few code snippet and require configurations.

Before we jump into code and require configurations, it is good for you check your Spark version from Ambari user interface from the HDI cluster. You need to pick the right delta lake versions based on your cluster Spark Version. The following table lists Delta Lake versions and their compatible Apache Spark versions:

HDI Version	Spark Version	Delta Lake Version	API URL
4.0	Spark 2.4.4	< 0.7.0	0.6.1 API Doc
5.0	Spark 3.1.2	1.0.x	1.0.1 API Doc

HDInsight - Delta Lake Configuration

Before we jump into code and configurations; we need to look into the below mentioned extendibility configurations provided by Spark:

spark.sql.extensions – It is used to configure Spark Session extensions, by providing the name of the extension class.
spark.sql.catalog.spark_catalog – This plugin configuration is used to configure custom catalog implementation. You can find the current catalog implementation from CatalogManager spark.sessionState.catalogManager.currentCatalog. The Spark 3.x uses SessionCatalog as default catalog.

When you would like to use Delta Lake on Spark 3.x on HDI 5.0, you need to configure sql extensions and delta lake catalog with following values:

Configuration Property	Delta Lake Value	Description
spark.sql.extensions	io.delta.sql.DeltaSparkSessionExtension	An extension for Spark SQL to activate Delta SQL parser to support Delta SQL grammar.
spark.sql.catalog.spark_catalog	org.apache.spark.sql.delta.catalog.DeltaCatalog	This replaces Spark’s default catalog by Delta Lake DeltaCatalog.

The above configurations need to be provided as part of the Spark Configuration before any Spark session is created. Apart from the above Spark configurations, the Spark Application uber jar should provide Delta Lake dependency.

Working with Spark 2.4.x with HDI 4.0 we just need to supply Delta Lake dependency, no additional spark configurations. To avoid class loading conflicts due to duplicate classes on the cluster classpath, we need to use the maven-shade-plugin to create an uber-jar with jackson dependencies.

Example Code

You can clone the example code from GitHub, the code is written in Scala. You can run example code using anyone of this option:

Copy the application jar to the Azure Storage blob associated with the cluster.
1. SSH to Headnode and run Spark-Submit from the headnode
2. Or Using Livy API
or Use Azure Toolkit for IntelliJ

The example application will generate stdout logs and delta lake parquet files with commit logs. The output examples are listed on GitHub.

Summary

Delta Lake is an open-source storage framework that extends parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta lake is fully compatible with Apache Spark APIs. Since the HDInsight Spark cluster is an installation of the Apache Spark library onto an HDInsight Hadoop cluster, the user can use compatible Delta Lake versions to take benefits of Delta Lake on HDInsight.

Updated Nov 18, 2022

Version 2.0

Microsoft

Joined July 28, 2022

View Profile

Analytics on Azure Blog