Azure HDInsight is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. HDInsight Apache Spark cluster is parallel processing framework that supports in-memory processing, it is based on Open-Source Apache Spark.
Apache Spark is evolving; it’s efficiency and ease of use makes it a preferred big data tool among big data engineers and data scientists. There are few essential features missing from the Spark, one of them is A(Atomicity)C(Consistency)I(Isolation)D(Durability) transaction. Majority of databases supports ACID feature out of the box, when it comes to Storage layer (ADLS Gen2) it is hard to support similar level of ACID feature provided by databases.
Delta Lake is a storage layer that brings ACID transactions to Apache Spark and big data workloads - for both streaming and batch operations. Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.
This blog is not about Delta Lake; we will talk more about how you can leverage delta with HDInsight Spark Cluster, few code snippet and require configurations.
Before we jump into code and require configurations, it is good for you check your Spark version from Ambari user interface from the HDI cluster. You need to pick the right delta lake versions based on your cluster Spark Version. The following table lists Delta Lake versions and their compatible Apache Spark versions:
HDI Version |
Spark Version |
Delta Lake Version |
API URL |
Spark 2.4.4 |
|||
Spark 3.1.2 |
Before we jump into code and configurations; we need to look into the below mentioned extendibility configurations provided by Spark:
When you would like to use Delta Lake on Spark 3.x on HDI 5.0, you need to configure sql extensions and delta lake catalog with following values:
Configuration Property |
Delta Lake Value |
Description |
spark.sql.extensions |
io.delta.sql.DeltaSparkSessionExtension |
An extension for Spark SQL to activate Delta SQL parser to support Delta SQL grammar. |
spark.sql.catalog.spark_catalog |
org.apache.spark.sql.delta.catalog.DeltaCatalog |
This replaces Spark’s default catalog by Delta Lake DeltaCatalog. |
The above configurations need to be provided as part of the Spark Configuration before any Spark session is created. Apart from the above Spark configurations, the Spark Application uber jar should provide Delta Lake dependency.
Working with Spark 2.4.x with HDI 4.0 we just need to supply Delta Lake dependency, no additional spark configurations. To avoid class loading conflicts due to duplicate classes on the cluster classpath, we need to use the maven-shade-plugin to create an uber-jar with jackson dependencies.
You can clone the example code from GitHub, the code is written in Scala. You can run example code using anyone of this option:
The example application will generate stdout logs and delta lake parquet files with commit logs. The output examples are listed on GitHub.
Delta Lake is an open-source storage framework that extends parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta lake is fully compatible with Apache Spark APIs. Since the HDInsight Spark cluster is an installation of the Apache Spark library onto an HDInsight Hadoop cluster, the user can use compatible Delta Lake versions to take benefits of Delta Lake on HDInsight.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.