Ingestion, ETL, and Stream Processing with Azure Databricks

Former Employee

Nov 30, 2020

This post is part of a multi-part series titled "Patterns with Azure Databricks". Each highlighted pattern holds true to the key principles of building a Lakehouse architecture with Azure Databricks:

A Data Lake to store all data, with a curated layer in an open-source format. The format should support ACID transactions for reliability and should also be optimized for efficient queries. This is done with Azure Data Lake Store plus Delta Lake.
A foundational compute layer built on open standards. The foundational compute Layer should support most core use cases for the Data Lake including curated data lake (ETL and stream processing), data science and ML, and SQL analytics on the data lake. Azure Databricks provides these capabilities using open standards that ensure rapid innovation and are non-locking and future proof.
Easy integration for additional and/or new use cases. No single service can do everything. There are always going to be new or additional use cases that aren't core to the Lakehouse. This is where easy integrations between the core Lakehouse services and other Azure data services and tools ensure that any analytics use case can be tackled.

Pattern for Ingestion, ETL, and Stream Processing

Companies need to ingest data in any format, of any size, and at any speed into the cloud in a consistent and repeatable way. Once that data is ingested into the cloud, it needs to be moved into the open, curated data lake, where it can be processed further to be used by high value use cases such as SQL analytics, BI, reporting, and data science and machine learning.

The diagram above demonstrates a common pattern used by many companies to ingest and process data of all types, sizes, and speed into a curated data lake. Let's look at the 3 major components of the pattern:

There are several great tools in Azure for ingesting raw data from external sources into the cloud. Azure Data Factory provides the standard for importing data on a schedule or trigger from almost any data source and landing it in its raw format into Azure Data Lake Storage/Blob Storage. Other services such as Azure IoT Hub and Azure Event Hubs provide fully managed services for real time ingestion. Using a mix of Azure Data Factory and Azure IoT/Event Hubs should allow a company to get data of just about any type, size, and speed into Azure.
After landing the raw data into Azure, companies typically move it into the raw, or Bronze, layer of the curated data lake. This usually means just taking the data in its raw, source format, and converting it to the open, transactional Delta Lake format where it can be more efficiently and reliably queried and processed. Ingesting the data into the Bronze curated layer can be done in a number of ways including:
1. Basic, open Apache Spark APIs in Azure Databricks for reading streaming events from Event/IoT Hubs and then writing those events or raw files to the Delta Lake format.
2. The COPY INTO command to easily copy data from a source file/directory directly into Delta Lake.
3. The Azure Databricks Auto Loader to efficiently grab files as they arrive in the data lake and write them to the Delta Lake format.
4. The Azure Data Factory Copy Activity which supports copying data from any of its supported formats into the Delta Lake format.
After the raw data has been ingested to the Bronze layer, companies perform additional ETL and stream processing tasks to filter, clean, transform, join, and aggregate the data into more curated Silver and Gold datasets. Using Azure Databricks as the foundational service for these processing tasks provides companies with a single, consistent compute engine (the Delta Engine) built on open standards with support for programming languages they are already familiar with (SQL, Python, R, Scala). It also provides them with repeatable DevOps processes and ephemeral compute clusters sized to their individual workloads.

The ingestion, ETL, and stream processing pattern discussed above has been used successfully with many different companies across many different industries and verticals. It also holds true to the key principles discussed for building Lakehouse architecture with Azure Databricks: 1) using an open, curated data lake for all data (Delta Lake), 2) using a foundational compute layer built on open standards for the core ETL and stream processing (Azure Databricks), and 3) using easy integrations with other services like Azure Data Factory and IoT/Event Hubs which specialize in ingesting data into the cloud.

If you are interested in learning more about Azure Databricks, attend an event, and check back soon for additional blogs in the "Patterns with Azure Databricks" series.

Updated Jan 06, 2021

Version 4.0

MikeCornell-Databricks

Former Employee

Joined September 24, 2020

View Profile

Analytics on Azure Blog