As businesses continue to generate massive amounts of data, the need for efficient data management solutions becomes increasingly important. This is where a data lakehouse comes in - a hybrid solution that combines the best features of a datalake and a data warehouse.
This article is Part 1 in the series, here we will explore how to build a data lakehouse using Azure Data Explorer (ADX) where the data flows from Azure SQL DB using Change Data Capture (CDC) through Azure Data Factory and events flowing from events hub.
In Part 2 we will show you how to deploy this solution using Bicep, a powerful infrastructure as code (IaC) tool from Microsoft. With this guide, you'll be able to create a data lakehouse that can handle large volumes of data and provide valuable insights for your business.
For the actual code on "how to guide" for the deployment see part 2:
Part 2 - Building a Data Lakehouse using Azure Data explorer - the deployment
The Data Lakehouse term was coined by Databricks on an article in 2021 and it describes an open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management, data mutability and performance of data warehouses.
A medallion architecture (also coined by Databricks) is a data design pattern used to logically organize data.
The goal is to incrementally improve the structure and quality of data as it flows through each layer of the architecture.
Medallion architectures are sometimes also referred to as "multi-hop" architectures.
Creating a multi layer data platform allow companies to improve data quality across the layers and at the same time provide for their business needs.
Unstructured and raw data are ingested using scalable pipelines to output the highest quality enriched data.
It contains 3 basic layers.
Bronze layer:
Silver layer:
Gold layer:
CDC (Capture Data Change) is the process of capturing changes made to data in a database and then delivering them to a downstream system. It is commonly used for data replication.
We will use CDC by capturing changes made to an Azure SQL and store the changes in Azure Data Explorer.
Azure Data Explorer is a fast and highly scalable data service for analytics.
The Azure Data Explorer analytics runtime is optimized for efficient analytics using powerful indexing technology to automatically index free-text and semi-structured data.
Azure Data Explorer is a fully managed, high-performance, big data analytics platform that makes it easy to analyze high volumes of data in near real time. The Azure Data Explorer toolbox gives you an end-to-end solution for data ingestion, query, visualization, and management.
By analyzing structured, semi-structured, and unstructured data across time series, and by using Machine Learning, Azure Data Explorer makes it simple to extract key insights, spot patterns and trends, and create forecasting models. Azure Data Explorer is scalable, secure, robust, and enterprise-ready, and is useful for log analytics, time series analytics, IoT, and general-purpose exploratory analytics.
The reason to use ADX for a Business Analytics Platform is that we get the best of both worlds: Data lakes and Data warehouses.
|
Data Lake |
Data Warehouse |
Azure Data Explorer |
Suitable for raw data |
Y |
N |
Y |
Suitable for BI |
N |
Y |
Y |
Long Data retention |
Y |
N |
Y |
Suitable for DS |
Y |
N |
Y |
Cost |
Low |
High |
Low |
We will ingest events from an Events hub into ADX using “one click ingestion”.
CDC (Capture Data Change) data will be written into ADX using Azure Data factory using periodically triggered ETLs.
To build the Medallion Architecture in Azure Data Explorer, data needs to be transformed and copied between the layers (Bronze->Silver->Gold).
More information can be found here:
Products
Order Details
{
"date": "2023-04-03 13:58:11",
"clickedProductId": 836,
"browser": "Safari",
"browserVersion": "4.5",
"ip": "202.254.96.230",
"device": "computer",
"source": "facebook"
}
All changes made to the tables in the Azure SQL DB are periodically copied to ADX using Azure Data Factory CDC.
All events in the event hub are ingested into ADX using ADX built-in ingestion.
.create table [bronzeProducts] (ProductID:int,Name:string,ProductNumber:string,Color:string,StandardCost:real,ListPrice:real,Size:string,Weight:string,ProductCategoryID:long,ProductModelID:long,rowguid:guid,ModifiedDate:datetime, OrderDate:datetime)
.create table [bronzeOrders] (SalesOrderID:long,SalesOrderDetailID:long,OrderQty:long,ProductID:int,UnitPrice:real,UnitPriceDiscount:real,LineTotal:real,ModifiedDate:datetime)
.create table [bronzeClicks] (['date']:datetime,ProductID:int,browser:string,browserVersion:string,ip:string,device:string,source:string)
.create table bronzeClicks ingestion json mapping "bronzeClicks_mapping"
'[{"column":"date","path":"$[\'date\']","datatype":"","transform":null},{"column":"ProductID","path":"$[\'ProductID\']","datatype":"","transform":null},{"column":"browser","path":"$[\'browser\']","datatype":"","transform":null},{"column":"browserVersion","path":"$[\'browserVersion\']","datatype":"","transform":null},{"column":"ip","path":"$[\'ip\']","datatype":"","transform":null},{"column":"device","path":"$[\'device\']","datatype":"","transform":null},{"column":"source","path":"$[\'source\']","datatype":"","transform":null}]'
.create table [silverProducts] (ProductID:int,Name:string,ProductNumber:string,Color:string,StandardCost:real,ListPrice:real,Size:string,Weight:string,ProductCategoryID:long,ProductModelID:long,rowguid:guid,ModifiedDate:datetime,OrderDate:datetime, WeightCategory:string)
.create table [silverOrders] (SalesOrderID:long,SalesOrderDetailID:long,OrderQty:long,ProductID:int,UnitPrice:real,UnitPriceDiscount:real,LineTotal:real,ModifiedDate:datetime, UnitFinalPrice:real, PurchaseDate:datetime)
.create table [silverClicks] (['date']:datetime,ProductID:int,browser:string,browserVersion:string,ip:string,device:string,source:string)
For the sake of the example we will create a silver layer table for the products and we will add a column called WeightCategory with 4 possible values: S,M,L,U=unknown.
The Weight category will be calculated as follows:
S- weight<100
M- 100<weight<1000
L- weight>1000
U – unknown
We will use an update policy to copy the data between the bronze and silver layers.
First, let’s create a table in the silver layer.
.create function
with (docstring = 'Ingest raw products data and calculate product weight category', folder='ingestprojection')
ParseProducts ()
{
bronzeProducts
| extend weightCategory = case(todouble(Weight) <= 100, "S",
todouble(Weight) <= 1000 and todouble(Weight) > 100 , "M",
todouble(Weight) > 1000, "L",
"U")
}
.alter table
silverProducts
policy update @'[{"Source": "bronzeProducts", "Query": "ParseProducts", "IsEnabled" : true, "IsTransactional": true }]'
For the sake of the example we will create a silver layer table for the orders and we will add 2 columns:
UnitFinalPrice to show the final unit price after discount.
PurchaseDate containing only the purchase date without the time.
.create function
with (docstring = 'Ingest raw orders data and calculate the final unit price', folder='ingestprojection')
ParseOrders ()
{
bronzeOrders
| extend UnitFinalPrice = round(UnitPrice * (1-UnitPriceDiscount),2)
| extend PurchaseDate = todatetime(split(ModifiedDate, " ")[0])
}
.alter table
silverOrders
policy update @'[{"Source": "bronzeOrders", "Query": "ParseOrders", "IsEnabled" : true, "IsTransactional": true }]'
For the clicks table we will copy it “as-is” to the silver layer.
.create function
with (docstring = 'Ingest raw clicks', folder='ingestprojection')
ParseClicks ()
{
bronzeClicks
}
.alter table
silverClicks
policy update @'[{"Source": "bronzeClicks", "Query": "ParseClicks", "IsEnabled" : true, "IsTransactional": true }]'
CDC data includes all the changes from the operational database.
So every update will show as a new row in ADX.
This requires we create materialized views that show only the latest values for an item.
We will create materialized views to show only the latest changes to the products and orders tables.
.create materialized-view silverOrdersLatest on table silverOrders
{
silverOrders
| summarize arg_max(ModifiedDate, *) by SalesOrderID
}
.create materialized-view silverProductsLatest on table silverProducts
{
silverProducts
| summarize arg_max(ModifiedDate, *) by ProductID
}
In the gold layer we will create a single materialized view to show daily aggregations. This view can be used in dashboards in ADX or Power BI.
silverOrdersLatest
| join kind=inner silverProductsLatest on $left.ProductID == $right.ProductID
| summarize DailySales = sum(LineTotal) by PurchaseDate, ProductID, Name
silverClicks
| join kind=inner silverProductsLatest on $left.ProductID == $right.ProductID
| summarize DailyClicks = count() by 'date', ProductID, browser, browserVersion, device, source
That’s it, we are done.
I hope you enjoyed reading this.
For the actual code on "how to guide" for the deployment see part 2:
Part 2 - Building a Data Lakehouse using Azure Data explorer - the deployment
Thanks
Denise
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.