This post was authored by Tahir Fayyaz, a Senior Partner Solutions Architect at Databricks.
Microsoft’s Power BI is a very popular business intelligence platform for creating and sharing visualizations, reports, and dashboards on top of the Azure Databricks Lakehouse using Delta Lake and Databricks SQL, a record-breaking and extremely fast SQL processing engine that can be used just like other cloud data warehouses.
To get the optimal performance from Power BI it is recommended to use a star schema data model and to make use of user-defined aggregated tables. However, as you build out your facts, dimensions, and aggregation tables and views in Delta Lake, ready to be used by the Power BI data model, it can become complicated to manage all the pipelines, dependencies, and data quality as you need to consider the following:
- How to easily develop and manage the data model’s transformation code.
- How to run and scale data pipelines for the model as data volumes grow.
- How to keep all the Delta Lake tables updated as new data arrives.
- How to view the lineage for all tables as the model gets more complex.
- How to actively stop data quality issues that result in incorrect reports.
Delta Live Tables
Delta Live Tables (DLT) is the first ETL framework that uses a simple declarative approach to building reliable data pipelines. DLT automatically manages your infrastructure at scale so data analysts and engineers can spend less time on tooling and focus on getting value from data.
To help with all of these challenges you can use DLT to develop, model, and manage the transformations, pipelines, and Delta Lake tables that will be used by Databricks SQL and Power BI.
Using Delta Live Tables offers the following benefits:
- Declarative APIs to easily build your transformations and aggregations using SQL or Python
- Support for streaming to provide fresh and up-to-date data by only processing new data
- Have confidence in your data with built-in data quality testing and monitoring
- Automatically generated lineage graph between all tables
- Automated management and auto-scaling of the clusters to run the pipeline
Developing your data model in Databricks and Power BI
In this guide, we will show you how to develop a Delta Live Tables pipeline to create, transform and update your Delta Lake tables and then build the matching data model in Power BI by connecting with a Databricks SQL endpoint.
All the code for this demo is available in the Azure Databricks Essentials retail demo repo and it makes use of a TPCH dataset.
Develop the DLT pipeline
To develop the DLT pipeline we have four Databricks notebooks structured in the following way to help you easily develop and share all of your ingestion, transformation and aggregation logic:
- 1-create-landing-tables
- 2-create-fact-tables
- 3-create-dimension-tables
- 4-create-aggregation-tables
If you are new to DLT you can follow the quick start tutorial to get familiar. You can then clone the repo into your Databricks Workspace to get started in developing and deploying the DLT pipeline.
The landing tables make use of Auto Loader as the data source which allows you to only ingest new files landing in storage. The following example shows how the orders landing table is created:
create or refresh streaming live table orders_landing
comment "The landing orders dataset, ingested from /tmp/tahirfayyaz using auto loader."
as select * from cloud_files("/tmp/tahirfayyaz/orders/", "json", map(
"cloudFiles.schemaHints", """
o_orderkey BIGINT,
o_custkey BIGINT,
o_orderstatus STRING,
o_totalprice FLOAT,
o_orderdate DATE,
o_orderpriority STRING,
o_clerk STRING,
o_shippriority INT,
o_comment STRING
"""
))
All downstream tables can then refer to upstream tables using the stream(live.table_name) or live.table_name syntax.
The joined header and line items fact table makes use of stream(live.table_name) to only incrementally process new records in the upstream landing tables. If data is being streamed then the table being written to must also be created as a streaming live table.
create or refresh streaming live table orders_lineitems_fct(
)
comment "Orders and line items joined as fact table"
tblproperties(pipelines.autoOptimize.zOrderCols = "ol_orderdatekey, l_partkey")
as select
o_orderkey as ol_orderkey,
o_custkey as ol_custkey,
o_orderstatus as ol_orderstatus,
-- o_totalprice,
o_orderdate as ol_orderdate,
-- powerbi needs date as yyyymmdd int
cast(date_format(o_orderdate,"yyyyMMdd") as int) as ol_orderdatekey,
o_orderpriority as ol_orderpriority,
o_clerk as ol_clerk,
o_shippriority as ol_shippriority,
o_comment as ol_comment,
l_partkey as ol_partkey,
l_suppkey as ol_suppkey,
l_linenumber as ol_linenumber,
l_quantity as ol_quantity,
l_extendedprice as ol_extendedprice,
l_discount as ol_discount,
l_tax as ol_tax,
l_returnflag as ol_returnflag,
l_linestatus as ol_linestatus,
l_shipdate as ol_shipdate,
l_commitdate as ol_commitdate,
l_receiptdate as ol_receiptdate,
l_shipinstruct as ol_shipinstruct,
l_shipmode as ol_shipmode,
l_comment as ol_line_comment
from stream(live.orders_landing) o
inner join stream(live.lineitem_landing) l
on o.o_orderkey = l.l_orderkey
The dimension and aggregated tables use live.table_name as they use all of the data from the upstream fact and dimension tables and therefore are fully recomputed each time the pipeline runs.
create live table orders_brand_aggregated (
)
comment "Orders Brand Aggregated Table"
as select
a.ol_orderdatekey as ol_orderdatekey,
a.ol_orderdate as orderdate,
b.p_brand as brand,
b.p_type as type,
sum(a.ol_extendedprice) total_price,
sum(a.ol_quantity) total_quantity
from live.orders_lineitems_fct as a
inner join live.part_dim as b
on a.ol_partkey = b.p_partkey
group by 1, 2, 3, 4
This shows the flexibility of DLT where you can mix streaming (incremental) or full recomputes of data based on your requirements.
Define Data Quality Constraints
Data quality constraints in DLT are a great way to ensure only correct data appears in your reports.
We have defined two types of constraints in our pipeline. One to retain but report on the invalid records and another one to drop the invalid records if there is a data quality issue.
Retain
create live table part_dim(
constraint valid_partkey expect (p_partkey is not null)
)
Drop
create live table orders_brand_aggregated (
constraint valid_total_price expect (total_price > 0) on violation drop row
)
There is also a third type that will FAIL if there invalid records that we have not used in our pipeline but this shows how flexible you can be with your data quality constraints.
Deploy the Delta Live Tables pipeline
To deploy the pipeline go to Workflows and click on the Delta Live Tables table. Click on Create Pipeline and then choose the notebooks used to develop the model. Enter the schema Target as retail_tpch_dev and this is where all the Delta Lake tables will be created.
Once the DLT pipeline is created you can start the initial run and after the first run, you will see the status of the pipeline and end-to-end lineage for all the tables along with data quality metrics based on the data quality constraints you defined.
Create Views on Delta Lake Tables for production
When deploying your data model to production and giving access to end-users it is a best practice to create views on the tables that will be used in Power BI as it provides a layer of isolation if any of the underlying databases, tables or columns change to avoid any issues for the production views. You can also use views to create a better logical structure for end-users or even introduce row and column level security and filters.
When creating the views do not use select * as you should define what columns are available in the production tables to avoid any issues with the final data model in Power BI if any of the underlying columns change.
Using a separate Databricks notebook create a database (schema) called retail_tpch_prod and create views for all of the fact, dimension, and aggregation tables.
For example, to create the database and aggregation views, run the following SQL:
create database retail_tpch_prod;
create or replace view retail_tpch_prod.orders_brand_aggregated
as select
ol_orderdatekey,
orderdate,
brand,
type,
total_price,
total_quantity
from tpch_retail_dev.orders_brand_aggregated;
Connect Databricks SQL and Power BI
Now that the data is curated in Delta Lake and ready to be accessed by Power BI you will need to create a Databricks SQL endpoint that will be used to query and process the data and you will need to connect Power BI to the Databricks SQL endpoint using Direct Query mode.
Once connected select all the dimensions, facts, and aggregation views that will be used in your Power BI data model.
Manage Star Schema Data Model in Power BI
In Power BI you can now configure your star schema data model by defining all the table relationships and user-defined aggregations.
User-defined aggregations can be set up using a groupBy column technique, column relationships, or a combination of both.
Your data model is now ready for creating reports and for you to publish the dataset to the Power BI service ready for other users to use.
Delta Live Tables for all your data model management
This example shows how you can use DLT to make data model management easier for your data engineers and analytics engineers. It also allows you to easily deploy your data pipelines and guarantee that data will be accurate and up to date for the end-users in Power BI. You can read more about all of the benefits in the Announcing General Availability of Databricks’ Delta Live Tables (DLT) post.
To get started with DLT you can follow the quick start guide and then use the code in the Azure Databricks Essentials retail demo repo to recreate the demo shared above.
For further optimizations, you can read our guide on Implementing a Star Schema in Databricks With Delta Lake and The Semantic Lakehouse with Azure Databricks and Power BI.