Blog Post

FastTrack for Azure

8 MIN READ

Metadata Driven Pipelines for Microsoft Fabric

Microsoft

Aug 04, 2023

Metadata-driven pipelines in Azure Data Factory, Synapse Pipelines, and now, Microsoft Fabric, give you the capability to ingest and transform data with less code, reduced maintenance and greater scalability than writing code or pipelines for every data source that needs to be ingested and transformed. The key lies in identifying the data loading and transformation pattern(s) for your data sources and destinations and then building the framework to support each pattern.

In this blog post, I will provide an overview of a Metadata driven pipeline in Microsoft Fabric that follows the medallion architecture (Bronze, Silver, Gold). The intent is not to provide a full tutorial on building metadata driven pipelines or Microsoft Fabric; rather it is show you some new features of Fabric and give you some ideas implementing metadata driven pipelines in Fabric.

Update 10/30/2023: Read this blog first, then check out this GitHub Repo to recreate in your own environment!

Metadata driven architecture for Fabric Modern Data Warehouse

The goal for this solution is to build a Star Schema in a Microsoft Fabric Lakehouse with Delta Tables, a Power BI Direct Lake Dataset and related reports for end user consumption. The solution contains full or incremental loads to the Bronze Lakehouse, leverages SQL Views as the Silver Layer, then performs full or incremental loads to the Gold Lakehouse.

Below are more details on each numbered part in the architecture diagram:

1 - Define pipeline configuration tables

Tables are defined that contain the configuration for each type of data load, 1 table for loading from the source SQL database to the Bronze Fabric Lakehouse and a 2nd table defined for loading from the Bronze Fabric Lakehouse to the Gold Lakehouse. Each table contains a row for each source/destination combination and includes such fields as source table name, source schema, date key, start date, and load type (full or incremental). The tables also contain fields for pipeline run results, such as the number of rows inserted, updated, status, max table transaction date, which are updated after each table is loaded.

The table below shows the configuration table for loading from a Source SQL database to the Bronze Lakehouse:

The table below shows the configuration table for loading from the Bronze Lakehouse to the Gold Lakehouse:

2 - Get Configuration details for tables to load from Source to Bronze Lakehouse

Below is what our final Orchestrator pipeline will look like, with the relevant steps from the architecture diagram above indicated:

The orchestrator pipeline contains a Lookup activity on the Source to Bronze configuration table to get the list of tables to load from source to the Bronze.

3 - Call child pipeline to load data from Source to Bronze Lakehouse

For each table defined in the Lookup activity, call a child pipeline to load the data from Source to Bronze Lakehouse, passing in the configuration detail from the lookup.

For Each activity:

Child pipeline to load from Source to Bronze Lakehouse:

4 - Copy Data from Source to Bronze Lakehouse

A step to set a variable called datepredicate is part of this pipeline. A selection predicate based upon date is needed for incremental loads from the source or if you want to load just a subset of the data. This simplifies the creation of the SQL source query string in the subsequent Copy Data activity.

If the load type setting from the configuration table is a full load, do a Copy Data Activity from the Source to Bronze Lakehouse Delta Lake Table.

Full load Copy Data Source settings:

Full Load Copy Data Destination Settings:

If the load type setting from the configuration table is an incremental load, do a Copy Data Activity from the Source to Bronze Lakehouse but as set destination as a Parquet file.

Incremental load Copy Data source settings:

Incremental load Copy Data destination settings:

5 - Call Notebook for incremental load merge

For incremental loads only, call a Spark Notebook to merge the incremental data to the Bronze Delta Lake table.

Create or Merge to Deltalake Notebook code below:

from delta.tables import *
from pyspark.sql.functions import *

lakehousePath = "abfss://yourpathhere"
tableName = "Invoices"
tableKey = "InvoiceID"
tableKey2 = None
dateColumn = "LastEditedWhen"

deltaTablePath = f"{lakehousePath}/Tables/{tableName}" 
parquetFilePath = f"{lakehousePath}/Files/incremental/{tableName}/{tableName}.parquet"

df2 = spark.read.parquet(parquetFilePath)

if tableKey2 is None:
    mergeKeyExpr = f"t.{tableKey} = s.{tableKey}"
else:
    mergeKeyExpr = f"t.{tableKey} = s.{tableKey} AND t.{tableKey2} = s.{tableKey2}"  

#Check if table already exists; if it does, do an upsert and return how many rows were inserted and update; if it does not exist, return how many rows were inserted
if DeltaTable.isDeltaTable(spark,deltaTablePath):
    deltaTable = DeltaTable.forPath(spark,deltaTablePath)
    deltaTable.alias("t").merge(
        df2.alias("s"),
        mergeKeyExpr
    ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
    history = deltaTable.history(1).select("operationMetrics")
    operationMetrics = history.collect()[0]["operationMetrics"]
    numInserted = operationMetrics["numTargetRowsInserted"]
    numUpdated = operationMetrics["numTargetRowsUpdated"]
else:
    df2.write.format("delta").save(deltaTablePath)  
    deltaTable = DeltaTable.forPath(spark,deltaTablePath)
    operationMetrics = history.collect()[0]["operationMetrics"]
    numInserted = operationMetrics["numTargetRowsInserted"]
    numUpdated = 0

#Get the latest date loaded into the table - this will be used for watermarking; return the max date, the number of rows inserted and number updated

deltaTablePath = f"{lakehousePath}/Tables/{tableName}"
df3 = spark.read.format("delta").load(deltaTablePath)
maxdate = df3.agg(max(dateColumn)).collect()[0][0]
# print(maxdate)
maxdate_str = maxdate.strftime("%Y-%m-%d %H:%M:%S")

result = "maxdate="+maxdate_str +  "|numInserted="+str(numInserted)+  "|numUpdated="+str(numUpdated)
# result = {"maxdate": maxdate_str, "numInserted": numInserted, "numUpdated": numUpdated}
mssparkutils.notebook.exit(str(result))

Return the number of rows inserted, updated and max date from the notebook results and store them in pipeline variables.

6 - Save pipeline run results to configuration table

For each table loaded, update the configuration table with the load details such as the number of rows read, inserted and updated from the variables or Copy Data output. What is especially critical for the incremental load is to update the start date configuration with the max transaction date loaded., which is returned in the Create or Merge to Deltalake notebook. This will be used to retrieve records from the source on the next subsequent run which are greater or equal to the max datetime of the table data loaded in this run.

7 - Leverage SQL Views over Bronze Lakehouse tables for Silver layer

SQL views are defined over the Bronze Lakehouse Delta tables. These views will be the source for loading the Gold Lakehouse tables from the Bronze Lakehouse tables.

While SQL views are supported in the Lakehouse SQL Endpoint, they are created and accessible only via the SQL Endpoint, which means that they are not available to us in a Data Factory Copy Data activity. The Copy Data Activity only leverages the Lakehouse endpoint at this time. However, views are accessible in the Fabric Data Warehouse in a Copy Data Activity. Therefore, views are created in a Fabric Data Warehouse. All Data Warehouse views reference Lakehouse Bronze tables and there is no data movement between Bronze to Silver:

8 - Get configuration details to load tables from Silver Views/Bronze Lakehouse to Gold Lakehouse

With the tables loaded in the Bronze Lakehouse and the views defined in the Silver layer, we can transform and load the tables to the Gold Lakehouse. In the orchestrator pipeline, do a Lookup on the configuration table to get the details for each Gold table load:

9 - Call child pipeline to load data from Silver Views/Bronze Lakehouse to Gold Lakehouse

For each table configuration returned from the Lookup activity, call a child pipeline to load the data from Bronze Lakehouse to Gold Lakehouse, passing in the configuration detail from the lookup.

Child pipeline to load from Bronze Lakehouse to Gold Lakehouse:

The pipeline to load from Bronze to Gold has a similar pattern to the pipeline that loads from Source to Bronze, except our source is Fabric Data Warehouse views in the Silver layer that reference tables in the Bronze Lakehouse.

If the load type setting from the configuration table is a full load, do a Copy Data Activity from the Silver View to Gold Lakehouse Delta Lake Table.

Full load Copy Data Source settings:

Full Load Copy Data Destination Settings:

If the load type setting from the configuration table is a full load, set the datepredicate variable to set the selection predicate for the incremental load query.

Then do a Copy Data Activity from the Bronze Lakehouse to the Gold Lakehouse but as a Parquet file.

Incremental load Copy Data source settings:

Incremental load Copy Data destination settings:

10 - Call Notebook for incremental load merge, Gold Lakehouse Delta tables

For incremental loads only, call a Spark Notebook to merge the incremental data to the Gold Delta Lake table. This is the same notebook that was called in the Load Source to Bronze Pipeline for incremental loads in step 5.

Return the number of rows inserted, updated and max date from the notebook results and store them in variables.

11 - Save pipeline run results to configuration table

Like step 6 for the Load Source to Bronze pipeline, update the Gold configuration table with the load details such as the number of rows read, inserted and updated. Again, it is critical to update the start date configuration with the max transaction date loaded for incremental loads.

12 - Create Fabric Dataset

Now that you Star Schema is created in your Gold Lakehouse, you can create a Dataset to be consumed by Power BI reports:

Note the Blue dashed line on the top edge of each table. Hover over the table and you can see that the Storage Mode is DirectLake, which means that Power BI reports will connect directly to the Delta Lake tables without having to import into a Power BI in-memory dataset or having to use a SQL Endpoint like Direct Query. Here you can create relationships between your fact and dimension tables, create measures, define hierarchies, etc. just like a Power BI dataset created in Power BI Desktop.

13 - Create Power BI Reports

Finally, create a Power BI Report on top of your dataset:

In this post, I illustrated different features of Microsoft Fabric to build metadata driven pipelines for your data workloads. Microsoft Fabric offers a one stop shop for building a Modern Data Warehouse in a Lakehouse with Delta Lake tables. Power BI Direct Lake connectivity to the Fabric Lakehouse Delta Lake tables offers the performance of Power BI Import storage mode with the accessibility of Direct Query, getting critical data to your end users without the overhead of importing and scheduling data refreshes or with the performance lags of Direct Query.

In a couple weeks I will post another pattern on loading data to the Fabric Data Warehouse rather than the Fabric Lakehouse and reasons you may want to do so. Stay tuned!

Update 8/22/2023 - And here it is! Metadata Driven Pipelines for Microsoft Fabric – Part 2, Data Warehouse Style - Microsoft Community Hub

Additional resources:

https://learn.microsoft.com/en-us/fabric/get-started/

https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview

https://learn.microsoft.com/en-us/fabric/data-factory/

https://learn.microsoft.com/en-us/power-bi/enterprise/directlake-overview

Updated Oct 30, 2023

Version 3.0

data & ai

jehayes

Microsoft

Joined November 25, 2019

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity

20 Comments

abhijeetmhetre12
Copper Contributor
Nov 04, 2024
Hi jehayes, in the extension to @Bob_Duffy's question

I’d like to discuss a deployment issue we're encountering and get your insights or possible solutions.
We’ve set up our pipeline using the medallion architecture, following your guidance to use Azure SQLDB for auditing and metadata management. However, during deployment, we’ve run into a challenge. Our setup includes four workspaces (dev, QA, UAT, and Prod), each with its own dedicated Azure SQL database for metadata.
The issue is that we currently lack a way to dynamically configure the Lookup connection to automatically point to the appropriate Azure SQL server for each environment. This limitation forces us to make manual changes within the data pipeline activities wherever the Azure SQL connection is used, which is inefficient and prone to errors.
Do you know of any workaround or solution that could help us avoid this manual intervention and dynamically handle the Azure SQL connections?

Thanks in advance for any guidance!
ToddChitt
Copper Contributor
Sep 10, 2024
I have used the Meta-Data Driven Copy Task in Azure Data Factory, and for that we typically set up a Control Table connection to a dedicated Azure SQL Database if the architecture does not also include an Azure SQL Database or Synapse Warehouse.
But with the advent of Fabric, many clients are trying to get away from needing all these stand-alone resources (read: Azure SQL DB, Data Warehouses, Pipelines, etc.). After all, that is what Microsoft is trying to sell us with Fabric: EVERYTHING under ONE ROOF, easily accessible and seamlessly connected. But there is a whole host of limitations in Fabric that preclude us from actually using it to replace the Meta Data Driven Copy Task in classic ADF (like no NVARCHAR(MAX) data type, no support for MERGE operation, and the list goes on).
I, too, can dream of a day where Microsoft publishes a wizard in Fabric that sets up the Top/Middle/Bottom level pipeline structure, all in Fabric. Oh, and while you're at it, Microsoft, please re-factor the thing so it supports TIMESTAMP datatypes as the Watermark. (While the wizard does not support it, we have been able to successfully re-vamp the bottom level pipeline so that it can use a TIMESTAMP as the Watermark.)
rocketporg
Brass Contributor
Jun 10, 2024
Hi Mark Kromer

Any news on that 'WIP' connection parameterisation feature you mentioned? I noticed Fabric has had a 'little' update recently in this regard where you can now dynamically choose the name of a connection. I almost thought they'd given us a possible workaround instead of giving us the parameterised the server name feature (which is really what is needed, like you could do with old ADF linked services), but then I realised it only works when combined with a limited choice of only Fabric items such as a lakehouse or warehouse as the 'type'...

Ah this was so close, if only it had also allowed SQL server and other possible types that are common on-prem, that at least would have been a kind of workaround I could live with

I've had to recommend that a number of my clients use Azure databricks in the meantime combined with vanilla ADF and a metadata driven architecture - with a possible future migration to Fabric. Although I have introduced a small metadata driven Fabric solution for a few clients, but only to those with just the one (or very few) on-prem source servers. For some of my clients who are reaching near 100 servers its just not practical, unless we build them some sort of ADF/Fabric hybrid architecture. However, most of them seem happy to use just Azure databricks with ADF in that regard, as its more mature...

So I guess we're still waiting for the 'server name' parameterisation feature... unless the devs can expand upon the dynamic 'connection name' feature that was introduced?

P.S. I've voted for this feature on the 'ideas' site https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=cd5ccbf7-391c-ee11-a81c-6045bd7e3068, but it seems it needs more votes
Mark Kromer
Microsoft
Nov 26, 2023
Bob_Duffy SQL DB is currently the best option for storing data pipeline metadata for these scenarios because it is an OLTP whereas the data stores currently in Fabric are not ideal for that scenario and are better suited for analytics at this time.
Mark Kromer
Microsoft
Nov 26, 2023
rocketporg Yes, connection parameterization will be coming to Fabric, it is currently a WIP
jehayes
Microsoft
Oct 30, 2023
KJsDad - Agree that in some cases having views to represent the Silver layer would not work, especially if your data source is from streaming applications. I can get away with the views because the theoretical source that I have is data from an ERP system in relational format and I don't have to worry as much about deduplication, data quality, etc. I am using Delta Tables that represent the Gold Lakehouse Star Schema? I could have had notebooks for each fact or dimension table instead of views, if that is what you are referring to.
KJsDad
Copper Contributor
Oct 05, 2023
Thanks for a terrific article. I do have a few observations and comments. It seems like the assumption is a simple view is all that is needed in the silver layer which I don't agree with. Wouldn't the silver layer contain canonical delta tables that have core transformations, quality checks and de-duplication of data which is key to merge, and canonical data sets? I would also think that a bronze shortcut would be preferred over a view so they are available to us in a Data Factory Copy Data activity. By having canonical delta tables and bronze shortcuts in our Silver Lakehouse we maintain the direct lake capability opening up options for quality checks and PBI connectivity. When building the Gold Lakehouse why wouldn't you just create delta tables that represent your star schema data warehouse? Since the Lakehouse has a SQL Endpoint all sql users/toolsets could access it that way and anyone needing to run AI/ML on the company vetted data warehouse could do so using the delta tables/python/etc. I would really appreciate any/all feedback on these design thoughts. What's good or bad? What else should I consider? Thanks Everyone!
silpaxia
Copper Contributor
Sep 12, 2023
Thank you so much This was really great. Any SCD Type 2 Dimension Best Practice or Source Code Available?
mranderson00
Copper Contributor
Sep 08, 2023
jehayes Speaking as someone who is helping build something similar, this was incredibly useful. I would very much love to have source code as a reference if you can share it.

Thanks,
Keith
rafable
Copper Contributor
Aug 30, 2023
Hi jehayes, this is very nice, Thanks for sharing, and looking forward to the GitHub repo.

One of my customers has a similar metadata-based solution incorporating ADF, Data Lake, and Synapse SQL Pools. All Gold layer transformation happens in Synapse using SQL Stored Procedures, which combine multiple Synapse tables to produce a star schema.

Using Fabric, does it make sense to leverage SQL Views that combine Lakehouse tables into fact and dimension structure that we can load to Gold? Or is it possible/recommended to use Notebooks for that? Are Notebooks in your proposed solution generic or customizable per target table?

Thanks,
Rafa