We’re thrilled to announce the release of a highly anticipated feature in Microsoft Purview: lineage tracking for Azure Databricks Unity Catalog. This marks a significant milestone in our ongoing efforts to enhance data governance and visibility across cloud environments.
By leveraging this new functionality, users can now track data flow across their Azure Databricks notebooks, improving the ability to audit, monitor, and manage data movement. With data increasingly flowing through complex, cloud-native platforms like Azure Databricks, having clear, end-to-end visibility is crucial for compliance, troubleshooting, and operational excellence.
What is Data Lineage?
Data lineage refers to the ability to track the origins, movements, and transformations of data as it flows across different systems and processes. It helps organizations answer key questions like:
- Where does this data come from?
- How is the data transformed and used?
- Which processes or users have modified the data?
In the context of Azure Databricks Unity Catalog, lineage shows how data flows through notebooks, allowing users to see which sources fed into their analyses and where the processed data is stored. By providing this visibility, data lineage helps improve transparency, making it easier to understand the lifecycle of data, diagnose errors, and ensure compliance with data governance policies.
Microsoft Purview can capture lineage at both the Unity Catalog table/view level and the column level.
What Are the Prerequisites for Enabling Lineage?
In addition to standard prerequisites for Azure Databricks Unity Catalog scans in Microsoft Purview (such as an active Azure subscription, Purview setup, and integration runtime), the following are key requirements specifically for fetching lineage:
- Enable System Schema: The system.access schema must be enabled in Unity Catalog, as lineage data is stored in system tables.
- User Privileges: The scanning account needs SELECT privileges on the following system tables:
- system.access.table_lineage
- system.access.column_lineage
These permissions are essential for Purview to retrieve lineage from Azure Databricks.
How to fetch lineage during scans?
To enable lineage during the scan setup in Microsoft Purview, follow the standard steps for configuring a Azure Databricks scan (register the source, configure runtime, etc.). The critical action required for lineage is:
- Toggle Lineage Extraction: When configuring the scan, ensure that Lineage Extraction is set to On. This will enable Microsoft Purview to fetch the lineage of the scanned Azure Databricks assets, including the flow of data through notebooks.
Then go ahead, run your scan and go grab a cup of coffee while Microsoft Purview does its magic!
Example: Comparing Lineage Views in Azure Databricks and Microsoft Purview
After enabling lineage and running a scan, all catalogs from Azure Databricks Unity Catalog will begin to appear in the Microsoft Purview Data Map. This means you’ll see a unified view of data sources across both systems, allowing for easy tracking of data flow and transformations.
Azure Databricks lineage: Shows lineage for datasets and transformations within your notebooks, highlighting dependencies.
Microsoft Purview lineage: Displays lineage across catalogs in a visual, end-to-end data flow.
These visual comparisons give you a clear understanding of how each platform captures and displays data lineage, making it easier to manage and trace your data flows.
What’s Next for Azure Databricks Lineage?
Currently only Azure Databricks notebook lineage is available, but we’re not stopping there!
Microsoft is actively working with Azure Databricks to bring lineage for jobs and pipelines, ensuring comprehensive tracking of data across your Azure Databricks environment. We continue to push the boundaries of data governance, making it easier for organizations to get full visibility into their data processes.
Stay tuned for future updates as we expand this functionality, bringing you even more insights and control!