Trusted data leads to trusted business insights. Ensuring trust in data goes hand-in-hand with making data easily discoverable. One of the ways to do this is by providing data consumers insight into the data's lineage - where data came from and what transformations it has undergone.
Data lineage in Azure Purview helps organizations to understand the data supply chain, from raw data in hybrid data stores, to business insights in Power BI. Azure Purview's turnkey integrations with Azure Data Factory, Power BI, Azure Data Share and other Azure Data Services automatically push lineage to Purview Data Map.
Azure Purview also supports Apache Atlas Lineage APIs that can be used to access and update custom lineage in Purview Data Map. Hook & Bridge support from Apache Atlas can also be used to easily push lineage from the Hadoop ecosystem.
Figure 1: Data lineage can be collected from various data systems
Azure Purview can stitch lineage across on-prem, multi-cloud and other platforms
Enterprise data estate contains data systems performing extraction, transformation/load, reporting, ML (machine learning) and so on. The goal of lineage feature in Purview is to capture the data linkage at each data transformation to help answer technical and business questions.
For instance, Purview’s lineage functionality will help capture the data movement and transformation stages such as the one described below.
- Data Factory would copy data from on-prem/raw zone to a landing zone in the cloud.
- Data processing systems like Synapse, Databricks would process and transform data from landing zone to Curated zone(staging) using notebooks or job definition.
- Data Warehouse systems then process the data from staging to dimensional models for optimal query performance and aggregation.
Data Analytics and reporting systems will consume the datasets and process through their meta model to create a BI (Business Intelligence) Dashboard, ML experiments etc
Root cause analysis scenarios
Azure Purview can help data asset owners troubleshoot a dataset or report containing incorrect data because of upstream issues. Data owners can use Azure Purview lineage as a central tool to understand upstream process failures and be informed about the reasons for discrepancies in their data sources.
Figure 2: Azure Purview lineage capability showing troubleshooting steps for a possible issue with Power BI report
Impact analysis scenarios
Data producers can use Azure Purview lineage to evaluate the downstream impact of changes made to their datasets. Lineage can be used as a central platform to know all the consumers of their datasets and understand the impact of any changes to their dependent datasets and reports. For instance, data engineers can evaluate the downstream impact for a deprecating column in a table or change in data type of a column. The data engineers can use Purview lineage to understand the number data assets potentially impacted by the schema changes of an upstream table. The column level lineage precisely points to the specific data assets that are impacted.
Figure 3: Azure Purview lineage capability showing the impact analysis for an upstream change
Azure Purview can connect with Azure Data Factory, Azure Data Share, Power BI to collect lineage currently. In the coming months many more data systems such as Synapse Analytics, Teradata, SQL Server and so on will be able to connect with Azure Purview for lineage collection.
Call to Action
We are looking forward to hearing, how Azure Purview helped perform troubleshooting and impact analysis of your data pipelines with the native lineage experiences.
- Create an Azure Purview account now and start understanding your data supply chain from raw data to business insights with free scanning for all your SQL Server on-premises and Power BI online
- Start by connecting a Data Factory or Data Share account to push lineage.
- Scan a Power BI tenant to see lineage in Purview. Use managed identity (MSI) authentication to set up a scan of a Power BI tenant
- Learn more on lineage user guide.