Lineage From Data Sources in the Data Lake to Visualization in the Reports
See Everything under Microsoft Purview & Azure Table Storage
@Rajeshdadi, @Tushar_Pardeshi , @ashishmodi, @InnovatorsClub
The Data Lineage solution answers all such questions. It shows the complete data journey from source to consumption and provides enriching value to businesses.
When we look at any typical enterprise level big data solution, especially when on the Cloud, there would be multiple services involved to handle data ingestion, processing, transformation and visualization. New entities and measures get derived at each stage which makes understanding the end-to-end journey of data complex.
This is where the Data Lineage solution becomes handy. It is a process of understanding, recording and visualizing how data flows and transforms while traveling across data sources to consumption. It not only helps in data understanding and transparency, but it also proves to be useful with impact analysis, debugging, diagnostics, governance and audit readiness. Data Lineage is truly an underpinning for Data Maturity in an organization and acts as a catalyst to democratize the data and its understanding.
We have created a three-part solution for the end-to-end lineage of your data by collecting information from multiple Azure Services involved in a big data project: Azure Data Lake Storage (ADLS), Azure Databricks (ADB), Azure Synapse Analytics, Azure Analysis Services (AAS) and Power BI (Datasets & Reports).
First, for AAS and PowerBI, we built a custom tool called ‘TomPo’ to extract the static lineage. Lineage from this is available independently as a PowerBI Dashboard and is also integrated with Purview. One can configure any one or both options. It’s an enrichment to Azure Purview in terms of showing the data model design, relationships, report pages, visual types, roles and memberships of PowerBI/AAS.
Second, for Azure Synapse Spark notebooks, we built another custom tool called ‘SparkLin’ to extract runtime lineage. Lineage from this is available in Microsoft Purview and also in a relational structure from SQL query. It is an added functionality to provide the Synapse Spark Notebooks lineage into Azure Purview and Table Storage.
Third, for few components like ADLS, lineage can be scanned as static metadata directly from Azure Purview.
Finally, integrating all under Azure Purview gives a holistic solution for data governance and compliance, showing the end-to-end data journey to any user.
We will talk about the details in further sections.
You can view this short demo of the complete solution to get better understanding.
Integrated end-to-end lineage fed into Azure Purview
In the remainder of this blog, we are covering the SparkLin - Spark data lineage. Lineage out of visualization tools is covered in another blog of TomPo mentioned in Additional Resources at end.
SparkLin captures and parses the lineage information from internal Spark logical execution plans of Spark notebooks, in-order to provide the clear understanding of where the data originated, how it has changed, and its ultimate destination within the data pipeline.
SparkLin is language agnostic and works on both Azure Databricks and Synapse. To extract runtime lineage from Azure Synapse Spark notebooks, we integrated Open Lineage in the solution. SparkLin is tightly coupled with Azure Purview for visualizing the lineage. It also supports querying lineage using SQL, which can be used in other areas.
Let us deep dive into the specifics for extracting the Runtime Lineage from Azure Synapse Spark notebooks and publishing into Azure Purview using SparkLin Parser:
The Architecture clearly depicts how we are enabling the moderators to extract the lineage out of the compute layer and from the model/visualization perspective.
Results can be viewed graphically into Azure Purview or can be queried via SQL from Azure Table storage.
Lineage results are stored in Azure Table storage, which can be queried using the below snippet and can be utilized further to derive any analytics.
SparkLin is not a direct release of the Azure Purview product and will not be officially supported. It is a solution developed by the HR Data Insights team within Microsoft’s Digital Employee Experience organization. This provides the detailed Spark lineage integrated into Purview that can help Microsoft customers and other users to understand their data better and run business with more efficiency and awareness (as given in above use case scenarios). The solution is being used in our programs successfully and helping with various scenarios.
The solution works only on metadata and can help in data governance and compliance.
Onboarding is easy with just a few configurations in Synapse Spark Pool environment and taking code scripts from our GitHub. Following are the steps:
For detailed onboarding steps, refer to GitHub.
Code Repository:
Details of TomPo-Lineage from PowerBI and Azure Analysis Services
Details on OpenLineage
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.