Microsoft Security Community Blog

5 MIN READ

End-to-End Data Lineage from Spark Big data Environment

InnovatorsClub

Former Employee

Jan 24, 2023

Lineage From Data Sources in the Data Lake to Visualization in the Reports

See Everything under Microsoft Purview & Azure Table Storage

Rajeshdadi, Tushar_Pardeshi , ashishmodi, InnovatorsClub

The Data Lineage solution answers all such questions. It shows the complete data journey from source to consumption and provides enriching value to businesses.

When we look at any typical enterprise level big data solution, especially when on the Cloud, there would be multiple services involved to handle data ingestion, processing, transformation and visualization. New entities and measures get derived at each stage which makes understanding the end-to-end journey of data complex.

This is where the Data Lineage solution becomes handy. It is a process of understanding, recording and visualizing how data flows and transforms while traveling across data sources to consumption. It not only helps in data understanding and transparency, but it also proves to be useful with impact analysis, debugging, diagnostics, governance and audit readiness. Data Lineage is truly an underpinning for Data Maturity in an organization and acts as a catalyst to democratize the data and its understanding.

Data Lineage - A Complete Solution

We have created a three-part solution for the end-to-end lineage of your data by collecting information from multiple Azure Services involved in a big data project: Azure Data Lake Storage (ADLS), Azure Databricks (ADB), Azure Synapse Analytics, Azure Analysis Services (AAS) and Power BI (Datasets & Reports).

First, for AAS and PowerBI, we built a custom tool called ‘TomPo’ to extract the static lineage. Lineage from this is available independently as a PowerBI Dashboard and is also integrated with Purview. One can configure any one or both options. It’s an enrichment to Azure Purview in terms of showing the data model design, relationships, report pages, visual types, roles and memberships of PowerBI/AAS.

Second, for Azure Synapse Spark notebooks, we built another custom tool called ‘SparkLin’ to extract runtime lineage. Lineage from this is available in Microsoft Purview and also in a relational structure from SQL query. It is an added functionality to provide the Synapse Spark Notebooks lineage into Azure Purview and Table Storage.

Third, for few components like ADLS, lineage can be scanned as static metadata directly from Azure Purview.

Finally, integrating all under Azure Purview gives a holistic solution for data governance and compliance, showing the end-to-end data journey to any user.

We will talk about the details in further sections.

You can view this short demo of the complete solution to get better understanding.

High-level Architecture:

How the Results Should Look:

Integrated end-to-end lineage fed into Azure Purview

How Are We Using the Complete Solution?

Use case scenarios:

Data scientists see the visualization of latest data assets and their transformation journey.
Enables data democratization as a service.
Data stewards and developers can quickly debug and perform impact analysis in the whole data chain.
Informs users with automatic alerts for failures and their impact on reporting.
Reverse Engineer an application code/reporting immensely useful in migration/modernization programs.

In the remainder of this blog, we are covering the SparkLin - Spark data lineage. Lineage out of visualization tools is covered in another blog of TomPo mentioned in Additional Resources at end.

SparkLin – Synapse Spark Data Lineage:

SparkLin captures and parses the lineage information from internal Spark logical execution plans of Spark notebooks, in-order to provide the clear understanding of where the data originated, how it has changed, and its ultimate destination within the data pipeline.

SparkLin is language agnostic and works on both Azure Databricks and Synapse. To extract runtime lineage from Azure Synapse Spark notebooks, we integrated Open Lineage in the solution. SparkLin is tightly coupled with Azure Purview for visualizing the lineage. It also supports querying lineage using SQL, which can be used in other areas.

Capabilities of SparkLin:

SparkLin Parser will help to understand forward and backward lineage for any entity.
Captures the information in Azure Table Storage for further analytics.

Logs error messages for future verification and run telemetry.
It has a configurable retry/restart mechanism for failed notebooks.
Shows detailed information related to Columns Transformations, Join Conditions, Input and Output Tables/Columns.

High-level Architecture:

Operational Aspects of Solution:

Let us deep dive into the specifics for extracting the Runtime Lineage from Azure Synapse Spark notebooks and publishing into Azure Purview using SparkLin Parser:

The Architecture clearly depicts how we are enabling the moderators to extract the lineage out of the compute layer and from the model/visualization perspective.

Azure Synapse clusters are configured to initialize the Open Lineage Spark Listener with an endpoint to receive data.
Spark operations will output data in a standard JSON format to the endpoint configured in the cluster.
Endpoint provided by an Azure Http/Blob Trigger Function app will filter incoming data and pass it to an Azure Blob storage.
Creates an event subscription for the Blob creation on the storage account using the event grid Blob trigger function app.
As soon as the Blob is placed in the container then the event grid Blob function app calls the SparkLin Parser which is built as a core component to parse the events and extract the data into a format compatible with Atlas APIs and Purview.
Data assets are scanned into Purview collection. Lineage data is synchronized with existing Purview metadata and uploaded to Purview using standard Apache Atlas APIs.

SparkLin Results:

Results can be viewed graphically into Azure Purview or can be queried via SQL from Azure Table storage.

Visualizing the Synapse Lineage in Microsoft Purview:

The rectangular boxes below are “Entities” and Oval Shaped are “Process”

Highlighted are Columns

Detailed information on Derived Columns, Temporary Tables, Delta Tables, Join Conditions used in the process:

Fetching Lineage via SQL

Lineage results are stored in Azure Table storage, which can be queried using the below snippet and can be utilized further to derive any analytics.

Security:

SparkLin parses only the Spark Internal execution logical plan, no actual data is involved.
Using Authorized Http Trigger function app, so users/applications can access by host/function keys.
SparkLin pushes the lineage output securely in Azure Table Storage.
Using SPN we are accessing Purview Instance for lineage push, maintaining the secrets/keys in Key Vault.
User access on lineage can be controlled by Azure Purview Roles assignment.

What More Should You Know?

SparkLin is not a direct release of the Azure Purview product and will not be officially supported. It is a solution developed by the HR Data Insights team within Microsoft’s Digital Employee Experience organization. This provides the detailed Spark lineage integrated into Purview that can help Microsoft customers and other users to understand their data better and run business with more efficiency and awareness (as given in above use case scenarios). The solution is being used in our programs successfully and helping with various scenarios.

The solution works only on metadata and can help in data governance and compliance.

What Does It Take to Onboard?

Onboarding is easy with just a few configurations in Synapse Spark Pool environment and taking code scripts from our GitHub. Following are the steps:

Upload the given “openlineage-spark:.jar” into the Synapse Spark Pool packages.
Update spark environment configurations in Synapse Spark Pool.
Create the Azure Function app and add functions related to SparkLin.
Create event grid subscription for Blob storage account.
Create Purview Collection where all lineage assets will reside.

Access Required:

Storage Blob contributor on storage account.
Data Curator/ Reader to publish the custom Atlas core assets in Purview.
Data Source admin and Collection admin to perform Delta Lake scan in respective Purview collection.

For detailed onboarding steps, refer to GitHub.