Introduction.
In the post Microsoft Fabric: Integration with ADO Repos and Deployment Pipelines - A Power BI Case Study. we have outlined key best practices for utilizing the seamless integration between Fabric and GIT via Azure DevOps repositories and the use of Fabric Deployment Pipelines, both features intended to improve collaborative development and agile application publishing in the Azure cloud.
Quality and value delivery of any data analysis application depends on the quality of the data that we manage to package, from the greatest quantity and diversity of reliable and truthful data sources.
Fabric Data Pipelines serve as the backbone of data integration and orchestration, allowing organizations to streamline the flow of data across disparate systems, applications, and services.
By moving and manipulating data, Fabric Data Pipelines help ensure data consistency, accuracy, and timeliness, ultimately supporting informed decision-making and driving business value.
In this post we first delve into the integration of Fabric Data Pipelines and Azure DevOps Repos, aimed at improving collaborative development and source code control. Finally, we address the key benefits of using Fabric's content-based strategy for continuous deployment to recommend including data pipelines as part of the content to be deployed and shared.
The role of Data Pipelines in Fabric.
Figure 1 briefly shows the stages for obtaining a data analytics solution.
Figure 1. Fabric Data Pipelines are a way to ingest and transform data into a Fabric solution.
There are many options in Fabric for data ingestion and transformations before building the semantic model of a Report or Lakehouse:
To date, Fabric lists the following as items that may be subject to source code control: [Overview of Fabric Git integration - Microsoft Fabric | Microsoft Learn]
- Data pipelines
- Lakehouse
- Notebooks
- Paginated reports
- Reports (except reports connected to semantic models hosted in Azure Analysis Services, SQL Server Analysis Services or reports exported by Power BI Desktop that depend on semantic models hosted in MyWorkspace)
- Semantic models (except push datasets, live connections, model v1, and semantic models created from the Data warehouse/lakehouse.)
The primary goal of a Data Pipeline, as an effective way to ingest data in Fabric, is to facilitate the efficient and reliable movement of data from various sources to designated destinations, while also enabling transformations and processing tasks along the way.
Why use source control for Fabric Data Pipelines?
It’s well known that data pipelines sometimes need to handle incremental/update logic by developers. And sometimes they need to recover a previous version to fix errors or maybe with the purpose of reusability.
Implementing source control for Fabric Data Pipelines is essential in modern software development practices. Source control, also known as version control, is a foundational aspect of collaborative software development, providing a systematic approach to managing changes to code and configurations throughout the development lifecycle. In the context of Fabric Data Pipelines, which play a crucial role in orchestrating data workflows and transformations, integrating source control becomes paramount for ensuring transparency, reproducibility, and reliability in data processing pipelines.
Source control is essential for managing Fabric’s data pipelines for several reasons:
- It allows you to keep track of changes, revert to previous versions, and understand the evolution of your data pipeline over time.
- Multiple team members can work on different parts of the pipeline simultaneously without overwriting each other’s work.
- Ensures that any data analysis or transformation can be reproduced, which is critical for debugging and auditing purposes.
- In case of personnel changes, source control provides continuity, allowing new team members to understand the pipeline’s history and current state.
Next, we show A step-by-step guide to use source control and version management for a Data Pipeline in Fabric.
1. Integrate your workspace with GIT, according to [Microsoft Fabric: Integration with ADO Repos and Deployment Pipelines - A Power BI Case Study.], [Overview of Fabric Git integration - Microsoft Fabric | Microsoft Learn]
2. Create a data pipeline in your workspace. To create a new data pipeline in Fabric you can refer to [Module 1 - Create a pipeline with Data Factory - Microsoft Fabric | Microsoft Learn], [Activity overview - Microsoft Fabric | Microsoft Learn].
Figure 2 shows three pipelines created in a workspace named Workspace Dev 1 and the Workspace’s Settings for the integration with an ADO repository (more details at Microsoft Fabric: Integration with ADO Repos and Deployment Pipelines - A Power BI Case Study. - Microsoft Community Hub)
Figure 2. Workspace integrated with GIT via an ADO Repo of a project.
3. Sync content with the ADO Repo.
The next figure shows all content synced after committing changes from Fabric UI.
If you add new data pipelines or update the content of some of them, this item is marked as “Uncommitted”. Every time you want to sync, select the “Source Control” button and commit the changes.
You will see in the repo in ADO the three pipelines created in Workspace Dev1.
To retrieve a pipeline version in the repo in ADO, you must select the committing line in Azure DevOps/Repos/Commits and then, Browse files.
You can edit some content outside Fabric, and when go back to Fabric, you will notice that it has been changed since the last commit:
Using the Source Control button, you can sync content again.
Press “Update All” and the pipeline is synced with Fabric’s content.
4. To see details of previous versions of a data pipeline in Microsoft Fabric, you can monitor the pipeline runs. [How to monitor pipeline runs - Microsoft Fabric | Microsoft Learn] Here we are listing the steps to follow and then we will show them with images.
- Navigate to your workspace and hover over your pipeline, click on the three dots to the right of your pipeline name to bring up a list of options, and Select View run history to see all your recent runs and their statuses.
The following picture illustrates the recent run history of a data pipeline.
- Select “Go to Monitor”, and you see the status of the activities across all the workspaces for which you have permissions within Microsoft Fabric.
- Open the pipeline to fix, and then, Update pipeline:
- When you select Update pipeline you can make changes to your pipeline from this screen. This selection will take you back to the pipeline canvas for edition, where you can change any mapping, delete activities and so on. You can save it, validate and run it again.
Deployment of data pipelines.
You can define a deployment pipeline in the workspace that contains the most recent and updated items and deploy all of them to the TEST Workspace. If you want to learn more about Fabric Deployment Pipelines refer to Microsoft Fabric: Integration with ADO Repos and Deployment Pipelines - A Power BI Case Study.
You can add Data pipelines in any workspace you want in Fabric. Common data pipeline code can go a long way to ensuring reproducible results in your analysis.
Therefore, this type of content can be used in a deployment pipeline.
Sharing a data pipeline code between a DEV WORKSPACE and a TEST WORSPACE greatly reduces the potential for errors, by helping to guarantee that the transformed data used for model training is the same as the transformed data the models will use in production.
A good practice mentioned in Best practices for lifecycle management in Fabric - Microsoft Fabric | Microsoft Learn is to use different databases in each stage. That is, build separate databases for development and testing to protect production data and not overload the development database with the entire volume of production data.
For now, data pipelines are not supported to be managed by deployment parameter rules. You can learn more about Deployment Rules in Create deployment rules for Fabric's Application lifecycle management (ALM) - Microsoft Fabric | Microsoft Learn.
However, you can edit the data pipeline inside the Test Workspace to change the source (as long as it has the same data structure or file format), run the edited pipeline and refresh data to obtain the desired results. Proceed similarly with the deployed data pipeline inside the Production Workspace: edit it, run and refresh data.
Next figure shows data source and destination to be configured in a data pipeline.
Summary
Fabric Data Pipelines serve as the backbone of data integration and orchestration. Source control is essential for managing Fabric’s data pipelines for several reasons, among the most significant are being able to access previous versions for reusability purposes or recovering from errors, being able to share code between developers and knowing the evolution of the data pipeline.
We have provided a step-by-step guide to include data pipelines into source control by means of Fabric-GIT integration, describing how to retrieve a specific data pipeline code from commit’s history, and updating the data pipeline inside Fabric.
Data Pipelines must be considered in the content to be shared in the Deployment Pipelines, due to the need to ensure data consistency and security from development stages to production.
You can find more information here:
Microsoft Fabric Life Cycle Management ALM in Fabric by taik18 - YouTube
How to monitor pipeline runs - Microsoft Fabric | Microsoft Learn
Git integration and deployment for data pipelines - Microsoft Fabric | Microsoft Learn
Datasets - Refresh Dataset - REST API (Power BI Power BI REST APIs) | Microsoft Learn
Data Factory in Microsoft Fabric documentation - Microsoft Fabric | Microsoft Learn