Part 2 - Unlock the Power of Azure Data Factory: A Guide to Boosting Your Data Ingestion Process

Joe_Fitzgerald · ‎Jul 17 2023

Introduction

To see a complete introduction to this blog series, including links to all the other parts, please follow the link below:

Part 1 - Unlock the Power of Azure Data Factory: A Guide to Boosting Your Data Ingestion Process

Part 2 of the blog series will focus on:

Configure Azure Data Factory Source Control
Construct Azure Data Factory Data Pipeline
Publishing Concept for Azure Data Factory
Configure Deployed Azure Resources.

Configure Azure Data Factory Source Control

After you have the Azure resources created Launch the Azure Data Factory Studio for the development instance, adf-demo-dev-eastus-001. In the Azure Data Factory Studio, navigate to the Manage tab, then under the Source control select the Git Configuration and then the blue Configure button.

From here you will have to select one of the two supported Repository Types: Azure DevOps Git or GitHub. In my case I’m using Azure DevOps. You will also have to select the Azure Active Directory tenant that is associated with your Azure DevOps instance.

Select the appropriate settings for your DevOps organization name, project name, and repository name. For the collaboration branch select the main branch and for publish branch use adf_publish. The adf_publish branch is required and is used with the ADF “publish concept” which will be discussed further in the section called “Publishing Concept for Azure Data Factory”. For the Root folder I use “/azuredatafactory/src” to isolate the data factory code to one folder.

After the configuration is successful you will see the following folder structure in your Azure DevOps Repo.

The last step to take with the Git configuration is to disable the Publish button in the Azure Data Factory Studio. To do this, Edit the Git configuration and select the checkbox to disable the Publish button. This is very important since you do not want to allow anyone to update the live data factory instance manually. The live data factory should only ever be updated via the CICD process.

Select the Update repository settings button when prompted.

Construct Azure Data Factory Data Pipeline

After the Git configuration is completed, the next step is to create Linked Services in the data factory. In Azure Data Factory Studio navigate to the Manage tab and then Linked services. Here we will create linked services to the Azure Key Vault and the Storage Account.

First create a linked service for Azure Key Vault.

Select the correct settings and Test the Connection:

Next, we will create the Linked Service for the storage account using the Azure Key Vault linked service that we just created.

In this particular case we will use the settings that you can see in the image below. For Authentication type use Account key and Azure Key Vault. Keep in mind that the data factory service principal was given access to Get and List of the key vault so when selecting the AKV linked service the data factory has permission to Get the available Secret names. In this case it will be “storageaccount-adf-demo” that we created when we created the key vault.

After the Linked services are created, the next step is to create Datasets in the data factory. The datasets will be used as the source and the destination when copying files between folders in the storage container. In the Azure Data Factory Studio navigate to the Author tab and then the ellipse next to Datasets to create a new dataset.

For the new dataset select Azure Blob Storage as the data store. For the format type select Binary. I set the name of the first dataset to “DS_Storage_Input_Folder”, select the Linked service we created earlier called “LS_ABLB_storage_Demo”. For the File path use “adftutorial” for the container and “input” for the folder name.

Create a second Dataset for the output folder using the settings in the image below:

Finally, we are now ready to create the “data factory pipeline” that will process any files that are in the storage container “input” folder and move them to the “output” folder. In the Azure Data Factory Studio, select the Author tab and the ellipse next to Pipelines to create a New pipeline.

I named the data factory pipeline PL_Process_Input_Files. I then added a Copy data Activity to the data factory pipeline and gave it the name of Copy Files.

To configure the Copy data Activity, select the Source tab and select the “DS_Storage_Input_Folder” for the Source dataset.

Next, select the Sink tab and select the “DS_Storage_Output_Folder” for the Sink dataset.

You can now test the data factory pipeline by clicking on the Debug button:

You can confirm that the files in the “input” folder have been copied to the “output” folder.

At this point of the development cycle, the developer would save the work completed to the Git repository. In this case we are still saving the updates to the “main” branch. Best practice should require a feature branch and merge request to the “main” branch with a code review. To save the updates select the “Save all”.

After saving the data factory, your Git repo folder/file structure should now look like this:

Publishing Concept for Azure Data Factory

To grasp the concept behind publishing a data factory, it's essential to comprehend the two modes in which an instance of a data factory operates. Let's draw a comparison to someone writing C# code in Visual Studio and compiling it into DLLs for distribution to an Azure App Service, for instance.

In Azure Data Factory, the first mode resembles a development environment, equipped with an Integrated Development Environment (IDE), where you can "program" the data factory. In this environment, known as Azure Data Factory Studio, you can add linked services, datasets, data pipelines, and more. Additionally, you have the ability to debug the data factory by running selected activities from within the Data Factory Studio. If the data factory has source control configured, saving any updates will store the JSON representation in the Git repository.

The second mode is referred to as the live mode. This is where the actual instance of the data factory runs and is ready to execute activities on demand. This is where the concept of "publishing" the data factory comes into play. Clicking the "Publish" button in the Data Factory Studio triggers a process that takes the JSON files representing the data factory and generates ARM templates, along with an ARM template parameters file. These ARM templates represent the entities created in the data factory, such as linked services, datasets, and data pipelines.

If the Git configuration is not connected, then the publishing process deploys these ARM templates to the live mode of the data factory in Azure. If the Git configuration is connected and the “Publish” button is not disabled, then the ARM templates will then be saved to the adf_publish branch in the Git repository.

Previously, the adf_publish branch played a role in facilitating the publishing process. However, with the introduction of the new deployment approach that no longer relies on the manual Publish button, the adf_publish branch is no longer necessary. It is best to only have the Git configuration connected in the development Azure Data Factory and have the “Publish” button disabled.

Configure Deployed Azure Resources

To deploy resources to the Azure cloud from Azure DevOps, an App Registration is required. The app registration will have a managed identity also known as a Service Principal. First, we need to create the Service Principal in Azure Active Directory that will have permission to deploy the ADF ARM Template. In the Azure portal, navigate to Active Directory and then to the App Registrations. Click on Register an application and give it a relative name like “ADF Deployment App Registration”.

After the app registration is created, you will need to create a Client Secret that will be used when creating the Service Connection in Azure DevOps. Be sure to copy the Value created for the secret because there is only one chance to capture it.

Next you need to grant Contributor permission to the App Registration on the resource group where the resources are located. Navigate to the Resource Group and click on “Access control (IAM)”, then click on “Add role assignment”. To grant the Contributor permission select the Contributor Role and select Next.

For Members, click on “+ Select members” then search for and select the App Registration that you previously created.

Finally click on the “Review and assign” button.

Navigate back to the Project Settings of your Azure DevOps portal to Create a New Service Connection. Select the Azure Resource Manager as the connection type and Service principal (manual)

The remaining settings are as follows:

Environment => Azure Cloud.

Scope Level => Subscription.

Subscription ID => Subscription ID of the subscription where the Azure resources are located.

Subscription Name => Subscription name where the Azure resources are located.

Service Principal ID => Application (client) ID of the App Registration.

Credential => Service Principal key.

Service Principal Key => Client Secret created for the App Registration.

Tenant ID => Active Directory Tenant ID

For simplicity’s sake, I will select the “Grant access permission to all pipelines” checkbox. Then click on “Verify and save” to create the Service Connection. If this checkbox is not selected, then the first time the Azure DevOps Pipeline runs using the Service Connection, you will be required to explicitly give the service connection permission to run the specific pipeline.

Conclusion

In Part 2 of this series, we learned how to configure source control for the data factory and then we created a sample data factory pipeline that uses linked services to Azure Key Vault and Azure Storage. Use of the linked services will become important later in this blog series to demonstrate how to use an ARM Template parameters file. We next discussed the “publishing concept” for Azure Data Factory and how it comes into play for creating the ARM Templates and parameter file. We also briefly discussed what the “publishing concept” means to configuring source control and what Git branch it is associated with. Finally, to configure our Azure resources, we created an app registration in Azure Active Directory, and used it with the Azure DevOps Service Connection for deploying the ADF ARM template. In Part 3, we will describe out to create the Azure DevOps pipeline using YAML to create the ADF ARM templates and how to use them for deployments.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs