Blog Post

Healthcare and Life Sciences Blog
6 MIN READ

Part 3 - Unlock the Power of Azure Data Factory: A Guide to Boosting Your Data Ingestion Process

Joe_Fitzgerald's avatar
Aug 04, 2023

Introduction

To see a complete introduction to this blog series, including links to all the other parts, please follow the link below:

Part 1 - Unlock the Power of Azure Data Factory: A Guide to Boosting Your Data Ingestion Process

 

Part 3  of the blog series will focus on:

  1. The YAML Pipeline Structure
  2. The Publish Process
  3. ARM Template Parameterization
  4. ADF ARM Template Deployment

Access to all files is in GitHub.

 

The YAML Pipeline Structure

To create the YAML pipeline for publishing data factory artifacts and then deploying those artifacts to a specific environment (dev, staging, production), we will use Azure DevOps Pipelines in this section. However, we will also include in this blog series a part that describes how to accomplish the publish and deployment using GitHub workflows and actions.

 

The YAML pipeline structure consists of user-defined variables and stages to: Publish Artifacts and Deploy Artifacts. To completely understand the publishing concept, refer to Part 2 of this blog series, under the section called “Publishing Concept for Azure Data Factory”. The main take away is that an instance of Azure Data Factory runs in a “live mode” or “data factory mode” and at the same time can have Git configured so that branches of ADF JSON files can be utilized for development. The publishing process creates an ARM Template file that can then be used for deploying the ADF. In our example we have a stage for deploying the artifacts to each environment (dev, staging, production).

 

Defining variables provides a convenient way to include data in multiple parts of the pipeline. As a reminder, do not set secret variables in your YAML file. Instead, you can set secret variables in the pipeline settings UI for the pipeline, set secret variables in variable groups or use the Azure Key Vault task to retrieve secrets.

 

For our example, we include the following variables and stages:

 

 

The Publish Process

In the YAML pipeline, the publishing of the ARM Template and accompanying files will be accomplished using a Node Package Manager(npm) script which will consolidate the JSON files that represent the ADF. This automated publishing of the ARM Template is well documented and makes use of the Azure Data Factory npm package utility. In this section we will attempt to succinctly describe the requirements for using the npm package and the steps that need to be included in the pipeline for publishing the deployment artifacts.

The first step in working with npm is the creation of the package.json file at the root of the source folder for the data factory. The package.json file lists the packages that your project depends on and tells npm what package to run with the associated commands. The Azure Data Factory npm package utility requires a package.json file to be located in the target repository. The screen shot below shows the file location in our repository:

 

 

The package.json file should contain the Json below for including the npm package for the Azure Data Factory Utility that will be installed on the build agent. Line 3 states that whenever we use the “publish” command, use the azure data factory utilities package to create the ARM Template for the data factory.

 

 

There are seven steps to the publishing process that are all included in the “build” stage of the YAML pipeline. The seven steps include:

  1. Install Node on the build agent.
  2. Install Node packages defined in package.json.
  3. Validate the creation of ADF artifacts.
  4. Generate the ARM Template and accompanying files from the data factory JSON source code.
  5. Copy the generated ARM Template and accompanying files to the artifacts staging directory on the build agent.
  6. Copy ARM Template parameter files for each environment to the artifacts staging directory on the build agent.
  7. Publish the pipeline artifacts for use in next stages.

 

 

The first two steps are for installing node and for installing node packages:

 

 

The next two steps are for validating that the JSON is a legitimate data factory and to create the ARM artifacts for publishing in the next steps. This will consolidate the items like datasets, linked services, and pipelines into the single or linked Template JSON files:

 

 

The final three steps will copy the generated arm template with its accompanying files and the arm template parameters files for each environment into the “ARMTemplateOutput” target folder. Finally, the generated artifacts are Published to Azure Pipelines artifacts for use in the deployment stages.

 

 

The output generated by the above steps will include files like the image below:

 

 

The files in the artifacts produced by the npm data factory utility are highlighted in yellow and the files highlighted in reddish are ARM Template parameter files that were created manually for each specific environment (dev, staging, prod).

 

ARM Template Parameterization

Before we begin discussing the deployment stages of the Azure pipeline and how to use the generated output files, there is a concept surrounding ARM Template Parameterization that we need to mention. The documentation for this concept explains scenarios when you might want to override the Resource Manager parameter configuration.

For the purposes of this article, we are leveraging the default capabilities involving the generated ARM Template parameter file, “ARMTemplateParametersForFactory.json” to create our specific ARM Template parameter files for each environment. In this specific case the ARM Template parameter file for the dev environment contains the following:

 

 

In our sample code the ARM Template parameter files for each environment are saved in the following folder structure:

 

ADF ARM Template Deployment

The following image shows the stage and steps required that automates the deployment of a data factory to multiple environments. In this case, we are deploying to the “dev” environment using a deployment job. Some important things to point out here are the dependsOn key word, which means this stage will not run until the build stage has completed successfully. There is the environment key word that is a target environment name to record the deployment history. You can optionally take advantage of the Approvals and Gates feature in Azure DevOps to control the workflow of the deployment pipeline. Next is the strategy keyword that will define how the data factory will be rolled out. In this case we are using the runOnce deployment strategy.

There are only two steps required for the deployment that include the task for checking out the code from the Git repository and the Azure Resource Template Deploy task. The important thing to point out here is line 115 and 116: csmFile and csmParameters parameters. The csmFile parameter will remain the same for each environment (dev, staging, prod), however the csmParameters parameter will have to be updated to use the ARM Template Parameters file that was manually created for each environment. In the case of the dev environment, the file is called ARMTemplateParams-Dev.json.

 

 

You can see the completed YAML pipeline in the GitHub repo.

 

Conclusion

In Part 3  of this series, we learned that the YAML pipeline structure consists of user-defined variables and stages to: Publish Artifacts and Deploy Artifacts. We briefly reviewed the ADF Publishing concept from Part 2 of this blog series. Then we discussed the requirements for using the Azure Data Factory npm package utility and the steps that are needed to be included in the pipeline for publishing the deployment artifacts. Including the all-important package.json file. We then went on to describe the seven steps to the publishing process that are all included in the “build” stage of the YAML pipeline. We then covered what artifacts are generated from the publishing process. From these generated artifacts, there is an ARM Template and an ARM Template Parameters file that are used to deploy the data factory and generating your custom ARM Template Parameters file for each environment. Once we understand how to accomplish the publishing process in the YAML pipeline, then we discussed the deployment stage that checkouts the files from the Git repository and uses the Azure Resource Manager Template Deployment task to deploy the Azure Data Factory.

Updated Aug 04, 2023
Version 1.0
No CommentsBe the first to comment