Healthcare and Life Sciences Blog

13 MIN READ

Part 4 - Unlock the Power of Azure Data Factory: A Guide to Boosting Your Data Ingestion Process

Microsoft

Aug 14, 2023

Background

This post is the next post in the series Unlock the Power of Azure Data Factory: A Guide to Boosting Your Data Ingestion Process. This also happens to overlap and is included in the series on YAML Pipelines. All code snippets and final templates can be found out on my GitHub TheYAMLPipelineOne. For the actual data factory, we will leverage my adf_pipelines_yaml_ci_cd repository.

Introduction

After reading parts 1-3 on Unlock the Power of Azure Data Factory one may be left with the next steps of how to take what was provided and convert it to an enterprise scale. Terminology and expectations are key so let’s outline what we would like to see from an enterprise-scale deployment:

Write once reuse across projects.
Individual components can be reused.
Limited manual intervention.
Easily updated.
Centralized definition.

Depending on where your organization is at in your pipeline and DevOps maturity this may sound daunting. Have no fear as we will walk you through how to achieve this with YAML templates for Azure Data Factory. At the end of this piece, you should be well equipped to create a working pipeline for Azure Data Factory in a manner of minutes.

Set Up

To assist in the goals outlined above for enterprise scale deployments, I recommend having a separate repository for your YAML templates that resides outside of your Data Factory. This will help check off the boxes on a centralized definition, write once and reuse across projects, easily updated, and individual components that can be reused. For more context on this check out my post on Azure DevOps Pipelines: Practices for Scaling Templates.

Our individual Data Factories will each have a dedicated CI/CD pipeline which will reference the separate repository we are putting the YAML templates in. This can be achieved natively in Azure DevOps. Furthermore, it is not unheard of for larger scale organizations to have a “DevOps Team” or a team responsible for pipeline deployments. If this is the case in your organization, you can think of this other team as “owning” the centralized repository.

Templating Up

For those who have read my posts on this topic both on the Microsoft Health and Life Sciences blog or my personal blog this shouldn’t be a new concept. For those unaware “Templating Up” is the process by which we outline the individual build/deploy steps into tasks.

This process will give us the individual tasks required for a build/deployment and visually provide us with what tasks can be repeated. An additional benefit of breaking this down in such a manner is we are left with task templates that can be reused outside our given project. More on that in a minute.

Recapping from the previous Part 3 post, here are the Microsoft Azure DevOps tasks and associated documentations which the process will require:

Install Node - In order to leverage a Node Package Manager (npm) it would make sense to have node installed on the agent.
Install NPM - Now that node is installed it's time to install npm so we can execute our package.
Run NPM package - This is where the "magic" happens, and the creation of our ARM template will occur. The nuance here is the requirement of the package to have the resource ID of a data factory. If adhering to a true CI/CD lifecycle what is in the deployed DEV instance if we want to deploy to future environments.
Publish Pipeline Artifacts - This task will now take the output from the npm package execution as well code in the repository and create a pipeline artifact. This is key as this artifact is what we will use for our deployment stages.

The steps required for deployment:

Stop Azure Data Factory Triggers – This is a PowerShell script created by the deploy process which will ensure our triggers are not executing during deployment.
Azure Resource Manager (ARM) Template Deployment – The ARM template published as part of the build process will now be deployed to an environment. We will need to provide the opportunity to include a parameter file as well as override parameters if needed.
Start Azure Data Factory Triggers – After a successful deployment we will want to start the Azure Data Factory triggers with the same script we used to stop the ADF triggers.

Now, perhaps the most important step in templating up, is identifying which steps are unique to Azure Data Factory. This answer, as surprising as this might sound, is none of these are specific to Data Factory. Installing node, npm, executing npm, publishing pipeline artifacts, executing PowerShell, and running ARM deployments are all platform agnostic tools. Let’s keep that in mind as we start to template this out.

Build Task

When we are creating reusable task templates the goal is that the template will perform exactly one task. Even if that task is as simple as 5 lines of code, the template should be one task as it is the lowest foundational block of a YAML Pipeline. As such by scoping and limiting it to one task, we can optimize that this same task is reused across multiple pipelines.

node_install_task.yml

parameters: 
- name: versionSpec 
  type: string 
  default: '16.x' 
steps: 
- task: NodeTool@0 
  inputs: 
    versionSpec: ${{ parameters.versionSpec }} 
  displayName: 'Installing Node Version ${{ parameters.versionSpec }}'

Defaulting the parameter to a new version can save developers a parameter definition yet also provides the ability to override future implementations.

npm_install_task.yml

parameters: 
- name: verbose 
  type: boolean 
  default: false 
- name: packageDir 
  type: string 
  default: '$(Build.Repository.LocalPath)' 
steps: 
- task: Npm@1 
  inputs: 
    command: 'install' 
    verbose: ${{ parameters.verbose }} 
    workingDir: ${{ parameters.packageDir }} 
  displayName: 'Install npm package'

One thing I try and do with these tasks is to provide all available inputs as parameters and set the default there. In this case we are doing it with the ‘verbose’ parameter. Again, this task can easily be reused for any pipeline that will require a npm install.

npm_custom_task.yml

parameters: 
- name: customCommand 
  type: string 
  default: '' 
- name: packageDir 
  type: string 
  default: '' 
- name: displayName 
  type: string 
  default: '' 
steps: 
- task: Npm@1 
  displayName: ${{ parameters.displayName }} 
  inputs: 
    command: 'custom' 
    customCommand: ${{ parameters.customCommand }} 
    workingDir: ${{ parameters.packageDir }}

This task will require the location of the package.json that would have been created as part of setting up your repository for Data Factory CI/CD. For a refresher this is the contents of the package.json

If one is astute and quick to notice this feels very similar to the npm install task directly above. Both tasks leverage the Npm@1 task. The difference here is when specifying a command with the value ‘custom’ the ‘customCommand’ property immediately becomes required.

Thus, our task template will require different inputs, and this leads to the delineation of a second task template being required.

Notice that the custom task has an input customCommand. This input is only required when the command = `custom`. Well, that’s what we have here so based on that requirement a separate task template is required.

ado_publish_pipeline_task.yml

parameters: 
- name: targetPath 
  type: string 
  default: '$(Build.ArtifactStagingDirectory)' 
- name: artifactName 
  type: string 
  default: 'drop' 
 
steps: 
- task: PublishPipelineArtifact@1 
  displayName: 'Publish Pipeline Artifact ${{ parameters.artifactName }} ' 
  inputs: 
    targetPath: ${{ parameters.targetPath }} 
    artifact:  ${{ parameters.artifactName }}

This task is required to attach the compiled artifact to the pipeline. This will let the pipeline re-use the artifact in future stages. This task is fundamental for any artifacts-based deployments in Azure DevOps.

Task Summary

Looking back on these tasks I want to emphasize something. These tasks are 100% agnostic of Data Factory. That means we’ve just created tasks that can be reused from JavaScript builds which leverage npm all the way to infrastructure deployments that will need the publish pipeline artifact task.

Build Job

Since this is just a build process, we just need a job template that will call each one of these tasks. Something to consider with Azure DevOps jobs is by default they will run in parallel. This job will act as the orchestrator of these tasks. To ensure optimal reusability we have to define the various inputs as parameters.

adf_build_job.yml

parameters: 
- name: packageDir 
  type: string 
  default: '' 
- name: dataFactoryResourceID 
  type: string 
  default: '' 
- name: regionAbrv 
  type: string 
  default: '' 
- name: serviceName 
  type: string 
  default: '' 
- name: environmentName 
  type: string 
  default: '' 
- name: adfDir 
  type: string 
  default: '' 
jobs: 
- job: 'adf_${{ parameters.serviceName }}_${{ parameters.environmentName }}_${{ parameters.regionAbrv }}_build' 
  steps: 
    - template: ../tasks/node_install_task.yml 
    - template: ../tasks/npm_install_task.yml 
      parameters: 
        packageDir: ${{ parameters.packageDir }} 
    - template: ../tasks/npm_custom_command_task.yml 
      parameters: 
        packageDir: ${{ parameters.packageDir }} 
        displayName: 'Run ADF NPM Utility' 
        customCommand: 'run build export ${{ parameters.adfDir }} ${{ parameters.dataFactoryResourceID }}' 
    - template: ../tasks/ado_publish_pipeline_task.yml 
      parameters: 
        targetPath: ${{ parameters.packageDir }} 
        artifactName: 'ADFTemplates'

For this to work across data factories the biggest pieces to parameterize will be the working directory the package.json is located in and the resourceID of the Data Factory which the utility will run against. Effectively the Data Factory Resource ID will belong to the Data Factory in your lowest environment. This ties back to the concept covered in Part 2 where we are promoting an application or package from the lowest environment through to our production environment. By creating these as parameters we are parametrizing this job to be used across multiple data factories.

Build Stage

For building the artifacts that we will use across environments we require only one stage. This stage’s purpose is first to ensure our Data Factory changes will compile correctly and second produce the reusable ARM templates, associate parameters, and necessary scripts to be leveraged across future deployment stages (azure environments). This stage will only require one job, the job to build and publish the Data Factory which we outlined above.

adf_build_stage.yml

parameters: 
- name: serviceName 
  type: string 
  default: 'SampleApp' 
- name: packageDir 
  type: string 
  default: '$(Build.Repository.LocalPath)/adf_scripts' 
- name: adfDir 
  type: string 
  default: '$(Build.Repository.LocalPath)/adf' 
- name: baseEnv 
  default: 'dev' 
- name: baseRegion 
  default: 'eus' 
stages: 
- stage: '${{ parameters.serviceName }}_build' 
  variables: 
  - template: ../variables/azure_global_variables.yml 
  - template: ../variables/azure_${{ parameters.baseEnv }}_variables.yml 
  jobs: 
    - template: ../jobs/adf_build_job.yml 
      parameters: 
        environmentName: ${{ parameters.baseEnv }} 
        dataFactoryResourceID: '/subscriptions/${{ variables.azureSubscriptionID }}/resourceGroups/${{ variables.resourceGroupAbrv }}-${{ parameters.serviceName }}-${{ parameters.baseEnv }}-${{ parameters.baseRegion }}/providers/Microsoft.DataFactory/factories/${{ variables.dataFactoryAbrv }}-${{ parameters.serviceName }}-${{ parameters.baseEnv }}-${{ parameters.baseRegion }}' 
        serviceName: ${{ parameters.serviceName }} 
        regionAbrv: ${{ parameters.baseRegion }} 
        packageDir: ${{ parameters.packageDir }} 
        adfDir: ${{ parameters.adfDir }}

This stage will require some arguments, which in this case, are being treated as default parameters. The reason a defaulted parameter is being leveraged vs a variable is that a default parameter will still give any calling pipeline the ability to override these values; however, still maintain these values as optional. It is extremely helpful if all Data Factories follow a pattern for the folders. In this case ‘adf’ is the folder mapped in Data Factory for source control repository and ‘adfscripts’ is where the ‘package.json’ will live as well as the parameter files for the various environments.

Parameter Name	Definition
serviceName	Name that will be used for UI description and the environment agnostic name of the data factory
packageDir	Where the package.json file is that is required for the npm task
adfDir	The directory which Azure Data Factory is mapped to in the repo
baseEnv	The NPM package requires a running instance of datafactory. This is being addressed by pointing to which environment to use.
baseRegion	The NPM package requires a running instance of datafactory. This is being addressed by pointing to which region to use.

ARM Template Parameters

Hopefully you have followed the steps outlined in Part 3 on how to create and store a separate parameter file for each environment.

Deployment Stage

Templating a deployment stage has its own art. A build stage, by definition, is structured so that it will always run once and generate an artifact. A deployment stage template will run more than once. Ultimately, we want to define the steps once and run against multiple environments.

A seasoned professional, with pipeline experience, may chime in here and point out that there are certain steps, think load testing potentially, that will only run in a test environment and not a dev or production environment. They are correct, I want to acknowledge that. There is a way to accommodate this; however, I will not be covering it here.

To recap we will want our deployment stage to execute a job with the following tasks:

Run the PrePostDeploymentScript.ps1 with the parameters required to stop the data factory triggers.
Deploy the ARM template.
Run the PrePostDeploymentScript.ps1 with the parameters required to start the data factory triggers.

Variables

Unlike the build template we are going to want to leverage variables template files that will be targeted to a specific environment. An in-depth review on how to use variable templates was covered in a previous post on the YAML Pipeline series.

The abbreviated version is across all Azure pipelines there will be variables scoped to the specific environment. These would be certain items such as the service connection name, a subscription id, or a shared key vault. These variables can be stored in a variable template file in our YAML template repository and loaded as part of the individual deployment stage.

Deployment Tasks

Same as the build tasks. These steps will need to be scoped to the individual level to optimize reuse for process outside of Data Factory.

azpwsh_file_execute_task.yml

parameters: 
- name: azureSubscriptionName 
  type: string 
- name: scriptPath 
  type: string 
- name: ScriptArguments 
  type: string 
  default: '' 
- name: errorActionPreference 
  type: string 
  default: 'stop' 
- name: FailOnStandardError 
  type: boolean 
  default: false 
- name: azurePowerShellVersion 
  type: string 
  default: 'azurePowerShellVersion' 
- name: preferredAzurePowerShellVersion 
  type: string 
  default: '3.1.0' 
- name: pwsh 
  type: boolean 
  default: false 
- name: workingDirectory 
  type: string 
- name: displayName 
  type: string 
  default: 'Running Custom Azure PowerShell script from file' 
 
steps: 
- task: AzurePowerShell@5 
  displayName: ${{ parameters.displayName }} 
  inputs: 
    scriptType: 'FilePath' 
    ConnectedServiceNameARM: ${{ parameters.azureSubscriptionName }} 
    scriptPath: ${{ parameters.scriptPath  }} 
    ScriptArguments: ${{ parameters.ScriptArguments }} 
    errorActionPreference: ${{ parameters.errorActionPreference }} 
    FailOnStandardError: ${{ parameters.FailOnStandardError }} 
    azurePowerShellVersion: ${{ parameters.azurePowerShellVersion }} 
    preferredAzurePowerShellVersion: ${{ parameters.preferredAzurePowerShellVersion }} 
    pwsh: ${{ parameters.pwsh }} 
    workingDirectory: ${{ parameters.workingDirectory }}

This art in this task is realizing we want to make it generic enough for Data Factory to use the same task to execute the PrePostDeploymentScript.ps1 to disable/reenable triggers. While we are at it, we also want to make sure this task can execute any PowerShell script we provide it.

ado_ARM_deployment_task.yml

parameters: 
- name: deploymentScope 
  type: string 
  default: 'Resource Group' 
- name: azureResourceManagerConnection 
  type: string 
  default: '' 
- name: action 
  type: string 
  default: 'Create Or Update Resource Group' 
- name: resourceGroupName 
  type: string 
  default: '' 
- name: location 
  type: string 
  default: eastus 
- name: csmFile 
  type: string 
  default: '' 
- name: overrideParameters 
  type: string 
  default: '' 
- name: csmParametersFile 
  type: string 
  default: '' 
- name: deploymentMode 
  type: string 
  default: 'Incremental' 
 
 
steps: 
- task: AzureResourceManagerTemplateDeployment@3 
   inputs: 
    deploymentScope: ${{ parameters.deploymentScope }} 
    azureResourceManagerConnection: ${{ parameters.azureResourceManagerConnection }} 
    action: ${{ parameters.action  }} 
    resourceGroupName: ${{ parameters.resourceGroupName }} 
    location: ${{ parameters.location }} 
    csmFile: '$(Agent.BuildDirectory)/${{ parameters.csmFile }}' 
    csmParametersFile: '$(Agent.BuildDirectory)/${{ parameters.csmParametersFile }}' 
    overrideParameters: ${{ parameters.overrideParameters }} 
    deploymentMode: ${{ parameters.deploymentMode }}

This ARM template deployment task should be able to handle any ARM deployment in your environment. It is not tied to just Data Factory as it will accept the ARM template, parameters file, and any override parameters. Additionally for those unaware AzureResourceManagerTemplateDeployment@3 supports bicep deployments if Azure CLI > 2.2 is available.

Deployment Job

Now that the tasks have been created, we will need to orchestrate them in a job. Our job will need to load the specific environment variables required for our deployment.

adf_deploy_env_job.yml

parameters: 
- name: environmentName 
  type: string 
- name: serviceName 
  type: string 
- name: regionAbrv 
  type: string 
- name: location 
  type: string 
  default: 'eastus' 
- name: templateFile 
  type: string 
- name: templateParametersFile 
  type: string 
- name: overrideParameters 
  type: string 
  default: '' 
- name: artifactName 
  type: string 
  default: 'ADFTemplates' 
- name: stopStartTriggersScriptName 
  type: string 
  default: 'PrePostDeploymentScript.ps1' 
- name: workingDirectory 
  type: string 
  default: '../' 
 
jobs: 
- deployment: '${{ parameters.serviceName }}_infrastructure_${{ parameters.environmentName }}_${{ parameters.regionAbrv }}' 
  environment: ${{ parameters.environmentName }} 
  variables:  
  - template: ../variables/azure_${{parameters.environmentName}}_variables.yml 
  - template: ../variables/azure_global_variables.yml 
  - name: deploymentName 
    value: '${{ parameters.serviceName }}_infrastructure_${{ parameters.environmentName }}_${{ parameters.regionAbrv }}' 
  - name: resourceGroupName 
    value: '${{ variables.resourceGroupAbrv }}-${{ parameters.serviceName }}-${{ parameters.environmentName }}-${{ parameters.regionAbrv }}' 
  - name: dataFactoryName 
    value: '${{ variables.dataFactoryAbrv }}-${{ parameters.serviceName }}-${{ parameters.environmentName }}-${{ parameters.regionAbrv }}' 
  - name: powerShellScriptPath 
    value: '../${{ parameters.artifactName }}/${{ parameters.stopStartTriggersScriptName }}' 
  - name: ARMTemplatePath 
    value: '${{ parameters.artifactName }}/${{ parameters.templateFile }}' 
     
  strategy: 
    runOnce: 
        deploy: 
            steps: 
            - template: ../tasks/azpwsh_file_execute_task.yml 
              parameters: 
                azureSubscriptionName: ${{ variables.azureServiceConnectionName }} 
                scriptPath: ${{ variables. powerShellScriptPath }} 
                ScriptArguments: '-armTemplate "${{ variables.ARMTemplatePath }}" -ResourceGroupName ${{ variables.resourceGroupName }} -DataFactoryName ${{ variables.dataFactoryName }} -predeployment $true -deleteDeployment $false' 
                displayName: 'Stop ADF Triggers' 
                workingDirectory: ${{ parameters.workingDirectory }} 
            - template: ../tasks/ado_ARM_deployment_task.yml 
              parameters: 
               azureResourceManagerConnection: ${{ variables.azureServiceConnectionName }} 
               resourceGroupName: ${{ variables.resourceGroupName }} 
               location: ${{ parameters.location }} 
               csmFile: ${{ variables.ARMTemplatePath }} 
               csmParametersFile: '${{ parameters.artifactName }}/parameters/${{ parameters.environmentName }}.${{ parameters.regionAbrv }}.${{ parameters.templateParametersFile }}.json' 
               overrideParameters: ${{ parameters.overrideParameters }} 
            - template: ../tasks/azpwsh_file_execute_task.yml 
              parameters: 
                azureSubscriptionName: ${{ variables.azureServiceConnectionName }} 
                scriptPath: ${{ variables. powerShellScriptPath }} 
                ScriptArguments: '-armTemplate "${{ variables.ARMTemplatePath }}" -ResourceGroupName ${{ variables.resourceGroupName }} -DataFactoryName ${{ variables.dataFactoryName }} -predeployment $false -deleteDeployment $true' 
                displayName: 'Start ADF Triggers' 
                workingDirectory: ${{ parameters.workingDirectory }}

If you notice the parameters for this job is a bit of a combination of the parameters required for each task as well as what’s required to load the variable template file for the specified environment.

Deployment Stage

The deploy stage template should call the deploy job template. To help consolidate, one thing I like to do in the stage template is to make it flexible so that the template can build to 1 or x environments. To achieve this, we pass in an Azure DevOps object containing the list of environments and regions we are wanting to deploy to. There is a more detailed article on how to go about this.

One note that I am going to call out here is I have included an option to load a job template to deploy Azure Data Factory via Linked ARM templates. This will be covered in a follow up post, for now this section can be ignored and/or omitted.

parameters: 
- name: environmentObjects 
  type: object 
  default: 
      environmentName: 'dev' 
      regionAbrvs: ['cus'] 
- name: environmentName 
  type: string 
  default: '' 
- name: templateParametersFile 
  type: string 
  default: 'parameters' 
- name: serviceName 
  type: string 
  default: '' 
- name: linkedTemplates 
  type: boolean 
  default: false 
 
stages: 
 
  - ${{ each environmentObject in parameters.environmentObjects }} : 
    - ${{ each regionAbrv in environmentObject.regionAbrvs }} : 
        - stage: '${{ parameters.serviceName }}_${{ environmentObject.environmentName}}_${{regionAbrv}}_adf_deploy' 
          variables: 
            - name: templateFile 
              ${{ if eq(parameters.linkedTemplates, false)}} : 
                value: 'ARMTemplateForFactory.json' 
              ${{ else }} : 
                value: 'linkedTemplates/ArmTemplate_master.json' 
          jobs: 
          - ${{ if eq(parameters.linkedTemplates, false)}} : 
            - template: ../jobs/adf_deploy_env_job.yml 
              parameters: 
                environmentName: ${{ environmentObject.environmentName }} 
                templateFile: ${{ variables.templateFile }} 
                templateParametersFile: ${{ parameters.templateParametersFile }} 
                serviceName: ${{ parameters.serviceName}} 
                regionAbrv: ${{ regionAbrv }} 
          - ${{ else }} : 
            - template: ../jobs/adf_linked_template_deploy_env_job.yml 
              parameters: 
                environmentName: ${{ environmentObject.environmentName }} 
                templateFile: ${{ variables.templateFile }} 
                templateParametersFile: ${{ parameters.templateParametersFile }} 
                serviceName: ${{ parameters.serviceName}} 
                regionAbrv: ${{ regionAbrv }}

Pipeline

At this point we have two stage templates (build and deploy) that will load all the necessary jobs and tasks we require. This pipeline will be stored in a YAML folder in the repository that Data Factory is connected to. In this case it is stored in my repo adf_pipelines_yaml_ci_cd.

adf_pipelines_template.yml

parameters: 
- name: environmentObjects 
  type: object 
  default: 
    - environmentName: 'dev' 
      regionAbrvs: ['eus'] 
      locations: ['eastus'] 
    - environmentName: 'tst' 
      regionAbrvs: ['eus'] 
      locations: ['eastus'] 
- name: serviceName 
  type: string 
  default: 'adfdemo' 
 
stages: 
- template: stages/adf_build_stage.yml@templates 
  parameters: 
    serviceName: ${{ parameters.serviceName }} 
- ${{ if eq(variables['Build.SourceBranch'], 'refs/heads/main')}}: 
  - template: stages/adf_deploy_stage.yml@templates 
    parameters: 
      environmentObjects: ${{ parameters.environmentObjects }} 
      serviceName: ${{ parameters.serviceName }}

If one is astute, you notice that I am using trunk based development. This means that every PR will run the CI which will generate a build artifact and confirm the Data Factory template is still valid. Any commitment to the main branch will trigger a deployment. This `Build.SourceBranch` variable will determine if the deployment stage is loaded or not.

Conclusion

That's it for part 4. At this stage we've covered how to break down our Azure DevOps Pipeline for Data Factory into reusable components. We've also created a template that we can define our deployment environments once and scale infinitely amount of times. We also introduced one methodology for YAML templates that accomplishes: