FastTrack for Azure

6 MIN READ

Can we make data engineering CI/CD more like code? (Yes we can!)

Microsoft

Aug 10, 2022

Prefix

As a CTO, your organization has a development team, they use GitHub or Azure DevOps (or any tool) for new releases, bug fixes, testing. Yet another team is working closely with the development team, the data engineering team. However, you cannot achieve an aligned approach for software development. The data team lead struggles, your DevOps team are trying, but still, they need to perform manual fixes to any new Synapse workspace.

The suggested development process when using Synapse calls for the following steps:

Only development workspaces are connected to Git.
Developers should use their own feature branches.
Push to a common collaboration branch using a pull request.
After review and merge are completed, 'Publish' needs to be executed (today this is only available through the UI).
- The 'Publish' reads all committed assets and creates an ARM template, together with a parameter file.
For deployment, one can use the marketplace task:
- Provide it with the template and parameter file.
- Point it to the required subscription/ resource group / workspace.
- Post deployment, access the newly provisioned workspace and modify all scripts with specific values.

Why? What?

Data engineering teams are requested to align to other teams with their software life cycle approach. Using tools for automatic creation and refreshing of Synapse workspace. However, even if they can reach a stage where the linked services are parameterized, there could be additional areas where parameters are required. For example, when a SQL script is pointing to a specific storage account, container, or dates, and these needs to be specific for an environment, or when a Spark notebook needs to read from specific location which is again environment specific.

Synapse does not provide (for now) the ability to inject parameters per user requirement, nor does it allow for global parameters to be used. This results in the need for an operational team to make manual changes after deployment. Changes include providing specific values, making changes to SQL notebooks and potentially other assets.

So what's new?

The notion we can tap into the deployment process, leverage the ARM template parameter concept. The deployment task will read the two files generated by the 'Publish' step, right? This is what we covered earlier, but let's challenge this approach. What if we can somehow modify the generated template and use our own parameter file? Will this allow us to inject parameters?

So, I started with a simple attempt to manually modify the generated file and entered the parameter('sql_password') somewhere in the generated file (to be specific in the 'query' value).

Then I also modified the parameter file and added this new parameter.

Using the Azure DevOps task to populate this value during deployment.

Which resulted in these values to be part of my provisioned workspace, hurray!

This taught me that we can tap into the deployment process.

You're probably saying - Hey show me the automation! You changed from going manual to full manual, right? Wrong.

How? It's time for automation

The ability to replace parameters exists, now its time for how to do it. I wrote these two user stories to help me focus on the solution:

As a developer, I want to have the ability to inject parameters to my SQL files or notebooks, so that I can write it once and it would be replaced for each environment with the proper values.
As a DevOps engineer, I want to have the ability to configure only the parameters a developer tells me about, provide the proper values for them in Azure DevOps and provision a workspace, so that I can focus on automation.

The problem has two main aspects to resolve. The first aspect is, how can we inject parameters to the auto generated ARM template file? The second part is how to create a dynamic parameter file and use it in Azure DevOps.

The first is rather simple to acomplish and this is due to the way the ARM template uses SQL or notebook code, it copies the content to a single JSON attribute, so if one would write in their SQL script

It would appear in the generated file. Quick and easy. Yes it isn't the most elegant solution - but it works and it is super simple.

Note, the reason I think it is a valid approach, is that you cannot save real passwords, nor SAS tokens in your GitHub repo anyway, so a step to remove these values before commiting will ensure these secrests will not be part of your repo.

Let's talk about the second part:

Parameter inclusion automation
The release pipeline

For the parameter automation, yherziger suggested we use a custom GitHub action, that would be triggered by a push to the published branch, scan the ARM template file, create an array of parameters by searching for the parameter(', then verify all of them exist in the generated file. If not, add the missing parameters and write it to another parameter file, commit it to the same branch. During testing we found out:

The script and yaml file must be located under the branch the generated ARM template exists in.
Add a dynamic value to the auto generated file that will always be diffrent from the commited file.

And for the last part of the solution, the Azure DevOps release. I used a CI pipeline since it can filter specific folders as its trigger. I pointed it to the newly created folder and file /injected_params/.

The pipeline only creates an artifact for the release phase, and this is what it looks like:

pool:
  name: Azure Pipelines
steps:
- task: DownloadPipelineArtifact@2
  displayName: 'Download Pipeline Artifact'

- task: PublishPipelineArtifact@1
  displayName: 'Publish Pipeline Artifact'

The release pipeline would use these artifacts for deployment; overall release pipeline would look like this:

The trigger looks like this:

The trigger for the first step is stright forward:

The configuration of the Synapse task was reviewed earlier. When configured be sure to execute at least once the CI pipeline, so you can have a reference to what the artifact would look like.

Out of Scope

Secrets manegment in Azure DevOps - you should refer to this document for further details.

Limitations

Your SQL scripts, or notebooks might seem to be invalid before the parameters are replaced.
Parameters would be wrapped with ' '

Full solution

The sample exists in Azure Samples.

GitHub Action (code)

name: Auto-inject ADF parameters

on:
  push:
    branches: ["workspace_publish"]

env:
  TEMPLATE_FILE: ./medalionsynapse12/TemplateForWorkspace.json
  PARAM_FILE: ./medalionsynapse12/TemplateParametersForWorkspace.json
  OUTPUT_FILE: ./injected_params/MedalionParams.json

jobs:
  inject-params:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v2
        with:
          python-version: 3.9
      - name: Configure git client
        run: |
          git config user.name "GitHub Param Auto-Inject Action"
          git config user.email "autoinjectaction@github.com"
          mkdir -p injected_params
      - name: Run parameters merge script
        run: |
          python3 ./scripts/autoinject_params.py \
            --template-file $TEMPLATE_FILE \
            --param-file $PARAM_FILE \
            --output-file $OUTPUT_FILE
      - name: Commit changes
        run: |
          git add $OUTPUT_FILE
          git commit -m "Auto-inject parameters"
          git push origin workspace_publish

Python code

#!/usr/bin/env python3

import argparse
import json
import re
import time


def merge_parameters(template_file_path, param_file_path):
    with open(param_file_path, "r") as file:
        # The parameter file needs to be parsed in its entirety,
        # so we aren't streaming it:
        param_file = json.loads(file.read())
    existing_params = param_file.get("parameters", {})
    merged_params = {**existing_params}
    pattern = re.compile(r"parameters\('(\w+)'\)")
    with open(template_file_path, "r") as file:
        # Since the ADF pipeline's template file can be quite big,
        # it's better to avoid loading all of it into memory. Instead,
        # trading for a slightly worse big-O:
        for buff in file:
            for param_name in pattern.findall(buff):
                merged_params[param_name] = existing_params.get(param_name, {"value": "-"})
    # Finally, write a timestamp parameter:
    merged_params["AUTOINJECT_PARAMS_RAN_AT"] = {"value": str(int(time.time()))}
    return {
        **param_file,
        "parameters": merged_params
    }


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Add parameters to ARM template")
    parser.add_argument("-t", "--template-file", type=str,
                        help="Path to input ARM template file")
    parser.add_argument("-p", "--param-file", type=str,
                        help="Path to input ARM parameters file")
    parser.add_argument("-o", "--output-file", type=str,
                        help="Path to source ARM template file. If empty, write to stdout", default=None)
    args = parser.parse_args()
    res = merge_parameters(
        template_file_path=args.template_file,
        param_file_path=args.param_file
    )
    if args.output_file:
        output_file = open(args.output_file, "w")
        for line in json.dumps(res, indent="\t").split("\n"):
            output_file.write(f"{line}\n")
        output_file.close()
    else:
        print(json.dumps(res, indent=4))

Updated Aug 10, 2022

Version 1.0

Data & AI

governance

yodobrin

Microsoft

Joined February 14, 2022

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity

Blog Post

Can we make data engineering CI/CD more like code? (Yes we can!)

Share