Developing custom activities in Data Factory / Synapse Analytics
Published Aug 15 2022 07:42 AM 1,327 Views
Microsoft

Microsoft FastTrack for Azure 

MarcoCardoso_0-1660063041352.png

 

Introduction


One of the key advantages of using Data Factory or Synapse Analytics for data movement is that, chances are, they implement most - if not all - of the data operations you need. The list of pipeline activities, data sources and sinks is very comprehensive, and is constantly growing.

But if you're working on more specific use cases, you may come across a process that the solution does not have a pipeline activity for. At that point, you may also notice that there is an activity called "Custom". So how can we use that in our pipelines?

In this article, we'll be handling the following situation: imagine we'd like to run Python scripts from a pipeline. We'll do that by using a Custom activity, while properly managing its environment and source code using Azure DevOps. The steps below were done in Synapse Analytics, but the same process is supported in Data Factory.

Custom activity implementation - Azure Batch


What the Custom activity does is schedule tasks on a service called Azure Batch to execute a custom workload. The following diagram provides an overview of how the service works.
 
MarcoCardoso_0-1660061420449.png

 



In our example, Synapse Analytics and Azure DevOps will share the role of the Application Service, providing the code to Azure Storage and scheduling jobs and tasks in Azure Batch. This allows us to implement custom activity logic in Azure Batch, and trigger it from within our pipelines.

Solution architecture


The following diagram shows what our final solution using Synapse and Azure Batch looks like:

MarcoCardoso_1-1660573943235.png

 


We're using Azure DevOps to both deploy the source code to Blob Storage, and manage the Python environments in Azure Batch. This ensures we can easily source control and reuse the Custom Activities we're developing.
Synapse Analytics can then trigger tasks in Azure Batch and monitor them to perform custom operations as part of a pipeline.

Environment operation overview


Before we go through the setup instructions, let's look at what the operation of this environment looks like.
 
First, we develop our python scripts, and source control them with Azure DevOps. MarcoCardoso_0-1660061661577.png
Committing to the main branch triggers a Build pipeline, which stores the code artifacts for future deployment. MarcoCardoso_1-1660061697482.png
When we're ready to deploy, we trigger a Release pipeline, which will publish the artifacts to Blob Storage, where they will be accessible by Azure Batch. MarcoCardoso_2-1660061725154.png
If there were changes to the environment - like updated requirements.txt or additional setup - we'll also trigger a Release pipeline to refresh worker nodes. This allows them to pick up the new changes. MarcoCardoso_3-1660061768967.png
Back to Synapse, we add a Custom Activity, specifying the command to run the new Python script. MarcoCardoso_4-1660061782158.png
Next, we publish and run the pipeline by creating a trigger. MarcoCardoso_5-1660061792076.png
Finally, we check out Azure Batch for the activity's execution logs. MarcoCardoso_6-1660061799234.png
 
As you can see, development and deployment of custom activities is made much simpler by leveraging Azure DevOps pipelines. The development team doesn't need to manage either Azure Batch or the code in Blob Storage - everything is automated for them.

Environment setup


Below are the steps taken to set up this environment. If you're interested in testing this out in your own subscription, you may use the resources in this repository to follow along the instructions.
 
Please note this is intended as learning material, and not for production environments. You may use the included configuration files and instructions as a starting point, but they are provided without warranty of any kind.
 
1. Create an Azure Synapse workspace and Storage Account (if you haven't already);  

2. Create an Azure Batch account

 

- Choose a subscription, resource group, name and region;

- Click `Select a storage account` and select Synapse's default storage. Optionally, you may set up a new storage account instead;

MarcoCardoso_7-1660061934880.png

3. Create a node pool.

 

- Use the ubuntuserver image;

- Set the dedicated node count to 1;

- Enable the start task and set the command to `sh setup.sh`. This will run our setup script when nodes are started.

- Add the container with your source code to the start task's resource files.

MarcoCardoso_8-1660061964148.pngMarcoCardoso_9-1660061973031.png

 

4. Set up a project in Azure DevOps; MarcoCardoso_10-1660062129663.png
5. Set up a repository for your source code; MarcoCardoso_11-1660062141397.png

6. Create an environment setup script;

 

- Paste in the following code:

apt update
apt install python3-pip -y
pip3 install -r requirements.txt

- This will install Python, Pip, and the required packages.

MarcoCardoso_12-1660062161835.png

7. Create your requirements.txt file;

 

For this example, we're adding Pandas as a dependency.

MarcoCardoso_13-1660062256509.png

8. Import the build pipeline in Azure DevOps;

 

- Select Azure Pipelines agent pool and Windows 2019 as the specification

- Change the Source specification to the repository you created.

- Save and queue the pipeline, and wait for it to complete before continuing;

MarcoCardoso_0-1660062320096.pngMarcoCardoso_1-1660062342373.png

9. Import the source code publish pipeline;

 

- Do the same configuration as before in the Agent configuration;

- Delete the existing build artifact and replace it with your new pipeline's build;

- In the AzureBlob File Copy task, configure a service principal to deploy source code to Blob Storage.

- Configure the pipeline variables;

- Generate a new release;

MarcoCardoso_2-1660062367894.pngMarcoCardoso_3-1660062375755.pngMarcoCardoso_4-1660062382000.pngMarcoCardoso_5-1660062390568.png

10. Import the worker pool refresh pipeline;

 

The build artifact doesn't need to be configured for this pipeline. Except for that, execute the same configurations as you did in the previous step;

 
11. Create an Azure Batch linked service in Synapse MarcoCardoso_6-1660062450287.png

After following these steps, you are ready to start developing and using custom activities in your Synapse workspace! Refer back to the Environment operation overview section to see how you can use this setup.

Next steps


Here are a few extra steps you might be interested in testing out:

- Updating the setup.sh script to set up a different environment in Azure Batch - like installing Node.js, or custom ODBC drivers;
- Changing the command in Synapse in order to do something else - like perform AzCopy or other CLI commands;
- Modifying the pipelines to deploy to a Test environment before moving to production;
 

Conclusion


In conclusion, this basic setup will allow you to implement custom activities for Synapse - or Azure Data Factory, for that matter - with minimal changes necessary.

Let me know in the comments what custom activities you've implemented using this template!
 
 
FastTrack for Azure:  Move to Azure efficiently with customized guidance from Azure engineering. FastTrack for Azure – Benefits and FAQ | Microsoft Azure 
Co-Authors
Version history
Last update:
‎Aug 15 2022 07:42 AM
Updated by: