Triggering machine learning (AzureML) pipelines from Azure Data Factory (ADF) or Synapse
Published Feb 09 2022 07:09 AM 3,069 Views
Microsoft

A pattern I see with the customers I interact with is to load data from a source system into a storage account and then trigger an Azure Machine Learning (AzureML) pipeline to perform training or inference, depending on their use case.

ADF orchestrating data movement and AzureML triggeringADF orchestrating data movement and AzureML triggering

Within Azure, the go-to services for orchestrating data moves are Azure Data Factory or Synapse. In this article, I will be talking about Azure Data Factory (ADF) pipelines, but the same ADF pipeline can be created within Synapse if you are already in Synapse.

 

I see two types of pipelines adopted by the customers; the wait-for-it and the fire-and-forget one.

 

The wait-for-it data factory pipeline

The wait-for-it patternThe wait-for-it pattern

This type of flow waits for the Azure Machine Learning pipeline to complete and then does something more, e.g., copies the data into an SQL database to make the inferences available for a downstream system.

 

Even if you don’t have anything else to do after the machine learning pipeline, some folks want to wait for the execution of the AzureML pipeline to handle potential failures.

See the following blog posts on how you can do error handling within the Synapse and ADF pipelines:

To implement such a pipeline, you can use the AzureMLExecutePipeline step within the Azure Data Factory & Azure Synapse pipelines. The JSON definition for the activity is like the following:

 

{
  "name": "Execute pipeline and wait",
  "type": "AzureMLExecutePipeline",
  "dependsOn": [],
  "policy": {
    "timeout": "7.00:00:00",
    "retry": 0,
    "retryIntervalInSeconds": 30,
    "secureOutput": false,
    "secureInput": false
  },
  "userProperties": [],
  "typeProperties": {
    "experimentName": "experiment-triggered-from-synapse",
    "mlPipelineId": "ab07****-ab07-****-ab07-ab07*****2fb"
  },
  "linkedServiceName": {
    "referenceName": "link_to_your_azureml_workspace",
    "type": "LinkedServiceReference"
  }
}

 

Note that the default timeout is seven days, hopefully, enough time for your machine learning pipeline to complete.

 

The fire-and-forget data factory pipeline

Although the AzureMLExecutePipeline step is a fantastic component, it has some limitations. For example, you cannot pass tags supported by the REST API. Another barrier, according to some customers, is that it waits for the pipeline to complete, something that you may not want. In these cases, you can fall back in the AzureML REST API and trigger the pipeline through the standard web activity component.

The fire-and-forget patternThe fire-and-forget pattern

In this case, you trigger the pipeline through its REST endpoint. You can find the endpoint from the CLI, the SDK or the web UI as seen below:

Getting the REST endpointGetting the REST endpoint

The format is something like the following:

 

https://<region>.api.azureml.ms/pipelines/v1.0/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/PipelineRuns/PipelineSubmit/ab07****-ab07-****-ab07-ab07*****2fb

 

To authenticate, you should use the Managed Identity of the Azure Data Factory or Synapse workspace. You will need to assign the least privilege the managed identity of the ADF will need to trigger the pipeline. For more details on creating a role with the bare minimum permissions, look at the managing roles in your workspace article.

 

The final web activity configuration should be the following:

Web activity configurationWeb activity configuration

In the request’s body, the only required parameter is the ExperimentName and any parameter required by your pipeline (if any). In my case, I also wanted to pass some tags to the execution using the following payload:

 

{
  "ExperimentName": "triggered-by-adf-web-activity",
  "ParameterAssignments": {"country": "Greece", "groupBy":"city"},
  "RunSource": "ADF",
  "Tags": {
     "key":"value"
   }
}

 

The activity’s JSON is the following:

 

{
  "name": "Trigger AzureML pipeline and move on",
  "type": "WebActivity",
  "dependsOn": [],
  "policy": {
    "timeout": "7.00:00:00",
    "retry": 0,
    "retryIntervalInSeconds": 30,
    "secureOutput": false,
    "secureInput": false
  },
  "userProperties": [],
  "typeProperties": {
    "url": "https://<region>.api.azureml.ms/pipelines/v1.0/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/PipelineRuns/PipelineSubmit/ab07****-ab07-****-ab07-ab07*****2fb",
    "method": "POST",
    "body": {
      "ExperimentName": "triggered-by-adf-web-activity",
      "ParameterAssignments": {
        "country": "Greece",
        "groupBy": "city"
      },
      "RunSource": "ADF",
      "Tags": {
        "key": "value"
      }
    },
    "authentication": {
      "type": "MSI",
      "resource": "https://ml.azure.com"
    }
  }
}

 

If you execute the pipeline, you will notice that the ADF pipeline finishes immediately after the http request is completed. The output of the web activity contains useful information like PipelineRunId and RunUrl.

Within your AzureML studio, you should be able to see the pipeline running. In the execution’s properties, you should see the following (note the Run source and the Tags):

Pipeline executing while ADF is finishedPipeline executing while ADF is finished

Of course, using the fire-and-forget approach is dangerous because there is no error handling within Azure Data Factory. Most of the customers that adopted this approach are either using Azure Event Grid to capture the run status changed events and retry failed jobs or use Azure Monitor to monitor the execution of those pipelines.

 

Let us know in the comments if you are using another pattern you would like to share!

 

References

Check out the dedicated tutorial in our excellent docs on triggering Azure Machine Learning pipelines (with instructions on how to do the same thing in Logic Apps) and invoke published ML pipelines.

Co-Authors
Version history
Last update:
‎Feb 09 2022 07:08 AM
Updated by: