AND logic - Many Pre-processing Jobs - Part 1
Published Mar 28 2022 07:12 PM 2,816 Views

Quite often, an ETL pipeline have multiple upstream sources: you need to copy a handful of data streams into a central place, before kicking off next stage processing. You want to express these dependencies: an activity should wait for all its predecessors to finish before starting. 

 

There are two ways to express the logic: (1) Inline and (2) ExecutePipeline, each with its own strength and shortcomings. Specifically, ExecutePipeline is the preferred way, if you also want to introduce error handling, with a common error handling job, to the logic. We will discuss about error handling in Part 2 of the series.

 

Express Multi-dependencies Inline

ADF pipelines naturally support logical and conditions in pipelines: you can connect many activities to an activity to express upstream dependencies. For instance, in this sample pipeline, Upstream1, Upstream2, and Upstream3 will kick off in parallel, and PostProcess will block until all upstream activities succeed. If any of the 3 upstream activities fail, the pipeline will fail, and PostProcess will never execute

 

Multi Dependency 01 Inline.png

The upsides of this approach are:

  • simplicity: you can specify dependencies with a couple of arrows, and ADF will enforce the dependencies for you
  • individual error handler: each upstream job may require different error handling logic. This approach allows use define individual error handling paths

When using error handling paths with inline approach, please beware of the limits on maximum activities per pipeline.

Express Multi-dependencies with Execute Pipeline

Alternatively, you may want to port all upstream in a separate pipeline, and use ExecutePipeline to stich them together.

Multi Dependency 02 Upstream.png

Multi Dependency 03 Execute Pipeline.png

 

In the Upstream pipeline, Upstream1, Upstream2, and Upstream3 will kick off in parallel. The pipeline will succeed if all upstream activities succeed. If any of the 3 upstream activities fail, the pipeline will fail. You may use Fail Activity to surface detailed error messages from the upstream pipelines.

 

In the main pipeline, please ensure Wait on Pipeline is selected. PostProcess will block until upstream pipeline succeed.

 

The upsides of this approach are:

  • modularity: you can break the dependency graph into two parts and modify them separately (for instance add another data source without touch post processing steps)
  • common error handler: if a shared error handling path is preferred, a UponFailure path can be added to ExecutePipeline activity for one for all error catching. Please utilize the error payload from ExecutePipeline for error logging. We will discuss this in details in Part 2
1 Comment
Co-Authors
Version history
Last update:
‎Mar 28 2022 09:05 PM
Updated by: