Azure Data Factory Blog articles

Continued region expansion: Azure Data Factory is generally available in Mexico Central

Chunhua — Thu, 05 Jun 2025 01:43:07 GMT

Azure Data Factory is now available Mexico Central.

You can now provision Data Factory in the new region in order to co-locate your Extract-Transform-Load logic with your data lake and compute.

See the full set of Azure Data Factory supported regions.

Announcing the new Databricks Job activity in ADF!

Noelle_Li — Wed, 14 May 2025 15:00:00 GMT

We’re excited to announce that Azure Data Factory now supports the orchestration of Databricks Jobs!

Databrick Jobs allow you to schedule and orchestrate a task or multiple tasks in a workflow in your Databricks workspace. Since any operation in Databricks can be a task, this means you can now run anything in Databricks via ADF, such as serverless jobs, SQL tasks, Delta Live Tables, batch inferencing with model serving endpoints, or automatically publishing and refreshing semantic models in the Power BI service.

And with this new update, you’ll be able to trigger these workflows from your Azure Data Factory pipelines.

To make use of this new activity, you’ll find a new Databricks activity under the Databricks activity group called Job.

Once you’ve added the Job activity (Preview) to your pipeline canvas, you can connect to your Databricks workspace and configure the settings to select your Databricks job, allowing you to run the Job from your pipeline.

We also know that allowing parameterization in your pipelines is important as it allows you to create generic reusable pipeline models.

ADF continues to provide support for these patterns and is excited to extend this capability to the new Databricks Job activity.

Under the settings of your Job activity, you’ll also be able to configure and set parameters to send to your Databricks job, allowing maximum flexibility and power for your orchestration jobs.

To learn more, read Azure Databricks activity - Microsoft Fabric | Microsoft Learn.

Have any questions or feedback? Leave a comment below!

Integrate Microsoft Fabric with SAP data with USB4SAP [in live and cached mode]

Sunil_Sabat — Wed, 19 Jun 2024 17:36:30 GMT

Microsoft Fabric integration with SAP using USB4SAP

With USB4SAP, Fabric users can access SAP data. This data can be used to refresh PowerBI semantic models in live and cache mode. USB4SAP provides deep integration into your SAP system (for raw tables data, as well as modeled information like reports, queries, CDS, BW extractors etc), SAP tables data extraction with delta / CDC capabilities (ADF connector) without the need for SLT or Change Pointers activation.

Specifically for integration with customers SAP systems, you can leverage USB4SAP connector for:

PowerBI live and cached mode
Onelake based integration
REST based synchronous API integration

It supports no-code, native SAP security based access to the following SAP objects (HANA or non-HANA based):

Tables (with Change data capture)
Views
CDS
Reports
TCodes
BW Extractors
ABAP queries

Following modes of Change Data Capture are supported:

Tables & views:
1. Time-series based [ie, date & time of the record create, update, delete]
2. Document & item number series based
Reports / Queries / TCodes:
1. Time-series based using variants on selection screen.

Conceptual architecture

Following are the key components for the conceptual architecture for MS Fabric integration with SAP systems.

Customers SAP systems (ERP, S4HANA, BW, CRM, SRM, APO, Solman etc) are organizational systems of record
Data transmission is REST over HTTPS (unless specified otherwise, where RFC / OData may also be used)
Data & information storage in any cloud (eg, Microsoft Azure) or on-premise repository
Information security is using SAS key over HTTPS
Synthesis layer is combination of tools like PowerAutomate / Logic Apps etc.
PowerBI/ PowerPlatform / MS Excel and other apps are supported using REST / PowerQuery
CX-Portal layer [optional] in MS SharePoint or other customer Portal solutions

Application architecture

Following are the application architectures for live and cache connection from Fabric PowerBI to backend SAP systems. Data Factory templates are also available to accelerate use of Ecoservity's connectors and integration patterns within a pipeline.

PowerQuey Connector Method:

Fabric live connection to SAP: Live query to SAP leverages following mechanism

PowerQuery module within PowerBI
REST API [over HTTPS] connectivity to SAP [based on SICF or Gateway] for Power Platform apps
Video guide is available here: https://youtu.be/vmJVoNSBdpM.

Following is the link for Azure marketplace listing for this solution (free trial available):

https://azuremarketplace.microsoft.com/en-us/marketplace/apps/ecoservity.peopleatwork4pbi?tab=Overview

Cached Method:

Microsoft Fabric cached connection to SAP: Cached query to SAP leverages following:

PowerQuery module within Fabric PowerBI
REST API [over HTTPS] connectivity to SAP [based on SICF or Gateway] , with SAS-key [over HTTPS] based security
Onelake data creation with support for CSV, JSON and Parquet
Video guide is available here: https://www.youtube.com/playlist?list=PLTum8dvrbVA05nV3hsr8rMPjqGHc2oOAq
Following is the link for Azure marketplace listing for this solution (free trial available):

https://azuremarketplace.microsoft.com/en-us/marketplace/apps/ecoservity.usb4sap_azure_data_factory?tab=Overview

REST Method:

REST API based connection to SAP: Cached query to SAP leverages following:

PowerQuery module within Fabric PowerBI
REST API [over HTTPS] connectivity to SAP [based on SICF or Gateway] , with SAS-key [over HTTPS] based security
Onelake data creation with support for CSV and Parquet

Data Factory Template Method:

In collaboration with Microsoft, Ecoservity has developed a set of Data Factory templates that make it faster and easier to integrate SAP into the Fabric ecosystem. These templates use Data Factories REST data source and data sink to read and write data from SAP.

The following screenshots show a Data Factory template that copies data from an SAP semantic model via REST.

Then, the data syncs to Fabric Onelake:

Conclusion:

In this blog, we reviewed alternate methods of using Ecoservity's USB4SAP product in conjunction with Data Factory to load SAP business data for PowerBI reports and data lake. You can adopt live and cache modes. Templates make it easy for end users to adopt the solution in a pipeline. Ecoservity product is available in Azure Market Place. You can go ahead and try it out as alternative to existing connectors available in data factory.

Optimizing ETL Workflows: A Guide to Azure Integration and Authentication with Batch and Storage

Josedobla — Mon, 27 May 2024 17:00:00 GMT

Introduction

When it comes to building a robust foundation for ETL (Extract, Transform, Load) pipelines, the trio of Azure Data Factory or Azure Synapse Analytics, Azure Batch, and Azure Storage is indispensable. These tools enable efficient data movement, transformation, and processing across diverse data sources, thereby helping us achieve our strategic goals.

This document provides a comprehensive guide on how to authenticate Azure Batch with SAMI and Azure Storage with Synapse SAMI. This enables user-driven connectivity to storage, facilitating data extraction. Furthermore, it allows the use of custom activities, such as High-Performance Computing (HPC), to process the extracted data.

The key enabler of these functionalities is the Synapse Pipeline. Serving as the primary orchestrator, the Synapse Pipeline is adept at integrating various Azure resources in a secure manner. Its capabilities can be extended to Azure Data Factory (ADF), providing a broader scope of data management and transformation.

Through this guide, you will gain insights into leveraging these powerful Azure services to optimize your data processing workflows.

Services Overview

During this procedure we will use different services, below you have more details about each of them.

Azure Synapse Analytics / Data Factory

Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.
Documentation:

Azure Batch

Azure Batch is a powerful platform service designed for running large-scale parallel and high-performance computing (HPC) applications in the cloud.
Documentation: Azure Batch runs large parallel jobs in the cloud - Azure Batch | Microsoft Learn

Azure Storage

Azure Storage provides scalable and secure storage services for various data types, including services like Azure Blob storage, Azure Table storage, and Azure Queue storage.
Documentation: Introduction to Azure Storage - Cloud storage on Azure | Microsoft Learn

Managed Identities

Azure Managed Identities are a feature of Azure Active Directory that automatically manages credentials for applications to use when connecting to resources that support Azure AD authentication. They eliminate the need for developers to manage secrets, credentials, certificates, and keys.
There are two types of managed identities:
- System-assigned: Tied to your application.
- User-assigned: A standalone Azure resource that can be assigned to your app
Documentation: Managed identities for Azure resources - Managed identities for Azure resources | Microsoft Learn

Scenario

Run an ADF / Synapse Pipeline that pulls a script located in a Storage Account and execute it into the Batch nodes using User Assigned Managed Identities (UAMI) for Authentication to Storage and System Assigned Managed Identity (SAMI) to authenticate with Batch.

Prerequisites

ADF / Synapse Workspace

Documentation: Quickstart: create a Synapse workspace - Azure Synapse Analytics | Microsoft Learn

UA Managed Identity

Storage Account

Documentation: Create a storage account - Azure Storage | Microsoft Learn

Procedure Overview

During this procedure we will walk through step by step to complete the following actions:

Create UAMI Credentials
Create Linked Services for Storage and Batch Accounts
Add UAMI and SAMI to Storage and Batch Accounts
Create, Configure and Execute an ADF / Synapse Pipeline

We will refer to ADF (Portal, Workspace, Pipelines, Jobs, Linked Services) as Synapse during all the exercise and examples to avoid redundancy.

Debugging

Procedure

Create UAMI Credentials

1. In your Synapse Portal, go to Manage -> Credentials -> New and fill in the details and click Create.

Create Linked Services Connections for Storage and Batch

2. In your Synapse Portal, go to Manage - Linked Services -> New -> Azure Blob Storage -> Continue and complete the form

a. Authentication Type: UAMI

b. Azure Subscription: Choose your one

c. Storage Account name: Choose your one where the script to be used is allocated

d. Credentials: choose the created into the Step #1

e. Click on Create

3. In Azure Portal go to your Batch Account -> Keys and Copy the Batch Account name & Account Endpoint to be used in next step, also copy the Pool Name to be used for this example.

4. In your Synapse Portal, go to Manage -> Linked Services -> New -> Azure Batch -> Continue and fill in the information

a. Authentication Method: SAMI (Copy the Managed Identity Name to be used later)

b. Account Name, Batch URL and Pool Name: Paste on here the values copied from Step#3

c. Storage linked service Name: Choose the one created from Step#2

5. Publish all your changes

Adding UAMI RBAC Roles to Storage Account

6. In the Azure Portal, go to your Storage Account -> Access Control (IAM)

a. Click on Add Option and then on Add role assignment and search for "Storage Blob Data Contributor", then click on Next.

b. Choose Managed Identity and select your UAMI click on Select and then click Next, Next and Review + assign.

Adding SAMI RBAC Roles to Batch Account

7. In the Azure Portal, go to your Batch Account -> Access Control (IAM)

a. Click on Add Option and then on Add role assignment

b. Click on "Privileged administrator roles" tab and then choose the Contributor role and click Next.

c. Choose Managed Identity and under Managed Identity lookup for "Synapse workspace" and then choose the SAMI same as it is added into the step 4a., then click on Select and Next, Next and Review and Assign.

Adding UAMI to Batch Pool

If you need to create a new Batch Pool, you can follow the following procedure:

Documentation: Configure managed identities in Batch pools - Azure Batch | Microsoft Learn
Make sure to select the UAMI configured into the Step 1

8. If you already have a Batch Pool created follow the next steps:

a. Into the Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Go to Identity

b. Click on Add then choose the necessary UAMI (on this example it was selected the one used by the Synapse Linked Services for Storage and another one used for other integrations) and click on Add.

Important: In case your Batch Pool use multiples UAMI's (as example to connect with Key Vault or other services), you have first to remove the existing one and then add all of them together.

c. Then, it is required to Scale in and Scale out the Pool to apply the changes.

Setting up the Pipeline

9. In your Synapse Portal, go to Integrate -> Add New Resource -> Pipeline

10. Into the right panel Activities -> Batch Services -> Drag and drop the Custom activities

11. In the Azure Batch tab details for the Custom Activities, click on the Azure Batch linked service and click the one created in Step 4 and test the connection (if you receive a connection error, please go to the Troubleshooting scenario 1)

12. Then go to Settings tab and add your script. Ffor this example, we will use a Powershell script previously uploaded to a Storage Blob Container and send the output to txt file.

a. Command: your script details

b. Resource linked Service: The Storage Service Linked connection configured previously on Step#2

c. Browse Storage: lookup for the Container where your script was uploaded

d. Publish your Changes and perform a Debug

Debugging

12. Check the Synapse Jobs Logs and outputs

a. Copy the Activity Run ID

b. Then, in the Azure Portal Go to your Storage Account -> Containers -> adfjobs -> select the folder with the activityID -> output.

c. On here you will find two files, "stderr.txt" and "stdout.txt" both of them contains information about the errors or the outputs of the commands executed during the task execution

13. Check the Batch Logs and outputs. To get the Batch logs you have different ways:

a. Over Nodes: In Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Nodes -> then into the Folders details go to the folder for this Synapse execution -> job-x -> lookup for the activityID

b. Over Jobs: In Azure Portal go to your Batch Account -> Jobs -> Select a pool with a name of adfv2-yourPoolName -> click on the Task with the ID same as it was the ActivityID of the Synapse Pipeline from step 12a.

What we have learned

During this walkthrough procedure we have learned and implemented about

Authentication: Utilizing User Assigned Managed Identities (UAMI) and System Assigned Managed Identity (SAMI) for secure connections.
Linked Services: Creation and configuration of linked services for Azure Storage and Azure Batch accounts.
Pipeline Execution: Steps to create, configure, and execute an ADF/Synapse Pipeline, emphasizing the use of Synapse as a unified term to avoid redundancy.
Debugging: Detailed instructions for creating credentials, adding RBAC roles, and setting up pipelines, along with troubleshooting tips.
Logs Analysis: How to access and analyze Synapse Jobs logs and Azure Batch logs for troubleshooting.
Error Handling: Understanding the significance of ‘stderr.txt’ and ‘stdout.txt’ files in identifying and resolving errors during task execution.

If you have any questions or feedback, please leave a comment below!

Data Factory Increases Maximum Activities Per Pipeline to 80

Noelle_Li — Fri, 29 Mar 2024 17:00:00 GMT

Data Factory pipeline developers create exciting and interesting data integration and ETL workflows for their data analytics projects. Because Data Factory is a platform service that is shared across ADF, Synapse, and Fabric, we had been limiting the number of activities in a single pipeline to 40 as a way to avoid resource exhaustion.

However, just this week, we have doubled the limit on number of activities you may define in a pipeline, from 40 to 80. With more freedom to develop, we want to empower you to create more powerful, versatile, and resilient data pipelines for all your business needs. We are excited to see what you come up with, harnessing the power of 40 more activities per pipeline!

What's the limit about & why did we raise it?

To ensure the resiliency and reliability of data pipelines, Data Factory places a limit on maximum number of activities that a pipeline may define. For the longest time, the limit has always been 40 activities per pipeline. Today, we are doubling it to 80, with future plans to raise it even further for our developers. The limit applies to the number of activities defined, not actually run. For instance, in the following example with conditional branching, there are 3 activities defined, even though, realistically speaking, in any pipeline run, only 2 will actually run.

We understand that our customers want to build resilient and useful data pipelines for their business needs, and sometimes, the 40 activities limit may come in the way of development. Hence, we are doubling the ceiling limit and giving you 40 more activities in a pipeline.

When to add more activities?

We strongly encourage customers to use the additional 40 activities to build error handling capabilities. For instance, send an email to my on-call alias when Copy activities failed, otherwise proceed.

Or build a try-catch block that attempts to move the data if it's ready or move on otherwise.

Build for Resilience and Retries!

We do not, however, encourage you to build a sequential pipeline, with 80 activities one after another. Please be aware that data pipelines, just like any other piece of software, can sometimes encounter failures. For instance, when the connection to your SQL server is throttling and a copy activity cannot complete in time.

In those cases, you need to retry and restart the pipeline. Please bear this information in mind, as you develop your pipeline: keep the actual steps within a pipeline to a reasonable amount. Production engineers will thank you to keep their lives simple. 🙂

Final Thoughts

With the power of data pipelines, we want you to be able to build and deliver business impact for your end users. We excited to see what you come up with, now with the power of 40 more activities!

Have any questions or feedback? Leave a comment below!

Action Required: Switch from Memory Optimized Data Flows in Azure Data Factory to General Purpose

Noelle_Li — Mon, 25 Mar 2024 23:57:50 GMT

Azure Data Factory Memory Optimized Data Flows will be fully retired on April 1, 2027. Going forward, all ADF Data Flows will use the General Purpose SKU that will provide performance that is superior to the current Memory Optimized and at the General-Purpose price.

How does this affect me?

Beginning April 1, 2024, the creation of new Azure Data Factory Memory Optimized Data Flows will be discontinued until it is fully retired on April 1, 2027.

Existing pipelines can continue to use existing Memory Optimized data flows, but you will not be able to create new Azure Integration Runtimes using Memory Optimized. You will be able to use General Purpose ADF Data Flows, which will provide better performance at a lower price.

Required action

To avoid disruptions, we recommend the following actions:

When creating new data flows, create a new Azure Integration Runtime using General Purpose instead of Memory Optimized.
Then assign General Purpose IRs to existing and new data flows instead of using Memory Optimized.

Help and Support

If you have questions, get answers from community experts in Microsoft Q&A or email our team.

If you have a support plan and require technical support, please create a support request.

Under Issue type, select Technical.
Under Subscription, select your subscription.
Under Service, select My services, then select Data Factory.
Under Summary, type a description of your issue.
Under Problem type, select Mapping Data Flow

Let us know in the comments if you have any questions or feedback!

Continued region expansion: Azure Data Factory is generally available in two more regions

Chunhua — Wed, 17 Jan 2024 18:00:16 GMT

Azure Data Factory is now available in two new regions:

Israel Central
Italy North

You can now provision Data Factory in the new regions in order to co-locate your Extract-Transform-Load logic with your data lake and compute.

See the full set of Azure Data Factory supported regions.

Continued region expansion: Azure Data Factory is generally available in Poland Central

Chunhua — Fri, 27 Oct 2023 19:00:00 GMT

Azure Data Factory is now available in Poland Central.

You can now provision Data Factory in the new region in order to co-locate your Extract-Transform-Load logic with your data lake and compute.

See the full set of Azure Data Factory supported regions.

General Availability of Time to Live (TTL) for Managed Virtual Network in Azure Data Factory

lrtoyou1223 — Wed, 11 Oct 2023 04:00:00 GMT

In the fast-paced world of data integration, where seamless and secure data movement is paramount, Azure Data Factory (ADF) stands as a trusted orchestrator of data workflows. Today, we are thrilled to announce a significant enhancement to ADF's capabilities - the General Availability of ADF Managed Virtual Network Time to Live (TTL).

What is Managed Virtual Network TTL?

Before we delve into the benefits and use cases, let's understand what Managed Virtual Network TTL is all about.

Time to Live (TTL) is a crucial enhancement for Azure integration runtimes within a Managed Virtual Network. It allows you to specify a TTL value and Data Integration Unit (DIU) numbers required for various data integration activities. The TTL feature helps to manage compute resources more effectively, reduce startup times, and optimize overall performance.

Key Benefits of Managed Virtual Network TTL

Now, let's explore the key benefits of Managed Virtual Network TTL and why it's a game-changer for your data integration workflows:

Improved Performance

One of the challenges in Managed Virtual Network is managing the startup time of compute resources, especially when dealing with multiple copy activities or complex pipelines. Managed Virtual Network TTL addresses this by keeping computes alive for a certain period after their execution completes. If a new copy activity starts during the TTL time, it will reuse existing computes, significantly reducing startup time and enhancing overall performance.

Compute Size Flexibility

With Managed Virtual Network TTL, you have the flexibility to select from pre-defined compute sizes or customize the compute size based on your specific requirements and real-time needs. This customization ensures that your compute resources are optimally sized for the tasks at hand.

Pipeline and External Activity

Time to Live (TTL) isn't just limited to copy activities; you can also tailor the compute size and TTL duration for pipeline and external activities, ensuring your data integration processes are finely tuned to your specific requirements.

Monitoring Your Managed Virtual Network

Azure Data Factory's Managed Virtual Network TTL feature brings a new level of control and efficiency to your data integration workflows. By allowing you to manage compute resources effectively and reduce startup times, it optimizes performance. However, to ensure that your data integration processes are running smoothly within this secure environment, you need visibility and monitoring. In Azure Data Factory, we also provide some new metrics to help you identify the issues and bottlenecks.

Learn more about monitoring: Monitor an integration runtime within a managed virtual network - Azure Data Factory | Microsoft Learn

Embrace ADF Managed Virtual Network TTL

We are excited to bring you this enhancement to Azure Data Factory, and we look forward to seeing how it transforms your data integration processes. Get started with Managed Virtual Network TTL today and unlock a new level of efficiency and security in your data workflows.

Metadata Driven Pipelines for Dynamic Full and Incremental Processing in Azure SQL

Marc_Bushong — Thu, 28 Sep 2023 14:35:51 GMT

Developing ETLs/ELTs can be a complex process when you add in business logic, large amounts of data, and the high volume of table data that needs to be moved from source to target. This is especially true in analytical workloads involving Azure SQL when there is a need to either fully reload a table or incrementally update a table. In order to handle the logic to incrementally update a table or fully reload a table in Azure SQL (or Azure Synapse), we will need to create the following assets:

Metadata table in Azure SQL
- This will contain the configurations needed to load each table end to end
Metadata driven pipelines
- Parent and child pipeline templates that will orchestrate and execute the ETL/ELT end to end
Custom SQL logic for incremental processing
- Dynamic SQL to perform the delete and insert based on criteria the user provides in the metadata table

*This article uses Azure SQL DB as the source and sink databases. However, Azure SQL MI, On-Prem SQL, and Synapse Dedicated Pools (along with Synapse Pipelines) will also be compatible for this solution. As a source, you can use databases like MySQL, Oracle, and others. You will just need to adjust the query syntax/connections to match the desired source.

Scenario

There is a need to load SQL tables from a SQL Server source on a daily frequency or multiple times a day. The requirements are to land the data first in ADLS Gen 2, and then finally load the tables into Azure SQL DB with the correct processing (incremental or full) while using a dynamic pipeline strategy to limit the number of objects used.

Metadata Table

The first set up that is required in our dynamic ETL is going to be a metadata (sometimes called "config" table) table on the destination SQL server environment. This table contains all of the information that is needed to pass into the ADF pipelines to determine the source query, ADLS Gen 2 storage location and metadata, processing metadata, staging metadata, and other metadata critical to performing the ETL. An example of a metadata table design and sample are below.

Metadata table definition

CREATE TABLE [meta].[ADLS_Metadata]( [FileName] [varchar](100) NULL, [StorageAccount] [varchar](100) NULL, [StorageContainer] [varchar](100) NULL, [ContainerDirectoryPath] [varchar](100) NULL, [LoadType] [varchar](25) NULL, [LoadIndicator] [varchar](25) NULL, [SourceSchema] [varchar](25) NULL, [SourceTable] [varchar](100) NULL, [StagingSchema] [varchar](25) NULL, [StagingTable] [varchar](100) NULL, [TargetSchema] [varchar](25) NULL, [TargetTable] [varchar](100) NULL, [ColumnKey] [varchar](500) NULL, [WaterfallColumn] [varchar](100) NULL, [TableColumns] [varchar](1000) NULL ) ON [PRIMARY] GO

Sample output of metadata table

The ETL will be facilitated entirely from this metadata table. Any tables that are not included within this table, would not be executed in our ETL pipelines. Any new tables or work that are needed to be added, simply insert them into the metadata table and they will be available when the pipeline is triggered, without needing to alter the ADF pipelines. Whether the data is going to the same storage container or a different one, different databases, etc. the metadata table allows you dictate where, what, and how of your ETL from one central location. This is a simple metadata table example, but you can make this as robust as you desire by adding in test flags, different load frequency indicators, and many others.

Metadata Driven Pipelines

Now that the metadata table is constructed, time to build the dynamic ADF pipelines to orchestrate and execute the ETL.

Here are the ADF objects needed to execute the ETL for 'N' number of tables. These will be shown in steps below. It is important the note the power of the dynamic metadata driven pipelines, they are able to execute/facilitate an enterprise level ETL with only 3 pipelines, 2-3 linked services, and 2-3 datasets in this scenario.

Linked Services:
- Source SQL Server*
  - Authenticated with system-assigned managed identity.
- Sink SQL Server*
  - Authenticated with system-assigned managed identity.
- ADLS Gen 2 Storage
  - Authenticated with system-assigned managed identity.
- *IF the source SQL Server and sink are the same service with the same authentication and integration runtime then you only need one linked service. Ex. both are Azure SQL DBs with the same authentication. However, if the authentication differs or they are different services (Azure SQL DB vs Azure SQL MI) then create one dynamic linked service for each.
Datasets:
- Source SQL Server Dataset*
- Sink SQL Server Dataset*
- ADLS Gen 2 Storage Dataset
- *One dataset per linked service. Separate datasets may not be needed if you have one dynamic linked service.
Pipelines:
- Main Orchestration Pipeline
- Full Load Processing Pipeline
- Incremental Processing Pipeline

Linked Service Creation

In this scenario, the source and sink SQL environments are both Azure SQL DB with the same authentication, so there will be only one linked service created with parameters to handle the dynamic use. Feel free to use your own naming conventions for the objects and parameters, just be sure they are generic and descriptive. Ex. not using "parameter1" or "linkedService1".

The generic name of the linked service will be "AzureSQLDB". The domain name and database name are referenced from the parameters that we created in the linked service to pass this connection information at runtime from the pipelines. Default values are available and will be used if there are not values passed through the pipeline.

The same pattern will be used for the ADLS Gen 2 linked service. In this linked service, there is a generic name used "ADLSGen2" and only the storage account parameter is used. There is no path specification used here to allow the use of the linked service for all containers and paths using the same authentication method and Integration Runtime. The path and file will be optional parameters of the dataset that references this linked service.

Dataset Creation

The dataset will created and using the linked services that were created above. There needs to be the parameters that are used in the linked service as well as additional parameters. The parameter names will align with the metadata table column names to provide ease of use.

For the SQL environment, there needs to be the parameters "serverName" and "databaseName" which come from the linked service. Then adding the parameters "schemaName" and "tableName" to have the ability to query/use all tables in a server or database using that linked service. Create the parameters first on the "Parameters" tab and then use the 'add dynamic content' to place the reference to the parameters that were just created. These parameters will be exposed/prompted when referenced in the pipelines that are created later in this article.

For ADLS Gen 2 storage, there will be a dataset for each type of file and compression. In this scenario, the data will be stored as Parquet files with snappy compression. The same concept as above is used for the parameters. For the linked service, the parameter is created on the dataset "storageAccountName". Then dataset specific parameters to identify all possible containers, paths and files within the dataset are "storageContainer", "containerDirectoryPath", and "fileName".

Pipeline and ETL Creation

With the metadata table, linked services, and datasets created, it is time to build out the metadata driven pipelines. The walkthrough below is split up into the 3 different pipelines, the main orchestration (parent) pipeline, full processing pipeline (child), and incremental processing pipeline (child). These pipelines are organized into folders for ease of access/formatting. The folders are virtual and offer no functionality other than organization in the UI. The folders are "Orchestration" which houses the main parent pipeline and "ETL" which contains the children pipelines that perform the processing.

Main Orchestration pipeline:

The main orchestration pipeline in this example is called "adventureWorks_Main". This pipeline will have a trigger associated with it and will control the execution of the whole ETL. This is the design of the pipeline in the UI, and each activity will be described.

On the parent pipeline, it is critical to have pipeline parameters to allow this process to be dynamic. These pipeline parameters will be used throughout the activities, and passed to the children pipelines. They will look familiar as they will be used in the dynamic datasets, linked services, and querying the metadata table.

sourceServerName
- source server connection to passed through the parameterized dataset to the parameterized linked service
sourceDatabaseName
- source database connection to passed through the parameterized dataset to the parameterized linked service
targetServerName
- Target server connection to passed through the parameterized dataset to the parameterized linked service. Typically where your metadata table lives as well.
targetDatabaseName
- Target database connection to passed through the parameterized dataset to the parameterized linked service. Typically where your metadata table lives as well.
loadIndicator
- This is a frequency/use indicator. 'Daily' is an example that signifies the table is loaded daily. This is used as a filter. So, you can place'Test' or some other value to control which tables and different frequencies or uses to execute.
  - Ex. 'Testing Only', 'Monthly', 'Hourly'. The frequency would correspond with a trigger frequency as well.
waterfallLookbackDays
- The amount of days to incrementally process. Used only to find the changed rows in incremental data sources. Requires a reliable date stamp that corresponds to tracked inserts and updates.
  - Ex. rows that have been updated within the last 120 days (-120)
- There are many ways to incrementally process, and this is just one used in the example. This article discusses using waterfall column/columns. You would just need to adjust the parameters, syntax, and dynamic script to fit your criteria.

The full reload path and the incremental reload path have the same activities and pattern, however they differ in 2 ways.

Look up query. Specifically, the WHERE clause
The parameters passed to the 'Execute Pipeline' activity within the 'ForEach Loop'

Full Reload Pattern/Queries:

Use a look up to extract the rows based on the SQL query criteria -> pass that result set to the for each loop and iterate over each table to perform the loading in the full load processing pipeline.

Lookup Activity - Full Reload:

Inside the look up activity - "Full Reload - Lookup Metadata".

The parameters defined on the dataset appear as properties within the activity using the dataset. The lookup activity will query the metadata table, which lives in the target server. This uses the pipeline parameters. The "schemaName" and "tableName" parameters are not needed since the lookup activity is performing a query, so placing a "x" value allows the pipeline to validate even though these are not used.

Dynamic content for "serverName" parameter

@pipeline().parameters.targetServerName

Dynamic content for the query. The pipeline parameter for "loadIndicator" is used in the SQL query as well as a hard coded filter for "Full" load types.

SELECT LoadType, TargetTable FROM meta.ADLS_Metadata WHERE LoadIndicator = '@{pipeline().parameters.loadIndicator}' AND LoadType = 'Full'

Sample query output:

For Each Loop - Full Reload:

The next step is to iterate through the output in the for each loop. Use the settings tab to define the items from the full reload lookup.

@activity('Full Reload - Lookup Metadata').output.value

Inside the for each loop, an execute pipeline activity will be used to call the full load processing pipeline (child). There are parameters on the child pipeline that are required to enter when executing. The parameters are going to be the pipeline parameters (from the parent pipeline) EXCEPT for a new parameter called 'targetTable'. The 'targetTable' comes from the item value that we are iterating from the output of look up activity.

Full Reload - Processing Pipeline (child)

The processing pipeline called "adventureWorks_FullLoad" is executed from the parent pipeline with the pipeline parameters being passed from parent to child. Because this pipeline is called within a for each loop, each table that is being iterated will be have their own execution from this pipeline. The overall purpose and design of this pipeline is:

Extract source data
Load source data to ADLS Gen 2 storage
Full reload of data from ADLS Gen 2 to Azure SQL DB

Lookup Activity:

This will be the same dataset and configuration for the use of pipeline parameters as the parent pipeline lookup activity with the only difference being the query that is being passed through.

Query used in dynamic content. This will return all the columns associated with the row, there should only be one row returned, if there are multiple due to the addition of different testing scenarios/frequencies, then refine the filtering logic for the appropriate context.

SELECT * FROM meta.ADLS_Metadata WHERE TargetTable = '@{pipeline().parameters.targetTable}' AND LoadIndicator = '@{pipeline().parameters.loadIndicator}'

Copy data to ADLS Gen 2:

The metadata gathered from the look up activity is used to extract the source table data and load to the location in ADLS Gen 2 specified by the metadata.

In the 'Source' - The pipeline parameters for the source server and source database are used for "serverName" and "databaseName". Then the output from the lookup provides the "schemaName" and "tableName". This time, a table is used instead of a query.

Example of activity output use in a parameter.

@activity('Full Load - Lookup Metadata').output.firstRow.SourceSchema

In the 'Sink' - all the parameters are populated from the lookup activity reading from the metadata table. This creates a path for each file. Each file will be overwritten with every execution. If you wish to retain historical copies of the loads, you can add an archive step to move the files from this location and/or add customer logic for date suffixes in the expression builder.

Example of storage container with loaded files

Copy Data to Azure SQL DB:

Once the data has been landed into ADLS Gen 2 as parquet files, it is time to load the files into the Azure SQL DB using another Copy activity.

In the 'Source' - this will be the same configuration as the 'Sink' of the previous copy activity. Using the output from the lookup activity which contains the metadata table result.

In the 'Sink' - The dataset parameters will be populated with the pipeline parameters for the "targetServerName" and "targetDatabaseName". The "schemaName" and "tableName" are populated from the lookup activity output.

There is a 'Pre-copy script' that is being executed to truncate the table if it exists already. If the table does not exist, the script will not try to truncate the table, and the 'Table option' of 'Auto create table' being selected will handle any new tables. The write behavior will be 'Insert' since there is a full reload.

If schema drift is present, one solution is to replace the truncate with a drop instead to recreate the table each execution. That has other risks associated with it that need to be considered.

Pre-copy script using pipeline parameters and lookup activity output parameters

IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = '@{activity('Full Load - Lookup Metadata').output.firstRow.TargetSchema}' AND TABLE_NAME = '@{activity('Full Load - Lookup Metadata').output.firstRow.TargetTable}' ) BEGIN TRUNCATE TABLE [@{activity('Full Load - Lookup Metadata').output.firstRow.TargetSchema}].[@{activity('Full Load - Lookup Metadata').output.firstRow.TargetTable}] END

Incremental Load Pattern/Queries:

The incremental processing load is going to be very similar the full reload processing method, with differences being in the filtering of lookup activities, additional parameters, and different methods to perform the loading inside the processing pipeline.

Lookup Activity - Incremental Load

This lookup activity will be the exact same dataset configuration as the full reload version, with the minor change of 'Incremental' being hard coded as the WHERE clause 'LoadType' filter instead of 'Full'. The parameters are still using the pipeline parameters to connect to the metadata table.

SELECT LoadType, TargetTable FROM meta.ADLS_Metadata WHERE LoadIndicator = '@{pipeline().parameters.loadIndicator}' AND LoadType = 'Incremental'

For Each Loop - Incremental Load

The for each loop will use the output from the incremental reload lookup activity output in the 'Items'.

@activity('Incremental Reload - Lookup Metadata').output.value

Inside the for each loop, there is an execute pipeline activity which calls the incremental reload processing pipeline (child). There is one extra parameter that was not used in the full reload processing pipeline execution - 'waterfallLookbackDays' (pipeline parameter).

Incremental Reload - Processing Pipeline (child)

The processing pipeline called "adventureWorks_IncrementalLoad" is executed from the parent pipeline with the pipeline parameters being passed from parent to child. Because this pipeline is called within a for each loop, each table that is being iterated will be have their own execution from this pipeline. The overall purpose and design of this pipeline is (differs slightly from the full reload):

Extract source data
Load source data to ADLS Gen 2 storage
Full reload of data from ADLS Gen 2 to a staging table in Azure SQL DB
Dynamic delete and insert from staging table into production table within Azure SQL DB

Lookup Activity:

This will be the same as the full reload version of the query and the dataset configuration.

SELECT * FROM meta.ADLS_Metadata WHERE TargetTable = '@{pipeline().parameters.targetTable}' AND LoadIndicator = '@{pipeline().parameters.loadIndicator}'

Copy Data to ADLS Gen 2:

This activity is where the main difference between the full reload and the incremental loads start.

In the 'Source' - instead of the table, there will a query used. This query will utilize the 'waterfallColumn' value from the metadata table lookup and the 'waterfallLookbackDays' pipeline parameter to filter the results to only that time period. This will allow the query to build for each table dynamically and return a subset of the source table regardless of where the source is, the table, etc.

Query using dynamic content

@concat( 'SELECT * FROM ', '[', activity('Incremental Load - Lookup Metadata').output.firstRow.SourceSchema, '].[', activity('Incremental Load - Lookup Metadata').output.firstRow.SourceTable, '] ', 'WHERE ', 'CONVERT(DATE, ', activity('Incremental Load - Lookup Metadata').output.firstRow.WaterfallColumn, ') >= DATEADD(DAY,', pipeline().parameters.waterfallLookbackDays, ', GETDATE())' )

If the source is different than SQL, you can adjust the dynamic query to match the syntax for the source environment like Oracle, MySQL, etc.

In the 'Sink' - the configuration will the same as the full reload. The dataset parameters will come from the lookup activity output on the same dataset.

@activity('Incremental Load - Lookup Metadata').output.firstRow.StorageAccount

Copy Data from ADLS Gen 2 to Stage Table in Azure SQL DB:

The next step is to load the data from ADLS Gen 2 into a staging table to prep to perform the incremental processing. The staging table allows for temporary data to be stored and leverage the full compute power of the Azure SQL DB, as well as maintaining more control over the processing.

In both 'Source' and 'Sink' - The configurations that are used will be the same as the full reload version, with the only differences being the parameters are pointing to the staging table referenced in the metadata table instead of the final version of the table. This table is in a different schema and has a 'STAGE_' prefix on the table name. In the 'Sink', the process of truncating the table pre copy, full loading, and/or auto creating tables that don't exist is the same.

@activity('Incremental Load - Lookup Metadata').output.firstRow.FileName

Pipeline parameter example for 'tableName' pointing to the 'StagingTable'

@activity('Incremental Load - Lookup Metadata').output.firstRow.StagingTable

Pre-copy script - referencing the staging tables

IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = '@{activity('Incremental Load - Lookup Metadata').output.firstRow.StagingSchema}' AND TABLE_NAME = '@{activity('Incremental Load - Lookup Metadata').output.firstRow.StagingTable}' ) BEGIN TRUNCATE TABLE [@{activity('Incremental Load - Lookup Metadata').output.firstRow.StagingSchema}].[@{activity('Incremental Load - Lookup Metadata').output.firstRow.StagingTable}] END

Dynamic Delete and Insert Script

This step is what actually performs the incremental processing of the tables. It will delete the rows that are in the staging table (simulating that data has changed and needs to be either updated or inserted) and then insert the staging table rows into the production table. The script will be provided and explained below.

This is a 'Script' activity with the linked service parameters pointing to the location of the target tables and metadata table. There is also an input script parameter 'FileName' that uses the 'FileName' value from the lookup activity. The option for 'NonQuery' is selected since this script performs a DDL statement and does not return a result. If a result was being returned instead of DDL statements, then 'Query' would be selected.

FileName parameter dynamic content

@activity('Incremental Load - Lookup Metadata').output.firstRow.FileName

Dynamic Delete and Insert SQL Script

--DECLARE @FileName VARCHAR (500) -- Manual Runs inside procedure will toggle this for troublshooting DECLARE @TargetTable VARCHAR (500) DECLARE @StagingTable VARCHAR (500) DECLARE @WhereClause VARCHAR(MAX) DECLARE @StagingSchema VARCHAR (50) DECLARE @TargetSchema VARCHAR (50) DECLARE @FullStagingTableName VARCHAR (500) DECLARE @FullTargetTableName VARCHAR (500) DECLARE @TargetTableColumnList NVARCHAR(MAX) DECLARE @DeleteStatementSQL NVARCHAR (MAX) DECLARE @InsertStatementSQL NVARCHAR (MAX) DECLARE @StatisticsUpdateSQL NVARCHAR (MAX) --SET @FileName = 'SalesOrderHeader.parquet' -- Manual Runs inside procedure will toggle this for troublshooting SET @TargetTable = (SELECT TargetTable FROM meta.ADLS_Metadata WHERE FileName = @FileName) SET @TargetSchema = (SELECT TargetSchema FROM meta.ADLS_Metadata WHERE FileName = @FileName) SET @StagingTable = (SELECT StagingTable FROM meta.ADLS_Metadata WHERE FileName = @FileName) SET @StagingSchema = (SELECT StagingSchema FROM meta.ADLS_Metadata WHERE FileName = @FileName) SET @FullStagingTableName = CONCAT(@StagingSchema, '.', @StagingTable) SET @FullTargetTableName = CONCAT(@TargetSchema, '.', @TargetTable) SET @TargetTableColumnList = ( SELECT ColumnList = STRING_AGG('[' + col.NAME + ']', ',' ) FROM sys.tables tab LEFT JOIN sys.schemas sch ON tab.schema_id = sch.schema_id LEFT JOIN sys.columns col ON tab.object_id = col.object_id WHERE sch.name = @TargetSchema AND tab.name = @TargetTable AND col.is_identity = 0 ) ; WITH PrimaryKeyList AS ( SELECT ColumnKey = RTRIM(LTRIM(Value)), RowNumber = ROW_NUMBER () OVER (ORDER BY value ASC) FROM meta.ADLS_Metadata CROSS APPLY STRING_SPLIT( ColumnKey, ',') WHERE FileName = @FileName ) /******* Section for single primary key OR Keys that do not need to be concated to be uniquely identified *********************/ SELECT @WhereClause = STRING_AGG(CASE WHEN E.ColumnKey IS NOT NULL THEN CONCAT( Beg.ColumnKey,' IN (SELECT ', Beg.ColumnKey, ' FROM ', @FullStagingTableName, ') AND') ELSE CONCAT( Beg.ColumnKey,' IN (SELECT ', Beg.ColumnKey, ' FROM ', @FullStagingTableName, ')' ) END, ' ') FROM PrimaryKeyList Beg LEFT JOIN PrimaryKeyList E ON Beg.Rownumber = E.Rownumber - 1 ; /***************************************************************************************************************************************/ /************************* Section used to concat a composite key and create the unique identifier during the load process if it does not exist in the source tables ******************* SELECT @WhereClause = CONCAT( 'CONCAT(', STRING_AGG(CASE WHEN E.ColumnKey IS NOT NULL THEN Beg.ColumnKey ELSE CONCAT(Beg.ColumnKey, ') ') END, ', ' ), 'IN (SELECT CONCAT(', STRING_AGG(CASE WHEN E.ColumnKey IS NOT NULL THEN Beg.ColumnKey ELSE CONCAT(Beg.ColumnKey, ') ') END, ', ' ), 'FROM ', @FullStagingTableName, ')' ) FROM PrimaryKeyList Beg LEFT JOIN PrimaryKeyList E ON Beg.Rownumber = E.Rownumber - 1 ; *********************************************************************************************************************************************************/ SELECT @DeleteStatementSQL = CONCAT('DELETE FROM ', @FullTargetTableName, ' WHERE ', @WhereClause) ; SELECT @InsertStatementSQL = CONCAT('INSERT INTO ', @FullTargetTableName, ' (', @TargetTableColumnList, ') ', ' SELECT ', @TargetTableColumnList, ' FROM ', @FullStagingTableName) --SELECT -- @StatisticsUpdateSQL = CONCAT('UPDATE STATISTICS ', @FullTargetTableName) --PRINT @DeleteStatementSQL --PRINT @InsertStatementSQL --PRINT @StatisticsUpdateSQL EXECUTE sp_executesql @DeleteStatementSQL ; EXECUTE sp_executesql @InsertStatementSQL ; --EXECUTE sp_executesql @StatisticsUpdateSQL ; Used in Dedicated SQL Pool to update statistics once tables have been loaded

See examples of the different steps of the script below for the table 'SalesOrderHeader'

Metadata table results for the staging table, target table, and the primary keys for the target table 'SalesOrderHeader'. You will notice that this table has multiple primary keys to provide the unique record for the data. This script will handle multiple primary keys or single primary keys in a method shown later.

First, the variables are built. One important variable is @TargetTableColumnList which compiles a comma separated list of the target table columns from the system tables. You will not need to maintain the columns in the target table since the script will compile a list from the system tables and exclude identity columns since these are not updated or inserted. If that is needed, then logic can be added to turn the identity insert on and off in the script.

The next step is to build the WHERE clause of our delete statement. This is done by using the column keys and splitting them out into different predicates. Executing the code down to the @WhereClause creation will produce this output.

@WhereClause = rowguid IN (SELECT rowguid FROM stage.STAGE_SalesOrderHeader) AND SalesOrderNumber IN (SELECT SalesOrderNumber FROM stage.STAGE_SalesOrderHeader) AND SalesOrderID IN (SELECT SalesOrderID FROM stage.STAGE_SalesOrderHeader)

There is a section commented out for handling composite keys that will not evaluate with each column key in their own predicate. In this scenario, the values are concatenated. The commented section for composite keys will show the below result for the same table/key combination.

@WhereClause = CONCAT(rowguid, SalesOrderNumber, SalesOrderID) IN (SELECT CONCAT(rowguid, SalesOrderNumber, SalesOrderID) FROM stage.STAGE_SalesOrderHeader)

Next, the delete and insert statements are created using the dynamic SQL in the script and previous steps. Here are the outputs from our example.

Delete statement

DELETE FROM salesLT.SalesOrderHeader WHERE rowguid IN (SELECT rowguid FROM stage.STAGE_SalesOrderHeader) AND SalesOrderNumber IN (SELECT SalesOrderNumber FROM stage.STAGE_SalesOrderHeader) AND SalesOrderID IN (SELECT SalesOrderID FROM stage.STAGE_SalesOrderHeader)

Insert statement

INSERT INTO salesLT.SalesOrderHeader ([SalesOrderID],[RevisionNumber],[OrderDate],[DueDate],[ShipDate],[Status],[OnlineOrderFlag],[SalesOrderNumber],[PurchaseOrderNumber],[AccountNumber],[CustomerID],[ShipToAddressID],[BillToAddressID],[ShipMethod],[CreditCardApprovalCode],[SubTotal],[TaxAmt],[Freight],[TotalDue],[Comment],[rowguid],[ModifiedDate]) SELECT [SalesOrderID],[RevisionNumber],[OrderDate],[DueDate],[ShipDate],[Status],[OnlineOrderFlag],[SalesOrderNumber],[PurchaseOrderNumber],[AccountNumber],[CustomerID],[ShipToAddressID],[BillToAddressID],[ShipMethod],[CreditCardApprovalCode],[SubTotal],[TaxAmt],[Freight],[TotalDue],[Comment],[rowguid],[ModifiedDate] FROM stage.STAGE_SalesOrderHeader

Finally, those statements are passed into sp_executesql to be executed.

Summary

The template and scripts will allow you to build dynamic metadata driven ETL process at enterprise scale with as little as 3 pipelines to facilitate 'N' number of tables. This metadata driven approach is highly flexible and scalable, which will allow you to build upon this solution and even cater it to your exact needs. Even if the requirements or change tracking logic is more complex than waterfall columns or composite keys, there is still an ability to add complex logic into this process to handle your ETL needs.

Integer Type Available for Pipeline Variables

ChenyeCharlieZhu — Thu, 17 Aug 2023 21:10:30 GMT

Today, we are announcing the support for Integer type for pipeline variables. This feature is quite self-explanatory: you can define a pipeline variable as integer, and use all the arithmetic functions with it, without converting it back to string type anymore.

This significantly simplifies the workflow if you are using an iterator within an Until or ForEach activity. Please note that in a Set variable activity, you can't reference the variable being set in the value field, i.e., no self-referencing. To work around this limitation, set a temporary variable and then create a second Set variable activity. The second Set variable activity sets the value of the iterator to the temporary variable.

Please be aware that variables are scoped at the pipeline level. This means that they're not thread safe and may cause unexpected and undesired behavior if they're used along with parallel iteration. Particularly, please be very careful when the value is also being modified within that ForEach activity.

We hope that you found this helpful! Let us know in the comments if you have any questions or feedback!

Documentation search now embedded in Azure Data Factory

Mark Kromer — Thu, 13 Jul 2023 21:40:57 GMT

In the Data Factory team, we are always looking for ways to make the life of the data engineer as easy as possible! To help with easily and quickly finding answers to your questions in Azure Data Factory (ADF), we've incorporated documentation search to our ADF search bar.

Without ever leaving your ADF design environment, we'll bring the related documentation to your searches quickly find the answers you are looking for!

Comment Out Part of Pipeline

ChenyeCharlieZhu — Fri, 07 Jul 2023 20:46:44 GMT

To kick start the second half of 2023, ADF team has brought you major improvements in pipeline development and authoring experience. Specifically, we now allow you to comment out part of your pipeline, without deleting the definition.

Introducing Deactivating and Reactivating Activities. Deactivate one or more activities from a pipeline, and we skip them during validation and during pipeline run. And you may choose to reactivate these activities at a later time.

Behaviors

An inactive activity behaves differently in a pipeline.

On canvas, the inactive activity is grayed out, with _Inactive sign_ placed next to the activity type.
On canvas, a status sign (Succeeded, Failed or Skipped) is placed on the box, to visualize the Mark activity as setting.
The activity is excluded from pipeline validation. Hence, you don't need to provide all required fields for an inactive activity.
During debug run and pipeline run, the activity won't actually execute. Instead, it runs a place holder line item, with the reserved status Inactive.
The branching option is controlled by Mark activity as option. In other words:
- if you mark the activity as Succeeded, the UponSuccess or UponCompletion branch runs.
- if you mark the activity as Failed, the UponFailure or UponCompletion branch runs
- if you mark the activity as Skipped, the UponSkip branch runs

An inactive activity never actually runs. This means the activity won't have an error field, or its typical output fields. Any references to missing fields will throw errors downstream.

Set Up

There are 2 ways to deactivate an activity.

First, you may deactivate a single activity from its General tab.

Select the activity you want to deactivate
Under General tab, select Inactive for Activity state
Pick a state for Mark activity as. Choose from Succeeded, Failed or Skipped

Alternatively, you can deactivate multiple activities with right click.

Press down Ctrl key to multi-select. Using your mouse, left click on all activities you want to deactivate.
Right click to bring up the drop down menu.
Select Deactivate to deactivate them all.
To fine tune the settings for Mark activity as, go to General tab of the activity, and make appropriate changes.

In both cases, you do need to deploy the changes to deactivate the parts during pipeline run.

Use Cases

Deactivation is a powerful tool for pipeline developer. It allows developers to "comment out" part of the code, without permanently deleting the activities. It shines in following scenarios:

When developing a pipeline, developer can add place holder inactive activities before filling all the required fields. For instance, I need a Copy activity from SQL Server to Data warehouse, but I haven't set up all the connections yet. So I use an _inactive_ copy activity as the place holder for iterative development process.
After deployment, developer can comment out certain activities that are constantly causing troubles to avoid costly retries. For instance, my on-premises SQL server is having network connection issues, and I know my copy activities fail for certain. I may want to deactivate the copy activity, to avoid retry requests from flooding the brittle system.

Continued region expansion: Azure Data Factory just became generally available in Sweden Central

Chunhua — Fri, 30 Jun 2023 05:37:06 GMT

Azure Data Factory is now available in Sweden Central.

You can now provision Data Factory in the new region in order to co-locate your Extract-Transform-Load logic, if you are utilizing the region for storing and managing your modern data warehouse.

See the full set of Azure Data Factory supported regions.

Securing outbound traffic with Azure Data Factory's outbound network rules

Abhishek Narain — Mon, 12 Jun 2023 19:00:00 GMT

Data security is paramount in today's digital world. With an increasing number of cyber threats, organizations are always on the lookout for robust solutions to enhance their security posture. In this blog, we delve into a critical feature provided by Azure Data Factory – Outbound Rules – that allows users to control and restrict outbound traffic to specific Fully Qualified Domain Names (FQDN).

Understanding Outbound Allow listing in Azure Data Factory

Outbound allow listing of FQDN is a network security practice that allows organizations to control outbound traffic from their networks to specific, approved domain names. Outbound rules in Azure Data Factory apply to pipeline activities, such as Copy, Dataflows, Web, Webhook, and Azure Function activities and authoring scenarios like data preview and test connection.

Note:

This feature is in Preview.
SSIS Integration runtime and Managed Airflow Integration runtime currently do not support the outbound rules.
This feature is independent of Managed VNet and applies to all supported activities running on SHIR, Azure IR (including AutoResolve IR), and Azure IR in Managed VNet. However, we suggest using Managed VNet for higher levels of compute isolation in conjunction with outbound allowlist capability to prevent data exfiltration.

These rules help organizations create a secure and exfiltration-proof data integration solution. What's more, Azure Policy enforces these rules, thereby boosting governance.

As it uses Azure Policy, these outbound rules can be enforced at different management levels based on the organization’s needs.

Management Group
Subscription
Resource Group
Resource (UI within Data Factory for this assignment is coming soon, but you can use REST API/ SDK to achieve this today)

Note: While in preview, the compliance for this policy is not reported

Steps to enable Azure Policy for outbound rules

Assign the outbound Policy with the desired scope.

Configure the parameters of the policy specifying the allowed domain names. Create the policy.
Note: Regex is not supported hence the domains should exactly be the same as used in the linked services. To update the outbound url list, please update the policy parameter.

Enable the feature in ADF studio.

The Outbound Rules feature in Azure Data Factory allows organizations to exercise granular control over outbound traffic, thereby strengthening network security during data integration. By integrating with Azure Policy, this feature also improves overall governance.

Resources:

If you have any questions or feedback, please post them in the comments below.

Introducing optional Source settings for DelimitedText and JSON sources for top-level CDC resource

Krishnakumar_Rukmangathan — Fri, 02 Jun 2023 13:00:00 GMT

In January, we announced the public preview of top-level CDC resource in ADF and followed up with real-time latency support for top-level CDC resource around March. Based on your feedback, we are now enabling additional source configurations for Delimited Text sources and JSON sources, which can be set optionally within a top-level CDC resource in ADF.

Previously for Delimited Text sources, we provided support for Comma as the only Column delimiter. Now when you select Delimited Text source for your CDC resource, you can set advanced source configurations which includes the selection of Compression type, Encoding, Column delimiter, Row Delimiter, Quote character, Escape character, and First row as header.

For JSON sources, previously the default document type was set to Document per line but now we have included support for Single document, Document per line, and Array of documents as document types. Additional settings for Unquoted column name, Has comments, Single quoted, and Backslash escaped have been added as well.

NOTE: These source settings are optional for selection and if not manually edited, they are set to the default values.

As we continue to add more features within top-level ADF CDC, we hope these optional source settings for DelimitedText and JSON sources help. Please continue to share your feedback in the comments!

Unroll multiple arrays in a single Flatten step in ADF

Mark Kromer — Fri, 21 Apr 2023 19:48:31 GMT

You can now easily unroll multiple arrays inside a single Flatten transformation in Azure Data Factory and Azure Synapse Analytics using a data pipeline with a Mapping Data Flow.

ADF and Synapse data flows gave a Flatten transformation to make it easy to unroll an array as part of your data transformation pipelines. We've updated the Flatten transformation to now allow for multiple arrays that can be unrolled in a single transformation step. This will make your ETL jobs much simpler with fewer transformation steps.

There is now a plus (+) next to the Unroll Array property where you can add more arrays to your list to unroll. You can also use ADF's meta functions like name and type to find arrays to unroll in your data using patterns. The resulting data will be joined together as a single result set as shown below.

ADF private DNS zone overrides ARM DNS resolution causing ‘Not found’ error.

Sachin215 — Wed, 19 Apr 2023 19:36:38 GMT

##Steps to Migrate:

Navigate to existing Private DNS zone privatelink.adf.azure.com
Go to portal.azure.com
Type ‘private DNS zones’ on the search bar and click on the option
Click on the privatelink.adf.azure.com private zone

Get the private IP of the existing private endpoint and delete the private zone
In the overview blade (default) you’ll see a table with the DNS records
Look for the one with the name ‘adf’ with Type ‘A’ and write down the IP under ‘Value’ for the next steps.
Click on ‘Virtual network links on the left panel, write down all the Virtual Networks for the next steps and then delete all the virtual network links
Go back go ‘overview’ and delete a private zone

3. Create a new Private DNS zone with the name ‘privatelink.adf.azure.com

1.In the main Private DNS zones page, click on ‘add’ on the toolbar

2. Select the subscription and resource group and add ‘privatelink.adf.azure.com’ as the name

Add Virtual network links and DNS ‘A’ record

1. In the privatelink.adf.azure.com private zone click on ‘Virtual network links’ on the left panel and then add a network link for each of the virtual networks from step 2c

Add DNS ‘A’ record

Go back to the overview panel and click on ‘+ Record set’, type ‘adf’ as the name, TTL: 10, TTL unit: seconds, and type the IP form step 2b

Trigger ADF pipeline using Storage event trigger over private network.

Sachin215 — Mon, 24 Apr 2023 13:30:19 GMT

Project Technology: Azure function, ADF, Azure Synapse, ADLS

Issue description: Customer has a strict regulatory compliance requirement, due to which they need to block all outbound (public endpoint) connections. Hence, most of our products were not able to provide the expected result as we have a dependency on public endpoints.

Summary:

The customer was not able to download PowerShell modules from PowerShell gallery in the Azure function due to outbound restrictions. However, we gave a suggestion to manually download and upload the files to Azure function via VS-Code. However, it did not work either.
As a result, we used PowerShell command line from user desktop instead of using VS-Code.
Now when we tried to access the ADF from Azure function, Boom! It failed.

To validate the access token, we executed MSI validator and came with the below error.

Reason for all these errors was, AZ PowerShell was trying to connect to management.azure.com in order to get the Oauth2 token. However, as per the bank’s regulations, access to any public endpoints was not allowed which basically stalled the project.
To get around this and access the storage behind the firewall/private endpoints/Private links, we proposed a solution to use Managed identity and REST API which enabled the function to grab bearer token without public endpoint access and REAST API was able to use that bearer token to access Azure storage.

Code if customer is using system assigned managed identity.

$resourceURI = "https://functeststorageacc01.queue.core.windows.net/"

$tokenAuthURI = $env:IDENTITY_ENDPOINT + "?resource=$resourceURI&api-version=2019-08-01"

$tokenResponse = Invoke-RestMethod -Method Get -Headers @{"X-IDENTITY-HEADER"="$env:IDENTITY_HEADER"} -Uri $tokenAuthURI

$accessToken = $tokenResponse.access_token

Code if customer is using user assigned managed identity.

$resourceURI = "https://functeststorageacc01.queue.core.windows.net/"

$tokenAuthURI = $env:IDENTITY_ENDPOINT + "?resource=$resourceURI&api-version=2019-08-01&client_id=$env:AZURE_CLIENT_ID"

$tokenResponse = Invoke-RestMethod -Method Get -Headers @{"X-IDENTITY-HEADER"="$env:IDENTITY_HEADER"} -Uri $tokenAuthURI

$accessToken = $tokenResponse.access_token

Code to use bearer token to access Azure storage.

$version = "2017-11-09"
$header = @{
Authorization = "Bearer $accessToken"
'x-ms-version' = $version
}

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12

$QueueMessage = "This is test message#1 "
$body = "<QueueMessage><MessageText>$QueueMessage</MessageText></QueueMessage>"
$item = Invoke-RestMethod -Method POST -Uri https://storazaarfdevbtgt00003.queue.core.windows.net/test2/messages -Headers $header -Body $body -ContentType "application/json"

Even though we had the bearer token and we were using REST API but still we were not able to trigger the ADF pipeline because in order to perform any operations using REST API to ADF would require access to Azure management plane (management.azure.com) which is not allowed in the bank’s environment.
Here we proposed a solution to trigger the ADF pipeline through storage event trigger using managed private endpoints, so ADF was able to read the storage over the private endpoints without needing to go out to public endpoints.

9.After creating all the above steps we were able to trigger the ADF using Storage Event.

Co-Author: Umesh Panwar (Apps & Infra CSA)

Pipeline Logic 3: Error Handling and Try Catch

ChenyeCharlieZhu — Wed, 05 Apr 2023 17:20:16 GMT

Series Overview

Orchestration allows conditional logic and enables users to take different paths based upon outcomes of a previous activity. Building upon the concepts of conditional paths, ADF and Synapse Pipelines allow users to build versatile and resilient workflows that can handle unexpected errors that work smoothly in auto-pilot mode.

This is an ongoing series that gradually levels up and helps you build even more complicated logic to handle more scenarios. We will walk through examples for some common use cases, and help you to build functional and useful workflows.

Please review the first installment in the series: Part 1: Error Handling and Best Effort Step and Part 2: OR

Error Handling and Try Catch

Error handling is a very common scenario in data engineering pipelines. From time to time, activities will fail, but we don't want to fail the whole pipeline due to a single activity failure.

We call this logic: Try-Catch, and we have streamlined the implementation for this common use case.

Add first activity
Add error handling to the UponFailure path
Add second activity, but don't connect to the first activity
Connect both UponFailure and UponSkip paths from the error handling activity to the second activity.

To learn more, read Pipeline failure and error message - Azure Data Factory | Microsoft Learn.

We hope that you have found this blog to be helpful! If you have any questions or feedback, please post them in the comments below.