Azure Data Factory is Azure's cloud ETL service for scale-out serverless data integration and data transformation. Often customers leverage Data Factory to perform bulk load or incremental load activities. Amongst the most commonly used activities for such purposes are Copy activity to copy data among data stores located on-premises and in the cloud & ForEach activity defines a repeating control flow in your pipeline. This activity is used to iterate over a collection and executes specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping structure in programming languages. One of the most common and frequent scenario is copying data from multiple folders and mapping it to multiple databases/schemas/tables.
This blog is inspired by the blog "Build metadata driven pipelines in Azure Data Factory" published by our Fast-track Engineer Olga Molocenco. In this series we will also capture the Copy data Activity's metadata in the same control table. This is often required by developers to debug in case of errors during execution instead of tracing back the activities through GUI and validating what went wrong
Note: It will only be able to copy the Copydata Activity Errors,
It will not capture the data if the initial LookUp Activity itself fails
About this solution template
This lab was build to address the following scenario
Copy data from a ADLS Gen2 Account containing multiple folders recursively to multiple databases within a Azure SQL Server. The mapping is retrieved via LookUp activity from a metadata control table also residing within the same Azure SQL Server. The Copydata activity is executed within a ForEach loop. At the end of each loop the Script activity is used to update the metadata within the same metadata control table
Note: You may be tempted to recreate the resources using your own naming conventions & datatypes.
Please do not attempt to do the same when setting up the lab for the very first time.
This may result in unexpected formatting or execution errors which may also cause validation to fail.
Same applies to the naming convention used for Pipeline and its Sub-Activities.
Once the lab is understood thoroughly for its components and inter-dependencies, feel free to leverage the same for your own purpose modifying its content
Environment Setup
1. ADLS Gen2 Account ->(I named my account as “incrementalsource”)
Create a storage account which supports hierarchical namespace (you may choose your own name) with the following sub-folders (use the same naming convention for subfolders & files). This will serve as the source from where data will be copied
Create a directory within each folder with the same name
Eg :
The text files will be organized as follows within each of these containers
Download the text files from here
2. Azure SQL Server ->(I named my server as “incrementalsql”)
Create an Azure SQL Server (you may choose your own name) which will contain the HR, finance, ITops databases along with a incrementalmetadata database (use the same naming convention for sdatabases).
Any basic/serverless tier databases will do for this lab with minimum storage.
Use the CREATE SCHEMA script to create hradmin, general schema in HR database and inventory schema in ITops databases respectively (use the same schema names).
Syntax :
CREATE SCHEMA [schema_name]
Download the create table scripts from here. Ensure you are running it against the right databases
3. Inserting Metadata Table rows
By default, copy activity maps source data to sink by column names in case-sensitive manner. If sink doesn't exist, for example, writing to file(s), the source field names will be persisted as sink names. If the sink already exists, it must contain all columns being copied from the source. Such default mapping supports flexible schemas and schema drift from source to sink from execution to execution - all the data returned by source data store can be copied to sink.
As you may have noticed some column names in the text files(SourceFiles) are different that what they are defined in the referencing target tables(TargetTables). This was done on purpose. There is a possibility that your source location may contain columns which may not be declared with the same name in the target tables. If you have static sources and static sink this mapping is easy via GUI. But what we are attempting here is to run a loop where the source and sink will keep changing as well as the column names and datatypes. We have hence created Parameterized Json scripts which we are being stored inside the metadata table under the TargetJson column. This will take care of the mapping when the pipeline is executed. Azure Data Factory copy activity will perform schema and data type mapping from source data to sink data.
How we created these custom scripts ?
You can essentially create the content by mapping a single static source to static sink. When you select Import Schema option it populates the columns and datatypes for you as shown below..
Once this is done, click on the view source code button
The entire section which appears within the parenthesis of translator parameter as shown highlighted below is the content used to create these customized scripts for the lab.
The resultant file may look something like this
The Json files which were used as values to the TargetJson columns are available here
We will now insert data into the MetadataControlTable using this script.
Using custom code we now have our environment ready to setup the Azure Data Factory environment. This will be covered in the next article. And finally we will show how to create the pipeline with the activities running in loops in the last article for this series
References:
- https://techcommunity.microsoft.com/t5/fasttrack-for-azure/build-metadata-driven-pipelines-in-azure-data-factory/ba-p/3443328
- https://www.mssqltips.com/sqlservertip/6320/logging-azure-data-factory-pipeline-audit-data/
- https://learn.microsoft.com/en-us/azure/data-factory/solution-template-bulk-copy-from-files-to-database
- https://learn.microsoft.com/en-us/azure/data-factory/monitor-schema-logs-events
- https://learn.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions
- https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
- https://learn.microsoft.com/en-us/azure/data-factory/parameters-data-flow
Don't forget to share a if this helps
Credit: Thanks Gary Lee, Nathan Widdup, Paul Burpo for reviews and guidance
FastTrack for Azure: Move to Azure efficiently with customized guidance from Azure engineering.