How to Handle Evolving Database Schemas in your ETL with Azure Data Factory
Published Apr 13 2020 03:14 PM 9,861 Views
Microsoft

When transforming data in ADF with Azure SQL Database, you have the choice of enforcing a validated schema or invoking "schema drift". If you have a scenario where your source or sink database schemas will change over time, you may wish to build your data flows in a way that incorporates flexible schema patterns. The features that we'll use in this post to demonstrate this technique include schema drift, column patterns, late binding, and auto-mapping.

 

 

Late Binding

Start with a new data flow and add an Azure SQL Database source dataset. Make sure your dataset does not import the schema and that your source has no projection. This is key to allow your flows to use "late binding" for database schemas that change.

 

Schema Drift

On both the Source and Sink transformations set "Schema Drift" on so that ADF will allow any columns to flow through, regardless of their "early binding" definitions in the dataset of projection. In this scenario, we're not including any schema or projection, so schema drift is needed. If you have a scenario that requires hardened schemas, set "Validate Schema" and import the schema in the dataset and set a source projection.

 

Column Patterns

When you are working with flexible schemas and schema drift, avoid naming specific column names in your data flow metadata. When transforming data and writing Derived Column expressions, use "column patterns". You will look for matching names, types, ordinal position, data types, and combinations of those field characteristics to transform data with flexible schemas.

 

flex4.png

 

Auto-Mapping

On the Sink transformation, map your incoming to outgoing fields using "auto-mapping". Setting specific hardened mappings is "early binding" and here you wish to remain flexible with "late binding".

 

One last note: In the video above, I demonstrate using the Alter Row transformation to update existing database rows. When using full schema-less datasets, you must enter the primary key fields in your sink using Dynamic Content. Simply refer to the target database key columns as string column names:

 

flex3.png

 

11 Comments
Brass Contributor

One addition Mark: this is all focussed on Data Flows. For the Copy activity as long as the data set and the copy activity no schema have defined, the copy activity will do a auto-mapping based on name. Simple and effective.

 

One interesting note; the auto-mapping is case sensitive. ADF will even warn if there was no match, but there was a match on different casing. I like that message, there is a problem and at the same time ADF flags what the likely cause is.

Microsoft

Note that my example is ETL. I am transforming and updating rows, not just simple mapping. That's why this needs to be done in a Data Flow.

Copper Contributor

I would like to see the same demo but have data flow create the target table schema columns via the schema drift. This only demonstrates a target table with the schema existing.

Microsoft

@craigwhite All you have to do is set "recreate table" on the sink and ADF will create the target table with your incoming columns

Copper Contributor

@Mark Kromer Will recreating the table delete all existing data? I was hoping if a new column\s was created on the source table, the new column\s would be added to the target table without data disruption (table alter). Is this the case? 

Microsoft

@craigwhite  It is a drop/recreate of the table, not an alter table

Copper Contributor

Is there a way to alter table and not recreate table?

Copper Contributor

@Mark Kromer:

 

Hi Mark, We have set of 500 CSV files which we have uploaded to Azure storage container. These CSV files have 4 different schemas being used with few columns common and few columns different. Our objective is to build a pipeline to load all these files into a single SQL database by mapping relevant columns for each schema. We have tried using schema drift and it is not giving required results. Any ideas on how we can do this please.

 

Thanks.

Copper Contributor

Hi@Mark Kromer

some comments before you wrote:
"It is a drop/recreate of the table, not an alter table."

 

is there a way to alter the table and add the missing columns?
this would be a very useful function, because we have sources whose structure can change, but we do not want to manually change the target tables every time. in principle: if columns are missing, create them automatically without losing data.

Copper Contributor

Hi Mark, 

 

Great video! Thanks for it. Is it possible to do an aggregate transformation in this case when we don't have schema input? I'm having multiples source tables as parameters "tableName". I would like to union them and group by after. I tried to use the aggregation but it does not show any column name for the group by part. I tried to use "name == mycolumName" but does not work there.

 

Thank you.

Copper Contributor

@Mark Kromer  Dear Mark. Thanks for putting up this great piece. As you mentioned ,it provides a way to 'recreate' a table at the destination side. Don't you think it will be a huge effort in the backend if we have to recreate a big table say 1 TB in size to accommodate a new column. Even with parallel processing, this sounds like a huge effort. Any upcoming industry have lot of new fields coming based on need and if for each time a table gets re-created, it sounds a mammoth effort. Is there a better way for handling this?

 

Thanks in advance! Best Regards - Chandan

Version history
Last update:
‎Apr 13 2020 03:14 PM
Updated by: