As organizations continue to collect and store large volumes of data in their data lakes, managing this data effectively becomes increasingly important. One key aspect of this is implementing Slow Change Dimension type 2, which allows organizations to track historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. In this blog post we will address the following scenario: a customer wants to implement Slow Change Dimension type 2 on top of their data lake.
For this example, we will use Serverless SQL Pool to demonstrate how this can be done. Additionally, in the next post, we will explore how the same approach can be used with Spark.
Slow Change Dimension type 2: This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Check out this wiki to learn more, https://en.wikipedia.org/wiki/Slowly_changing_dimension
When implementing Slow Change Dimension type 2, there are various approaches that can be used depending on how the data is handled at the source. For example, if the source includes row version information or columns with flags for deleted or updated records, a different approach may be used compared to scenarios where this information is not available. In this particular scenario, we will follow a specific approach to address the challenges that are present.
Considerations for this Scenario:
Solution:
Consider as in the example below the folder sourcefiles_folder with all files in parquet format which were sent from the source to be compared to a destination. My destination is the external table: SCD_DimDepartmentGroup.
The file has the following columns:
Column |
Datatype |
|
DepartmentGroupKey |
int |
Not Null |
ParentDepartmentGroupKey |
int |
NULL |
DepartmentGroupName |
nvarchar |
Not Null |
Here are the current values for this table - SCD_DimDepartmentGroup as Fig. 1, shows:
The last surrogate key is of the value of 8:
Code - First Load and Dimension Creation:
CREATE EXTERNAL TABLE SCD_DimDepartmentGroup
WITH (
LOCATION = '/SCD_DimDepartmentGroup',
DATA_SOURCE = SCD_serveless_dim,
FILE_FORMAT = Parquet_file
)
AS
SELECT ROW_NUMBER () OVER (ORDER BY DepartmentGroupKey) ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,1 ID_valid
,0 ID_Deleted
,Curr_date as From_date
,Null as End_date
FROM
OPENROWSET(
BULK 'https://Storage.blob.core.windows.net/Container/SCD/sourcefiles_folder/',
FORMAT = 'PARQUET'
) AS [SCD_DimDepartmentGroup]
The implementation of the SCD is quite simple is basically filtering, persisting the data filtered, and comparing.
Current values on the table:
TableA_SCD_DimDepartmentGroup_NEW - In this example, this step will basically insert all the values that exist on the files and do not exist on the destination into an External Table.
The column DepartmentGroupKey defines if the row is new or not. You will need a column or columns to understand if this is considered new information or not in your dimension.
Please note three columns were added to this example:
Code - First, let's create the data source that will be used across all the external tables:
CREATE EXTERNAL DATA SOURCE SCD_serveless_dim
WITH (
LOCATION = 'https://Storage.blob.core.windows.net/Container/SCD/transformation_folder/',
CREDENTIAL = [MSI]
)
Now, let's create the external table for Step 1:
Please note that I am using the maximum value of the surrogate key column ID_Surr from the main table SCD_DimDepartmentGroup and adding 1 to it to generate new surrogate values for rows that require them. If some values repeat at this point, it is okay because I will reuse this information with the Department Key to create a new surrogate key in the last step (Step 5), thus ensuring a unique value for each row. I am using the max+1 approach to ensure that existing surrogate keys are not affected.
DECLARE @ID_Surr AS INT
SELECT @ID_Surr = MAX (ID_Surr)+1 FROM SCD_DimDepartmentGroup
CREATE EXTERNAL TABLE TableA_SCD_DimDepartmentGroup_NEW
WITH (
LOCATION = '/TableA_SCD_DimDepartmentGroup_NEW',
DATA_SOURCE = SCD_serveless_dim,
FILE_FORMAT = Parquet_file
)
AS
SELECT @ID_Surr as ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,1 ID_valid
,0 ID_Deleted
,Getdate() as Curr_date
FROM
OPENROWSET(
BULK 'https://Storage.blob.core.windows.net/Container/SCD/sourcefiles_folder/',
FORMAT = 'PARQUET' ---DELTA can be used here
) AS [SCD_DimDepartmentGroup_Silver]
WHERE NOT EXISTS (
SELECT 1
FROM SCD_DimDepartmentGroup
WHERE SCD_DimDepartmentGroup_Silver.[DepartmentGroupKey] = SCD_DimDepartmentGroup.DepartmentGroupKey
)
Results in Fig. 2 - Step 1:
TableB_SCD_DimDepartmentGroup_OLD will be the next external table to be created by inserting all values that exist on the file exported from the source and do exist on the destination. Those value also should not exist on the external table created in Step 1 (A).
In this scenario, the data that will be versioned is not a new row (Step 1); it is rows that already exist in the destination, but something has changed. The DepartmentGroupKey is the key that will always remain the same, so if there is a new key, it means a new row, and the comparison will happen for any other column. Additionally, in this scenario, it is possible that the source may send the same information again without any changes. Therefore, comparing the data is necessary to ensure that the row sent is indeed a row to be versioned.
Please note for the rows to be versioned:
Code:
Please note I re-used my surrogate key column ID_Surr of the main table SCD_DimDepartmentGroup for the values that will be versioned.
CREATE EXTERNAL TABLE TableB_SCD_DimDepartmentGroup_OLD
WITH (
LOCATION = '/TableB_SCD_DimDepartmentGroup_OLD',
DATA_SOURCE = SCD_serveless_dim,
FILE_FORMAT = Parquet_file
)
AS
CREATE EXTERNAL TABLE TableB_SCD_DimDepartmentGroup_OLD
WITH (
LOCATION = '/TableB_SCD_DimDepartmentGroup_OLD',
DATA_SOURCE = SCD_serveless_dim,
FILE_FORMAT = Parquet_file
)
AS
SELECT ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,0 ID_valid
,1 ID_Deleted
,From_date
FROM [SCD_DimDepartmentGroup]
WHERE NOT EXISTS ( SELECT 1
FROM
OPENROWSET(
BULK 'https://Storage.blob.core.windows.net/Container/SCD/sourcefiles_folder',
FORMAT = 'PARQUET'
) AS [SCD_DimDepartmentGroup_Silver]
WHERE SCD_DimDepartmentGroup_Silver.[DepartmentGroupKey] = SCD_DimDepartmentGroup.DepartmentGroupKey
AND (ISNULL(SCD_DimDepartmentGroup_Silver.[ParentDepartmentGroupKey], 1) = ISNULL(SCD_DimDepartmentGroup.[ParentDepartmentGroupKey], 1)
AND SCD_DimDepartmentGroup_Silver.[DepartmentGroupName] = SCD_DimDepartmentGroup.[DepartmentGroupName])
)
AND NOT EXISTS
(SELECT 1 FROM TableA_SCD_DimDepartmentGroup_NEW
WHERE TableA_SCD_DimDepartmentGroup_NEW.[DepartmentGroupKey] = SCD_DimDepartmentGroup.DepartmentGroupKey)
Results are Fig. 3 - Step 2 :
TableC_SCD_DimDepartmentGroup_GOLD_OLD_INS - The next external table will get the new rows from the source file. The filter will be the rows that we already know something changed and needs to be versioned from Step 2.
Code:
Please note that I am using the maximum value of the surrogate key column ID_Surr from the main table SCD_DimDepartmentGroup and adding 1 to it to generate new surrogate values for rows that require them. If some values repeat at this point, it is okay because I will reuse this information with the Department Key to create a new surrogate key in the last step (Step 5), thus ensuring a unique value for each row. I am using the max+1 approach to ensure that existing surrogate keys are not affected.
DECLARE @ID_Surr AS INT
SELECT @ID_Surr = MAX (ID_Surr)+1 FROM SCD_DimDepartmentGroup
CREATE EXTERNAL TABLE TableC_SCD_DimDepartmentGroup_GOLD_OLD_INS
WITH (
LOCATION = '/TableC_SCD_DimDepartmentGroup_GOLD_OLD_INS',
DATA_SOURCE = SCD_serveless_dim,
FILE_FORMAT = Parquet_file
)
AS
SELECT @ID_Surr as ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,1 ID_valid
,0 ID_Deleted
,getdate() Curr_date
FROM
OPENROWSET(
BULK 'https://Storage.blob.core.windows.net/Container/SCD/sourcefiles_folder/',
FORMAT = 'PARQUET'
) AS [SCD_DimDepartmentGroup_Silver]
WHERE EXISTS (SELECT 1 FROM TableB_SCD_DimDepartmentGroup_OLD
WHERE SCD_DimDepartmentGroup_Silver.[DepartmentGroupKey] = TableB_SCD_DimDepartmentGroup_OLD.DepartmentGroupKey)
The results are in Fig. 4 - Step 3:
For all those external tables that were created on the last steps.
CREATE EXTERNAL TABLE UNION_SCD_DimDepartmentGroup
WITH (
LOCATION = '/UNION_SCD_DimDepartmentGroup',
DATA_SOURCE = SCD_serveless_dim,
FILE_FORMAT = Parquet_file
)
AS
SELECT ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,ID_valid
,ID_Deleted
,getdate() as From_date
,Null as End_date
FROM TableA_SCD_DimDepartmentGroup_NEW
UNION ALL
SELECT ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,ID_valid
,ID_Deleted
,From_date as From_date
,Getdate() as End_date
FROM TableB_SCD_DimDepartmentGroup_OLD
UNION ALL
SELECT ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,ID_valid
,ID_Deleted
,getdate() as From_date
,Null as End_date
FROM TableC_SCD_DimDepartmentGroup_GOLD_OLD_INS
Results are - Fig. 5 - Union :
Here we have the new rows, new versioned rows consolidated on the external table.
Next, we need to transfer this to our main table, by recreating an external table with the new information and dropping the old one.
1) I need the data from the main table excluding the rows that were versioned on this round
2) I need the data from the external tables Union.
3) I need to keep the historical changes of the old/main table
4) On top of that I will recreate the surrogate key. But for this, we will order per ID_Surr and DepartmentGroupKey. So, the new surrogate key values will start from the max surrogate+1 as we defined in the first Steps.
SELECT ROW_NUMBER () OVER (ORDER BY ID_Surr, DepartmentGroupKey) ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,ID_valid
,ID_Deleted
,From_date
,End_date
FROM (
SELECT ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,ID_valid
,ID_Deleted
,From_date
,End_date
FROM SCD_DimDepartmentGroup
WHERE NOT EXISTS
( SELECT 1 FROM TableB_SCD_DimDepartmentGroup_OLD
WHERE SCD_DimDepartmentGroup.DepartmentGroupKey = TableB_SCD_DimDepartmentGroup_OLD.DepartmentGroupKey
)
UNION
SELECT ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
, ID_valid
,ID_Deleted
,From_date
,End_date
FROM UNION_SCD_DimDepartmentGroup
UNION
SELECT ID_Surr
,[DepartmentGroupKey]
,[ParentDepartmentGroupKey]
,[DepartmentGroupName]
,ID_valid
,ID_Deleted
,From_date
,End_date
FROM SCD_DimDepartmentGroup
WHERE SCD_DimDepartmentGroup.ID_valid = 0
)NEW_SCD
Results are in Fig. 6 - New Table. Please note in green there are the rows that are versioned, and in blue are the new rows:
This new table should replace the current SCD_DimDepartmentGroup, by recreating a new external table and dropping this old one.
Please review the considerations for the scenario to understand what was relevant for this implementation. As I mentioned earlier, there are different approaches that can be used to create a Slow Change Dimension, depending on how the data is handled at the source. In this post, I have provided examples of how to read new data, version the old data that has changed, and recreate new external tables with the updated information for a Slow Change Dimension type 2.
In the next post, I will demonstrate how to achieve the same using Spark. Stay tuned for more!
Liliam, UK.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.