Advanced Data Vault 2.0 Models on Synapse Data Warehouse in Microsoft Fabric

Naveed-Hussain · ‎Jan 24 2024

This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH and Co-authored with Dmytro Polishchuk Senior BI Consultant from Scalefree
The Technical Review is done by Ian Clarke and Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft

The last article in this blog series discussed the basic entity types in Data Vault 2.0: hubs, links and satellites. While it would be theoretically possible to limit a model to just these three basic entity types, the resulting Data Vault model would be inefficient: it would most likely consume too much storage, be less efficient due to the many joins, and require a number of grain shifts during information delivery. This is due to certain characteristics in the data that require special treatment.

For these characteristics, Data Vault 2.0 provides special entity types that deal with the specialities. This article focuses on two of them: the non-historized link, which is used to capture transactions and events, and the multi-active satellite, which is used to model multiple active descriptions for the same parent hub or link in the same load.

Modeling Transactional Data

A standard link, as discussed in the previous article of this blog series, is designed for capturing relationships between business objects, identified by their business keys. This is a typical scenario in enterprise (and personal) data. The relationship between employees and their corporate cars illustrates this concept as the relationship between the two exists for a given period.

The same is true for facts, transactions, and (real-time) messages and the like: at the time of transaction, business keys are in a relationship with each other. And the (past) relationship remains part of the business history. But in theory, this history is not changing - unlike the relationship between employees and their corporate cars which will eventually change when a new car is assigned, the employee leaves the firm, and so on.

And there is another difference: the way such transactional data is analyzed is different. Typically, the business user would like to create fact entities for aggregation purposes and, for example, count the number of transactions, and calculate their overall revenue, all in regard to defined dimensions (per country, per day, etc.).

To deal with this scenario in the enterprise data, a non-historized link is used. This link is a simple variation of the standard link with the goal to capture the source transactions and events.

A standard link is a distinct list of relationships that have existed. This presents a problem when dealing with transactions and events due tof repeating patterns. Typically people have their behavior and purchase, for example milk, localized only in two or three different grocery stores: one close to work and two closer to home. 80 % of the milk is purchased in those three grocery stores. This results in the same business keys involved in this transaction: customer number, store number, and product number would be repeated for each transaction.

The standard link would only capture the relationship between these business keys once, when the relationship occurs for the first time. After that, the relationship is already known and since the standard link is a distinct list, it would not capture the relationship anymore:

In a standard link, the granularity is defined by the number of hub references in the link. This would not be sufficient to capture multiple transactions that involve the same business keys (e.g., the same customer, store, and product) in the same target link.

To resolve the issue, an additional element of the alternate key on the non-historized link is required. Typically a transaction or event ID, such as the call ID or a basket ID in the retail industry, which also can be modeled as its own hub. In combination with the hub references, the combined business key should be unique per transaction. This way, it is now possible to capture one record in the target link per source record from the operational system.

If a transaction needs to be canceled or corrected, a reverse or adjusting transaction has to be created in the source system and loaded into the link. This is often done in such source systems, but there are also options (so called technical counter transactions) if the source system just updates or deletes a transaction.

The table below shows a typical structure of a non-historized link which consists of business keys of the referenced hubs (customer and salesperson) and the transaction identifier (e.g. invoice number):

Invoice_HashKey	LoadDateTime	RecordSource	Customer_HashKey	SalesPerson_HashKey	InvoiceNumber
dc1375c3…	1/1/2022 15:00	SAP	42ff8cfb…	566afa40…	00321654987
b36d416d…	2/15/2022 10:00	SALESFORCE	a8452cf2…	f71f7769…	00321654098
ce35a0fe…	2/16/2022 16:00	SAP	42ff8cfb…	de76fdd0…	03321654698
495e4975…	3/20/2022 17:00	SALESFORCE	566afa40…	3280395d…	00321111098

A non-historized link is often used in conjunction with a non-historized satellite - a special entity type that contains descriptive attributes for every corresponding record in a non-historized link. A primary key of a non-historized satellite is a link hash key. A non-historized link and a non-historized satellite have a 1:1 relationship to each other and are joined via hash key only. No delta check is performed during the loading process and therefore a non-historized satellite contains no HashDiff column. LoadDateTime is also not part of the primary key of the non-historized satellite. However, for partitioning purposes it is often included in the primary key for indexing reasons. The LoadDateTime is, as usual, part of the table in any case.

The table below shows the typical structure of a non-historized satellite which consists of a hash key of the parent non-historized link, system attributes, such as the LoadDateTime and RecordSource and the associated descriptive attributes:

Invoice_HashKey	LoadDateTime	RecordSource	InvoiceIssueDate	Description	TotalAmount
dc1375c3…	1/1/2022 15:00	SAP	12/31/2021	Monthly invoice for Dec-2021	456.59
b36d416d…	2/15/2022 10:00	SALESFORCE	2/14/2022	Equipment delivery	355.48
ce35a0fe…	2/16/2022 16:00	SAP	2/15/2022	Consulting	100.00
495e4975…	3/20/2022 17:00	SALESFORCE	3/18/2022	Equipment installation	499.99

The separation of the incoming granularity (in the non-historized link) and the descriptive attributes (in the non-historized satellite) is one option to model these transactions and events. The described approach places all descriptive attributes in the non-historized satellite. As a rule, there are several attributes associated with a transaction that should always be presented to business users for analysis. For example, Invoice Amount and Invoice Date. It means that in this case a non-historized satellite will be joined to a non-historized link in every query.

It is recommended that a 1:1 relationship is maintained between the non-historized link and its non-historized satellite to improve the join performance: for every link record, there should be exactly one record in the satellite describing the non-historized transaction or event.

This leads to the question why a separation is required at all? Why not just denormalize both entities into one? And that is exactly the other option where all descriptive attributes are kept in the non-historized link itself to avoid a join to the satellites in every query:

In this case, however, the width of the link table can do harm to your performance optimizations. If the data within a row becomes too wide, it will cut down the number of records per database page in row-oriented database management systems. As a result, the performance of loading the link structure will decrease. On the other hand, joining the data during query time is no longer required and therefore the query performance on a row-based storage would increase.

As a counter-argument, column oriented database management systems perform more efficiently with wide tables, since such DBMS store data for different columns on separate pages or sections. That allows fetching specific column values without the need to read the entire row of data. Such columstore tables will generally outperform clustered index or heap tables, making them more often than not the best choice for large tables.

Data in the Warehouse in Microsoft Fabric is stored in the parquet file format^[1] which is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle large volumes of complex data. With its column-based format files organized by column, rather than by row, it saves storage space and speeds up analytics queries. It has highly efficient data compression and decompression as well as increased data throughput and performance.

Thus, the physical implementation for non-historized links and non-historized satellites may vary based on the physical platform chosen for implementation.

Depending on the query patterns of the data consumers and the physical platform chosen for implementation, data modelers may want to keep only part of attributes in the non-historized link directly. Such attributes usually are the most commonly used ones such as Invoice Date, Total Invoice Amount, etc. This avoids a join (to a non-historized satellite) for a significant number of queries.

In a Microsoft Fabric workspace, a Synapse Data Warehouse is one of the two distinct data warehousing experiences that supports ACID transactions, DDL, and DML queries. The second experience is a SQL Endpoint of the Lakehouse, which is read-only. In the Lakehouse, data can only be modified through the "Lake" view of the Lakehouse using Spark.

How to Implement on Microsoft Fabric

To implement the non-historized link and satellite in Synapse Data Warehouse, the following DDL statements are used:

CREATE TABLE [RAW_VAULT].[INVOICE_NH_LINK]

(                [Invoice_HashKey]                  VARCHAR(32)         NOT NULL

,                [LoadDateTime]                     DATETIME2(6)       NOT NULL

,                [RecourdSource]                    VARCHAR(100)      NOT NULL

,                [Customer_HashKey]                 VARCHAR(32)         NOT NULL

,                [SalesPerson_HashKey]                 VARCHAR(32)         NOT NULL

,                [InvoiceNumber]                    VARCHAR(100)               NULL

);



ALTER TABLE [RAW_VAULT].[INVOICE_NH_LINK]

ADD CONSTRAINT PK_INVOICE_NH_LINK PRIMARY KEY NONCLUSTERED

                                  ( [Invoice_HashKey] )

                 NOT ENFORCED;



ALTER TABLE [RAW_VAULT].[INVOICE_NH_LINK]

 ADD CONSTRAINT UNIQUE_INVOICE_NH_LINK UNIQUE NONCLUSTERED

(                [Customer_HashKey]

,                [SalesPerson_HashKey]

,                [InvoiceNumber]

                 )

    NOT ENFORCED;

The first statement creates the table. The second and third statements add the primary key and unique constraints to the entity. Note that these constraints are not enforced in the Fabric Warehouse and therefore serve as documentation for the user. The SQL optimizer also uses these constraints to create a better execution plan. The table is identified by the Invoice_Hashkey. Customer_HashKey and SalePerson_HashKey identify the incoming granularity together with the InvoiceNumber and thus represent the alternate key constraint on the table, which is represented as a unique constraint in the above DDL statement. RecordSource column is used for debugging.

The non-historized satellite is created with the next DDL statement:

CREATE TABLE [RAW_VAULT].[INVOICE_NH_SAT]

(                [Invoice_HashKey]                  VARCHAR(32)         NOT NULL

,                [LoadDateTime]                     DATETIME2(6)       NOT NULL

,                [RecourdSource]                    VARCHAR(100)      NOT NULL

,                [InvoiceIssueDate]                 DATE                                 NULL

,                [Description]                           VARCHAR(5000)             NULL

,                [TotalAmount]                        FLOAT                               NULL

)



ALTER TABLE [RAW_VAULT].[INVOICE_NH_SAT]

ADD CONSTRAINT PK_INVOICE_NH_SAT PRIMARY KEY NONCLUSTERED

                                  ( [Invoice_HashKey] )

                 NOT ENFORCED;

Here, the primary key of the entity is the Invoice_HashKey again. As in the parent non-historized link the RecordSource is used for debugging. The rest of the attributes are descriptive and describe the incoming granularity.

Additional information about the value of non-historized links can be found on the Scalefree Blog.

Modeling Multi-Active Data

While non-historized links and satellites deal with the granularity of incoming transactions, messages and events, multi-active satellites solve a different problem. Typically, a satellite can only capture one descriptive delta per parent (identified by the hash key) and load (identified by the load date). For example, a source system might deliver multiple phone numbers per contact and with no given upper limit on the number of phone numbers. Therefore, prioritizing the phone numbers is not a sufficient solution.

A Multi-Active Satellite is only used when the source system has multiple active records at the same time or the delivery mechanism does not provide enough metadata, for example, XML.

One of the first options to consider when having multiple active records is the possibility to model a multi-active attribute as a separate “weak” hub and then to use a link to track relationships between the two hubs, illustrated on the diagram below.

But we usually don’t recommend this option for such cases. As a rule, multi-active attributes are not regarded as business objects as in the case with multiple active telephone numbers.

The more commonly used option is to use a multi-active satellite defining an additional key column (for example, Type Code, Sequence, etc.) to uniquely identify a record.

Usually we recommend our clients use a generated sequence of integer values for such purpose, managed by the DWH platform and/or ETL process during the staging process, and load it into the Raw Vault along with additional key columns coming from the source system as regular attributes. The generated sequence will ensure uniqueness of a record without relying on external key columns from the source system. For example, the same phone type can be assigned more than once for different phone numbers associated with one client in one delivery (e.g. two mobile numbers). Type code or other key fields coming from a source are a better concept for the Business Vault where they can be handled and standardized in accordance with the business requirements.

Refer to the table below to see the typical structure of a multi-active satellite which consists of a hash key of the associated hub, additional sequence key field PhoneSequence, and the associated attributes - PhoneType and PhoneNumber.

Customer_HashKey	HashDiff	LoadDateTime	RecordSource	Sequence	Type	PhoneNumber
42ff8cfb…	dc1375c3…	1/1/2022 15:00	SALESFORCE	1	Mobile	+1-410-555-0198
42ff8cfb…	dc1375c3…	1/1/2022 15:00	SALESFORCE	2	Work	+1-410-555-0891
42ff8cfb…	dc1375c3…	1/1/2022 15:00	SALESFORCE	3	Fax	+1-410-665-0700
42ff8cfb…	dc1375c3…	1/1/2022 15:00	SALESFORCE	4	Other	+1-410-665-0702
a8452cf2…	b36d416d…	2/15/2022 10:00	SALESFORCE	1	Mobile	+1202-555-799
a8452cf2…	b36d416d…	2/15/2022 10:00	SALESFORCE	2	Fax	+1202-555-298
a8452cf2…	b36d416d…	2/15/2022 10:00	SALESFORCE	3	Other	+1202-555-976
42ff8cfb…	ce35a0fe…	2/16/2022 16:00	SALESFORCE	1	Work	+1-613-555-0130
42ff8cfb…	ce35a0fe…	2/16/2022 16:00	SALESFORCE	2	Other	+1-613-555-0183
566afa40…	495e4975…	3/20/2022 17:00	SALESFORCE	1	Mobile	+61 1900 654 321
566afa40…	495e4975…	3/20/2022 17:00	SALESFORCE	2	Mobile	+61 1900 456 123
566afa40…	495e4975…	3/20/2022 17:00	SALESFORCE	3	Work	+61 1900 963 499
566afa40…	495e4975…	3/20/2022 17:00	SALESFORCE	4	Work	+61 1900 654 500

The DDL statement for in Synapse Data Warehouse in Fabric to create a multi-active satellite is shown below:

CREATE TABLE [RAW_VAULT].[CUSTOMER_PHONE_MA_SAT]

(                [Customer_HashKey]                 VARCHAR(32)         NOT NULL

,                [HashDiff]                                 VARCHAR(32)         NOT NULL

,                [LoadDateTime]                     DATETIME2(6)       NOT NULL

,                [RecourdSource]                    VARCHAR(100)      NOT NULL

,                [Sequence]                              INT                            NOT NULL

,                [Type]                                       VARCHAR(100)               NULL

,                [PhoneNumber]                     VARCHAR(100)                NULL

)



ALTER TABLE [RAW_VAULT].[CUSTOMER_PHONE_MA_SAT]

 ADD CONSTRAINT PK_CUSTOMER_PHONE_MA_SAT PRIMARY KEY NONCLUSTERED

 (               [Customer_HashKey]

 ,               [LoadDateTime]

 ,               [Sequence]

)

    NOT ENFORCED;

The first statement creates the table. The table is identified by the Customer_HashKey and LoadDateTime. Thus, the second statement adds the unenforced primary key constraint to the table. The Scalefree Blog provides additional information about multi-active satellites.

Additional Special Entity Types

In addition to the presented non-historized link and its satellite and the multi-active satellite, there are more special entity types in Data Vault. The following table from the Scalefree Wiki presents the types:

A detailed discussion of the entity types is beyond the scope of this blog article. Most of them are presented in the book Building a Scalable Data Warehouse with Data Vault 2.0 by Dan Linstedt and Michael Olschimke or discussed in our trainings.

Conclusion

This article introduced special entity types as an addition to the basic entity types. These special entity types serve specific purposes and typically deal with specific scenarios in the incoming data set or define how the data is being used down-stream. On the other hand, they are only variations of the basic entity types and don’t introduce drastically new entities.

In our next article of this series we are going to put these entity types into work when we provide a more comprehensive modeling example.

About the Authors

Michael Olschimke is co-founder and CEO at Scalefree International GmbH, a Big-Data consulting firm in Europe, empowering clients across all industries to take advantage of Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of data warehousing individuals from the industry, taught classes in academia and publishes on these topics on a regular basis.

Dmytro Polishchuk is Senior BI Consultant for Scalefree. With his more than 6 years experience in Business Intelligence, he possesses extensive knowledge in the areas of ETL development and data warehousing, specifically with various Microsoft Azure DWH/BI solutions and Power BI. Dmytro is Microsoft Certified Azure Data Engineer Associate and Certified Data Vault 2.0 Practitioner.

^[1] What is data warehousing in Microsoft Fabric?: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-warehousing#open-format-for-seamless-en...

<<< Back to Blog Series Title Page

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs