Analytics on Azure Blog

18 MIN READ

Modelling Microsoft Dynamics 365 Data Using Data Vault 2.0

Microsoft

Feb 15, 2024

This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH and Co-authored with Markus Lewandowski Senior BI Consultant and Bibhush Nepal Technical Solutions Specialist from Scalefree
The Technical Review is done by Ian Clarke and Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft

The previous articles in this series covered individual aspects of Data Vault 2.0 on the Microsoft Azure platform. We discussed several architectures, such as the standard or base architecture, the real-time architecture, and one for self-service BI. We have also looked at individual entities, their purpose, and how to model them.

What is missing now is the big picture: we have the tools but how do we use them in a real-world example? How do we get started on an actual project? To answer these questions, we have teamed up with consultants from Scalefree, a consulting firm specializing in Big Data projects with Data Vault 2.0 and use some of their process patterns in this article to get you started.

Introduction

For this article, we have extracted data from the Microsoft Dynamics API. The goal of this article is to come up with the Raw Data Vault model, including important hubs, links, and satellites. We will take the standard Sales process of Microsoft Dynamics and use its business objects for this purpose. Those business objects are:

Lead
Account
Contact
Opportunity
Opportunity Product
Quote
Product

The Data Model we are using is a selection from the Microsoft Common Data Model where we picked all objects that play a role during the standard sales process.

The sales process can be described very roughly as follows:

Lead -> Account / Contact -> Opportunity -> Quote -> Opportunity Close

This Raw Data Vault could later be used to create reports, dashboards, and other data-centric solutions.

Note that, while more complete, this article does not describe the full capabilities of Data Vault 2.0. Data Vault 2.0 shows its full strength when data from multiple data sources must be integrated. This is not the case in this article as we only model data from Microsoft Dynamics CRM to keep the article simple. However, you can easily extend the model with additional data in the next iteration. Data Vault 2.0 has been designed for the agile development of the data analytical platform. Step by step, you would add more and more data from additional (or the same) data sources and integrate the data into the platform.

A Team Effort

Building a data analytics platform is not a job only for data modelers. Instead, a data analytics platform is a complex technical system, and building it requires expertise from technical and business areas. Therefore, it is a team effort, and many roles are involved, including data engineers, architects, testers, DevOps engineers, and more. But in addition to the typical data roles, it is highly recommended also to include two other types of roles: one that focuses on business value and the other focused on the data sources.

The business role represents the user and can explain the requirements. They know the business processes and how the business value, typically a report or dashboard, is used later on. They can also explain data quality issues and know what step in the business process causes them.

On the other side of the team, there are the data source specialists. They know the source system very well from a technical perspective and know the purpose of each data attribute, can explain data quality issues from a source system perspective, and know-how data relate to each other.

For the same reasons, we teamed up with the CRM experts at Scalefree to write this article and create a more comprehensive Data Vault 2.0 model for the Microsoft Dynamics 365 CRM solution.

Getting started with the Data Vault project is always a challenge. Typically, teams face two questions (among others):

How to start the overall project?
How to start with the modeling aspects?

The next two sections present our answers to these questions.

How to start the overall project?

At Scalefree, our practice is to initiate our client engagement with a series of sprints. The first sprint, called “sprint minus one” is all about resources and is performed by a senior role of Scalefree. Does the project have access to the right resources, including team members, business users, software and hardware, and meeting space? Would the team be able to get working as they join the context? Are user credentials set up?

Once the readiness of the project has been established, the next sprint, called “sprint zero” is used to set up the infrastructure and the initial information requirement for the first real sprint (“sprint one)”. The goal is to set up the architecture for the data analytics platform, including all the required layers. As a basis, one of the Data Vault 2.0 architectures from our previous articles . However, in most cases, we do not apply these architectures directly. In many cases, the architecture requires some adjustments to the circumstances of the client, for example, pre-existing components, tool stack, etc. With that in mind, we honestly don’t know if the architecture draft actually works in a project: many variables are unknown, some tools are new and sometimes even untested, and the team might lack skills or experience.

The worst case scenario would be to start developing with this architecture blueprint only to realize in two years from now, that the architecture choice was a bad one or doesn’t even work at all. Instead of wasting two years of budget, we follow a “fail hard fast” strategy. If the architecture blueprint doesn’t work, you want to fail as soon as possible to limit the wasted budget. And that is the main goal of sprint zero: set up the architecture and establish the data flow from an actual data source to the actual target. This data flow doesn’t have to use Data Vault models. We are more concerned about data flow. More important than the Data Vault model is to use the sprint to try assumptions and unknowns.

For example, in one of our projects, the client wanted to use binary hash keys in the information marts. It was unclear if the dashboard application was able to join on these binary hash keys. Instead of waiting for this to test in the future, we used sprint zero to test the assumption that it would work. If not, we could still adjust our architecture and implementation techniques for the later sprints.

The goal of sprint one, and most subsequent sprints, is to deliver business value. However, it is not the purpose of sprint one to define the business value. Instead, this should be done in a previous sprint. Since this is sprint one, sprint zero must define the business value with a first information requirement.

Most subsequent sprints should deliver business value or at least some progress. There might be some exceptions when technical debt piles up over time. In this case, a technical sprint might focus on reducing the technical debt and not delivering any business value. However, this is the exception, not the rule.

How to get started with the Data Vault model?

While the Data Vault model is not a concern of sprint zero, it is certainly a concern in subsequent sprints. Many less experienced Data Vault practitioners face the issue of getting started with the first Data Vault model, even if they went through proper training. A good start is always to define the client’s business model and focus on the major concepts. For example, the following diagram could present the business model of a customer-centric organization:

We call this diagram a “conceptual model” or ontology. It defines the business objects and their relationship. Typically, it is defined in a meeting with business users when we ask them early in the project to explain their business.

The next knowledge to extract from the business is the business keys which are used to identify the business objects in the conceptual model. When asking business users from various departments, different answers are often given. This is due to the fact that different departments often use different operational systems with different business key definitions to identify the same business objects. Therefore, expect multiple business keys per business object to be given as an answer, depending on who you ask. Add them to the conceptual model:

We have added the business keys using braces to the above diagram. However, we did not try to find the best business key for a certain business object. Do not judge the business keys, yet. Email addresses might not be a good choice, given privacy requirements, but add them nevertheless to the concept. These business key candidates will give us choice, so the more (actual ones), the better. With that in mind, make sure to invite all relevant business users to capture as many data sources as possible.

This model will be of great use when analyzing specific source systems, such as Microsoft Dynamics CRM. The basic idea of the business key, as discussed in article 7 of this series, is to be shared across source systems, so it can be used to integrate the separate data sets.

Identify the Concept and Type of Data

Once the organizational context of the model is known, it is time to analyze the actual dataset. The first step is to identify the concept and type of data:

Is this dataset related to one of the concepts in the conceptual diagram?
Does this dataset represent a business object (maybe in addition to the ones on the conceptual diagram) or transactional data?
What about reference data?
Is multi-active data (refer to article 8 of this series) involved?

These are some of the questions that should be asked during this phase.

The relationship between the raw data from the data source and the conceptual model defines to a good extent how to model the source dataset in the Raw Data Vault.

For example, if the source dataset is directly related to one of the business objects in the conceptual model and represents master data, we would often model the source dataset as a combination of a hub and satellite. This is the easiest way to capture changes to descriptive data attributes and forms the basis for many queries to produce dimension entities in star or snowflake schemas. Foreign keys in such data sets are often indicators for relationships with other hubs.

On the other hand, there is data that represents transactions, messages, and events. Such data is often modeled as non-historized links or dependent child links meaning data that is being processed more as a stream of individual records then changes to such records. This explains the name “non-historized,” but it doesn’t mean that we cannot capture changes to transactions, events, and messages if they occur. This is possible but out of the scope of this article.

Focus on the Business Keys

Business key identification is an important activity in designing the Data Vault model. In an ideal world, our consultants could analyze every single data source to identify the business key that is the one shared most. However, none of our customers are willing to set up the required budget, and they are right not to do this. Instead, we use the concept model to extend our limited view beyond the specific data source we’re trying to integrate. The business key candidates provided by the business users can be used as an indicator regarding the amount of sharing of individual attributes of the specific data source.

Once the concept of the dataset has been recognized, the available business key candidates are identified: of the business keys mentioned in the conceptual model, which business keys actually exist in the dataset to be ingested? If none of these candidates exist, what other options exist to uniquely identify the records in the dataset?

Ultimately, a business key must meet some important criteria. The most important ones are uniqueness across two dimensions:

Uniquely identify the business object across the enterprise
Uniquely identify the business object over time

Both aspects are important in data analytical platforms. The first dimension is important because the stated goal is to build an enterprise-wide data platform. Having a key local to a tenant colliding with another tenant’s local key (e.g. business key 42 identifies different customers at different tenants) presents an issue for identifying the business object uniquely across tenants. This can be easily resolved by the extension of the business key with some additional attributes, such as the tenant number. That combination of the local business key and the tenant ID is uniquely identifying the customer.

The second dimension is across time. In some cases, business keys are being reused: for example, when a bank account number is too short, many banks are reusing the old bank account number of closed accounts after a while. This works in operational systems, as these systems often have no history. However, data analytical platforms should provide historical data. In this context, we need to uniquely identify a bank account over time. The resolution is to extend the business key again. Typically some other attribute can be found to create a unique business key, such as the opening year of the account or in the case of aviation flights, the flight date. The combination of the bank account number and the opening year identifies the bank account uniquely over time.

Identify Relationships

A source dataset might include a number of business keys. Consider retail transactions: a customer (identified by a customer number) walks into a store (identified by a store number) and purchases a product (identified by a product number). The transaction table includes these three business keys. Just having them included in the same source entity is an indicator of a relationship between these business keys. Since these business keys are loaded into individual hubs, a link is required to establish the relationship.

Another important indicator of the need for links is foreign keys. If a foreign key refers to another source dataset that is providing the data for another business object, it indicates that a relationship between the two business objects exists.

Define and Split Satellites if necessary

Once business keys are identified, the remaining attributes in the source dataset are typically of descriptive character. Such descriptive attributes are loaded into satellite entities describing the hub or link. But what if multiple derived hubs and links are involved in the source dataset? In this case, it helps to answer a simple question on every data attribute: “If this is a descriptive attribute, what is it describing?” The answer can only be one of the hubs or links. Then, add the attribute to the satellite that describes the respective parent.

In addition, there might be reasons to split the satellites:

As a general rule, load a satellite only from a single data source
Split a satellite for rate of change if no table compression is available (less important with commercial database engines, such as Microsoft Synapse or Fabric)
Separate attributes by their privacy class (non-personal vs personal data attributes) to prepare a physical delete
Separate attributes by their security class (to implement column-level and row-level security)
Separate attributes for business reasons

The last reason should only be applied if the advantages exceed the disadvantages as they might have to be joined later on.

THE LEAD MODEL

The remainder of this article describes the Raw Data Vault models derived from the Microsoft Dynamics CRM data. Each model has been derived from a source dataset. While we describe them individually, they are part of an integrated model.

The first three source entities are relatively straightforward to turn into a Raw Data Vault model. Leads, accounts, and contacts are often considered business objects and operational master data. Therefore, the base Data Vault entity would be a hub and a satellite to capture the data in the Raw Data Vault. For the lead object, the identifier is the Lead ID in the data source. Given the data set we analyzed, the Lead ID is unique across all records. The center of the diagram shows the hub for leads and its business key:

Two satellites are used to capture the descriptive data: one for personal data and the other one for non-personal data. This is required to perform physical deletes of personal data when the lead exercises the right to be forgotten, as defined in GDPR. In addition, the satellite named “EffSat Lead” is an effectivity satellite that captures the deleted timestamp to indicate hard-deleted records from the source system, which are soft-deleted in the Raw Data Vault. To recap: the personal data satellite is used to support physical deletes of personal data in the data analytics platform, while the effectivity satellite supports soft-deleting records where they have been removed from the source system.

There are two links in the center of the diagram: a hierarchical link (HLink) to the left of the Hub Lead and a same-as-link (SAL) to the right. Both links are standard links, but they serve a specific purpose: the hierarchical link is used to capture parent-child hierarchies, as found in the source data with the Parent Link attribute. The attached effectivity satellite indicates if this relationship is still valid or outdated because the relationship has been changed.

The same-as-link is used for a different purpose: in many data sources there are duplicate records. This is especially true for CRM systems, but certainly not limited to them. We experience duplicate records everywhere, not only within one system but also across multiple systems. The same-as-link is used to indicate the duplicates and map them to their designated master records. In the case of Dynamics CRM, the source data set provides a Master Lead which can be used to map the records. However, in many other cases, the data source is not providing such data, or it might not be sufficient, and we discuss in one of the next articles of this series how we implement a same-as-link in the Business Vault using business logic, such as fuzzy matching or Soundex algorithms. Because these mappings can change over time, an effectivity satellite is used to capture the deleted date of outdated mappings.

There are three referenced hubs at the bottom of the diagram: hubs for account, customer, and contact. They are introduced because of foreign keys in the source dataset, which also result in respective links and effectivity satellites to capture changes on the foreign keys. Since all these related entities play multiple roles in the lead dataset (one is the parent, and one is a direct relationship), there are multiple links, including their effectivity satellites.

THE CONTACT MODEL

The contact model is centered on a hub for contact, with the business key Contact ID, similar to the lead model:

Again, this is due to the nature of the contact business object. The source data contains personal data and therefore the descriptive data is split into two satellites on the hub to separate personal data from the non-personal data. The source dataset contains a similar hierarchy and master indicator that lead to the hierarchical link and the same-as-link for the same reasons.

The biggest difference is the lower number of foreign keys in the source dataset, which leads to fewer links.

THE ACCOUNT MODEL

The Raw Data Vault model for accounts is very similar to the lead and contact model, due to the similar nature of the business object. Again, this is a clear business object and operational master data. Such cases are often modeled as hubs and satellites. Therefore, the center of the diagram shows the hub for an account:

However, GDPR doesn’t play much role in this case because (we assume) that the account object doesn’t hold any personal data. Corporate data is not affected by the right to be forgotten. Therefore, the descriptive attributes are not split into different satellites.

There are also fewer foreign keys, only the primary contact of the account and the originating lead. The Data Vault design follows the same principles in regard to the referenced hub, the link, and the effectivity satellite as the Lead model.

THE PRODUCT AND QUOTE MODEL

It is easy to argue that a Product is also a business object. Therefore, the Raw Data Vault model is again centered on a hub:

The source data set provides a product hierarchy which is captured by the hierarchical link to the left of the product hub. A link called “DLink Opportunity Line Item” is greyed out because it is discussed in the next section.

It could be argued that the quote object in the data source represents a business object or alternatively facts. The decision on how to model the quote data depends largely on this decision. Based on the conceptual model from the beginning of this article, the decision is made to consider the quote as a business object and therefore model it as a hub with satellites:

The design largely follows the same principles as the previous models. Since quotes might include personal data, we decided to split the attributes into two satellites for personal and non-personal data attributes. The quote dataset has a number of foreign keys, including customer, account, contact, and opportunity which is discussed in the next section. Every foreign key is implemented as a link with their effectivity satellites to capture soft deletes.

THE OPPORTUNITY MODEL

For the opportunity dataset in Dynamics CRM, it was decided that this data represents factual data. We consider this data as non-historized as the data represents typical business transactions.

In the following diagram, the opportunity is captured as two links and a hub:

The opportunity is identified by the Opportunity ID. This business key is captured by the Hub Opportunity. There is also an effectivity satellite to keep track of deletes in the source data as soft-deletes in the data warehouse.

But most of the actual data is captured by the two links to the left: “NHLink Opportunity” and “DLink Opportunity Line Item.” In Microsoft Dynamics, an opportunity can have multiple line items. We expect that the user would like to analyze opportunities on both levels (the opportunity level and the line item level) as facts. To capture such data efficiently, we typically use a non-historized link (here: NHLink Opportunity) or dependent child link (DLink Opportunity Line Item).

Both link entities have some similarities: they should capture the data in the same granularity as they arrive at the data analytics platform. For every opportunity, there should be one record in the non-historized link, and for every line item, there should be one record in the dependent-child link.

As the name suggests, the dependent child link contains a dependent child, the Line Item Number. This is required because the opportunity could have the same product multiple times on different line items, for example, to provide different discount levels or line item descriptions. To be able to capture each individual line item, the granularity of the link is extended by the dependent child, so each line item can be described individually by the attached satellites.

The same is true for the non-historized link on the opportunity level. There might be repeating patterns of the same data, for example, retail transactions, phone call records, etc., that require a similar model to capture individual descriptions. In this specific example, there is no need for a transaction ID, which would be typically added to the non-historized link in a similar manner to the dependent child. This is due to the opportunity hub as it provides a unique identifier for each opportunity. Therefore, the alternate key on the hub references in the non-historized link is unique per opportunity already.

With this in mind, one could argue that a standard link would be sufficient to capture the granularity of the opportunities. This is correct, but using a non-historized link in this example is helpful for power users and data scientists as they are looking for these link entities to deliver facts.

The Customer Model

There is one interesting aspect in this dataset: both leads and contacts have a customer reference in the source data. However, there is no customer dataset. Instead, a customer is either a contact or an account (individual or organization) in relation to the link.

A bad practice would be to implement some conditional logic in the loading process of the Raw Data Vault to distinguish between such cases. However, conditions can change over time or break due to errors. Therefore, the recommendation is not to apply any business rules or conditions when loading the Raw Data Vault.

Instead, we use a simple trick: the data source considers the customer to be a generalized business object with no descriptions by itself. A generalized hub is used to capture this perspective from the data source. The business keys are the customer IDs from the lead and account tables. Similar references exist to opportunities.

The following diagram illustrates the hub customer:

Note that the only new entities in the overall model are the customer hub (with the customer ID as the business key) and the effectivity satellite on the hub. All other entities in the above diagram have been introduced in the previous sections already.

Therefore, the hub customer contains both lead and contact identifiers. Business rules are required to distinguish between them. We leave this to one of the next articles when we implement them in the Business Vault.

Conclusion

We discussed the process of how to get started with the Data Vault project and start creating the Data Vault model in this article. We also demonstrated how to model an actual dataset from Microsoft Dynamics CRM to provide a real-world example with many of the real decisions to be made: while one could argue that quote data could be either a business object or fact data, the same is true for opportunities. Depending on the client’s context and business practices, opportunities could be seen as business objects or facts.

In the case of the opportunities, it was decided to model this as fact data. This is largely due to the business requirements to analyze opportunities in a star schema.

However, this could also be argued for the quote object. So, what is the best decision? To be honest: we don’t know. It actually depends on many factors, such as the source metadata, the source data, business preferences and expectations for the information mart model, and other factors. The good news is that both models are valid and can serve any business expectation. One will be more efficient than the other. But since both are valid models, we can always refactor the model if your modeling decision could be improved.

We left an unsolved problem in this article: the issue of the generalized customer hub that contains business keys for both leads and contacts. We will demonstrate how to deal with the customer hub and relate its records to either the contact or lead hub, depending on the business key. Since this requires business logic, we solve it using Business Vault entities in the next article.

About the Authors

Michael Olschimke is co-founder and CEO at Scalefree International GmbH, a Big-Data consulting firm in Europe, empowering clients across all industries to take advantage of Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of data warehousing individuals from the industry, taught classes in academia and publishes on these topics on a regular basis.

Markus Lewandowski works as a senior consultant at Scalefree International GmbH, a Big Data consulting firm in Europe. His focus is on designing and implementing CRM solutions for a diverse range of clients, specializing in enterprise-level integration. Additionally, he shares his expertise by teaching a CRM solutions class at a local university.

Bibhush Nepal works as a technical solutions specialist in Business Intelligence and Data Warehousing at Scalefree International GmbH, a Big-Data consulting firm in Europe. His focus mainly lies in assisting the Internal Data Warehousing team to develop and continuously improve their Internal Data Warehouse and other related projects. He is currently pursuing his Master’s Degree in Data Analytics.

<<< Back to Blog Series Title Page

Updated Feb 16, 2024

Version 3.0

Microsoft

Joined February 20, 2023

View Profile

Analytics on Azure Blog

Follow this blog board to get notified when there's new activity