A Modern Data Analytics Platform on Azure with Data Vault 2.0

Naveed-Hussain

Microsoft

Jun 30, 2023

This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH and Co-authored with mwinkelmann from Scalefree and @JonasDeKeuster from VaultSpeed
The Technical Review is done by Ian Clarke and Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft

Introduction

Business users need more and more ready-to-consume data and information to support their decision-making in a data-driven organisation. Though, many of today’s systems fail to deliver this on-time or on the right quality and quantity.

This article starts a blog series regarding the Data Vault 2.0 concept to build scalable business intelligence solutions that provide a new approach for building a modern data analytics platform.

Data Vault 2.0 contains all necessary components to accomplish enterprise vision in Data Warehousing and Information Delivery. Based on the three pillars, methodology, architecture and modelling, Data Vault contains all required components needed to create a modern analytics solution.

Defining a Modern Data Analytics Platform

We define a modern data analytics platform by three basic characteristics. First, the solution is not limited to one single source system or type of data. Instead, most of our clients extract load data from multiple sources into the platform, whether internal systems or external sources, for example, purchased data. The data for each source could be delivered in various loading cycles: we have built solutions that consumed data from the source in independent nightly batches, CDC loads, near real-time or in actual real-time. And the data may be structured, semi-structured or unstructured data. To add context, structured data often originates from relational database applications while semi-structured data, such as JSON or XML documents and messages are loaded from real-time feeds, Web-Service and REST APIs or semi-structured applications and their databases. Additionally, unstructured data might be images, PDFs and video files or streams.

Our clients require the data for their decision-makers as a direct or indirect input to their decision making. Therefore, all this data should be loaded into the data analytics platform and made available to them in one form or another.

The next characteristic is that the platform turns the raw data from the source into actionable information and this information is then consumed by the information user. Therefore, the raw data is not consumed directly but is pre-processed with business logic to turn it into useful information. A typical challenge in our projects is that end-users cannot agree on how the information should look or how to transform the raw data into the information - meaning which business logic should be applied.

But the goal is to serve all these business users, regardless of their definition of the truth: we strongly believe that there is (sadly) no single version of the truth as it was assumed in legacy data warehousing but there are many different truths depending on who consumes the information. Every information user wants to apply their business perspective on the raw data. Therefore, the raw data are the facts, stored in a single point of facts.

The third characteristic we see in our engagements is that the data analytics platform is often distributed. Instead of building many siloed data warehouse solutions, our clients require an enterprise-wide effort, as above, with all data and all information made available. However, the solution might be distributed across different environments. And there are plenty of reasons to distribute: technical, legal, and organisational are the most common ones across our clients.

Our clients use the integration of the data lake and relational database to store and keep unstructured data on Azure Data Lake Storage and structured data in Synapse Analytics (this follows the hybrid architecture, which we discuss in our next article). This technical reasoning allows the data to be stored and processed in the most optimal choice for the data. Some clients have legal requirements, for example, that certain data (e.g., consumer’s financial data) remains in a certain jurisdiction. Organisational reasons drive other clients to build the solution across different environments to physically separate compliance data from the rest of the data analytics platform.

Brief History of Data Vault 2.0

Some of these characteristics and the requirements that we will discuss later in this article, might be new to the industry but they are decades old. Dan Linstedt, the inventor of Data Vault and co-founder of Scalefree, faced them when working for a US government department while he was tasked with building a decentralised data analytics platform that would extract and load all data from hundreds of source systems and deliver the information resulting from transformations to information users with varying and potentially contradicting business rules and information requirements.

That’s when he invented Data Vault modelling (or Data Vault 1.0) which later became Data Vault 2.0 by the addition of other aspects required to build enterprise data warehouse solutions, or as we refer here in this series, data analytics platforms. The additional aspects covered by Data Vault 2.0 include the Data Vault 2.0 architecture, the Data Vault 2.0 methodology, Data Vault 2.0 implementation practices and, yes, the Data Vault 2.0 model.

Value of Data Vault 2.0

These aspects, known as the Pillars of Data Vault 2.0, are used to implement data analytics platforms that bring several features to the business.

Dan’s early clients in the US government space required fully auditable solutions. Data Vault has been designed for such environments and we have used it to implement data-driven solutions for banks, insurance companies and other clients with high auditability requirements. The concepts have allowed us to build solutions that allowed full data lineage at the attribute level to prove the source of data that was used to produce some information artifact. Our clients are able to reconstruct any delivery that the platform received and to reproduce any report that was ever shipped out to an end-user. This requires that the unmodified data still exists, contradicting legacy approaches where data is made to conform. And it requires that the unmodified business logic of the report still exists, resulting in the need to version business logic, which is possible either via version control, but also possible in a more convenient way by having all these different versions available in the data analytics platform for consumption.

The early clients also had very high security and privacy requirements. Cell-level security, the combination of row-level and column-level security is a must-have in these environments to apply any usage restrictions via a dynamic Access Control List (ACL) to the data and information. Multiple security classifications are supported and used by most of our clients today.

The deletion or reduction of PII data, for example of citizens, is a fundamental requirement in many government projects. Therefore, Data Vault supports the separation of attributes by privacy classification. Most of our industry clients often use only two classes: personal data and non-personal data. Based on this classification and the subsequent separation, it is possible in the Data Vault 2.0 model to independently delete the data by privacy class and therefore implement a physical delete. For environments where the separation by record is not feasible, there are solutions that support the logical delete as well.

To be clear: these patterns stem from early requirements in the 1990’s and are, by the time of writing this article, more than 30 years old. But they are still valid for challenges stemming from HIPAA and GDPR regulations and are used to implement GDPR compliant data platforms on relational databases such as Synapse SQL Pools or data lakes.

Data Vault 2.0 has also been designed for the agile delivery of data-driven solutions. We use it to implement the data analytics platform sprint-by-sprint, thus avoiding the risky big bang approach that often fails in our domain. The focus of each sprint is placed upon business value and those who have attended our training know how much I personally stress the fact that most sprints should be focused on business value. In our domain, business value is defined as something that can be used by the business user, often a report, a dashboard, or at least some parts of it, such as KPIs. Additional targets we have produced in our internal and client projects have been described in more detail on the Scalefree Blog.

The ability of Data Vault 2.0 to go agile largely depends on the adaptability of the Data Vault 2.0 model. It is easy and straightforward to extend an existing Data Vault 2.0 model by additional source data, additional business logic and additional target information models. So, instead of adding all source data from a source system, we focus on the business value, as described before, and ask a simple question: Which data is required for this business value? This, and only this, data is then added to the Data Vault 2.0 model, the required business logic is added and the target information model, often a star schema or snowflake schema, is derived to build the report. This is also known as the tracer bullet approach in Agile software development because it shoots the functionality through the layers of the data analytics platform.

But the adaptability goes further: what if the structure of the source system changes and the data? What if business logic needs to be modified because of changing business requirements? The Data Vault 2.0 model can easily absorb these changes; even in environments where existing code cannot be modified and only the creation of new tables or views are permitted but altering is allowed. Data Vault supports a zero-change impact on code but be aware that this leads to additional entities so it’s only an option for certain clients which meet such requirements.

Many of our clients also require the application of one or sometimes multiple business timelines in a multi-temporal solution. Thus, we train the patterns to achieve this, and they have led to great success in our projects. The most sophisticated solutions we built in the past required the parallel application of business timelines or a combination of different business timelines in a combined business perspective. The final solution allowed the user to switch between different business timelines in their dashboard by selecting a dimension member. With that selection of the timeline, a different business perspective was selected and applied to the information in an instant, performed using high performance standards even on a high volume of data.

Our international clients also often require the implementation of multi-lingual platforms that cover not only the labels in a dashboard but also the data and information itself. The same applies to multi-currency support where the business user would like to transform all foreign currencies into a leading currency, or better: any currency to be selected by the user in the dashboard. These clients also often require multi-tenancy where the data from different organisations, production plants, countries, etc. they define as tenants is captured by the platform and allows either a tenant-specific report or a report across tenants. Data Vault 2.0 supports all the above at a high speed regardless of the data volume.

Regarding data volume, the Data Vault 2.0 concept has been designed for very large data warehousing (VLDW) environments, better known today as Big Data solutions. These systems can process any volume of data at any speed. Well-implemented solutions can linearly scale with the data volume or speed: to process twice the volume of data, our clients only need twice the resources (and not four times, eight times or sixteen times the future data volumes and velocities. It is also possible to deploy the data analytics platform on massively parallel processing (MPP) platforms, on premises or in the Azure cloud. For example, Azure Synapse Dedicated SQL Pool is a MPP platform that is executed on many computing nodes in parallel, as many other services in the Azure cloud. Data Vault 2.0 works well with these technologies because it has been designed for them. However, it doesn’t mean that you have to use MPP platforms. Data Vault 2.0 also scales down, to single-node systems such as Azure SQL DB, or Microsoft SQL Server on premises.

Our clients also face a good amount of data variety. It’s not uncommon that clients process hundreds or thousands of source tables for their reporting needs. While the concept supports the addition of thousands of tables over time, doing this manually is cumbersome. Instead, our clients rely on automation tools that rely on the fact that the Data Vault 2.0 implementation is pattern-based. All entities follow similar patterns, and our clients automate the data warehouse using defined meta-data and automation templates. It is possible to automate 100 % of the data lake and 100 % of the raw data processing in the so-called Raw Data Vault (to be discussed).

Conclusion and Outlook

With all these values in mind, Data Vault 2.0 sounds like the “silver bullet” that will solve all your data processing requirements when building a data analytics platform. But is it? Yes and no. There is a good article of one of the article authors over at Vaultspeed that discusses this in more detail.

One myth is that Data Vault is only good for large enterprise solutions or solutions with very high volume but not for small solutions. We certainly disagree with this assumption because of the scalability of the Data Vault 2.0 concept: many of our client solutions start small with additional requirements added later due to their growing needs. Scalability also refers to the functionality; with Data Vault, it is easy to add additional features to the data analytics platform, such as real-time processing, security, data lakes, etc. It surely requires an extensible architecture that keeps these potential future extensions in mind.

We will start discussing the Data Vault 2.0 architecture in the next article and cover the remaining Data Vault 2.0 pillars and the potential for automating the development of the data analytics platform in subsequent articles. If you can’t wait for it, check out our book “Building a Scalable Data Warehouse with Data Vault 2.0”, by Dan Linstedt and Michael Olschimke, which is based on on-premises Microsoft SQL Server. It gives you the patterns and practices and you can expect this series as a small update to the book.

About the Authors

Michael Olschimke is co-founder and CEO at Scalefree International GmbH, a Big-Data consulting firm in Europe, empowering clients across all industries to take advantage of Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of data warehousing individuals from the industry, taught classes in academia, and publishes on these topics on a regular basis.

Marc Winkelmann is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Jonas De Keuster is VP Product Marketing at VaultSpeed, a best-in-class data warehouse automation solution to speed up the process of data integration, building on the Data Vault 2.0 methodology. Jonas has close to 10 years of experience as a DWH consultant in various industries like banking, insurance, healthcare, and HR services. This background allows him to help maintain VaultSpeed’s product-market fit and engage in conversations with members of the data industry.

<<< Back to Blog Series Title Page

Updated Jul 19, 2023

Version 8.0

Microsoft

Joined February 20, 2023

View Profile

Analytics on Azure Blog

Follow this blog board to get notified when there's new activity