Real-Time Processing with Data Vault 2.0 on Azure

Naveed-Hussain · ‎Jun 30 2023

This Article is Authored By Michael Olschimke co-founder and CEO at Scalefree International GmbH and Co-authored with @obause from Scalefree
The Technical Review is done by Ian Clarke, Naveed Hussain, Irfan Maroof and Hajar Habjaoui – GBBs (Cloud Scale Analytics) for EMEA at Microsoft

Introduction

In the last article, we discussed the Data Vault 2.0 architecture with regards to batch processing. However, Data Vault 2.0 is not limited to batch processing: it is possible to load data at any speed, in batches, CDC, near real-time or actual real-time.

This article is based on the previous article but extends the architecture with real-time processing.
We discussed several options on how real-time data can be ingested and how we, at Scalefree, typically implement a real-time Data Vault 2.0 architecture on the Azure cloud.

Traditional Batch Loading

Traditionally, most data warehouse systems were loaded in nightly batches; a very common scenario was that the data from the data sources had to be delivered up to a certain time in the night (e.g., 2 am in the morning) to be transformed and loaded into the “core data warehouse” layer. The loading of this layer had to be completed by 8 am in the morning when the office doors opened, and the first employees wanted to retrieve information from the reporting solution. This timeframe defines the boundaries of the limited “loading window” for the data warehouse.

One problem with this method is the delivery of the usable information at a defined point in time; this will be discussed in a later article in this series.

This article is focusing on data ingestion where a clear trend is visible in the industry, to load data more often. The issue with nightly batches is that they lead to a peak of data processing overnight. To make sure that all data can be processed during the loading window at night, organizations must adjust their infrastructure needs to be able to deal with the expected maximum peak of required computing power.

When using real-time approaches, the loading window is extended to 24 hours. Often the argument is brought forward that the real-time processing might impact the information delivery. First, if that is really a problem, it is possible to dynamically scale up and down the resources in the cloud in use today. They are also often separated in a well-designed architecture. This counterargument doesn’t hold true anyway. The peak overnight is gone; instead, occasionally, some subsets of data will be processed. This subset of data must be smaller than the peak overnight.

Real-Time vs. Near Real-Time

The traditional batch loading pattern is not outdated; many data sources today will continue to be loaded in batches. However, more and more data sources are loaded more often. Clients with a high volume of data, for example telecommunication providers, load some of their data, especially the call records, at least every hour incrementally. This makes sense as these call records are created incrementally in the business.

From our perspective, there is no need to limit the loading to once per hour, especially for this incremental data. Therefore, many organizations load their data even more often. Once they load in mini batches at least every fifteen minutes, we typically call this near real-time. In this scenario, the data is not loaded immediately, but in a short period of time, no longer than the aforementioned fifteen minutes. In the meantime, the data is stored in a cache.

Actual real-time or message streaming goes further; why wait for certain records to accumulate? Instead, in actual real-time, every single message, for example about a phone call, is loaded directly into the data analytics platform without any cache. Caching assumes to later pull the data out of the cache, but in actual real-time, the data is always pushed.

It doesn’t mean that the data is available immediately in the presentation layer (e.g., the dashboard); processing the messages might take some time. The acceptable processing delay is typically defined by the consequences of missing a deadline; we distinguish between hard, soft, and firm real-time:

Hard real-time systems fail if only one message deadline is missed (think about the nuclear power plant and the message of a failed cooling system which arrives late…)
Soft real-time systems can work with missing deadlines, but the longer a message takes, the lower the quality of the overall system.
Firm real-time systems will kick out late arriving messages, but the overall system will not fail due to some messages arriving late.

The requirements are typically set by the information consumer; they will certainly define the consequences of missing deadlines. But the definition between near real-time and actual real-time has two dimensions: a user might define that the dashboard should deliver information in actual real-time or every X number of minutes. However, if the source is not delivering its data in real-time, the information requirement cannot be met. On the other hand, if a source delivers data in actual real-time, the information consumer might define the delivery of daily reports, ignoring the new real-time data until the next defined reporting date.

Later in this article, we will discuss how to decouple the real-time data ingestion from information delivery to allow information delivery to follow its own pace.

We should discuss one more definition before we conclude this section: message queues vs message topics. The difference is relatively simple; message queues distribute workload among multiple worker roles (subscribers) attached to the message queue. Once a message arrives in the queue, only one of the worker roles will receive the message for processing. Therefore, message queues are a great tool for scaling up. When a peak of messages arrives, just add more worker roles to process the messages in the queue. Message topics are different; they are like a radio station. Every message received from the publisher is sent to every subscriber. Therefore, it's a great tool for information delivery when all dashboards and subscribing apps should receive the same information.

Real-Time Integration

There are many blueprints available for real-time architecture. We want to present and discuss two architectures often seen in the industry: the lambda architecture and the Data Vault 2.0 architecture.

The lambda architecture, shown in the diagram below, distinguishes between a speed layer and a batch layer:

The speed layer is processing the real-time messages with the appropriate tools. The requirement here is focusing on speed and less on reliability. The results don’t have to be complete by definition and the focus is on throughput and performance, rather than accuracy.

Accuracy and completeness is provided by the batch layer which is used to process high volumes of data in regular batches. The problem with this layer is that there are gaps on the timeline between the batches. They are completed by the speed layer as much as possible.

A serving layer integrates the data from both the speed layer and the batch layer for presentation purposes.

While the lambda architecture, first described by Nathan Marz, has been received well by the industry, we see some issues from a Data Vault 2.0 perspective. For example, the architecture only implements a single layer in each of the data flows transforming the incoming data on-the-fly to the serving layer. To be more precise, redundant transformations are done in parallel in both speed and batch layers and, at least in the pure theory, there is no defined layer for capturing the raw, unmodified data, for example for auditing purposes. Microsoft advices its clients that data is landed in it's raw form into a data lake from both the hot and cold paths. This is a much better approach then historizing the message queue, a strategy some organizations attempt.

Another issue is the assumption that the batch layer provides the real-time data again, just in batches. At least among our clients and their source systems, this is often not the case. We need a solution that can deal with unreliable sources under high auditing requirements. Real-time data might or might not be delivered in the more batch-oriented loads “overnight.” We have discussed the lambda architecture in additional detail on our Youtube channel.

The alternative architecture often found in the industry is the kappa architecture. However, this one assumes we can load batches via real-time feeds. Yes, it works technically, but the message queue was designed for delivering messages not batches of data. And this architecture has been misused by many clients when they started to spend a lot of time and effort in exporting their data to the message queues and topics while the solution would have been fairly simple, just dump the data on the lake (for now).

At first glance, the Data Vault 2.0 architecture is similar to the lambda architecture:

The foundation for this architecture was shown in the last article of this blog series. The above diagram shows the missing details for real-time architectures.

The batch-driven part via the data lake remains unmodified. The real-time part, labeled “message streaming,” is an extension to the existing architecture and captures the real-time data but integrates it with the batch-driven architecture at multiple points in the data flow:

In the real-time flow, there are multiple layers implemented: first, the message is delivered by the publisher to a message queue. Typically this is not done directly but via a web socket, REST interface, or something similar.

The messages are then picked up and processed by a Raw Data Vault (RDV) loader. This worker role is responsible for ingesting the incoming raw data into the Raw Data Vault. In real-time systems, this is done directly into the Raw Data Vault layer - there is no need to store the data in the data lake per se. However, all messages are forked into the data lake for multiple reasons: first: to capture the data in its original format, especially when the message queue contains a lot of different data flows which are not yet all captured by the Raw Data Vault. The data lake will capture all data easily without further analysis required. Once the data is needed to produce new business value, the messages are analyzed and loaded into the Raw Data Vault continuously.

The other reason to store the incoming message flow in the data lake is for the data scientists who might want to ignore the Data Vault in the beginning of their journey.

Once the Raw Data Vault loader has inserted the records into the Raw Data Vault, the message is forwarded to another message queue for the next layer.

Those messages are then picked up by the Business Vault (BV) processor and used to trigger the application of business logic in the worker role.

The business logic can be based solely on the incoming message. However, in many cases, the Business Vault processor will integrate data from the Raw Data Vault and the Business Vault on disk, for example when enriching real-time messages by data from the batch-oriented flow. The combined data can then be used to calculate any complex business logic in real-time.

Once the message processing is done, the worker role decides if the result is relevant for the subscribing application, the dashboard, or app. If so, a message is sent to the subsequent message topic to inform the dashboards. This message is typically generated by the Business Vault processor as a result of the incoming raw message flow and the applied business logic and, in many cases, only informs the dashboard that something relevant has happened.

In those cases, the dashboard has to query an information mart for more detailed information to present to its users. This way, any limitation regarding message size is accounted for. If an information message is rather small, the detailed information can also be sent directly to the dashboard via the message topic.

This architecture, which is also discussed in this video, follows the idea of pushing messages all the way from the publisher downstream to the subscriber. The messages are absorbed by loading them into the Raw Data Vault for forking them off into the data lake but the main process is the push inside the message streaming area. If the real-time messages were loaded via the data lake (not pushed forward via the message queues and the worker roles), the real-time flow would turn into a near real-time flow; the data lake would act as a cache.

The major difference to the lambda architecture is that the Data Vault 2.0 architecture integrates at various levels: first of all, it integrates data from the real-time feed and the batch feed. Lambda architecture integrates only on the resulting information. This is done in the Data Vault 2.0 architecture as well, but in addition to the raw data integration. It follows the same principles as in the batch feed described in the previous article.

The Value of Virtualization

Due to virtualization, the information marts are real-time in nature: real-time data is loaded by the RDV loader into the Raw Data Vault in real-time; results from business logic can be sent by the Business Vault processor in real-time into materialized tables in the Business Vault. The View just presents the results as a dimensional model or any other target model. When experiencing performance issues from these virtualized target models, instead of materializing the views, it should be considered to introduce materialized bridge and PIT tables, a concept we will introduce in a later article in this blog series. Another approach is to denormalize some dimensions into the underlying bridge table.

If the performance of a relational database is not sufficient for the real-time flow, combine different technologies to meet your objectives - keep some data in a relational database, some other data on a Azure Cache, combine it with Power BI’s capabilities to query directly from Azure Cache for Redis.

The presented approach works as long as the standard principles in Data Vault 2.0 are followed: to transform data and otherwise apply business rules only after the Raw Data Vault. Because business rules introduce data dependencies, this approach removes them from this point and allows the ingestion and integration of data in parallel. In the architecture, every layer could be distributed, from the data lake to the Raw Data Vault, Business Vault and information mart layer based on these principles.

Real-Time Data Vault on Azure

Microsoft Azure provides many commercial-grade components to set up the architecture described before. The following diagram shows the components of a typical Data Vault 2.0 architecture on Azure:

The above diagram shows how our Scalefree consultants typically set up the real-time architecture. It follows the layered approach as the conceptual Data Vault 2.0 architecture:

Data sources deliver data either in batches or real-time (or both). Depending on the delivery cycle, the data is either directly loaded into the Azure Data Lake (batches) or first accepted by the Event Hub (real-time).

The Event Hubs are a commonly used service in the Azure cloud and are used to continuously accept the data in real-time. Event Hub provides an option to load the messages directly into the Azure Data Lake,^[1] which implements the recommended fork of the real-time data for later use by data scientists (if they don’t want to directly work with the message stream on the Event Hub).

The Raw Data Vault Loader (RDV Loader) is based on Stream Analytics and primarily used for separating the business keys from the relationships (between business keys) and the descriptive data according to the target model in the Raw Data Vault (hubs capture the business keys, links will capture the relationships and the event granularity, satellites will capture the descriptive data). Stream Analytics can also be used for calculating hash keys for the target model, if required. While the standard Stream Analytics functions don’t allow the calculation of MD5 or SHA-1 hash values, this can be implemented using UDFs. The Azure service works well for this task as the real-time analytics service is designed for mission-critical workloads and supports the execution of SQL-Statements on the message stream in real-time.

Once the data is captured by the Raw Data Vault, the message is forwarded to the Business Vault processor via one of the supported outputs. This component will deal with the application of transformation rules and other business rules in real-time. It will also produce the target message structure for consumption by the (dashboard) application. The Azure cloud provides several options for this, starting with Synapse, Spark or Databricks.^[2]

The results from the Business Vault processor can be loaded into physical tables in the Business Vault on Synapse, but also just delivered in real-time (without materialization in the database). Once the optional materialization is done, the target message is generated and sent to the real-time information mart layer implemented by a streaming dataset.

The streaming dataset is then consumed by PowerBI. The dashboard service will store the messages in a temporary cache for presentation purposes. However, this cache will expire quickly. On the other hand, the Synapse database has all data available for other uses, including strategic, long-term reporting. The purpose of the real-time stream is to be displayed only for a short period of time: most real-time dashboards show only the last five minutes, but that's depending on the user requirements. Long-term dashboards, with visualizations for more than one hour should be loaded from the Synapse database.

Additional information about the use of real-time message streams and additional options to the suggested streaming data sets are discussed in the Power BI documentation.

Information Delivery in Real-Time

The presented architecture in this article can be used to implement various use cases: from real-time dashboarding and reporting to information delivery into mobile apps, data cleansing, and business process automation. The Scalefree blog describes them in some more detail.

The following screenshots show the sourcing of the raw data and the KPI definition in Stream Analytics:

The first diagram shows the sourcing of the real-time data for data ingestion into the Business Vault processor. The Event Hub is located in the Event Hub namespace, which might contain multiple Event Hubs. The consumer group defines the publish and subscribe mechanism, involving an event publisher and one or more event consumers.

The design allows projects to have both the Business Vault processor attached to the Event Hub as a consumer and the data lake as another parallel target. Both will have their own consumer group where the current position of the processed messages is maintained. If one of the consumers would not be able to process incoming messages immediately, for example due to overload of the underlying compute node, this slow-down and delay of one consumer would not affect the other consumer.

The next screenshot shows the definition of the KPI to be sent to the dashboard:

In this case, the business logic is implemented directly in the outgoing message that should be sent to the dashboarding application, PowerBI. This Stream Analytics job preview allows the modification of the query statement against the real-time message stream and provides a preview of the resulting message stream. This message stream will then be ingested in real-time by the next worker role, stream analytics application, or dashboarding tool. In this case, the number of calls occurring in a specific time range, grouped by the Caller is calculated and will be consumed by PowerBI.

In PowerBI, a visualization is created to present the results from Stream Analytics streaming dataset in real-time:

Because the underlying data source is a real-time message feed, PowerBI will act as a consumer to the results from Stream Analytics via the streaming dataset. Changes to the results, because new messages have arrived on the source Event Hub, will trigger refreshes in the visualization.

The underlying streaming data set will provide some transient history, enough to present a line chart, for example for the last hour. Other than that, the streaming dataset will not materialize its data for long-term queries (there is no underlying database). However, if long-term analysis is required, the business logic should feed data into the Business Vault first for later re-use in both the real-time dashboard and the long-term, strategic dashboard.

Outlook

In this article, we have discussed the general real-time architecture for Data Vault 2.0 and adopted it to the Azure cloud using typical components available in this offering. However, your implementation of the Data Vault 2.0 real-time architecture is not limited to the services we selected in this article. Other options are available and the actual implementation depends on your selected tool stack based on your actual requirements. Additional information can be found in one of our Scalefree webinars on real-time processing.

We will cover the Databricks solution in one of the following articles and another architecture variation based on Data Vault 2.0’s managed self-service BI concept.

About the Authors

Michael Olschimke is co-founder and CEO at Scalefree International GmbH, a Big-Data consulting firm in Europe, empowering clients across all industries to take advantage of Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of data warehousing individuals from the industry, taught classes in academia, and publishes these topics on a regular basis.
Ole Bause is working at Scalefree in the area of Business Intelligence, Data Engineering, and Enterprise Data Warehousing with Data Vault 2.0. He is a certified Data Vault 2.0 Practitioner and Azure Data Engineer Associate. Data warehouse automation is also one of his core areas of competence.

<<< Back to Blog Series Title Page

^[1] Configure Event Hubs to capture data to Data Lake Storage Gen1

^[2] Stream processing with Azure Databricks

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs