Data Science with Azure Synapse and Data Vault 2.0

Naveed-Hussain · ‎Jun 30 2023

This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH
The Technical Review is done by Ian Clarke and Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft

The use of a Managed Self-Service BI with Data Vault 2.0 is demonstrated. The architecture is described, processes explained and compared to “classical” data science approaches (e.g., sandboxing).

Introduction

In our previous articles of this blog series, we introduced Data Vault 2.0 as a concept for designing data analytics platforms, and introduced both the Data Vault 2.0 reference architecture and the real-time architecture. These architectures have one thing in common: they are setup and further developed by the data warehouse team.

In this article, we are discussing another extension to the reference architecture: the integration of custom solutions for power users and data scientists. While the data warehouse team provides the infrastructure and many capabilities, it is driven by power users and data scientists. This concept, referred to as “managed self-service BI,” is used to transform organizations into a data-driven organization. Here, the goal is to make data and information widely available to any decision-maker. And there are plenty of them: in our opinion, every employee, from the C-level down to the lowest level, needs to make decisions - otherwise one would question why their job is not automated yet. Too often, this decision-making process is often based on gut-feeling. The goal of managed self-service BI is to turn these decisions into rational decisions by an educated, i.e., knowledgeable decision-maker.

Reasons for Self-Service and Data Science

Managed self-service BI supports two major use cases: first, the case of power users who want to extend the data analytics platform by additional, custom-built solutions. These solutions should use artifacts from the controlled area, including data artifacts from the Raw Data Vault, business rules from the Business Vault and information artifacts from one of the information marts. But power users can also extend the solution by additional data, not yet integrated by the data warehouse team. They can create their own information marts, called user marts, and create their own dashboards.

We are going to discuss how they do it - because what we want to avoid is that power users create their solutions in the IT controlled areas of the data analytics platform. Instead, they will get their own area, known as the user space. This is where they can add additional data, extend the solutions provided by the data warehouse team and create their user marts.

The second use case is for data science: in this case, the final solution is often not built on the data analytics platform or is only used for ad-hoc solutions. The artifacts provided by the Data Vault 2.0 architecture can be used to augment data flows with additional raw data easily available from the data analytics platform or results from the business logic available. By using the business logic from the Business Vault, data scientists can leverage code that already exists, thus increasing the re-usability of the business logic and therefore reducing inconsistencies in the final solutions.

Self-Service BI

The common solution for the use cases described in the previous section is to use technological solutions that allow power users and data scientists to access the data sources directly, circumventing the centralized data warehouse, and to build their own solutions.

The result? Because every user creates their own solutions, they have to reinvent the same business logic and other transformation logic over and over by themselves. There might be shared documentation on how to implement certain solutions, e.g. for de-duplication of customer records, but users might mis-interpret the documentation or develop the code with deviations from the documentation. These deviations could have many reasons: development errors, over-simplifications and workarounds, design issues, just to name a few. Providing re-usable code libraries might be a solution to some extent but only as long as power users can agree on a common toolset. We had this once in a while in our projects with mid-size and enterprise clients but typically only for a limited period of time.

Another issue is that power users have to deal with regulatory issues such as GDPR when extracting data directly from the source system. Because self-service widely follows an unmanaged approach, power users and data scientists have to deal with reducing or deleting personal data when required, to secure the local datasets when sharing with their peers and to maintain auditability. For most users, this presents a burden, especially if quick ad-hoc reporting should be supported by more casual power users where they are not used to such practices.

Managed Self-Service BI Architecture

Another issue we have with the industry definition of self-service BI, is the distinction between power users and casual users. From our experience, this is too black and white. Instead we see a lot of gray area in between power users and casual users.

The most “casual” users we ever encountered were board members of a large enterprise who received all BI reports in print, just like their emails. No judgement at all here, they knew how to do their business, but their business was certainly not in IT. Some more advanced users know how to read their emails in their email application and how to open a static PDF report. The more advanced users know how to login into the BI dashboarding software, and even more advanced users know how to drill down pre-configured dimensions on the report. The sophisticated users know how to customize the dashboard by adding additional dimensions, remove some others and configure different aggregations on measures.

But are they power users? It depends on your perspective. If you are in the data analytics space, you might consider those with SQL capabilities are - those who login to the database and query the information marts by joining dimensions to facts and aggregating using SQL functionalities. Or those users who are not satisfied with the pre-defined dimensional models in the information mart and rather access the Data Vault model to build their own solutions.

Data Vault 2.0 has been a journey for us when we learned it, but it is also a journey for every BI user. As they experience the limitations of the predefined solutions, they start digging deeper into the architecture, learning as they dig. So where do you draw the line between casual users and power users? The reality is, there is none - it’s a gray area and it depends on your perspective who -you- consider as casual or power users.

With that in mind, the architecture should serve its users. Managed Self-Service BI is a concept based on the Data Vault 2.0 architecture and extends it by user space:

The user space, also described in a Scalefree blog article, is not limited to the relational database. Every environment of your data analytics platform, for example, the relational database, the data lake, the real-time feed, etc. deserve a user space. Because power users might want to process binary data, that should not be loaded into a relational database.

While the IT-driven area above the user space is controlled by the data warehouse team, the user space is managed by the power users who build their own solutions in this area. It is similar to sandboxing in other self-service approaches to some extent, but without any time limitation: users who built these solutions build them out of a specific need. And we, well you dear reader, don’t have enough resources to build all the solutions they might require. Even if your organization would have unlimited financial resources, there will be not enough experienced and skilled developers available on the market to build all their solutions. That’s why they have to build these solutions themselves and you should not take it away after 90 days. In 90 days, you still lack the resources to build all their solutions.

The user space acts like a green field where power users can do whatever they want. Well, for starters, we would advise them to follow some best practices: maybe having an upstream model that follows Data Vault notation using hubs, links, and satellites. And having a downstream model for information delivery that often follows dimensional modelling with facts and dimensions. We call them user marts but they are part of the user space.

But in reality, power users, they “create entities” in their user space. Let’s take a look at a “randomly picked” (my) user space from our internal data analytics platform at Scalefree:

The user of this user space is mostly using it to upload some data, wrangle it using SQL tools and offload it to the next application - it’s being used to leverage some SQL capabilities. But those users also exist and in this particular case, the user space has apparent issues:

“Delete me” tables that apparently should be removed already (or later)
“Employee” records that contain highly personal data
Sensitive opportunity and call records
Data from senior tests conducted at Scalefree
Outdated lead records from 2018

There are no Data Vault entities such as hubs, links, or satellites. There are reference tables (indicated by an “r”) but they are due to be deleted. The good news is that the user space is secured and the only person with access is me, the owner. But it is a mess and needs a bit of management.

“Managing” Self-Service BI

The issue we have with pure “self-service BI” is that it lacks this management. Users can do whatever they want to do and have the capabilities for. The “green field” from above is unlimited, except by time, which helps to reduce the mess on a regular basis.

The user space is this green field, but we already discussed about best practices (using Data Vault and dimensional models in the user space).

But what about some fences?

What about a fence for data and information security, where power users cannot share any data with anybody but must follow a data security guideline? What about a privacy guideline where it would not be allowed to store personal data in the user space? What about a fence to maintain auditability?

In the beginning, these fences are just documents on a central Wiki but how is it possible to check the adherence of these guidelines? One of our clients used to have an environment where users were sending spreadsheets back and forth by email. There was a guideline that regulated that a user could only share data they had access to with employees having the same security level clearance as themselves. Based on manual spot checks, the organization confirmed that users where largely following these guidelines. However, the application and the check was based on manual, error prone activities.

The advantage of the user space on the data analytics platform is that it is possible to monitor the adherence of the guidelines by the power user. For example, retrieving the GRANTS to the entities in the user space gives you the knowledge of who can access the solutions in the user spaces. Try this with your email server, scanning all emails for spreadsheets and their actual content, the recipients, senders, required and actual security levels, etc. Technically it might be possible, but it’s a tougher task than querying the information schema of the database.

At Scalefree, we are using “smart” business keys, where contacts have a prefix of “C-”, leads a prefix of “L-”, and employees have a prefix of “E-”. An employee business key might be “E-0048,” for example. Based on these patterns it is possible to identify tables in the user spaces that refer to contacts, leads and employees and inform users when a consumer has requested their right to be forgotten according to GDPR. Depending on the service level agreement between the data warehouse team and the power users, they might either just inform them and leave it to them to remove or reduce the record (the most common practice) or delete the record automatically (less common practice).

It might sound to be of advantage to create many fences (i.e. guidelines) on how to use the user space. But every fence limits the potential applications for the user spaces: the more guidelines and regulations are put in place, the less solutions can be built using the user space on the data analytics platform and the more users go around it and find alternative solutions. And a popular solution we all want to prevent as much as possible is the use of spreadsheets for data analytics and everything that entails (attaching them to emails, for example).

But the “managing” in managed self-service BI is not limited to the user space, a technical solution as it is. More aspects require management: for example a book of standards should help power users to identify preferred tools that are supported by the data analytics team. Infrastructure needs to be managed. Data needs to be managed: as a power user, I would expect that I don’t need to reduce or delete personal data from the platform as it is maintained by the data analytics team. This requires data governance, including the availability of data and API catalogs.

There should be support processes in place, supporting the power users. This could either happen on-demand, for example by providing them a support hotline but also proactively through documentation and training, but also monitoring. Typically, most of our clients are monitoring the user spaces to check the adherence of the power user to the guidelines, but also for their resource consumption (storage, compute), security (number of grants), stability (how many CREATE, ALTER, DROP statements executed lately) or impact on the organization.

Managed self-service BI should only be used for solutions with relatively low impact on the organization: these solutions should be used by individual power users, and maybe their teams or their direct manager. But these solutions often increase their impact over time - users keep sharing the information with their peers and all of a sudden, 500+ users are using the power user solution from the user space or the CFO.

Industrialization

Monitoring helps us to identify those solutions that exceed the thresholds we defined: on impact, stability, security violations, resource consumption. Once a solution hits one of the defined thresholds, the solution deserves some additional review and maybe the industrialization of the solution.

The industrialization is an alternative to the time limitation of the traditional sandboxing approach: instead of just shutting down the solution after exceeding the time limit, the goal of the industrialization effort is to analyze the solution in the user space, redesign it if necessary and re-implement it in the IT controlled area of the data analytics platform. By all means, that is another development sprint that follows the standards for design, implementation, testing and documentation of new data-driven solutions. The user space can serve as a basis for requirements gathering or in very good cases, the solution is based on the same standards, including Data Vault models, and can easily be lifted. The close the power user is developing according to the development standards, the easier is the industrialization of the solution.

For example, if a power user is not executing any more DDL statements, such as CREATE TABLE or ALTER VIEW, on its user space, it serves as an indicator that the solution is stable and maybe just done. Typically, most power users would actively try to hand over the solution to the data analytics team because, while building these solutions is fun, maintaining is not. And that’s when the data analytics team come to mind (for a good reason). But in many other cases, the solution is somehow maintained with more or less success. Once a solution is stable, it could be argued that the time of experimentation is over, and the solution should be rebuilt using standard development techniques in the IT controlled area.

Prototyping new Data

We used this pattern with a current client in the media industry which was facing decreasing revenues due to changing media consumption patterns by consumers. In one of the client meetings a business user literally told us the reason why revenues drop over time - he said, “these reports are extremely sad, not just because of the shrinking revenue, but because we also know the reasons - because our newspaper subscribers are just dying. Young people don’t read newspapers anymore.” We don’t know if all these facts are true, but that was the business environment the client was in.

But their situation also presented business opportunities: they had to make a choice - either accept their fate and perish in history eventually or accept the challenge to survive. They went for the challenge and tried to come up with the “next big thing” for their readers. A local Wiki, an app for dating, etc. Anything that could relate to their typical audience and the next generation. The business user who complained about the shrinking business value also told us that it was quite simple - they tried out a business prototype, tried to generate a user base first. Once they had a user base, they could try to generate revenue from the user base.

But that required adjusting the business model of the prototype. What if the growth rate of the user base did not meet their expectations? By modifying the business model and its prototype, they could fine-tune the growth. The primary issue of the business user was that, while they would get the data from the web servers for analytical purposes, it “did not fit into the spreadsheet application” anymore. By chance, the internal data warehouse was built using Data Vault 2.0. The data analytics team provided the business users with a user space where they could upload the data, join it with existing data and information artifacts, for example for benchmarking and use their insights to modify the business prototype.

Their issue was that they were not at the point to define the business rules how to transform the incoming dataset into useful information because they even did not know how the data would look like. Because data just describes the (changing) business prototype. The client eventually used managed self-service BI for prototyping and, when one out of ten of these business prototypes eventually generated revenues, the solution was industrialized.

Extending Existing Standard Solutions

Another interesting example is found quite often: the IT-driven part of the data architecture will provide data artifacts that are loaded from data flows already ingested with standard processes. It will further provide existing business rules, and information assets.

But what if a power user doesn’t agree with the business rules applied? Or if they want to extend the information marts by additional dimensions? It is actually easy to extend existing standard solutions and therefore to customize them:

For example, the power user could create a new computed satellite in the user space based on some satellite from the Raw Data Vault or Business Vault and attach it to an existing, often but not necessarily the same, hub (or link). This user defined satellite only exists in the user space but can be used now by the power user to create a custom dimension. For this purpose, the power user could copy the DDL statement for a dimension view in an information mart and customize it to use the satellite data from the user space. Since a power user cannot create anything in the IT-driven area of the architecture, they have to create these dimensions in the user space as well. Additional information about the combination of the user space with the Data Vault 2.0 model can be found on the Scalefree blog.

Once the dimension is ready, it can be consumed by a custom-built dashboard, replacing the standard dimension by the modified dimension.

The only problem is that it is hard to distinguish between standard reports, build by the data analytics team, and power user reports (believe me, I could build some report in a spreadsheet application, with fake data, and the report exactly looks like a report from your BI platform). To help casual users, who just pull some report, to distinguish between these two different types of reports, with their different quality levels, we typically advise our clients at Scalefree to add a footnote to the standard reports. This footnote indicates that “this is a standard report, built by the data analytics team, budgeted by xyz” to help casual users to identify these reports. Power users might add any footnote they want, but certainly not “built by the data analytics team.” If they do, you’ve got a problem in your organization, but not a technical one.

Conclusion

Managed Self-Service BI is a pattern that works well with organizations which are driven by the business. There are many more successful use cases where this concept has played a key role for the adoption of the platform and the agility of the data analytics team to provide data (and information) to decision makers. With the security and privacy features of Data Vault 2.0, it helps these organizations to make data widely available in time within limits set forth by the security team. Its the tool to empower traditional organizations to become data-driven.

About the Author

Michael Olschimke is co-founder and CEO at Scalefree International GmbH, a Big-Data consulting firm in Europe, empowering clients across all industries to take advantage of Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of data warehousing individuals from the industry, taught classes in academia and publishes on these topics on a regular basis.

<<< Back to Blog Series Title Page

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs