Having discussed the value of Data Vault 2.0 and the associated architectures in the previous articles of this blog series, this article will focus on the organization and successful execution of Data Vault 2.0 projects using Azure DevOps. It will also discuss the differences between standard Scrum, as used in agile software development, and the Data Vault 2.0 methodology, which is based on Scrum but also includes aspects from other methodologies. Other functions of Azure DevOps, for example the deployment of the data analytics platform, will be discussed in subsequent articles of this ongoing blog series.
This article should be especially useful for Scrum Masters with experience in software development projects who are joining a Data Vault 2.0 project. It should discuss the similarities and differences between both types of projects and help the reader to avoid typical pitfalls that might increase risks if left unaccounted for.
One of the advantages of the Data Vault 2.0 model is the extensibility of the model: it is possible to extend the model by new entities step by step. Most of our clients at Scalefree start with a small proof of concept (PoC) before moving towards their first Data Vault 2.0 model. This initial model is later extended by additional source data sets to the Raw Data Vault and transformation logic in the Business Vault in subsequent iterations.
That’s why Data Vault 2.0 is often used with Agile methodologies such as Scrum. One of our clients, Berenberg, selected Data Vault 2.0 because it was in line with their strategy to introduce Agile development methodologies at the bank. Additionally, it allowed them to align the development cycle of the data warehouse with the development cycle of the operational software development.
However, Data Vault 2.0 projects are different to traditional software development in some respects. For example, data warehouse architectures, such as the Data Vault 2.0 reference architecture presented in a previous articles of this blog series, often follow a multi-layer design. This is not uncommon to traditional software development but requires additional provisions. In software development, a concept called “tracer bullet approach” is used to “shoot” software features one by one through the layers of the software system. This allows the team to focus on the business value to be produced, instead of technical layers. The same concept is used in the Data Vault 2.0 methodology as we will discuss later in this article. In addition, we will present the typical roles in a Data Vault 2.0 project and the tools we use on the Azure cloud to support our projects.
As already mentioned, the Data Vault 2.0 methodology is based on Scrum. For example, it uses time-boxed sprints to develop the data analytics platform in multiple iterations over time. As in Scrum, the length of the sprint is defined by the organization and is often aligned with the sprints of the software development organization. We recommend this to clients because in one unsynchronized case, one of the source attributes that we were ingesting in the particular sprint was removed from the operational source system by a deployed feature mid-sprint. The reasoning behind this experience, the operational source team never knew that we were dependent on this attribute because we did not have a chance to deploy our functionality and declare our dependencies.
Data Vault 2.0 teams also use the typical Scrum ceremonies, such as sprint plannings, sprint reviews and retrospectives and daily sprint meetings. To better synchronize with other Agile teams, including software development teams of the operational source systems, a Scrum of Scrum is often used.
Successful teams quickly establish a focus on the business value to be produced. This could be a new report or dashboard, or at least a new measure to be calculated. It could be the implementation and integration of a data cleansing functionality that feeds back cleansed data to the operational system. In other words, there must be a reason why the business is setting up a budget for an analytical data platform and it is certainly not to “have a fancy data lake.” Teams are advised to distinguish between functional and non-functional requirements and focus every sprint to implement some functional requirements. In addition, each sprint should implement some non-functional requirements to build the technical platform.
Teams often make the mistake of building the technical platform first without implementing any business value.
This will lead to a number of issues: first, the business becomes curious about the return of their investment. Again, business is focused on reports, dashboards or at least some KPIs. They also care about the implementation of non-functional requirements: for example, they expect the solution to adhere to the law and implement requirements regarding GDPR. However, they primarily don’t set up the project budget in order to have a “fancy GDPR compliant data lake”. This is only a secondary reason.
The other issue is that the implementation of the data flows for new functional requirements often influence the technical implementation. There is a high probability that the upfront implementation of non-functional requirements will not be sufficient for the actual data flows and will require refactoring. Think Agile: nobody expects to have non-functional requirements before the business value. If some critical requirements exist, such as security, just limit the number of users who can access the initial platform and broaden the user base as security features are added to the platform.
A typical Data Vault team consists of five to nine team members, just as in typical Scrum teams. However, besides the product owner (which is the same in both methodologies) the role definitions are closer aligned with traditional data warehouse development than agile software development. Scrum defines that there should be three major roles in a team: the product owner, the developers and the Scrum master. This works well in software development, but in data warehousing, there are highly specialized tools that require different skill sets among the developers. Therefore, the Data Vault team consists of automation engineers, data modelers, testers, dashboard designers, etc., instead of a generalized developer role. This doesn’t mean that a team member cannot learn new skills, it just establishes the focus of the team member as it relates to a specific tool and set of tasks.
However, it also reflects the dependencies in a team due to the layered nature of an analytical data platform. Before the dashboard can be built, the upstream layers, such as the Raw Data Vault, the Business Vault and the information mart must be built. When a team member is fixed in their dashboarding position, this member would wait until the end of the sprint to start developing the dashboard. To avoid this inefficiency, we encourage team members to actively find work and extend their skills. For example, in one of our projects we had exactly this situation. However, the dashboard designer was interested in the orchestration of data warehousing tasks and was working on this non-functional artifact to prevent idle time.
Another difference is the use of a project manager role instead of a Scrum master. In Scrum, the Scrum master is not managing the project but merely dealing with obstacles in the project, organizing meetings, etc. This is true for the project manager of a Data Vault team, but the project manager role is broader and the title should reflect this. For example, the project manager is responsible for effort estimation using function point analysis, which works great in Data Vault 2.0 projects. The role is also responsible for the maintenance of sprint plans.
These sprint plans are used to establish a pattern during the sprint. It is always surprising how some data warehouse projects are managed; when a new business requirement is formulated, everybody in the team scrambles to find out how to build the, say report, where the data comes from and how to transform it. Imagine this process to be applied to production processes, such as car production. A new customer order arrives and everybody in the production street tries to find out which machines to use, and the location of necessary components. Meanwhile, missing components are ordered and dumped into a big, unorganized pile of components for later use, somebody figures out how to apply the painting, etc. This mess is how many data warehouses and Big Data projects function.
But car makers actually use blueprints and organized production lines to mass-produce their products. This is where the sprint plan comes into play; it defines the activities that are required to deliver the requested information, including the sequence and dependencies of these activities. The sprint plan acts as the blueprint for the information delivery process. Typical projects have more than one sprint plan; there are often different sprint types, for example one for data provisioning, that is the addition of source systems to the data lake, one for the implementation of business value by adding the source data to the Raw Data Vault, adding the transformation rules to the Business Vault, creating (or extending) the information mart and the dashboard. There could be a sprint type for the deployment of new artifacts if the deployment is not part of the development sprint. For every sprint type, a sprint plan exists that defines the activities. The below table shows a typical sprint plan for the development of new information assets:
ID |
WBS |
Task Name |
Duration (hrs) |
Predecessors |
1 |
1 |
Scope and Estimate Functionality |
|
|
2 |
1.1 |
Review information requirements for completeness |
0.5 |
|
3 |
1.2 |
Review documented business rules and KPIs |
0.5 |
1.1 |
4 |
1.3 |
Prepare a rough design for FPA estimation |
1.0 |
1.2 |
5 |
1.4 |
Perform FPA estimation |
1.0 |
1.3 |
6 |
1.5 |
Scope down functionality to be produced in sprint |
0.5 |
1.4 |
7 |
2 |
Identify Source Data |
|
|
8 |
2.1 |
Maintain source to requirements matrix |
4.0 |
|
9 |
2.2 |
Perform Pre-Analysis |
2.0 |
2.1 |
10 |
3 |
Setup Meta-Data in Automation Tool |
|
|
11 |
3.1 |
Define the source and Raw Data Vault metadata |
8.0 |
2.1 |
12 |
3.2 |
Create additional or modify existing automation templates |
2.0 |
|
13 |
4 |
Prepare Information Delivery |
|
|
14 |
4.1 |
Define and set up required Business Vault entities in detail |
4.0 |
1.2 |
15 |
4.2 |
Develop business rules according to documentation |
4.0 |
4.1 |
16 |
4.3 |
Develop regression test cases for Business Vault |
4.0 |
4.1 |
17 |
4.4 |
Derive information mart according to information requirements |
2.0 |
3.1, 4.2 |
18 |
4.5 |
Develop regression test cases for information mart |
2.0 |
4.4 |
19 |
5 |
Create Information Artifacts |
|
|
20 |
5.1 |
Create or extend existing dashboard / report |
4.0 |
4.4 |
21 |
6 |
Finalize Documentation |
|
|
22 |
6.1 |
Maintain source to target matrix |
2.0 |
4.4 |
23 |
6.2 |
Record actual effort |
1.0 |
6.1 |
24 |
6.3 |
Update information requirements to reflect current implementation |
1.0 |
4.4 |
25 |
7 |
Prepare Deployment |
|
|
26 |
7.1 |
Prepare deployment package |
4.0 |
4.4 |
27 |
7.2 |
Deploy to UAT environment |
2.0 |
7.1 |
28 |
7.3 |
Perform user-acceptance test (UAT) |
4.0 |
7.3 |
29 |
7.4 |
Deploy to production environment |
2.0 |
7.3 |
During a retrospective meeting, the goal is to improve the development process. This is often translated into improvements and changes to the sprint plan; by adjusting the set of required activities, the blueprint of the sprint is improved and should lead to better process quality. Most clients improve the process regarding efficiency, productivity (to produce more function points per sprint), quality (to produce technical debt to be fixed later), automation, duration (to reduce the sprint duration over time) and other aspects.
Another issue to be optimized over time is typically the deployment of new functionalities into production. We often observe that projects hesitate to deploy new features due to a variety of pains (dependencies, missing documentation, testing, just to name a few). Instead of avoiding the pain, we advise them to deploy on a constant basis. Teams either get used to the pain or, as is most common, teams fix the issues to reduce or stop the pain. The problem is, if teams don’t deliver on a constant basis, they will not feel the pain and therefore not fix the issues.
In Scrum, software requirements are defined and refined by user stories. A user story defines the functionality from the business user’s perspective and should be focused on the business value. While this works in software development, it presents an issue in data warehousing. Here, detailed requirements are a necessity and teams often make the mistake to turn the user stories into technical ones, creating one user story for each technical artifact to be produced: one for the data lake (“ingest Dynamics CRM into the data lake”), one for the Raw Data Vault (“create hubs, links, satellites for the account CRM object”), etc. However, this goes against our recommendation to focus the user stories on business value.
Another approach which proved more valuable for us in our projects: information requirements. They are more closely related to traditional software requirements where all necessary detail is described but focused on one functional requirement. Though, we still use user stories as an intro to the information requirement. The complete structure of the document contains the following sections that provide all necessary details to develop the functionality:
The graphical mockup is either an Excel spreadsheet with artificial information, a Visio diagram or a screenshot from a legacy report. Alternatively, an information model, such as a dimensional star or snowflake model could be used.
In the next section, all required details of the mockup are explained in full: if a detail is relevant, the business user should be able to describe it. This section is followed by a section with KPI descriptions, often referring to KPI tree definitions with all required instructions to calculate the measures. This is followed by a list of data sources, often described in more detail in an external document or Wiki. The last section in our template is a detailed list of test cases for the user acceptance test (UAT).
There are multiple variations possible: first, the structure is not fixed; our recommendation is that organizations adjust it to their needs and experiences from previous sprints. They should use it as a starting point and adopt it in the first set of sprints. The structure could also be printed on a user story card: the front could be the introductory user story and the details could be on the back. Later in this article,we will set up the structure in Azure DevOps. Additional details about information requirements can be found on the Scalefree blog.
Microsoft’s Azure DevOps Services provides various services to development teams, including project management, requirements management, version control, automated builds, tests and release management. It allows project management using different methodologies, including CMMI, Scrum and other Agile approaches.
This article focuses on two different aspects of Azure DevOps that we typically customize in our Data Vault 2.0 projects: the Scrum / Kanban board for the user stories and the requirements management.
To establish the continuous delivery of information assets, we customize Azure DevOps to support sprint types and sprint plans with a focus on the business value. A card on the Scrum board should have a focus on business value and document the functionality to be implemented using the information requirement template across all layers of the data analytics platform. This requires working with backlog items, process templates (for the information requirement structure) and sub-tasks. All necessary details are in the sub-sections and sub-tasks of the work item (the back of the card). Each work-item has the same, or at least a similar, structure. The tasks to load data through the layers are most of the time, if not always, the same.
The first step is to modify the process template in the organizational settings. A short-cut to the settings can be found in the context menu of a work item:
In the customization settings of the process template, the attributes can be created to reflect the structure of the information requirement:
As already discussed, the structure is not fixed - every organization should come up with their best practice regarding the description of information requirements. The goal should be ensuring the template becomes stable after a number of sprints where the pattern is established based on experience from the early sprints.
Next, the sub-tasks should be configured. The best approach is to create a backlog item as a template and prepare it with all the required sub-tasks from the above sprint plan example. The template is then left on the Scrum board for later use when new work items should be created:
The hierarchy of the sprint plan is flattened into the sub-task name. This template is then copied into a new work item, including all the sub-tasks:
An alternative to this approach is the free plug-in 1-Click Child-Links, available on the Visual Studio Marketplace.
Teams can still modify the structure of their sub-tasks in the copy of the work item by adding, removing or changing the existing ones. Again, over time, a pattern should emerge and might influence the defined sub-tasks in the template work item. Note that multiple sprint types require different sprint plans and, therefore, different work item templates (one for each sprint type). Each sprint plan is then independently optimized over time.
These changes often stem from the retrospective meeting at the end of the sprint. They are conducted in the Data Vault 2.0 methodology, just like in Scrum, to improve the development process. Typically, the team discusses what went well, what didn’t go well and how the process should be improved. Improvements often affect the sprint plan because the team decides what activities should be started or stopped. If an activity is important to the team, it should be on the sprint plan to make sure it gets performed as defined.
As the work item is implemented, the card is moved between the panes of the Scrum board:
Instead of having many technical cards with no business value, the Scrum board is now more organized, shows direct business value and typically has much fewer cards. In theory, the team should focus only on producing one business value - one would expect only one card, one information requirement on the Scrum board. In reality, the team will also work on some smaller tasks and non-functional requirements, which are represented by their own cards, often not following the standard sprint plan template discussed in this section.
The goal of this article was to explain the major differences between the Data Vault 2.0 methodology and the Scrum methodology, especially how we use them in our projects. There are additional differences that deal with the mass production of information, total quality management, effort estimation and other patterns. These topics are typically discussed in more detail in our trainings and the book Building a Scalable Data Warehouse with Data Vault 2.0 by Dan Linstedt and Michael Olschimke.
In the upcoming second part, we are going to demonstrate the more technical aspects of modeling and loading a data analytical platform on the Azure cloud using Data Vault 2.0. For this reason, we are moving the series over to the Microsoft TechCommunity which also supports commenting!
Michael Olschimke is co-founder and CEO at Scalefree International GmbH, a Big-Data consulting firm in Europe, empowering clients across all industries to take advantage of Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of data warehousing individuals from the industry, taught classes in academia, and publishes on these topics on a regular basis.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.