machine learning

12 Topics

Architecting the Next-Generation Customer Tiering System
Authors Sailing Ni*, Joy Yu*, Peng Yang*, Richard Sie*, Yifei Wang* *These authors contributed equally. Affiliation Master of Science in Business Analytics (MSBA), UCLA Anderson School of Management, Los Angeles, California 90095, United States (Conducted December 2025) Acknowledgment This research was conducted as part of a Microsoft-sponsored Capstone Project, led by Juhi Singh and Bonnie Ao from the Microsoft MCAPS AI Transformation Office. Microsoft’s global B2B software business classifies customers into four tiers to guide coverage, investment, and sales strategy. However, the legacy tiering framework mixes historical rules with manual heuristics, causing several issues: Tiers do not consistently reflect customer potential or revenue importance. Statistical coherence and business KPIs (TPA, TCI, SFI) are not optimized or enforced. Tier distributions are imbalanced due to legacy ±1 movement and capacity rules. Sales coverage planning depends on a tier structure not grounded in data. To address these limitations, we, UCLA Anderson MSBA class of Dec'25, designed a next-generation KPI-driven tiering architecture. Our objective was to move from a heuristic, static system toward a scalable, transparent, and business-aligned framework. Our redesigned tiering system follows five complementary analytical layers, each addressing a specific gap in the legacy process: Natural Segmentation (Unsupervised Baseline): Identify the intrinsic structure of the customer base using clustering to understand how customers naturally group Pure KPI-Based Tiering (Upper-Bound Benchmark): Show what tiers would look like if aligned only to business KPIs, quantifying the maximum potential lift and exposing trade-offs. Hybrid KPI-Aware Segmentation (Our Contribution): Integrate clustering geometry with KPI optimization and business constraints to produce a realistic, interpretable, and deployable tiering system. Dynamic Tiering (Longitudinal Diagnostics): Analyze historical patterns to understand how companies evolve over time, separating structural tier drift from noise. Optimization & Resource Allocation (Proof of Concept): Demonstrate how the new tiers could feed into downstream coverage and whitespace prioritization through MIP- and heuristic-based approaches. Together, these components answer a core strategic question: “How should Microsoft tier its global customer base so that investment, coverage, and growth strategy directly reflect business value?” Our final architecture transforms tiering from a static classification exercise into a KPI-driven, interpretable, and operationally grounded decision framework suitable for Microsoft’s future AI and data strategy. Solution Architecture Diagram 1. Success Metrics Definition Before designing any segmentation system, the first step is to establish success metrics that define what “good” looks like. Without these metrics, models can easily produce clusters that are statistically neat but misaligned with business needs. A clear KPI framework ensures that every model—regardless of method or complexity—is evaluated consistently on both analytical quality and real business impact. We define success across two complementary dimensions: 1.1 Alignment & Segmentation Quality: These metrics evaluate whether the segmentation meaningfully separates customers based on business potential. 1.1.1 Tier Potential Alignment (TPA) Measures how well assigned tiers follow the rank order of PI_acct, our composite indicator of future growth potential. Implemented as a Spearman rank correlation, TPA tests whether higher-potential accounts systematically land in higher tiers. Step 1 - Formula for PI_acct (Potential Index per Account) Step 2 - Formula for TPA (Tier Potential Alignment) 𝜌_𝑠 = Spearman rank correlation Tier Rank = ordinal tier number (Tier A = highest → Tier D = lowest) Interpretation: TPA=1 ⇒ Perfect alignment (higher potential → higher tier) TPA=0 ⇒ No statistical relationship TPA<0 ⇒ Misalignment (tiers contradict potential) 1.1.2 Tier Compactness Index (TCI) Measures how homogeneous each tier is. Low within-tier variance on PI_acct or Revenue indicates that customers grouped together truly share similar characteristics, improving interpretability and resource planning. (1) Potential-based Compactness - TCI_PI (2) Revenue-based Compactness - TCI_REV TCI=1 ⇒ tiers are tight and well-separated TCI=0 ⇒ tiers are random or overlapping TCI<0 ⇒ within-tier variance exceeds total variance (poor grouping) 1.2 Business Impact These metrics test whether the segmentation supports strategic goals, not just statistical structure. 1.2.1 Strategic Focus Index (SFI) Quantifies how much revenue comes from the company’s most strategically important tiers. High SFI means segmentation helps focus investments—sales coverage, specialist time, programs—on the customers that matter most. Under the Tier Policy framework, the definition of “strategic” automatically adapts to the number of tiers K - for example, taking the top L tiers (e.g., top 2) or top x % of tiers ranked by mean potential or revenue. High SFI: strong emphasis on top strategic segments (potentially efficient but watch concentration risk). Moderate SFI: balanced focus across tiers. Low SFI: diffuse portfolio, limited emphasis on priority segment 2. Static Segmentation 2.1 Pure Unsupervised Clustering 2.1.1 Model Conclusions at a Glance Across all unsupervised models evaluated—Ward, Weighted Ward, K-Medoids, K-Means / K-Means++, and HDBSCAN — only the Ward model (K=4, Policy v2) provides a segmentation that is simultaneously: statistically coherent, business-aligned (high SFI), geometrically stable (clean Silhouette), and operationally interpretable. All alternative models either distort cluster geometry, collapse SFI, or produce unstable/illogical tier structures. Final Recommendation: Use Ward (K=4, Policy v2) as the natural segmentation baseline. 2.1.2 High-Level Algorithm Comparison Table 1. Algorithm Comparison Model Algorithm Summary Strengths Weaknesses Business Use Ward Variance-minimizing hierarchical merges Best balance of TPA/TCI/SFI; stable geometry Sensitive to correlated features Primary model for segmentation Weighted Ward Distance reweighted by PI + revenue Higher TPA Silhouette collapse; unstable Not recommended K-Medoids Medoid-based dissimilarity minimization Robust to outliers Cluster compression; weak SFI Diagnostic only K-Means / K-Means++ Squared-distance minimization Fast baseline SFI collapse; over-tight clusters Numeric benchmark only HDBSCAN Density-based clustering with noise Good for anomaly detection TPA collapse; noisy tiers; broken PI ordering Not suitable for tiering 2.1.3 Modeling Results Table 2. Unsupervised Clustering Model Results Metric FY26 Baseline (Legacy A+B) Ward K=4 (Policy v2) Weighted Ward2-B (α=4, β=0.8, s=0.7, K=5) Unweighted Ward (Policy v2, K=4) Unweighted Ward (Policy v2, K=3) K-Medoids B4 Behavior-only (K=3) K-means K=4 (Policy v2) K-means++ K=4 (Policy v2) HDBSCAN (baseline settings) TPA 0.260 0.260 0.860 0.260 0.300 0.520 0.310 0.310 0.040 TCI_PI 0.222 0.461 0.772 0.461 0.405 0.173 0.476 0.476 0.004 TCI_REV 0.469 0.801 0.640 0.801 0.672 0.002 0.831 0.831 0.062 SFI 0.807 0.868 0.817 0.868 0.960 0.656 0.332 0.332 0.719 Silhouette nan 0.560 0.145 0.560 0.604 0.466 0.523 0.523 0.186 Ward (K=4, Policy v2) remains the strongest performer: SFI ≈ 0.87, Silhouette ≈ 0.56, stable geometric structure. Weighted Ward raises TPA/PI slightly but Silhouette collapses (~0.15) → structural instability → not viable. K-Medoids consistently compresses clusters; TPA/TCI gain is offset by TCI_REV collapse and low SFI. K-Means / K-Means++ tighten numeric clusters but SFI drops to ~0.33 → tiers lose strategic meaning. HDBSCAN generates large noisy segments; TPA = 0.044, TCI_PI = 0.004, Silhouette = 0.186, and Tier A/B contain negative PI → fundamentally unsuitable. Conclusion: Only Ward (K=4) produces segmentation with both statistical integrity and business relevance. 2.1.4 Implications, Limitations, Next Steps Implications Our current unsupervised segmentation delivers statistically coherent and operationally usable tiers, but several structural findings emerged: Unsupervised methods reveal the data’s natural shape, not business priorities: Ward/K-means/HDBSCAN can discover separations in the feature space but cannot move clusters toward preferred PI or revenue patterns. Cluster outcomes cannot guarantee business-desired constraints. For example: If Tier A’s PI mean is too low, the model cannot raise it. If Tier C becomes too large, clustering cannot rebalance it. If the business wants stronger SFI, clustering alone cannot optimize that objective Some business-critical metrics are only evaluated after clustering, not optimized within clustering: Tier size distributions, average PI per tier, and revenue share are structurally important but not part of the unsupervised objective. Hence, Unsupervised clustering provides a statistically coherent view of the data’s natural structure, but it cannot guarantee business-preferred tier outcomes. The models cannot enforce hard constraints (e.g., desired A/B/C distribution, monotonic PI means, revenue share targets), nor can they adjust tiers when PI is too low or clusters become imbalanced. Additionally, key tier-level KPIs—such as average PI per tier, tier size stability, and revenue distribution—are only evaluated after clustering rather than optimized during it, limiting their influence on the final tier design. To overcome these structural limitations, the next stage of the system must incorporate semi-supervised guidance and policy-based optimization, where business KPIs directly shape tier boundaries and ranking. Future iterations will expand the policy beyond PI and revenue to include behavioral and market signals and bring tier-level metrics into the objective function to better align the segmentation with real-world operational priorities. 2.2 Semi-supervised KPI-Driven Learning Composite Score — KPI-Driven Objective for Tiering To guide our semi-supervised and hybrid methods, we define a Composite Score that unifies Microsoft’s key business KPIs into a single optimization target. It ensures that all modeling layers—Pure KPI-Based Tiering and Hybrid KPI-Aware Segmentation—optimize toward the same business priorities. Unsupervised clustering cannot optimize business outcomes. A composite objective is needed to consistently evaluate and improve tiering performance across: Potential uplift (TPA) Stability of tier structure (SFI) Within-tier improvement (TCI_PI) Revenue scale (TCI_REV) To align tiering with business priorities, we summarize four key KPIs—TPA, SFI, TCI_PI, and TCI_REV—into one normalized measure: Composite Score = 0.35×TPA + 0.35×SFI + 0.30×(TCI_PI + TCI_REV) This score provides a single benchmark for comparing methods and serves as the optimization target in our semi-supervised and hybrid approaches. How It Is Used Benchmarking: Compare all methods on a unified scale. Optimization: Serves as the objective in constrained local search (Method 3). Rule Learning: Guides the decision-tree logic extracted after optimization. Why It Matters The Composite Score centers the analysis around a single question: “Which tiering structure creates the strongest balance of growth potential, stability, and revenue impact?” 2.3 Pure KPI-Based Tiering 2.3.1 Model Conclusions at a Glance Pure KPI-based tiering shows what the tiers would look like if Microsoft prioritized business KPIs above all else. It achieves the largest KPI improvements, but causes major distribution shifts and violates movement rules, making it operationally unrealistic. Final takeaway: Pure KPI tiering is a valuable benchmark for understanding KPI potential, but cannot be operationalized. 2.3.2 High-Level Algorithm Summary Table 3. Methods of KPI-Based Tiering Method Algorithm Summary Strengths Weaknesses Business Use New_Tier_Direct (PI ranking only) Rank accounts by PI/KPI score and assign tiers directly Highest KPI gains; preserves overall tier distribution Moves ~20–40% companies; violates ±1 rule; disrupts continuity KPI upper-bound benchmark Tier_PI_Constrained (PI ranking + ±1 rule) Same as above but restrict movement to adjacent tiers KPI lift + respects movement constraint Still moves ~20–40%; breaks tier distribution (Tier C inflation) Diagnostic only 2.3.3 Modeling Results Table 4. Modeling Results for KPI-Based Tiering KPI FY26 Baseline New_Tier_Direct Tier_PI_Constrained Composite Score 0.5804 0.8105 0.763 TPA 0.2590 0.8300 0.721 TCI_PI 0.2220 0.5360 0.492 TCI_REV 0.4690 0.3970 0.452 SFI 0.8070 0.6860 0.650 New_Tier_Direct Composite Score: 0.5804 → 0.8105 TPA increases sharply (0.259 → 0.830) Violates ±1 rule; major reassignments (~20%–40%) Tier_PI_Constrained Respects ±1 movement KPI still improves (Composite 0.763) But tier distribution collapses (Tier C over-expands) Still ~20–40% movement → not feasible Hence: No PI-only method balances KPI lift with operational feasibility. 2.3.4 Limitations & Next Steps Pure KPI tiering cannot simultaneously: preserve tier distribution, respect ±1 movement rule, and deliver consistent KPI improvements. This creates the need for a hybrid model that combines clustering structure with KPI-aligned tier ordering. 2.4 Hybrid KPI-Aware Segmentation (Our Contribution) 2.4.1 Model Conclusions at a Glance Our hybrid method blends clustering geometry with KPI-driven optimization, achieving a practical balance between: statistical structure, business constraints, and KPI improvement. Final Recommendation: This is the segmentation framework we recommend Microsoft to adopt. ➔ It produces the most deployable segmentation by balancing KPI lift with stability and interpretability. ➔ Delivers meaningful KPI improvement while changing only ~5% of accounts, compared to Model B’s 20–40%. 2.4.2 High-Level Algorithm Summary Table 5. Algorithm Comparison Component Purpose Strengths Notes Constrained Local Search Optimize composite KPI score starting from FY26 tiers KPI uplift with strict constraints Only small movements allowed (~5%) Tier Movement Constraint (+1/–1) Ensure realistic transitions Guarantees business rules; keeps structure stable Limits improvement ceiling Decision Tree Learn interpretable rules from optimized tiers Deployable, explainable, reusable Accuracy ~80%; tunable with weighting Closed Loop Optimization Improve both rules and allocation iteratively Stable + interpretable Future extension 2.4.3 Modeling Results Table 6. Modeling Results for Hybrid Segmentation KPI FY26 Baseline New_Tier_Direct Tier_PI_Constrained ImprovedTier Composite Score 0.5804 0.8105 0.763 0.6512 TPA 0.2590 0.8300 0.721 0.2990 TCI_PI 0.2220 0.5360 0.492 0.3450 TCI_REV 0.4690 0.3970 0.452 0.5250 SFI 0.8070 0.6860 0.650 0.8160 Interpretation of Hybrid Model (Improved Tier) Composite Score: 0.5804 → 0.6512 TPA improvement (0.259 → 0.299) TCI_PI and TCI_REV both rise SFI improves compared to constrained PI method Only ~5% of companies move tiers, versus Method 2’s 20–40% This makes Method 3 the only method that simultaneously satisfies: KPI improvement original tier distribution ±1 movement rule low operational disruption interpretability (via decision tree) 2.4.4 Conclusion Model C offers a pragmatic middle ground: KPI lift close to pure PI tiering, operational impact close to clustering, and full interpretability. For Microsoft, this hybrid framework is the most realistic and sustainable segmentation approach 3. Dynamic Tier Progression 3.1 Model Conclusions at a Glance Our benchmarking shows that CatBoost and XGBoost consistently deliver the strongest overall performance, achieving the highest macro-F1 (~0.76) across all tested methods. However, despite these gains, the underlying business pattern remains dominant: tier changes are extremely rare (≈5.4%), and Microsoft’s one-step movement rule severely limits model learnability. Dynamic tiering is far more valuable as a diagnostic signal generator than a strict forecasting engine. While models cannot reliably predict future tier transitions, they can surface atypical account patterns, signals of risk, and emerging opportunities that support earlier sales intervention and more proactive account planning. 3.2 Models To predict future model upgrades and downgrades, we tested the following models: Table 7. Models Used for Dynamic Prediction Model Strengths Weaknesses When to Use MLR Simple; interpretable; fast baseline Weak on imbalanced data When transparency and explainability are needed Neural Network Captures nonlinear patterns; stronger recall than MLR Requires tuning; sensitive to imbalance data When exploring richer behavioral signals CatBoost (baseline, weighted, oversampled) Strongest overall balance; robust with categorical data; best macro-F1 Still limited by rarity of tier changes; weighted/oversampled versions risk overfitting Default diagnostic model for surfacing atypical account patterns XGBoost (baseline, weighted) High performance; scalable; production-ready Limited by structural imbalance; weighted versions increase false positives When deploying a stable scoring layer to sales teams Performance was then measured using accuracy, but more importantly, macro recall, precision, and F1, since upgrades and downgrades are much rarer and require balanced evaluation. 3.3 Model Results Across all models, overall accuracy appears high (0.95–0.97), but this metric is dominated by the fact that Tier transitions are extremely rare — only 808 of 15,000 cases (5.4%) moved tiers, while 95% stayed unchanged. According to macro metrics such as recall, precision, and F1, every model struggles to reliably detect upgrades and downgrades. CatBoost and XGBoost deliver the strongest balanced results, achieving the highest macro F1 scores (~0.76). However, even these advanced methods only capture half or fewer of the true upgrade and downgrade events. This reinforces that the challenge is not algorithmic performance, but the underlying business pattern: tier movements are infrequent, policy-driven, and weakly connected to observable account features. Table 8. Results for Dynamic Prediction Model Accuracy Macro Recall Precision F1 Score MLR 0.95 0.36 0.70 0.37 Neural Network 0.95 0.58 0.71 0.63 CatBoost 0.97 0.94 0.67 0.76 CatBoost (Weighted) 0.82 0.49 0.82 0.54 CatBoost (Oversampling) 0.69 0.42 0.75 0.42 XGBoost 0.97 0.93 0.67 0.76 XGBoost (Weighted) 0.97 0.85 0.70 0.76 3.4 Dynamic Tiering Implications Based on the results, our dynamic tiering will have the following implications to Microsoft: Tier changes are not reliably forecastable under current rules. Year-over-year stability is so dominant that even strong ML models cannot surface consistent upgrade or downgrade signals. This suggests that transitions are driven more by sales judgment and tier policy than by measurable account behavior. The dynamic model is still valuable: just not as a predictor of future tiers. Rather than serving as a forecasting engine, this pipeline should be viewed as a diagnostic tool that helps identify accounts with unusual patterns, emerging risks, or outlier behavior worth reviewing. Dynamic progression complements, rather than replaces, the core segmentation. It provides an additional layer of insight alongside clustering and KPI-optimized segmentation, helping Microsoft maintain both structural clarity (static segmentation) and forward-looking awareness (dynamic progression). 4. Optimization in Practice To understand how segmentation could support downstream coverage planning, we developed a small optimization proof-of-concept using Microsoft’s seller–tier capacity guidelines (e.g., max accounts per role × tier, geo-entity restrictions, in-person vs remote coverage rules). 4.1 What We Explored Using our final hybrid segmentation (Method 3), we tested a simplified workflow: Formulate a coverage optimization problem ○ Assign sellers to accounts under constraints such as: role × tier capacity limits, single-geo assignment, ±1 tier movement rules, domain restrictions for Tier C/D. ○ This naturally forms a mixed-integer optimization problem (MIP). Prototype with standard optimization tools ○ Linear and integer programming formulations using Gurobi, OR-Tools, and Pyomo. ○ Heuristic solvers (e.g., local search, greedy reallocation, hill climbing) as faster alternatives. Simulate coverage scenarios ○ Estimate changes in workload balance and whitespace prioritization under different seller–tier mixes. ○ Validate feasibility of the optimization with respect to Microsoft’s operational rules. 4.2 What We Learned Due to limited operational metrics (detailed whitespace values, upgrade probabilities, territory boundaries) and time constraints, we did not build a fully deployable engine. However, the PoC confirmed that: The segmentation integrates cleanly into a prescriptive segmentation → optimization → coverage pipeline. A full solver could feasibly allocate sellers under realistic business constraints. Gurobi-style MIP formulations and simulation-based heuristics are both valid paths for future development. In short: the optimization layer is technically viable and aligns naturally with our segmentation design, but its full implementation exceeds the scope of this capstone. 5. AI & LLM Integration To make segmentation accessible to a broad set of stakeholders like sales leaders, strategists, and business analysts, we built a conversational tiering assistant powered by LLM-based interpretation of strategic priorities. The assistant allows users to describe their intended segmentation direction in natural language, which the system translates into numerical weights and a refreshed set of tier assignments. 5.1 LLM Workflow Architecture The following flowchart demonstrates how the LLM work: Users communicate their goals using intuitive, high-level language (e.g. “prioritize runway growth”, “reward high-potential emerging accounts”). Front end collects the user’s tiering preference through a chat interface. The frontend sends this prompt to our cloud FastAPI service on Render. The LLM interprets the prompt and infers the relative strategic weights and which clustering method to use (KPI-based or Hybrid Approach). The server applies these weights in the tiering code to generate updated tiers based on the selected approach. The server returns a refreshed CSV with new tier assignments which can be exported through the chat interface. 5.2 Why LLMs Matter LLMs enhanced the project in three ways: Interpretation Layer: Helps business users articulate strategy in plain English and convert it to quantifiable modeling inputs. Explainability Layer: Surfaces cluster drivers, feature differences, and trade-offs across segments in natural language. Acceleration Layer: Enables real-time exploration of “what-if” tiering scenarios without engineering support. This integration transforms segmentation from a static analytical artifact into a dynamic, interactive decision-support tool, aligned with how Microsoft teams actually work. 5.3 Backend Architecture and LLM Integration Pipeline The conversational tiering system is supported by a cloud-based backend designed to translate natural-language instructions into structured model parameters. The service is deployed on Render and implemented with FastAPI, providing a lightweight, high-performance gateway for managing requests, validating inputs, and coordinating LLM interactions. FastAPI as the Orchestration Layer - User instructions are submitted through the chat interface and delivered to a FastAPI endpoint as JSON. FastAPI validates this payload using Pydantic, ensuring the request is well-formed before any processing occurs. The framework manages routing, serialization, and error handling, isolating request management from the downstream LLM and computation layers. LLM Invocation Through the OpenAI API - Once a validated prompt is received, the backend invokes the OpenAI API using a structured system prompt engineered to enforce strict JSON output. The LLM returns four normalized weights reflecting the user’s strategic intent, along with metadata used to determine whether the user explicitly prefers a KPI-based method or the default Hybrid approach. If no method is specified, the system automatically defaults to Hybrid. Low-temperature decoding is used to minimize stochastic variation and ensure repeatability across identical user prompts. All OpenAI keys are securely stored as Render environment variables. Schema Enforcement and Robust Parsing -To maintain reliability, the backend enforces strict schema validation on LLM responses. The service checks both JSON structure and numeric constraints, ensuring values fall within valid ranges and sum to one. If parsing fails or constraints are violated, the backend automatically reissues a constrained correction prompt. This design prevents malformed outputs and guards against conversational drift. Render Hosting and Operational Considerations - The backend runs in a stateless containerized environment on Render, which handles service orchestration, HTTPS termination, and environment-variable management. Data required for computation is loaded into memory at startup to reduce latency, and the lightweight tiering pipeline ensures that the system remains responsive even under shared compute resources. Response Assembly and Delivery - After LLM interpretation and schema validation, the backend applies the resulting weights and streams the recalculated results back to the user as a downloadable CSV. FastAPI’s Streaming Response enables direct transmission from memory without temporary filesystem storage, supporting rapid interactive workflows. Together, these components form a tightly integrated, cloud-native pipeline: FastAPI handles orchestration, the LLM provides semantic interpretation, Render ensures secure and reliable hosting, and the default Hybrid method ensures consistent behavior unless the user explicitly requests the KPI approach. DEMO: Microsoft x UCLA Anderson MSBA - AI-Driven KPI Segmentation Project (LLM demo) 6. Conclusion Our work delivers a strategic, KPI-driven tiering architecture that resolves the limitations of Microsoft’s legacy system and sets a scalable foundation for future segmentation and coverage strategy. Across all analyses, five differentiators stand out: Clear separation of natural structure vs. business intent: We diagnose where the legacy system diverges from true customer potential and revenue—establishing the analytical ground truth Microsoft never previously had. A precise map of strategic trade-offs: By comparing unsupervised, KPI-only, and hybrid approaches, we reveal the operational and business implications behind every tiering philosophy—making the framework decision-ready for leadership. A business-aligned segmentation ready for deployment: Our hybrid KPI-aware model uniquely satisfies KPI lift, distribution stability, ±1 movement rules, and interpretability—providing a reliable go-forward tiering backbone. A future-proof architecture that extends beyond static tiers: Dynamic progression modeling and optimization PoC show how tiering can evolve into forecasting, prioritization, whitespace planning, and resource optimization. A blueprint for Microsoft’s next-generation tiering ecosystem: The system integrates data science, business KPIs, optimization, and LLM interpretability into one cohesive workflow—positioning Microsoft for an AI-enabled tiering strategy. In essence, this work transforms customer tiering into a strategic, explainable, and scalable system—ready to support Microsoft’s growth ambitions and future AI initiatives.
BonnieAo
Dec 04, 2025 Place Analytics on Azure Blog
769Views
2likes
0Comments
Data Vault 2.0 Warehouse Automation on Azure
This is the series of 'Blog Articles' on the topic "Data Vault 2.0 on Azure" where we start from 'What?' and then slowly dwell into 'How To?' implement DV 2.0 on Azure Data Platform Technologies.
Naveed-Hussain
Oct 10, 2025 Place Analytics on Azure Blog
9.9KViews
0likes
1Comment
Defining the Raw Data Vault with Artificial Intelligence
This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH. The Technical Review is done by Ian Clarke, Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft The Data Vault concept is used across the industry to build robust and agile data solutions. Traditionally, the definition (and subsequent modelling) of the Raw Data Vault, which captures the unmodified raw data, is done manually. This work demands significant human intervention and expertise. However, with the advent of artificial intelligence (AI), we are witnessing a paradigm shift in how we approach this foundational task. This article explores the transformative potential of leveraging AI to define the Raw Data Vault, demonstrating how intelligent automation can enhance efficiency, accuracy, and scalability, ultimately unlocking new levels of insight and agility for organizations. Note that this article describes a solution to AI-generated Raw Data Vault models. However, the solution is not limited to Data Vault, but allows the definition of any data-driven, schema-on-read model to integrate independent data sets in an enterprise environment. We discuss this towards the end of this article. Metadata-Driven Data Warehouse Automation In the early days of Data Vault, all engineering was done manually: an engineer would analyse the data sources and their datasets, come up with a Raw Data Vault model in an E/R tool or Microsoft Visio, and then develop both the DDL code (CREATE TABLE) and the ELT / ETL code (INSERT INTO statements). However, Data Vault follows many patterns. Hubs look very similar (the difference lies in the business keys) and are loaded similarly. We discussed these patterns in previous articles of this series, for example, when covering the Data Vault model and implementation. In most projects where Data Vault entities are created and loaded manually, a data engineer eventually develops the idea of creating a metadata-driven Data Vault generator due to these existing patterns. The effort to build a generator is too considerable, and most projects are better off using an off-the-shelf solution such as Vaultspeed. These tools come with a metadata repository and a user interface for setting up the metadata and code templates required to generate the Raw Data Vault (and often subsequent layers). We have discussed Vaultspeed in previous articles of this series. By applying the code templates to the metadata defined by the user, the actual code for the physical model is generated for a data platform, such as Microsoft Fabric. The code templates define the appearance of hubs, links, and satellites, as well as how they are loaded. The metadata defines which hubs, links, and satellites should exist to capture the incoming data set consistently. Manual development often introduces mistakes and errors that result in deviations in code quality. By generating the data platform code, deviations from the defined templates are not possible (without manual intervention), thus raising the overall quality. But the major driver for most project teams is to increase productivity. Instead of manually developing code, they generate the code. Metadata-driven generation of the Raw Data Vault is standard practice in today's projects. Today’s project tasks have therefore changed: while engineers still need to analyse the source data sets and develop a Raw Data Vault model, they no longer create the code (DDL/ELT). Instead, they set up the metadata that represents the Raw Data Vault model in the tool of their choice. Each data warehouse automation tool comes with its specific features, limitations, and metadata formats. The data engineer/modeler must understand how to transfer the Raw Data Vault model into the data warehouse automation tool by correctly setting up the metadata. This is also true for Vaultspeed; the data modeler can set up the metadata either through the user interface or via the SDK. This is the most labour-intensive task concerning the Raw Data Vault layer. It also requires experts who not only know Data Vault modelling but also know (or can analyse) the source systems' data and understand the selected data warehouse automation solution. Additionally, Data Vault is not equal to Data Vault in many cases, as it allows for a very flexible interpretation of how to model a Data Vault, which also leads to quality issues. But what if the organization has no access to such experts? What if budgets are limited, time is of the essence, or there are no available experts in sufficient numbers in the field? As Data Vault experts, we can debate the value of Data Vault as much as we want, but if there are no experts capable of modeling it, the debate will remain inconclusive. And what if this problem is only getting worse? In the past, a few dozen source tables might have been sufficient to be processed by the data platform. Today, several hundred source tables could be considered a medium-sized data platform. Tomorrow, there will be thousands of source tables. The reason? There is not only an exponential growth in the volume of data to be produced and processed, but it also comes with an exponential growth in the complexity of data shape. The source of this exponential growth in data shape comes from more complex source databases, APIs that produce and deliver semi-structured JSON data, and, ultimately, more complex business processes and an increasing amount of generated and available data that needs to be analysed for meaningful business results. Generating the Data Vault using Artificial Intelligence Increasingly, this data is generated using artificial intelligence (AI) and still requires integration, transformation, and analysis. The issue is that the number of data engineers, data modelers, and data scientists is not growing exponentially. Universities around the world only produce a limited number of these roles, and some of us would like to retire one day. Based on our experience, the increase in these roles is linear at best. Even if you argue for exponential growth in these roles, it is evident that there is no debate about a growing gap between the increasing data volume and the people who should analyse it. This gap cannot be closed by humans in the future. Even in a world where all kids want to become and eventually work in a data role. Sorry for all the pilots, police officers, nurses, doctors, etc., there is no way for you to retire without the whole economy imploding. Therefore, the only way to close the gap is through the use of artificial intelligence. It is not about reducing the data roles. It's about making them efficient so that they can deal with the growing data shape (and not just the volume). For a long time, it was common sense in the industry that, if an artificial intelligence could generate or define the Raw Data Vault, it would be an assisting technology. The AI would make recommendations, for example, such as which hubs or links to model and which business keys to use. The human data modeler would make the final decision, with input from the AI. But what if the AI made the final decision? What would it look like? What if one could attach data sources to the AI platform and the AI would analyze the source datasets, come up with a Raw Data Vault model, and load that model into Vaultspeed or another data warehouse automation tool, know the source system’s data, know Data Vault modelling, and understand the selected data warehouse automation? These questions were posed by Michael Olschimke, a Data Vault and AI expert, when initially considering the challenge. He researched the distribution of neural networks on massively parallel processing (MPP) clusters to classify unstructured data at Santa Clara University in Silicon Valley. This prior AI research, combined with the knowledge he accumulated in the Data Vault, enabled him to build a solution that later became known as Flow.BI. Flow.BI as a Generative AI to Define the Raw Data Vault The solution is simple, at least from the outside: attach a few data sources, let the AI do the rest. Flow.BI supports several data sources already, including Microsoft SQL Server and derivatives, such as Synapse and Fabric, as long as a JDBC driver is available, Flow.BI should eventually be able to analyze the data source. And the AI doesn’t care if the data originates from a CRM system, such as Microsoft Dynamics, or an e-commerce platform; it's just data. There are no provisions in the code to deal with specific datasets, at least for now. The goal of Flow.BI is to produce a valid, that is, consistent and integrated, enterprise data model. Typically, this follows a Data Vault design, but it's not limited to that (we’ll discuss this later in the article). This is achieved by following a strict data-driven approach that imitates the human data modeler. Flow.BI needs data to make decisions, just like its human counterpart. Source entities with no data will be ignored. It only requires some metadata, such as the available entities and their columns. Datatypes are nice-to-have; primary keys and foreign keys would improve the target model, just like entity and column descriptions. But they are not required to define a valid Raw Data Vault model. Humans write this text, and as such, we like to influence the result of the modelling exercise. Flow.BI is appreciating this by offering many options for the human data modeler to influence the engine. Some of them will be discussed in this article, but there are many more already available and more to come. Flow.BI’s user interface is kept as lean and straightforward as possible: the solution is designed so that the AI should take the lead and model the whole Raw Data Vault. The UI’s purpose is to interact with human data modelers, allowing them to influence the results. That’s what many screens are related to - and the configuration of the security system. A client can have multiple instances, which result in independent Data Vault models. This is particularly useful when dealing with independent data platforms, such as those used by HR, the compliance department, or specific business use cases, or when creating the raw data foundation for data products within a data mesh. In this case, a Flow.BI instance equals a data product. But don’t underestimate the complexity of Flow.BI: The frontend is used to manage a large number of compute clusters that implement scalable agents to work on defining the Raw Data Vault. The platform is implementing full separation of data and processing, not only by client but also by instance. Mapping Raw Data to Organizational Ontology The very first step in the process is to identify the concepts in the attached datasets. For this purpose, there is a concept classifier that analyses the data and recognizes datasets and their classified concepts that it has seen in the past. A common requirement of clients is that they would like to leverage their organizational requirements in this process. While Flow.BI doesn’t know a client’s ontology; it is possible to override (and in some cases, complete) the concept classifications and refer to concepts from the organizational ontology. By doing so, Flow.BI will integrate the source system’s raw data into the organization's ontology. It will not create a logical Data Vault, which is where the Data Vault model reflects the desired business, but instead model the raw data as the business uses it, and therefore follow the data-driven Data Vault modeling principles that Michael Olschimke has taught to thousands of students over the years at Scalefree. Flow.BI also allows the definition of a multi-tenant Data Vault model, where source systems either provide multi-tenant data or are assigned to a specific tenant. In both cases, the integrated enterprise data model will be extended to allow queries across multiple tenants or within a single tenant, depending on the information consumer’s needs. Ensuring Security and Privacy Flow.BI was designed with security and privacy in mind. From a design perspective, this has two aspects: Security and privacy in the service itself, to protect client solutions and related assets Security and privacy are integral to the defined model, allowing for the effective utilization of Data Vault’s capabilities in addressing security and privacy requirements, such as satellite splits. While Flow.BI is using a shared architecture; all data and metadata storage and processing are separated by client and instance. However, this is often not sufficient for clients as they hesitate to share their highly sensitive data with a third party. For this reason, Flow.BI allows two critical features: Local data storage: instead of storing client data on Flow.BI infrastructure, the client provides an Azure Data Lake Storage to be used for storing the data. Local data processing: A Docker container can be deployed into the client’s infrastructure to access the client's data sources, extract the data, and process it. When using both options, only metadata, such as entity and column names, constraints, and descriptions, are shared with Flow.BI. No data is transferred from the client’s infrastructure to Flow.BI. The metadata is secured on Flow.BI’s premises as if it were actual data: row-level security separates the metadata by instance, and roles and permissions are defined per client who can access the metadata and what they can do with it. But security and privacy are not limited to the service itself. The defined model also utilizes the security and privacy features of Data Vault. For example, it enables the classification of source columns based on security and privacy. The user can set up security and privacy classes and apply them to the influence screen for both. By doing so, the column classifications are used when defining the Raw Data Vault and can later be used to implement a satellite split in the physical model (if necessary). An upcoming release will include an AI model for classifying columns based on privacy, utilizing data and metadata to automate this task. Tackling Multilingual Challenges A common challenge for clients is navigating multilingual data environments. Many data sources use English entity and column names, but there are systems using metadata in a different language. Also, the assumption that the data platform should use English metadata is not always correct. Especially in government clients, the use of the official language is mandatory. Both options, translating the source metadata to English (the default within Flow.BI) and translating the defined target model into any target language, are supported by Flow.BI’s translations tab on the influence screen: The tab utilizes an AI translator to fully automatically translate the incoming table names, column names, and concept names. However, the user can step in and override the translation to improve it to their needs. All strings of the source metadata and the defined model are passed through the translation module. It is also possible to reuse existing translations for a growing list of popular data sources. This feature enables readable names for satellites and their attributes (as well as hubs and links), resulting in a significantly improved user experience for the defined Raw Data Vault. Generating the Physical Model You should have noticed by now that we consistently discuss the defined Raw Data Vault model. Flow.BI is not generating the physical model, that is, the CREATE TABLE and INSERT INTO statements for the Raw Data Vault. Instead, it “just” defines the hubs, links, and satellites required for capturing all incoming data from the attached data sources, including business key selection, satellite splits, and special entity types, such as non-historized links and their satellites, multi-active satellites, hierarchical links, effectivity satellites, and reference tables. Video on Generating Physical Models This logical model (not to be confused with “logical Data Vault modelling”) is then provided to our growing number of ISV partner solutions that will consume our defined model, set up the required metadata in their tool, and generate the physical model. As a result, Flow.BI acts as a team member that analyses your organizational data sources and their data, knows how to model the Raw Data Vault, and how to set up metadata in the tool of your choice. The metadata is provided by Flow.BI can be used to model the landing zone/staging area (either on a data lake or a relational database such as Microsoft Fabric) and the Raw Data Vault in a data-driven Data Vault architecture, which is the recommended practice. With this in mind, Flow.BI is not a competition to Vaultspeed or your other existing data warehouse automation solution, but a valid extension that integrates with your existing tool stack. This makes it much easier to justify the introduction of Flow.BI to the project. Going Beyond Data Vault Flow.BI is not limited to the definition of Data Vault models. While it has been designed with the Data Vault concepts in mind, a customizable expert system is used to define the Data Vault model. Although the expert system is not yet publicly available, it has already been implemented and is in use for every model generation. This expert system enables the implementation of alternative data models, provided they adhere to data-driven, schema-on-read principles. Data Vault is such an example, but many others are possible, as well: Customized Data Vault models Inmon-style enterprise models in third-normal form (3NF, if no business logic is required Kimball-style analytical models with facts and dimensions, again without business logic Semi-structured JSON and XML document collections Key-value stores “One Big Table (OBT)” models “Many Big Related Table (MBRT)” models Okay, we’ve just invented the MBRT model as we're writing the article, but you get the idea: many large, fully denormalized tables with foreign–key relationships between each other. If you've developed your data-driven model, please get in touch with us. About the Authors Michael Olschimke is co-founder and CEO of Flow.BI, a generative AI that defines integrated enterprise data models, such as (but not limited to) Data Vault. Michael has trained thousands of industry data warehousing professionals, taught academic classes, and published regularly on topics around data platforms, data engineering, and Data Vault. He has over two decades of experience in information technology, with a specialization in business intelligence topics, artificial intelligence and data platforms. <<< Back to Blog Series Title Page
Naveed-Hussain
Oct 10, 2025 Place Analytics on Azure Blog
757Views
0likes
0Comments
Azure Stream Analytics Virtual Network Integration Goes GA!
We are thrilled to announce that the highly anticipated capability of running your Azure Stream Analytics (ASA) job in an Azure Virtual Network (VNET) is now generally available (GA)! This feature, which has been in public preview, is set to revolutionize how you secure and manage your ASA jobs by leveraging the power of virtual networks. What Does This Mean for You? With VNET integration, you can now lock down access to your ASA jobs within your virtual network infrastructure. This provides enhanced security through network isolation, ensuring that your data remains protected and accessible only within your private network. By deploying a containerized instance of your ASA job inside your VNET, you can privately access your resources using: Private Endpoints: These allow you to connect your VNET-injected ASA job to your data sources privately via Azure Private Link. This means that your data traffic remains within the Azure backbone network, reducing exposure to the public internet and enhancing security. Service Endpoints: These enable you to connect your data sources directly to your VNET-injected ASA job. This simplifies the network architecture by providing direct connectivity. Service Tags: These allow you to manage network security by defining rules that allow or deny traffic to Azure Stream Analytics. This helps in maintaining a secure environment by controlling which services can communicate with your ASA jobs. Overall, VNET integration enhances the security of your ASA jobs by leveraging Azure's robust networking features. Expanded Regional Availability We are also excited to announce that this capability is now available in additional regions! Along with the existing regions (West US, Central Canada, East US, East US 2, Central US, West Europe, and North Europe), you can now enable VNET integration in the following regions: Australia East France Central North-Central US Southeast Asia Brazil South Japan East UK South Central India These regions were added in response to customer feedback. If you have suggestions for additional regions, please complete this form: https://forms.office.com/r/NFKdb3W6ti?origin=lprLink This expansion ensures that more customers around the globe can benefit from the enhanced security and network isolation provided by VNET integration. Getting Started To get started with VNET integration for your ASA jobs, follow these steps: Set Up Your VNET: Create or use an existing Azure Virtual Network. Create a Subnet: Add a dedicated subnet for your ASA job within the VNET. Set Up Azure NAT Gateway or disable outbound connectivity: Enhance security and reliability by setting up an Azure NAT Gateway or disable default outbound connectivity. Associate a Storage Account: Ensure you have a General Purpose V2 (GPV2) Storage account linked to your ASA job. Configure Your ASA Job: Azure Portal: Go to Networking and select "Run this job in virtual network." Follow the prompts to configure and save. Visual Studio Code: In the 'JobConfig.json' file, set up the 'VirtualNetworkConfiguration' to reference the subnet. Check Permissions: Make sure you have the necessary Role-based access control permissions on the subnet or higher. For detailed instructions and requirements, refer to the official documentation Run your Stream Analytics in Azure virtual network - Azure Stream Analytics | Microsoft Learn. Join the Revolution Stay tuned for more updates and exciting features as we continue to innovate and improve Azure Stream Analytics. Our other Ignite releases include Azure Stream Analytics Kafka Connectors is Now Generally Available! If you have any questions or need assistance, feel free to reach out to us at askasa@microsoft.com. Happy streaming!
Anasheh_Boisvert
Nov 19, 2024 Place Analytics on Azure Blog
486Views
0likes
0Comments
AI/ML ModelOps is a Journey. Get Ready with SAS® Viya® Platform on Azure
Do you want easy answers to following set of questions? Then this article is for you. How many AI/ML models do we have? Where are they stored/inventoried? When was each model updated? By whom? How? Who manages our models? Are the right models being used in production? How do we know? What effort is needed to deploy models? Who’s responsible? Are there documented processes? How long does a model take to be deployed? How old is the data it was trained on? Is the data clean and trustworthy? How are models performing? How do we compare different models for the same use case over time? Does IT work with our analytics teams to create development environments that make it possible to create models that can be easily deployed?
ravisha
Mar 04, 2024 Place Analytics on Azure Blog
6.8KViews
2likes
0Comments
Run your Azure Stream Analytics job inside your Azure Virtual Network (Public Preview)
This article describes the Azure Stream Analytics virtual network integration feature and how to set it up.
Anasheh_Boisvert
Jul 31, 2023 Place Analytics on Azure Blog
3.9KViews
2likes
0Comments
Azure Stream Analytics has Launched a New Competitive Pricing Model!
ASA is proud to offer best-in-class price to performance ratio with a new pricing model.
Anasheh_Boisvert
Jul 21, 2023 Place Analytics on Azure Blog
7.4KViews
2likes
3Comments
Agile Data Vault 2.0 Projects with Azure DevOps
Having discussed the value of Data Vault 2.0 and the associated architectures in the previous articles of this blog series, this article will focus on the organization and successful execution of Data Vault 2.0 projects using Azure DevOps. It will also discuss the differences between standard Scrum, as used in agile software development, and the Data Vault 2.0 methodology, which is based on Scrum but also includes aspects from other methodologies. Other functions of Azure DevOps, for example the deployment of the data analytics platform, will be discussed in subsequent articles of this ongoing blog series.
Naveed-Hussain
Jul 19, 2023 Place Analytics on Azure Blog
7.2KViews
1like
0Comments
Data Science with Azure Synapse and Data Vault 2.0
The use of a Managed Self-Service BI with Data Vault 2.0 is demonstrated. The architecture is described, processes explained and compared to “classical” data science approaches (e.g., sandboxing).
Naveed-Hussain
Jul 19, 2023 Place Analytics on Azure Blog
4.5KViews
0likes
0Comments
Private Preview of Kafka Input and Output with Azure Stream Analytics
Azure Stream Analytics now allows you to directly connect to Kafka clusters to ingest and output data. The Kafka adapters by Azure Stream Analytics are managed by Microsoft's Azure Stream Analytics team, allowing it to meet business compliance standards without managing extra infrastructure. The Kafka adapters are backward compatible and support versions starting from version 0.10 with the latest client release.
ebenezer_nkrumah
May 23, 2023 Place Analytics on Azure Blog
4KViews
2likes
0Comments