microsoft fabric
84 TopicsApproaches to Integrating Azure Databricks with Microsoft Fabric: The Better Together Story!
Azure Databricks and Microsoft Fabric can be combined to create a unified and scalable analytics ecosystem. This document outlines eight distinct integration approaches, each accompanied by step-by-step implementation guidance and key design considerations. These methods are not prescriptive—your cloud architecture team can choose the integration strategy that best aligns with your organization’s governance model, workload requirements and platform preferences. Whether you prioritize centralized orchestration, direct data access, or seamless reporting, the flexibility of these options allows you to tailor the solution to your specific needs.2.6KViews8likes1CommentAzure Databricks & Fabric Disaster Recovery: The Better Together Story
Author's: Amudha Palani amudhapalani, Eric Kwashie ekwashie, Peter Lo PeterLo and Rafia Aqil Rafia_Aqil Disaster recovery (DR) is a critical component of any cloud-native data analytics platform, ensuring business continuity even during rare regional outages caused by natural disasters, infrastructure failures, or other disruptions. Identify Business Critical Workloads Before designing any disaster recovery strategy, organizations must first identify which workloads are truly business‑critical and require regional redundancy. Not all Databricks or Fabric processes need full DR protection; instead, customers should evaluate the operational impact of downtime, data freshness requirements, regulatory obligations, SLAs, and dependencies across upstream and downstream systems. By classifying workloads into tiers and aligning DR investments accordingly, customers ensure they protect what matters most without over‑engineering the platform. Azure Databricks Azure Databricks requires a customer‑driven approach to disaster recovery, where organizations are responsible for replicating workspaces, data, infrastructure components, and security configurations across regions. Full System Failover (Active-Passive) Strategy A comprehensive approach that replicates all dependent services to the secondary region. Implementation requirements include: Infrastructure Components: Replicate Azure services (ADLS, Key Vault, SQL databases) using Terraform Deploy network infrastructure (subnets) in the secondary region Establish data synchronization mechanisms Data Replication Strategy: Use Deep Clone for Delta tables rather than geo-redundant storage Implement periodic synchronization jobs using Delta's incremental replication Measure data transfer results using time travel syntax Workspace Asset Synchronization: Co-deploy cluster configurations, notebooks, jobs, and permissions using CI/CD Utilize Terraform and SCIM for identity and access management Keep job concurrencies at zero in the secondary region to prevent execution Fully Redundant (Active-Active) Strategy The most sophisticated approach where all transactions are processed in multiple regions simultaneously. While providing maximum resilience, this strategy: Requires complex data synchronization between regions Incurs highest operational costs due to duplicate processing Typically needed only for mission-critical workloads with zero-tolerance for downtime Can be implemented as partial active-active, processing most workload in primary with subset in secondary Enabling Disaster Recovery Create a secondary workspace in a paired region. Use CI/CD to keep Workspace Assets Synchronized continuously. Requirement Approach Tools Cluster Configurations Co-deploy to both regions as code Terraform Code (Notebooks, Libraries, SQL) Co-deploy with CI/CD pipelines Git, Azure DevOps, GitHub Actions Jobs Co-deploy with CI/CD, set concurrency to zero in secondary Databricks Asset Bundles, Terraform Permissions (Users, Groups, ACLs) Use IdP/SCIM and infrastructure as code Terraform, SCIM Secrets Co-deploy using secret management Terraform, Azure Key Vault Table Metadata Co-deploy with CI/CD workflows Git, Terraform Cloud Services (ADLS, Network) Co-deploy infrastructure Terraform Update your orchestrator (ADF, Fabric pipelines, etc.) to include a simple region toggle to reroute job execution. Replicate all dependent services (Key Vault, Storage accounts, SQL DB). Implement Delta “Deep Clone” synchronization jobs to keep datasets continuously aligned between regions. Introduce an application‑level “Sync Tool” that redirects: data ingestion compute execution Enable parallel processing in both regions for selected or all workloads. Use bi‑directional synchronization for Delta data to maintain consistency across regions. For performance and cost control, run most workloads in primary and only subset workloads in secondary to keep it warm. Implement Three-Pillar DR Design Primary Workspace: Your production Databricks environment running normal operations Secondary Workspace: A standby Databricks workspace in a different(paired) Azure region that remains ready to take over if the primary fails. This architecture ensures business continuity while optimizing costs by keeping the secondary workspace dormant until needed. The DR solution is built on three fundamental pillars that work together to provide comprehensive protection: 1. Infrastructure Provisioning (Terraform) The infrastructure layer creates and manages all Azure resources required for disaster recovery using Infrastructure as Code (Terraform). What It Creates: Secondary Resource Group: A dedicated resource group in your paired DR region (e.g., if primary is in East US, secondary might be in West US 2) Secondary Databricks Workspace: A standby Databricks workspace with the same SKU as your primary, ready to receive failover traffic DR Storage Account: An ADLS Gen2 storage account that serves as the backup destination for your critical data Monitoring Infrastructure: Azure Monitor Log Analytics workspace and alert action groups to track DR health Protection Locks: Management locks to prevent accidental deletion of critical DR resources Key Design Principle: The Terraform configuration references your existing primary workspace without modifying it. It only creates new resources in the secondary region, ensuring your production environment remains untouched during setup. 2. Data Synchronization (Delta Notebooks) The data synchronization layer ensures your critical data is continuously backed up to the secondary region. How It Works: The solution uses a Databricks notebook that runs in your primary workspace on a scheduled basis. This notebook: Connects to Backup Storage: Uses Unity Catalog with Azure Managed Identity for secure, credential-free authentication to the secondary storage account Identifies Critical Tables: Reads from a configuration list you define (sales data, customer data, inventory, financial transactions, etc.) Performs Deep Clone: Uses Delta Lake's native CLONE functionality to create exact copies of your tables in the backup storage Tracks Sync Status: Logs each synchronization operation, tracks row counts, and reports on data freshness Authentication Flow: The synchronization process leverages Unity Catalog's managed identity capabilities: An existing Access Connector for Unity Catalog is granted "Storage Blob Data Contributor" permissions on the backup storage. Storage credentials are created in Databricks that reference this Access Connector. The notebook uses these credentials transparently—no storage keys or secrets are required. What Gets Synced: You define which tables are critical to your business operations. The notebook creates backup copies including: Full table data and schema Table partitioning structure Delta transaction logs for point-in-time recovery 3. Failover Automation (Python Scripts) The failover automation layer orchestrates the switch from primary to secondary workspace when disaster strikes. Microsoft Fabric Microsoft Fabric provides built‑in disaster recovery capabilities designed to keep analytics and Power BI experiences available during regional outages. Fabric simplifies continuity for reporting workloads, while still requiring customer planning for deeper data and workload replication. Power BI Business Continuity Power BI, now integrated into Fabric, provides automatic disaster recovery as a default offering: No opt-in required: DR capabilities are automatically included. Azure storage geo-redundant replication: Ensures backup instances exist in other regions. Read-only access during disasters: Semantic models, reports, and dashboards remain accessible. Always supported: BCDR for Power BI remains active regardless of OneLake DR setting. Microsoft Fabric Fabric's cross-region DR uses a shared responsibility model between Microsoft and customers: Microsoft's Responsibilities: Ensure baseline infrastructure and platform services availability Maintain Azure regional pairings for geo-redundancy. Provide DR capabilities for Power BI as default. Customer Responsibilities: Enable disaster recovery settings for capacities Set up secondary capacity and workspaces in paired regions Replicate data and configurations Enabling Disaster Recovery Organizations can enable BCDR through the Admin portal under Capacity settings: Navigate to Admin portal → Capacity settings Select the appropriate Fabric Capacity Access Disaster Recovery configuration Enable the disaster recovery toggle Critical Timing Considerations: 30-day minimum activation period: Once enabled, the setting remains active for at least 30 days and cannot be reverted. 72-hour activation window: Initial enablement can take up to 72 hours to become fully effective. Azure Databricks & Microsoft Fabric DR Considerations Building a resilient analytics platform requires understanding how disaster recovery responsibilities differ between Azure Databricks and Microsoft Fabric. While both platforms operate within Azure’s regional architecture, their DR models, failover behaviors, and customer responsibilities are fundamentally different. Recovery Procedures Procedure Databricks Fabric Failover Stop workloads, update routing, resume in secondary region. Microsoft initiates failover; customers restore services in DR capacity. Restore to Primary Stop secondary workloads, replicate data/code back, test, resume production. Recreate workspaces and items in new capacity; restore Lakehouse and Warehouse data. Asset Syncing Use CI/CD and Terraform to sync clusters, jobs, notebooks, permissions. Use Git integration and pipelines to sync notebooks and pipelines; manually restore Lakehouses. Business Considerations Consideration Databricks Fabric Control Customers manage DR strategy, failover timing, and asset replication. Microsoft manages failover; customers restore services post-failover. Regional Dependencies Must ensure secondary region has sufficient capacity and services. DR only available in Azure regions with Fabric support and paired regions. Power BI Continuity Not applicable. Power BI offers built-in BCDR with read-only access to semantic models and reports. Activation Timeline Immediate upon configuration. DR setting takes up to 72 hours to activate; 30-day wait before changes allowed.475Views2likes0CommentsTableau to Power BI Migration: Semantic Layer-First Approach for Cloud Architects
Author's: Lavanya Sreedhar LavanyaSreedhar, Peter Lo PeterLo, Aryan Anmol aryananmol and Rafia Aqil Rafia_Aqil In this guide, we provide practical guidance for migrating from Tableau to Power BI, with a focus on technical best practices and architecture. Unifying business intelligence on the Microsoft Fabric platform, enterprises gain closer integration with Microsoft 365 (Teams, Copilot, Excel). For cloud solution architects and BI developers, a successful migration is not just about rebuilding dashboards in a new tool. It requires thoughtful architectural planning and a shift to a more model-centric approach to BI. Why Semantic Layer-First Architecture Matters The Traditional Migration Challenge Most Tableau to Power BI migrations follow a dashboard-centric approach: teams attempt to replicate existing Tableau workbooks, calculated fields, and LOD (Level of Detail) expressions directly into Power BI reports. While this may seem efficient initially, it creates significant downstream challenges: Duplicated logic: Each report embeds its own calculations and business rules, leading to conflicting KPIs across the organization Maintenance overhead: Changes to business logic require updating dozens or hundreds of individual reports Governance gaps: Without centralized definitions, semantic drift occurs—different teams calculate "Revenue" or "Active Customer" differently Scalability issues: As data volumes grow, report-level transformations become performance bottlenecks The Semantic Layer-First Alternative Microsoft's recommended approach centers on semantic models (formerly called datasets)—centralized, governed data models that separate business logic from visualization. In this architecture: The payoff is substantial: when data evolves or business rules change, you update the semantic model once, and all dependent reports automatically reflect the changes—no manual redesign required. Understanding Migration Complexity: Simple to Very Complex Dashboards Not all Tableau dashboards are created equal. The migration strategy should align with dashboard complexity, and the semantic layer approach becomes increasingly valuable as complexity grows. Follow a Step-by-Step Migration Strategy Migrating from Tableau to Power BI is not a one-click effort – it requires a mix of automated and manual refactoring, plus a sound change management plan. Below are key strategies and best practices for a successful migration: Audit your Tableau estate: Start by taking inventory of all existing Tableau workbooks, data sources, and dashboards. Determine what needs to be migrated (focus on high-value, widely used reports first) and identify any redundant or obsolete content that can be retired rather than converted. Conduct a proof-of-concept (PoC): Before migrating everything, pick a representative complex dashboard (or a subset of your data) and perform a pilot migration. This will help you validate that Power BI can connect to your data (e.g. setting up the Power BI gateways for on-premises sources), test performance (Import vs DirectQuery modes), and experiment with replicating key visuals or calculations. Use the PoC to uncover any surprises early – for example, test that any Level of Detail expressions or table calculations in Tableau can be re-created in DAX. The lessons learned here should inform your overall project plan. Use a phased migration approach: Plan to run Tableau and Power BI in parallel for some period, rather than switching everything at once. Migrate in waves – for example, by business unit or subject area – and incorporate user feedback as you go. This phased approach reduces risk and allows your team to improve the process with each iteration. It also gives end users time to adjust gradually. Migrate high-impact dashboards first: Prioritize the migration of key reports and dashboards that are critical to the business or have the most usage. Delivering these early wins will not only surface any technical challenges to solve but will also help demonstrate the value of Power BI’s capabilities to stakeholders. Early success builds buy-in and momentum for the rest of the migration. Reimagine (don’t just replicate) the experience: It’s rarely possible – or desirable – to exactly re-create every Tableau visualization pixel-for-pixel in Power BI. Embrace the opportunity to focus on business questions and improve user experience with Power BI’s features. For example, rather than replicating a complex Tableau workaround, you might implement a cleaner solution in Power BI using native features (like bookmarks, drilldowns, or simpler navigation between pages). Engage business users and subject matter experts during this redesign to ensure the new reports meet their needs. Enable dataset reusability: One major benefit of the Power BI approach is the ability to create shared datasets and dataflows. As you migrate, look for opportunities to create central semantic models (datasets) that can serve multiple reports. For instance, if several Tableau workbooks are all using similar data about sales, you can create one central Sales dataset in Power BI. Report creators across the organization can then build different Power BI reports on that single dataset without duplicating data or logic. This reduces maintenance and promotes a “build once, reuse often” strategy. Provide training and support: Expect a learning curve for teams moving to Power BI – especially those who are very fluent in Tableau. Plan for user upskilling and training programs. Establish a support community or office hours where new users can ask questions and get help. If possible, identify Power BI champions or recruit a Power BI Center of Excellence (COE) team who can guide others. During the transition, ensure there are subject matter experts (SMEs) available to address questions and validate that the new reports are correct. Manage change and expectations: It’s important to communicate why the organization is moving to Power BI (e.g. benefits like deeper integration, lower TCO, better governance) to get buy-in from end users. Some users may be resistant to change, especially if they’ve invested a lot of time in mastering Tableau. Prepare to handle varying responses – emphasize the personal benefits (like improved performance, new capabilities, or career growth with popular skills) to encourage adoption. Also, involve influential business users early and gather their feedback, so they feel ownership in the new solution. Establish governance from Day 1: Don’t wait until after migration to think about governance. Use this chance to set up Power BI governance aligned to best practices. Decide on important aspects such as workspace naming conventions, who can create or publish content, how you’ll monitor usage and costs, and how to manage data access and security (for example, designing a strategy for RLS, and deciding when to use per-user datasets vs. organizational semantic models). Good governance will ensure your shiny new Power BI environment doesn’t sprawl into chaos over time. Allow time for adjustment and iteration: Finally, be patient and iterative. Depending on the scale of your organization and the number of Tableau assets, a full migration can take months or even a year or more. Plan realistic transition periods where both systems might coexist. Continuously refine your approach with each wave of migration. Power BI’s frequent update cadence (monthly releases) means new features may emerge even during your project – stay updated, as new capabilities could simplify your migration (for example, the introduction of field parameters or Copilot might let you modernize certain Tableau features more easily). Reimagine (don’t just replicate) the experience (Step 5): Phase 1: Assessment and Planning 1. Audit Your Tableau Estate Inventory all workbooks, data sources, and calculated fields Identify high-traffic dashboards (prioritize for early migration) Categorize by complexity (Simple/Medium/Complex/Very Complex) 2. Design Your Semantic Architecture Map Tableau data sources to Power BI data sources (DirectQuery, Import, or Direct Lake) Plan star schema for fact/dimension tables Identify shared calculations that should live in semantic models vs. report-specific logic 3. Choose Storage Modes Source Type Recommended Mode Rationale Databricks Delta Lake Direct Lake Real-time analytics, no refresh lag Azure SQL Database DirectQuery or Import Based on data volume and refresh SLAs On-Premises SQL Server Import (via Gateway) Network latency considerations Excel/CSV files Import Small reference data Phase 2: Build the Semantic Layer 1. Create Star Schema Data Models Tableau often relies on flat, denormalized datasets. Power BI performs best with star schemas: Fact tables: Transactional data (sales, orders, events) with foreign keys to dimensions Dimension tables: Descriptive attributes (customers, products, dates) with primary keys Relationships: One-to-many from dimension to fact, leveraging bidirectional filtering sparingly 2. Migrate Calculations to DAX Measures Convert Tableau calculated fields to DAX measures in the semantic model: --Example of DAX: -- Define as measure: Total Revenue = SUMX( 'Sales', 'Sales'[Quantity] * 'Sales'[Unit Price] ) 2.1 Use Copilot to Accelerate DAX Development Leverage Copilot in Power BI Desktop to generate and validate DAX: Describe the calculation in natural language Copilot suggests DAX syntax Review, test, and refine Phase 3: Understanding Migration Complexity: Simple to Very Complex Dashboards Not all Tableau dashboards are created equal. The migration strategy should align with dashboard complexity, and the semantic layer approach becomes increasingly valuable as complexity grows. 1. Dashboard Conversion Best Practices Think in "pages" not "sheets": Power BI reports combine multiple visuals per page; group related visuals logically Use slicers for interactivity: Replace Tableau filters with Power BI slicers and filter pane Leverage bookmarks for navigation: Create dynamic report experiences with show/hide containers Simple Complexity Level Category Tableau Feature Power BI Equivalent Microsoft Fabric Enhancements Best Practice Notes Data Model Single custom SQL Power Query for data shaping and ETL. OneLake Shortcuts for unified data access. Use star schema for optimized performance; push logic into the semantic layer rather than visuals. Calculations Basic IF/ELSE, SUM Data Analysis Expressions (DAX) for measures and calculated columns. Copilot for Power BI to assist with DAX creation. Fabric IQ for natural language queries. Centralize calculations in semantic models for consistency and governance. Medium Complexity Level Category Tableau Feature Power BI Equivalent Fabric Enhancements Best Practice Notes Data Model Multiple custom SQL (up to 3) Connect live to databases (Azure Databricks): DirectQuery in Power BI Connect with cloud data sources: Power BI data sources OneLake Shortcuts for unified access without databricks compute cost. Semantic Models can combine multiple sources. Optimize with star schema; Prefer OneLake Shortcuts for performance; avoid heavy transformations in visuals. Calculations Nested IFs, CASE Data Analysis Expressions (DAX) for measures and calculated columns. Copilot for Power BI to assist with DAX creation. Fabric Data Agent for conversational BI. Fabric IQ for natural language queries: Fabric IQ Centralize logic in semantic models; use Copilot for automation and validation; keep calculations reusable. Reporting Tooltip format in Bar and Map visuals Select All/Clear option for Single Select dropdown Standard tooltips offer help tooltips, text, and background formatting. Dynamic tooltip will be able to create the Tooltip page and reuse it in multiple visuals The customization is so much better than the OOB tooltips Create report tooltip pages in Power BI - Power BI | Microsoft Learn Use Clear All Slicers Button. Disable Single Select, Add Clear All Slicers button, Customize the Button and Use the Button Complex Complexity Level Category Tableau Feature Power BI Equivalent Fabric Enhancements Best Practice Notes Data Model Multiple sources Create relationship using more than one column Composite Models in Power BI (DirectQuery + Import) for combining multiple sources, also connect to various cloud services. Dataflows for pre-processing. Power BI allows a relationship between 2 tables based on only one active column. OneLake Shortcuts for unified access without Azure Databricks compute cost; Microsoft Fabric Dataflows Gen2 offers multiple ways to ingest, transform, and load data efficiently. Consolidate sources into semantic models; use Direct Lake for performance; Plan and design data model to comply with star schema supported by Power BI Relationship DAX USERELATIONSHIP DAX for activating relationships in Power BI for a specific calculation Calculations LOD, window functions Data Analysis Expressions (DAX) for measures and calculated columns. Copilot to assist with complex DAX. Fabric IQ Ontology for semantic alignment. Q&A visual for natural language. Change how visuals interact in a Power BI report. Centralize calculations in semantic layer; use variables in DAX for readability and performance. Fabric Data Agent for a conversational BI. Very Complex Complexity Level Category Tableau Feature Power BI Equivalent Fabric Enhancements Best Practice Notes Data Model Multi-source, Excel, SQL Composite Models in Power BI (DirectQuery + Import) for combining multiple sources, also connect to various cloud services. Dataflows for pre-processing. OneLake Shortcuts for unified access; Connector overview build-in support. Mirroring for real-time sync. Combine multiple sources into well-structured semantic models for consistency and optimized performance. Calculations Predictive logic Data Analysis Expressions (DAX) for measures and calculated columns. Fabric AutoML, ML models, AI Insights, Python/R, Notebook‑based ML (Spark/Scikit‑Learn), Fabric AI Functions, Fabric IQ Ontology Fabric Data Agent for a conversational BI. Centralize logic in semantic models; leverage Copilot for automation and parameter-driven workflows. Prepare for Copilot and Q&A integration. 2. Tableau Feature Equivalents Tableau Feature Power BI Equivalent Microsoft Learn Link Calculated Fields DAX Measures DAX Documentation Parameters Field Parameters / Bookmarks Use report readers to change visuals Actions Drillthrough / Bookmarks Drillthrough Tableau Prep Power Query / Dataflows Differences between Dataflow Gen1 and Dataflow Gen2 Tableau Server Power BI Service What is Power BI? Overview of Components and Benefits Phase 4: Governance and Deployment Workspace Planning (Dev / Test / Prod Separation) A proper workspace strategy is essential for governed deployments in Fabric and Power BI. Fabric supports separate Development, Test, and Production stages using Deployment Pipelines, enabling controlled promotions of semantic models, reports, dataflows, notebooks, lakehouses, and other items. You can assign each workspace to a pipeline stage (Dev → Test → Prod) to ensure safe lifecycle management. Sensitivity Labeling (Microsoft Purview Information Protection) Sensitivity labels allow governed classification and protection of data across Fabric items. Sensitivity labels can be applied directly to Fabric items (semantic models, reports, dataflows, etc.) through the item's header flyout or the item settings. Labels from Microsoft Purview Information Protection enforce data access rules and help organizations meet compliance requirements. Endorsement & Certification (Promoted, Certified, Master Data) Endorsement improves discoverability and trust in shared organizational content. Promoted: Item creators mark content as recommended for broader use. Certified: Administrators or authorized reviewers validate content meets organizational quality standards. Master Data: Indicates authoritative single‑source‑of‑truth items such as semantic models or lakehouses. All Fabric items except dashboards can be promoted or certified; data‑containing items can be designated as Master Data. Monitoring & Capacity Planning Determine the appropriate size for fabric capacity when migrating from Tableau to PowerBI. The Fabric SKU Estimator can generate a SKU recommendation (estimate) for your capacity requirements. Ensuring performance and cost efficiency requires ongoing monitoring of your Fabric capacity. Microsoft recommends evaluating workloads using Fabric Capacity Metrics and planning SKU sizes based on real usage. Fabric uses bursting and smoothing to handle spikes while enforcing capacity limits. Monitoring helps identify high compute usage, background refreshes, and interactive workloads to optimize performance. Fabric Data Source Connections (OneLake+ Manage Connections) Microsoft Fabric is designed as an end‑to‑end analytics platform that integrates data from many different source systems into a unified environment powered by OneLake, Data Factory, Real‑Time Analytics, Dataflows , Lakehouses, Warehouses, and Mirrored Databases. The Strategic Advantage: Semantic Layer + Fabric IQ The semantic layer-first approach sets the foundation for the next evolution in enterprise analytics. Fabric IQ (announced at Ignite 2025) is Microsoft's semantic intelligence platform that auto-elevates semantic models into ontologies—structured knowledge graphs that power AI agents, Copilot experiences, and cross-domain data reasoning. What this means for your migration: Semantic models you build today become the foundation for AI-driven analytics tomorrow Data Agents can reason across multiple semantic models, answering questions that span domains Business users transition from "report consumers" to "data explorers" via natural language interfaces Conclusion: Build for the Future, Not Just for Today Migrating from Tableau to Power BI is more than a technology swap—it's an opportunity to re-architect your analytics strategy for the cloud-native, AI-powered era. The semantic layer-first approach requires upfront investment in data modeling, DAX expertise, and Fabric platform adoption. But the payoff is transformative: Consistency: Single source of truth for all business metrics Scalability: Semantic models that serve hundreds of reports and thousands of users Agility: Changes to business logic propagate instantly across the enterprise Future-readiness: Foundation for Fabric IQ, Data Agents, and AI-driven insights Start your migration with the end in mind: not just convert dashboards, but a modern, governed, AI-ready analytics platform that scales with your business. Addressing Key Migration Concerns (1) Why a semantic‑layered model approach is better than recreating Tableau dashboards A semantic‑layered modeling approach is the optimal strategy for migration and is significantly more effective than attempting to recreate Tableau dashboards exactly as they exist. By contrast, Power BI and Fabric encourage a semantic model–first architecture, where all business rules, relationships, calculations, and transformations are centralized in a governed model that serves many dashboards. The approach not only provides consistency and reuse across the enterprise but also ensures that report authors build on a single certified version of the truth. (2) How semantic-layered model approach reduces the constant redesign caused by changing data needs. A semantic‑layered modeling approach directly addresses concern about constant changes and frequent redesigns of dashboards when data evolves. With a semantic layer, changes are absorbed in the model layer—so the logic is updated once and flows automatically into all dependent reports. Combined with Fabric features like OneLake shortcuts, Direct Lake mode, and centralized governance, the semantic layer drastically reduces breakage, minimizes rework, and ensures scalability as data continues to grow and shift. Additional Resources Direct Lake in Microsoft Fabric Create Fabric Data Agents OneLake Shortcuts Write DAX queries with Copilot - DAX871Views1like1CommentHow to Join Microsoft Tech Community & Leina Future Data & AI Hub (Step-by-Step Guide) - Arabic
This session provides a step-by-step walkthrough on how to officially join the Microsoft Tech Community and become a member of the Leina Future Data & AI Hub – Microsoft User Group. The event is designed to help learners, trainers, and data professionals understand: • How to create and verify a Microsoft Tech Community profile • How to join an official Microsoft User Group • How community membership supports learning, networking, and professional growth • How participants can access recorded sessions, resources, and future events This session supports community enablement and knowledge sharing within the Microsoft ecosystem, with a focus on Data, AI, Power BI, and Microsoft Fabric learners. Target Audience: • Students and fresh graduates • Data analysts and Excel/Power BI learners • Trainers and professionals interested in Microsoft communities Session Format: • Recorded session • Practical walkthrough • Community-focused learning Speaker / Host Dr. Leina Nazar Abdelgalil Microsoft Certified Trainer (MCT) Founder – Leina Future Data & AI Hub254Views2likes4CommentsOverload to Optimal: Tuning Microsoft Fabric Capacity
Co-Authored by: Daya Ram, Sr. Cloud Solutions Architect and Rafia Aqil, Could Solutions Architect Optimizing Microsoft Fabric capacity is both a performance and cost exercise. By diagnosing workloads, tuning cluster and Spark settings, and applying data best practices, teams can reduce run times, avoid throttling, and lower total cost of ownership—without compromising SLAs. Use Fabric’s built-in observability (Monitoring Hub, Capacity Metrics, Spark UI) to identify hot spots and then apply cluster- and data-level remediations. Capacity Planning For capacity planning and sizing guidance, see Plan your capacity size. Selecting the wrong SKU can lead to two major issues: Over-provisioning: Paying for resources you don’t need. Under-provisioning: Struggling with performance bottlenecks and failed jobs. To simplify this process, Microsoft provides the Fabric SKU Estimator, a powerful tool designed to help organizations accurately size their capacity based on real-world usage patterns. Run the SKU Estimator before onboarding new workloads or scaling existing ones. Combine its recommendations with monitoring tools like Fabric Capacity Metrics to validate performance and adjust as needed. Options to Diagnose Capacity Issues 1) Monitoring Hub — Start with the Story of the Run What to use it for: Browse Spark activity across applications (notebooks, Spark Job Definitions, and pipelines). Quickly surface long‑running or anomalous runs; view read/write bytes, idle time, core allocation, and utilization. How to use it From the Fabric portal, open Monitoring (Monitor Hub). Select a Notebook or Spark Job Definition to run and choose Historical Runs. Inspect the Run Duration chart; click on a run to see read/write bytes, idle time, core allocation, overall utilization, and other Spark metrics. What to look for Use the guide: application detail monitoring to review and monitor your application. 2) Capacity Metrics App — Measure the Whole Environment What to use it for: Review capacity-wide utilization and system events (overloads, queueing); compare utilization across time windows and identify sustained peaks. How to use it Open the Microsoft Fabric Capacity Metrics app for your capacity. Review the Compute page (ribbon charts, utilization trends) and the System events tab to see overload or throttling windows. Use the Timepoint page to drill into a 30‑second interval and see which operations consumed the most compute. What to look for Use the Troubleshooting guide: Monitor and identify capacity usage to pinpoint top CU‑consuming items. 3) Spark UI — Diagnose at Deeper Level Why it matters: Spark UI exposes skew, shuffle, memory pressure, and long stages. Use it after Monitoring Hub/Capacity Metrics to pinpoint the problematic job. Key tabs to inspect Stages: uneven task durations (data skew), heavy shuffle read/write, large input/output volumes. Executors: storage memory, task time (GC), shuffle metrics. High GC or frequent spills indicate memory tuning is needed. Storage: which RDDs/cached tables occupy memory; any disk spill. Jobs: long‑running jobs and gaps in the timeline (driver compilation, non‑Spark code, driver overload). What to look for Set via environment Spark properties or session config. Data skew, Memory usage, High/Low Shuffles: Adjust Apache Spark settings: i.e. spark.ms.autotune.enabled, spark.task.cpus and spark.sql.shuffle.partitions. Remediation and Optimization Suggestions A) Cluster & Workspace Settings Runtime & Native Execution Engine (NEE) Use Fabric Runtime 1.3 (Spark 3.5, Delta 3.2) and enable the Native Execution Engine to boost performance; enable at the environment level under Spark compute → Acceleration. Starter Pools vs. Custom Pools Starter Pool: prehydrated, medium‑size pools; fast session starts, good for dev/quick runs. Custom Pools: size nodes, enable autoscale, dynamic executors. Create via workspace Spark Settings (requires capacity admin to enable workspace customization). High Concurrency Session Sharing Enable High Concurrency to share Spark Sessions across notebooks (and pipelines) to reduce session startup latency and cost; use session tags in pipelines to group notebooks. Autotune for Spark Enable Autotune (spark.ms.autotune.enabled = true) to auto‑adjust per‑query: spark.sql.shuffle.partitions Spark.sql.autoBroadcastJoinThreshold spark.sql.files.maxPartitionBytes. Autotune is disabled by default and is in preview; enable per environment or session. B) Data‑level best practices Microsoft Fabric offers several approaches to maintain optimal file sizes in Delta tables, review documentation here: Table Compaction - Microsoft Fabric. Intelligent Cache Enabled by default (Runtime 1.1/1.2) for Spark pools: caches frequently read files at node level for Delta/Parquet/CSV; improves subsequent read performance and TCO. OPTIMIZE & Z‑Order Run OPTIMIZE regularly to rewrite files and improve file layout. V‑Order V‑Order (disabled by default in new workspaces) can accelerate reads for read‑heavy workloads; enable via spark.sql.parquet.vorder.default = true. Vacuum Run VACUUM to remove unreferenced files (stale data); default retention is 7 days; align retention across OneLake to control storage costs and maintain time travel. Collaboration & Next Steps Engage Data Engineering Team to Define an Optimization Playbook Start with reviewing capacity sizing guidance, cluster‑level optimizations (runtime/NEE, pools, concurrency, Autotune) and then target data improvements (Z‑order, compaction, caching, query refactors). Triage: Monitor Hub → Capacity Metrics → Spark UI to map workloads and identify high‑impact jobs, and workloads causing throttling. Schedule: Operationalize maintenance: OPTIMIZE (full or selective) during off‑peak windows; enable Auto Compaction for micro‑batch/streaming writes; add VACUUM to your cadence with agreed retention. Add regular code review sessions to ensure consistent performance patterns. Fix: Adjust pool sizing or concurrency; enable Autotune; tune shuffle partitions; refactor problematic queries; re‑run compaction. Verify: Re‑run the job and change, i.e. reduced run time, lower shuffle, improved utilization.652Views2likes0CommentsFrom Bronze to Gold: Data Quality Strategies for ETL in Microsoft Fabric
Introduction Data fuels analytics, machine learning, and AI but only if it’s trustworthy. Most organizations struggle with inconsistent schemas, nulls, data drift, or unexpected upstream changes that silently break dashboards, models, and business logic. Microsoft Fabric provides a unified analytics platform with OneLake, pipelines, notebooks, and governance capabilities. When combined with Great Expectations, an open-source data quality framework, Fabric becomes a powerful environment for enforcing data quality at scale. In this article, we explore how to implement enterprise-ready, parameterized data validation inside Fabric notebooks using Great Expectations including row-count drift detection, schema checks, primary-key uniqueness, and time-series batch validation. A quick reminder: ETL (Extract, Transform, Load) is the process of pulling raw data from source systems, applying business logic and quality validations, and delivering clean, curated datasets for analytics and AI. While ETL spans the full Medallion architecture, this guide focuses specifically on data quality checks in the Bronze layer using the NYC Taxi sample dataset. 🔗 Full implementation is available in my GitHub repository: sallydabbahmsft/Data-Quality-Checks-in-Microsoft-Fabric: Data Quality Checks in Microsoft Fabric Why Data Quality Matters More Than Ever? AI and analytics initiatives fail not because of model quality but because the underlying data is inaccurate, incomplete, or inconsistent. Organizations adopting Microsoft Fabric often ask: How can we validate data as it lands in Bronze? How do we detect schema changes before they break downstream pipelines? How do we prevent silent failures, anomalies, and drift? How do we standardize data quality checks across multiple tables and pipelines? Great Expectations provides a unified, testable, automation-friendly way to answer these questions. Great Expectations in Fabric Great Expectations (GX) is an open-source library for: ✔ Declarative data quality rules ("expectations") ✔ Automated validation during ETL ✔ Rich documentation and reporting ✔ Batch-based validation for time-series or large datasets ✔ Integration with Python, Spark, SQL, and cloud data platforms Fabric notebooks now support Great Expectations natively (via PySpark), enabling engineering teams to: Build reusable DQ suites Parameterize expectations by pipeline Validate full datasets or daily partitions Integrate validation into Fabric pipelines and alerting Data Quality Across the Medallion Architecture This solution follows the Medallion Architecture, with validation at every layer. This pipeline follows a Medallion Architecture, moving data through the Bronze, Silver, and Gold layers while enforcing data quality checks at every stage. 📘 P.S. Fabric also supports this via built-in Medallion task flows: Task flows overview - Microsoft Fabric | Microsoft Learn 🥉Bronze Layer: Ingestion & Validation Ingest raw source data into Bronze without transformations. Run foundational DQ checks to ensure structural integrity. Bronze DQ answers: ➡ Did the data arrive correctly? 🥈Silver Layer: Transformation & Validation Clean, standardize, and enrich Bronze data. Validate business rules, schema consistency, reference values, and more. Silver DQ answers: ➡ Is the data accurate and logically correct? 🥇 Gold Layer: Enrichment & Consumption Produce curated, analytics-ready datasets. Validate metrics, aggregates, and business KPIs. Gold DQ answers: ➡ Can executives trust the numbers? Recommended Data Quality Validations: Bronze Layer (Raw Ingestion) Ingestion Volume & Row Drift – Validate total row count and detect unexpected volume drops or spikes. Schema & Data Type Compliance – Ensure the table structure and column data types match the expected schema. Null / Empty Column Checks – Identify missing or empty values in required fields. Primary Key Uniqueness – Detect duplicate records based on the defined composite or natural key. Silver Layer (Cleaned & Standardized Data) Reference & Domain Value Validation – Confirm that values match valid categories, lookups, or reference datasets. Business Rule Enforcement – Validate logic constraints (e.g., StartDate <= EndDate, percentages within range). Anomaly / Outlier Detection – Identify unusual patterns or values that deviate from historical behavior. Post-Standardization Deduplication – Ensure standardized and enriched records no longer contain duplicates. Gold Layer (Curated, Business-Ready Data) Metric & Aggregation Consistency – Validate totals, ratios, rollups, and other aggregated metrics. KPI Threshold Monitoring – Trigger alerts when KPIs exceed defined thresholds. Data / Feature Drift Detection (for ML) – Monitor changes in distributions across time. Cross-System Consistency Checks – Compare business metrics across internal systems to ensure alignment. Implementing Data Quality with Great Expectations in Fabric Step 1 - Read data from Lakehouse (parametrized): lakehouse_name = "Bronze" table_name = "NYC Taxi - Green" query = f"SELECT * FROM {lakehouse_name}.`{table_name}`" df = spark.sql(query) Step 2 - Create and Register a Suite: context = gx.get_context() suite = context.suites.add( gx.ExpectationSuite(name="nyc_bronze_suite") ) Step 3 - Add Bronze Layer Expectations (Reusable Function): import great_expectations as gx def add_bronze_expectations( suite: gx.ExpectationSuite, primary_key_columns: list[str], required_columns: list[str], expected_schema: list[str], expected_row_count: int | None = None, max_row_drift_pct: float = 0.2, ) -> gx.ExpectationSuite: # 1. Ingestion Count & Row Drift if expected_row_count is not None: min_rows = int(expected_row_count * (1 - max_row_drift_pct)) max_rows = int(expected_row_count * (1 + max_row_drift_pct)) row_count_expectation = gx.expectations.ExpectTableRowCountToBeBetween( min_value=min_rows, max_value=max_rows, ) suite.add_expectation(expectation=row_count_expectation) # 2. Schema Compliance schema_expectation = gx.expectations.ExpectTableColumnsToMatchSet( column_set=expected_schema, exact_match=True, ) suite.add_expectation(expectation=schema_expectation) # 3. Required columns: NOT NULL for col in required_columns: not_null_expectation = gx.expectations.ExpectColumnValuesToNotBeNull( column=col ) suite.add_expectation(expectation=not_null_expectation) # 4. Primary key uniqueness (if provided) if primary_key_columns: unique_pk_expectation = gx.expectations.ExpectCompoundColumnsToBeUnique( column_list=primary_key_columns ) suite.add_expectation(expectation=unique_pk_expectation) return suite Step 4 - Attach Data Asset & Batch Definition: data_source = context.data_sources.add_spark(name="bronze_datasource") data_asset = data_source.add_dataframe_asset(name="nyc_bronze_data") batch_definition = data_asset.add_batch_definition_whole_dataframe("full_bronze_batch") Step 5 - Run Validation: validation_definition = gx.ValidationDefinition( data=batch_definition, suite=suite, name="Bronze_DQ_Validation" ) results = validation_definition.run( batch_parameters={"dataframe": df} ) print(results) 7. Optional: Time-Series Batch Validation (Daily Slices) Fabric does not yet support add_batch_definition_timeseries, so your notebook implements custom logic to validate each day independently: dates_df = df.select(F.to_date("lpepPickupDatetime").alias("dt")).distinct() for d in dates: df_day = df.filter(F.to_date("lpepPickupDatetime") == d) results = validation_definition.run(batch_parameters={"dataframe": df_day}) This enables: Daily anomaly detection Partition-level completeness checks Early schema drift detection Automating DQ with Fabric Pipelines Fabric pipelines can orchestrate your data quality workflow: Trigger notebook after ingestion Pass parameters (table, layer, suite name) Persist DQ results to Lakehouse or Log Analytics Configure alerts in Fabric Monitor Production workflow Run the notebook Check validation results If failures exist: Raise an incident Fail the pipeline Notify the on-call engineer This creates a closed loop of ingestion → validation → monitoring → alerting. An example of DQ pipeline: Results: How Enterprises Benefit By standardizing data quality rules across all domains, organizations ensure consistent expectations and uniform validation practices , improved observability makes data quality issues visible and actionable, enabling teams to detect and resolve failures early. This, in turn, enhances overall reliability, ensuring downstream transformations and Power BI reports operate on clean, trustworthy data. Ultimately, stronger data quality directly contributes to AI readiness high-quality, well-validated data produces significantly better analytics and machine learning outcomes. Conclusion Great Expectations + Microsoft Fabric creates a scalable, modular, enterprise-ready approach for ensuring data quality across the entire medallion architecture. Whether you're validating raw ingested data, transformed datasets, or business-ready tables, the approach demonstrated here enables consistency, observability, and automation across all pipelines. With Fabric’s unified compute, orchestration, and monitoring, teams can now integrate DQ as a first-class citizen not an afterthought. Links: Implement medallion lakehouse architecture in Fabric - Microsoft Fabric | Microsoft Learn GX Expectations Gallery • Great Expectations657Views0likes1CommentUnlock AI-Ready Insights: Empowering Nonprofits with Microsoft Fabric
Disconnected data slows impact—but it doesn’t have to. Nonprofit data solutions in Microsoft Fabric unify your data, accelerate insights, and create an AI-ready foundation for smarter decisions. Imagine moving from raw donor data to actionable insights in weeks, not months. With Fabric, nonprofits gain: Unified fundraising data in one place Instant insights through ready-made dashboards AI-powered possibilities for predictive analytics and smarter strategies This isn’t just technology—it’s a game-changer for fundraising, volunteer management, and program delivery. Built with nonprofit needs in mind, Fabric helps you reduce complexity and unlock mission outcomes faster. Ready to see how? Read the full article on Microsoft for Nonprofits LinkedIn: https://www.linkedin.com/posts/microsoft-for-nonprofits_disconnected-data-slows-impact-but-it-doesn-activity-7402016549773017088-8cH9104Views0likes0CommentsDefining the Raw Data Vault with Artificial Intelligence
This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH. The Technical Review is done by Ian Clarke, Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft The Data Vault concept is used across the industry to build robust and agile data solutions. Traditionally, the definition (and subsequent modelling) of the Raw Data Vault, which captures the unmodified raw data, is done manually. This work demands significant human intervention and expertise. However, with the advent of artificial intelligence (AI), we are witnessing a paradigm shift in how we approach this foundational task. This article explores the transformative potential of leveraging AI to define the Raw Data Vault, demonstrating how intelligent automation can enhance efficiency, accuracy, and scalability, ultimately unlocking new levels of insight and agility for organizations. Note that this article describes a solution to AI-generated Raw Data Vault models. However, the solution is not limited to Data Vault, but allows the definition of any data-driven, schema-on-read model to integrate independent data sets in an enterprise environment. We discuss this towards the end of this article. Metadata-Driven Data Warehouse Automation In the early days of Data Vault, all engineering was done manually: an engineer would analyse the data sources and their datasets, come up with a Raw Data Vault model in an E/R tool or Microsoft Visio, and then develop both the DDL code (CREATE TABLE) and the ELT / ETL code (INSERT INTO statements). However, Data Vault follows many patterns. Hubs look very similar (the difference lies in the business keys) and are loaded similarly. We discussed these patterns in previous articles of this series, for example, when covering the Data Vault model and implementation. In most projects where Data Vault entities are created and loaded manually, a data engineer eventually develops the idea of creating a metadata-driven Data Vault generator due to these existing patterns. The effort to build a generator is too considerable, and most projects are better off using an off-the-shelf solution such as Vaultspeed. These tools come with a metadata repository and a user interface for setting up the metadata and code templates required to generate the Raw Data Vault (and often subsequent layers). We have discussed Vaultspeed in previous articles of this series. By applying the code templates to the metadata defined by the user, the actual code for the physical model is generated for a data platform, such as Microsoft Fabric. The code templates define the appearance of hubs, links, and satellites, as well as how they are loaded. The metadata defines which hubs, links, and satellites should exist to capture the incoming data set consistently. Manual development often introduces mistakes and errors that result in deviations in code quality. By generating the data platform code, deviations from the defined templates are not possible (without manual intervention), thus raising the overall quality. But the major driver for most project teams is to increase productivity. Instead of manually developing code, they generate the code. Metadata-driven generation of the Raw Data Vault is standard practice in today's projects. Today’s project tasks have therefore changed: while engineers still need to analyse the source data sets and develop a Raw Data Vault model, they no longer create the code (DDL/ELT). Instead, they set up the metadata that represents the Raw Data Vault model in the tool of their choice. Each data warehouse automation tool comes with its specific features, limitations, and metadata formats. The data engineer/modeler must understand how to transfer the Raw Data Vault model into the data warehouse automation tool by correctly setting up the metadata. This is also true for Vaultspeed; the data modeler can set up the metadata either through the user interface or via the SDK. This is the most labour-intensive task concerning the Raw Data Vault layer. It also requires experts who not only know Data Vault modelling but also know (or can analyse) the source systems' data and understand the selected data warehouse automation solution. Additionally, Data Vault is not equal to Data Vault in many cases, as it allows for a very flexible interpretation of how to model a Data Vault, which also leads to quality issues. But what if the organization has no access to such experts? What if budgets are limited, time is of the essence, or there are no available experts in sufficient numbers in the field? As Data Vault experts, we can debate the value of Data Vault as much as we want, but if there are no experts capable of modeling it, the debate will remain inconclusive. And what if this problem is only getting worse? In the past, a few dozen source tables might have been sufficient to be processed by the data platform. Today, several hundred source tables could be considered a medium-sized data platform. Tomorrow, there will be thousands of source tables. The reason? There is not only an exponential growth in the volume of data to be produced and processed, but it also comes with an exponential growth in the complexity of data shape. The source of this exponential growth in data shape comes from more complex source databases, APIs that produce and deliver semi-structured JSON data, and, ultimately, more complex business processes and an increasing amount of generated and available data that needs to be analysed for meaningful business results. Generating the Data Vault using Artificial Intelligence Increasingly, this data is generated using artificial intelligence (AI) and still requires integration, transformation, and analysis. The issue is that the number of data engineers, data modelers, and data scientists is not growing exponentially. Universities around the world only produce a limited number of these roles, and some of us would like to retire one day. Based on our experience, the increase in these roles is linear at best. Even if you argue for exponential growth in these roles, it is evident that there is no debate about a growing gap between the increasing data volume and the people who should analyse it. This gap cannot be closed by humans in the future. Even in a world where all kids want to become and eventually work in a data role. Sorry for all the pilots, police officers, nurses, doctors, etc., there is no way for you to retire without the whole economy imploding. Therefore, the only way to close the gap is through the use of artificial intelligence. It is not about reducing the data roles. It's about making them efficient so that they can deal with the growing data shape (and not just the volume). For a long time, it was common sense in the industry that, if an artificial intelligence could generate or define the Raw Data Vault, it would be an assisting technology. The AI would make recommendations, for example, such as which hubs or links to model and which business keys to use. The human data modeler would make the final decision, with input from the AI. But what if the AI made the final decision? What would it look like? What if one could attach data sources to the AI platform and the AI would analyze the source datasets, come up with a Raw Data Vault model, and load that model into Vaultspeed or another data warehouse automation tool, know the source system’s data, know Data Vault modelling, and understand the selected data warehouse automation? These questions were posed by Michael Olschimke, a Data Vault and AI expert, when initially considering the challenge. He researched the distribution of neural networks on massively parallel processing (MPP) clusters to classify unstructured data at Santa Clara University in Silicon Valley. This prior AI research, combined with the knowledge he accumulated in the Data Vault, enabled him to build a solution that later became known as Flow.BI. Flow.BI as a Generative AI to Define the Raw Data Vault The solution is simple, at least from the outside: attach a few data sources, let the AI do the rest. Flow.BI supports several data sources already, including Microsoft SQL Server and derivatives, such as Synapse and Fabric, as long as a JDBC driver is available, Flow.BI should eventually be able to analyze the data source. And the AI doesn’t care if the data originates from a CRM system, such as Microsoft Dynamics, or an e-commerce platform; it's just data. There are no provisions in the code to deal with specific datasets, at least for now. The goal of Flow.BI is to produce a valid, that is, consistent and integrated, enterprise data model. Typically, this follows a Data Vault design, but it's not limited to that (we’ll discuss this later in the article). This is achieved by following a strict data-driven approach that imitates the human data modeler. Flow.BI needs data to make decisions, just like its human counterpart. Source entities with no data will be ignored. It only requires some metadata, such as the available entities and their columns. Datatypes are nice-to-have; primary keys and foreign keys would improve the target model, just like entity and column descriptions. But they are not required to define a valid Raw Data Vault model. Humans write this text, and as such, we like to influence the result of the modelling exercise. Flow.BI is appreciating this by offering many options for the human data modeler to influence the engine. Some of them will be discussed in this article, but there are many more already available and more to come. Flow.BI’s user interface is kept as lean and straightforward as possible: the solution is designed so that the AI should take the lead and model the whole Raw Data Vault. The UI’s purpose is to interact with human data modelers, allowing them to influence the results. That’s what many screens are related to - and the configuration of the security system. A client can have multiple instances, which result in independent Data Vault models. This is particularly useful when dealing with independent data platforms, such as those used by HR, the compliance department, or specific business use cases, or when creating the raw data foundation for data products within a data mesh. In this case, a Flow.BI instance equals a data product. But don’t underestimate the complexity of Flow.BI: The frontend is used to manage a large number of compute clusters that implement scalable agents to work on defining the Raw Data Vault. The platform is implementing full separation of data and processing, not only by client but also by instance. Mapping Raw Data to Organizational Ontology The very first step in the process is to identify the concepts in the attached datasets. For this purpose, there is a concept classifier that analyses the data and recognizes datasets and their classified concepts that it has seen in the past. A common requirement of clients is that they would like to leverage their organizational requirements in this process. While Flow.BI doesn’t know a client’s ontology; it is possible to override (and in some cases, complete) the concept classifications and refer to concepts from the organizational ontology. By doing so, Flow.BI will integrate the source system’s raw data into the organization's ontology. It will not create a logical Data Vault, which is where the Data Vault model reflects the desired business, but instead model the raw data as the business uses it, and therefore follow the data-driven Data Vault modeling principles that Michael Olschimke has taught to thousands of students over the years at Scalefree. Flow.BI also allows the definition of a multi-tenant Data Vault model, where source systems either provide multi-tenant data or are assigned to a specific tenant. In both cases, the integrated enterprise data model will be extended to allow queries across multiple tenants or within a single tenant, depending on the information consumer’s needs. Ensuring Security and Privacy Flow.BI was designed with security and privacy in mind. From a design perspective, this has two aspects: Security and privacy in the service itself, to protect client solutions and related assets Security and privacy are integral to the defined model, allowing for the effective utilization of Data Vault’s capabilities in addressing security and privacy requirements, such as satellite splits. While Flow.BI is using a shared architecture; all data and metadata storage and processing are separated by client and instance. However, this is often not sufficient for clients as they hesitate to share their highly sensitive data with a third party. For this reason, Flow.BI allows two critical features: Local data storage: instead of storing client data on Flow.BI infrastructure, the client provides an Azure Data Lake Storage to be used for storing the data. Local data processing: A Docker container can be deployed into the client’s infrastructure to access the client's data sources, extract the data, and process it. When using both options, only metadata, such as entity and column names, constraints, and descriptions, are shared with Flow.BI. No data is transferred from the client’s infrastructure to Flow.BI. The metadata is secured on Flow.BI’s premises as if it were actual data: row-level security separates the metadata by instance, and roles and permissions are defined per client who can access the metadata and what they can do with it. But security and privacy are not limited to the service itself. The defined model also utilizes the security and privacy features of Data Vault. For example, it enables the classification of source columns based on security and privacy. The user can set up security and privacy classes and apply them to the influence screen for both. By doing so, the column classifications are used when defining the Raw Data Vault and can later be used to implement a satellite split in the physical model (if necessary). An upcoming release will include an AI model for classifying columns based on privacy, utilizing data and metadata to automate this task. Tackling Multilingual Challenges A common challenge for clients is navigating multilingual data environments. Many data sources use English entity and column names, but there are systems using metadata in a different language. Also, the assumption that the data platform should use English metadata is not always correct. Especially in government clients, the use of the official language is mandatory. Both options, translating the source metadata to English (the default within Flow.BI) and translating the defined target model into any target language, are supported by Flow.BI’s translations tab on the influence screen: The tab utilizes an AI translator to fully automatically translate the incoming table names, column names, and concept names. However, the user can step in and override the translation to improve it to their needs. All strings of the source metadata and the defined model are passed through the translation module. It is also possible to reuse existing translations for a growing list of popular data sources. This feature enables readable names for satellites and their attributes (as well as hubs and links), resulting in a significantly improved user experience for the defined Raw Data Vault. Generating the Physical Model You should have noticed by now that we consistently discuss the defined Raw Data Vault model. Flow.BI is not generating the physical model, that is, the CREATE TABLE and INSERT INTO statements for the Raw Data Vault. Instead, it “just” defines the hubs, links, and satellites required for capturing all incoming data from the attached data sources, including business key selection, satellite splits, and special entity types, such as non-historized links and their satellites, multi-active satellites, hierarchical links, effectivity satellites, and reference tables. Video on Generating Physical Models This logical model (not to be confused with “logical Data Vault modelling”) is then provided to our growing number of ISV partner solutions that will consume our defined model, set up the required metadata in their tool, and generate the physical model. As a result, Flow.BI acts as a team member that analyses your organizational data sources and their data, knows how to model the Raw Data Vault, and how to set up metadata in the tool of your choice. The metadata is provided by Flow.BI can be used to model the landing zone/staging area (either on a data lake or a relational database such as Microsoft Fabric) and the Raw Data Vault in a data-driven Data Vault architecture, which is the recommended practice. With this in mind, Flow.BI is not a competition to Vaultspeed or your other existing data warehouse automation solution, but a valid extension that integrates with your existing tool stack. This makes it much easier to justify the introduction of Flow.BI to the project. Going Beyond Data Vault Flow.BI is not limited to the definition of Data Vault models. While it has been designed with the Data Vault concepts in mind, a customizable expert system is used to define the Data Vault model. Although the expert system is not yet publicly available, it has already been implemented and is in use for every model generation. This expert system enables the implementation of alternative data models, provided they adhere to data-driven, schema-on-read principles. Data Vault is such an example, but many others are possible, as well: Customized Data Vault models Inmon-style enterprise models in third-normal form (3NF, if no business logic is required Kimball-style analytical models with facts and dimensions, again without business logic Semi-structured JSON and XML document collections Key-value stores “One Big Table (OBT)” models “Many Big Related Table (MBRT)” models Okay, we’ve just invented the MBRT model as we're writing the article, but you get the idea: many large, fully denormalized tables with foreign–key relationships between each other. If you've developed your data-driven model, please get in touch with us. About the Authors Michael Olschimke is co-founder and CEO of Flow.BI, a generative AI that defines integrated enterprise data models, such as (but not limited to) Data Vault. Michael has trained thousands of industry data warehousing professionals, taught academic classes, and published regularly on topics around data platforms, data engineering, and Data Vault. He has over two decades of experience in information technology, with a specialization in business intelligence topics, artificial intelligence and data platforms. <<< Back to Blog Series Title Page334Views0likes0CommentsThe Future of AI: How Lovable.dev and Azure OpenAI Accelerate Apps that Change Lives
Discover how Charles Elwood, a Microsoft AI MVP and TEDx Speaker, leverages Lovable.dev and Azure OpenAI to create impactful AI solutions. From automating expense reports to restoring voices, translating gestures to speech, and visualizing public health data, Charles's innovations are transforming lives and democratizing technology. Follow his journey to learn more about AI for good.1.9KViews2likes0Comments