microsoft fabric
18 TopicsDefining the Raw Data Vault with Artificial Intelligence
This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH. The Technical Review is done by Ian Clarke, Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft The Data Vault concept is used across the industry to build robust and agile data solutions. Traditionally, the definition (and subsequent modelling) of the Raw Data Vault, which captures the unmodified raw data, is done manually. This work demands significant human intervention and expertise. However, with the advent of artificial intelligence (AI), we are witnessing a paradigm shift in how we approach this foundational task. This article explores the transformative potential of leveraging AI to define the Raw Data Vault, demonstrating how intelligent automation can enhance efficiency, accuracy, and scalability, ultimately unlocking new levels of insight and agility for organizations. Note that this article describes a solution to AI-generated Raw Data Vault models. However, the solution is not limited to Data Vault, but allows the definition of any data-driven, schema-on-read model to integrate independent data sets in an enterprise environment. We discuss this towards the end of this article. Metadata-Driven Data Warehouse Automation In the early days of Data Vault, all engineering was done manually: an engineer would analyse the data sources and their datasets, come up with a Raw Data Vault model in an E/R tool or Microsoft Visio, and then develop both the DDL code (CREATE TABLE) and the ELT / ETL code (INSERT INTO statements). However, Data Vault follows many patterns. Hubs look very similar (the difference lies in the business keys) and are loaded similarly. We discussed these patterns in previous articles of this series, for example, when covering the Data Vault model and implementation. In most projects where Data Vault entities are created and loaded manually, a data engineer eventually develops the idea of creating a metadata-driven Data Vault generator due to these existing patterns. The effort to build a generator is too considerable, and most projects are better off using an off-the-shelf solution such as Vaultspeed. These tools come with a metadata repository and a user interface for setting up the metadata and code templates required to generate the Raw Data Vault (and often subsequent layers). We have discussed Vaultspeed in previous articles of this series. By applying the code templates to the metadata defined by the user, the actual code for the physical model is generated for a data platform, such as Microsoft Fabric. The code templates define the appearance of hubs, links, and satellites, as well as how they are loaded. The metadata defines which hubs, links, and satellites should exist to capture the incoming data set consistently. Manual development often introduces mistakes and errors that result in deviations in code quality. By generating the data platform code, deviations from the defined templates are not possible (without manual intervention), thus raising the overall quality. But the major driver for most project teams is to increase productivity. Instead of manually developing code, they generate the code. Metadata-driven generation of the Raw Data Vault is standard practice in today's projects. Today’s project tasks have therefore changed: while engineers still need to analyse the source data sets and develop a Raw Data Vault model, they no longer create the code (DDL/ELT). Instead, they set up the metadata that represents the Raw Data Vault model in the tool of their choice. Each data warehouse automation tool comes with its specific features, limitations, and metadata formats. The data engineer/modeler must understand how to transfer the Raw Data Vault model into the data warehouse automation tool by correctly setting up the metadata. This is also true for Vaultspeed; the data modeler can set up the metadata either through the user interface or via the SDK. This is the most labour-intensive task concerning the Raw Data Vault layer. It also requires experts who not only know Data Vault modelling but also know (or can analyse) the source systems' data and understand the selected data warehouse automation solution. Additionally, Data Vault is not equal to Data Vault in many cases, as it allows for a very flexible interpretation of how to model a Data Vault, which also leads to quality issues. But what if the organization has no access to such experts? What if budgets are limited, time is of the essence, or there are no available experts in sufficient numbers in the field? As Data Vault experts, we can debate the value of Data Vault as much as we want, but if there are no experts capable of modeling it, the debate will remain inconclusive. And what if this problem is only getting worse? In the past, a few dozen source tables might have been sufficient to be processed by the data platform. Today, several hundred source tables could be considered a medium-sized data platform. Tomorrow, there will be thousands of source tables. The reason? There is not only an exponential growth in the volume of data to be produced and processed, but it also comes with an exponential growth in the complexity of data shape. The source of this exponential growth in data shape comes from more complex source databases, APIs that produce and deliver semi-structured JSON data, and, ultimately, more complex business processes and an increasing amount of generated and available data that needs to be analysed for meaningful business results. Generating the Data Vault using Artificial Intelligence Increasingly, this data is generated using artificial intelligence (AI) and still requires integration, transformation, and analysis. The issue is that the number of data engineers, data modelers, and data scientists is not growing exponentially. Universities around the world only produce a limited number of these roles, and some of us would like to retire one day. Based on our experience, the increase in these roles is linear at best. Even if you argue for exponential growth in these roles, it is evident that there is no debate about a growing gap between the increasing data volume and the people who should analyse it. This gap cannot be closed by humans in the future. Even in a world where all kids want to become and eventually work in a data role. Sorry for all the pilots, police officers, nurses, doctors, etc., there is no way for you to retire without the whole economy imploding. Therefore, the only way to close the gap is through the use of artificial intelligence. It is not about reducing the data roles. It's about making them efficient so that they can deal with the growing data shape (and not just the volume). For a long time, it was common sense in the industry that, if an artificial intelligence could generate or define the Raw Data Vault, it would be an assisting technology. The AI would make recommendations, for example, such as which hubs or links to model and which business keys to use. The human data modeler would make the final decision, with input from the AI. But what if the AI made the final decision? What would it look like? What if one could attach data sources to the AI platform and the AI would analyze the source datasets, come up with a Raw Data Vault model, and load that model into Vaultspeed or another data warehouse automation tool, know the source system’s data, know Data Vault modelling, and understand the selected data warehouse automation? These questions were posed by Michael Olschimke, a Data Vault and AI expert, when initially considering the challenge. He researched the distribution of neural networks on massively parallel processing (MPP) clusters to classify unstructured data at Santa Clara University in Silicon Valley. This prior AI research, combined with the knowledge he accumulated in the Data Vault, enabled him to build a solution that later became known as Flow.BI. Flow.BI as a Generative AI to Define the Raw Data Vault The solution is simple, at least from the outside: attach a few data sources, let the AI do the rest. Flow.BI supports several data sources already, including Microsoft SQL Server and derivatives, such as Synapse and Fabric, as long as a JDBC driver is available, Flow.BI should eventually be able to analyze the data source. And the AI doesn’t care if the data originates from a CRM system, such as Microsoft Dynamics, or an e-commerce platform; it's just data. There are no provisions in the code to deal with specific datasets, at least for now. The goal of Flow.BI is to produce a valid, that is, consistent and integrated, enterprise data model. Typically, this follows a Data Vault design, but it's not limited to that (we’ll discuss this later in the article). This is achieved by following a strict data-driven approach that imitates the human data modeler. Flow.BI needs data to make decisions, just like its human counterpart. Source entities with no data will be ignored. It only requires some metadata, such as the available entities and their columns. Datatypes are nice-to-have; primary keys and foreign keys would improve the target model, just like entity and column descriptions. But they are not required to define a valid Raw Data Vault model. Humans write this text, and as such, we like to influence the result of the modelling exercise. Flow.BI is appreciating this by offering many options for the human data modeler to influence the engine. Some of them will be discussed in this article, but there are many more already available and more to come. Flow.BI’s user interface is kept as lean and straightforward as possible: the solution is designed so that the AI should take the lead and model the whole Raw Data Vault. The UI’s purpose is to interact with human data modelers, allowing them to influence the results. That’s what many screens are related to - and the configuration of the security system. A client can have multiple instances, which result in independent Data Vault models. This is particularly useful when dealing with independent data platforms, such as those used by HR, the compliance department, or specific business use cases, or when creating the raw data foundation for data products within a data mesh. In this case, a Flow.BI instance equals a data product. But don’t underestimate the complexity of Flow.BI: The frontend is used to manage a large number of compute clusters that implement scalable agents to work on defining the Raw Data Vault. The platform is implementing full separation of data and processing, not only by client but also by instance. Mapping Raw Data to Organizational Ontology The very first step in the process is to identify the concepts in the attached datasets. For this purpose, there is a concept classifier that analyses the data and recognizes datasets and their classified concepts that it has seen in the past. A common requirement of clients is that they would like to leverage their organizational requirements in this process. While Flow.BI doesn’t know a client’s ontology; it is possible to override (and in some cases, complete) the concept classifications and refer to concepts from the organizational ontology. By doing so, Flow.BI will integrate the source system’s raw data into the organization's ontology. It will not create a logical Data Vault, which is where the Data Vault model reflects the desired business, but instead model the raw data as the business uses it, and therefore follow the data-driven Data Vault modeling principles that Michael Olschimke has taught to thousands of students over the years at Scalefree. Flow.BI also allows the definition of a multi-tenant Data Vault model, where source systems either provide multi-tenant data or are assigned to a specific tenant. In both cases, the integrated enterprise data model will be extended to allow queries across multiple tenants or within a single tenant, depending on the information consumer’s needs. Ensuring Security and Privacy Flow.BI was designed with security and privacy in mind. From a design perspective, this has two aspects: Security and privacy in the service itself, to protect client solutions and related assets Security and privacy are integral to the defined model, allowing for the effective utilization of Data Vault’s capabilities in addressing security and privacy requirements, such as satellite splits. While Flow.BI is using a shared architecture; all data and metadata storage and processing are separated by client and instance. However, this is often not sufficient for clients as they hesitate to share their highly sensitive data with a third party. For this reason, Flow.BI allows two critical features: Local data storage: instead of storing client data on Flow.BI infrastructure, the client provides an Azure Data Lake Storage to be used for storing the data. Local data processing: A Docker container can be deployed into the client’s infrastructure to access the client's data sources, extract the data, and process it. When using both options, only metadata, such as entity and column names, constraints, and descriptions, are shared with Flow.BI. No data is transferred from the client’s infrastructure to Flow.BI. The metadata is secured on Flow.BI’s premises as if it were actual data: row-level security separates the metadata by instance, and roles and permissions are defined per client who can access the metadata and what they can do with it. But security and privacy are not limited to the service itself. The defined model also utilizes the security and privacy features of Data Vault. For example, it enables the classification of source columns based on security and privacy. The user can set up security and privacy classes and apply them to the influence screen for both. By doing so, the column classifications are used when defining the Raw Data Vault and can later be used to implement a satellite split in the physical model (if necessary). An upcoming release will include an AI model for classifying columns based on privacy, utilizing data and metadata to automate this task. Tackling Multilingual Challenges A common challenge for clients is navigating multilingual data environments. Many data sources use English entity and column names, but there are systems using metadata in a different language. Also, the assumption that the data platform should use English metadata is not always correct. Especially in government clients, the use of the official language is mandatory. Both options, translating the source metadata to English (the default within Flow.BI) and translating the defined target model into any target language, are supported by Flow.BI’s translations tab on the influence screen: The tab utilizes an AI translator to fully automatically translate the incoming table names, column names, and concept names. However, the user can step in and override the translation to improve it to their needs. All strings of the source metadata and the defined model are passed through the translation module. It is also possible to reuse existing translations for a growing list of popular data sources. This feature enables readable names for satellites and their attributes (as well as hubs and links), resulting in a significantly improved user experience for the defined Raw Data Vault. Generating the Physical Model You should have noticed by now that we consistently discuss the defined Raw Data Vault model. Flow.BI is not generating the physical model, that is, the CREATE TABLE and INSERT INTO statements for the Raw Data Vault. Instead, it “just” defines the hubs, links, and satellites required for capturing all incoming data from the attached data sources, including business key selection, satellite splits, and special entity types, such as non-historized links and their satellites, multi-active satellites, hierarchical links, effectivity satellites, and reference tables. Video on Generating Physical Models This logical model (not to be confused with “logical Data Vault modelling”) is then provided to our growing number of ISV partner solutions that will consume our defined model, set up the required metadata in their tool, and generate the physical model. As a result, Flow.BI acts as a team member that analyses your organizational data sources and their data, knows how to model the Raw Data Vault, and how to set up metadata in the tool of your choice. The metadata is provided by Flow.BI can be used to model the landing zone/staging area (either on a data lake or a relational database such as Microsoft Fabric) and the Raw Data Vault in a data-driven Data Vault architecture, which is the recommended practice. With this in mind, Flow.BI is not a competition to Vaultspeed or your other existing data warehouse automation solution, but a valid extension that integrates with your existing tool stack. This makes it much easier to justify the introduction of Flow.BI to the project. Going Beyond Data Vault Flow.BI is not limited to the definition of Data Vault models. While it has been designed with the Data Vault concepts in mind, a customizable expert system is used to define the Data Vault model. Although the expert system is not yet publicly available, it has already been implemented and is in use for every model generation. This expert system enables the implementation of alternative data models, provided they adhere to data-driven, schema-on-read principles. Data Vault is such an example, but many others are possible, as well: Customized Data Vault models Inmon-style enterprise models in third-normal form (3NF, if no business logic is required Kimball-style analytical models with facts and dimensions, again without business logic Semi-structured JSON and XML document collections Key-value stores “One Big Table (OBT)” models “Many Big Related Table (MBRT)” models Okay, we’ve just invented the MBRT model as we're writing the article, but you get the idea: many large, fully denormalized tables with foreign–key relationships between each other. If you've developed your data-driven model, please get in touch with us. About the Authors Michael Olschimke is co-founder and CEO of Flow.BI, a generative AI that defines integrated enterprise data models, such as (but not limited to) Data Vault. Michael has trained thousands of industry data warehousing professionals, taught academic classes, and published regularly on topics around data platforms, data engineering, and Data Vault. He has over two decades of experience in information technology, with a specialization in business intelligence topics, artificial intelligence and data platforms. <<< Back to Blog Series Title Page171Views0likes0CommentsExplore Microsoft Fabric Data Agent & Azure AI Foundry for agentic solutions
Contributors for this blogpost: Jeet J & Ritaja S Context & Objective Over the past year, Gen AI apps have expanded significantly across enterprises. The agentic AI era is here, and the Microsoft ecosystem helps enable end-to-end acceleration of agentic AI apps in production. In this blog, we'll cover how both low-code business analysts and pro-code developers can use the Microsoft stack to build reusable agentic apps for their organizations. Professionals in the Microsoft ecosystem are starting to build advanced agentic generative AI solutions using Microsoft AI Services and Azure AI Foundry, which supports both open source and industry models. Combined with the advancements in Microsoft Fabric, these tools enable robust, industry-specific applications. This blog post explains how to develop multi-agent solutions for various industries using Azure AI Foundry, Copilot Studio, and Fabric. Disclaimer: This blogpost is for educational purposes only and walks through the process of using the relevant services without ton of custom code; teams must follow engineering best practices—including development, automated deployment, testing, security, and responsible AI—before any production deployment. What to expect In-Focus: Our goal is to help the reader explore specific industry use cases and understand the concept of building multi-agent solutions. In our case, we will focus on the insurance and financial services use case, use Fabric Notebooks to create sample (fake) datasets, utilize simple click-through based workflow to build-and-configure three agents (both on Fabric and Azure AI Foundry), tie them together and offer the solution via Teams or Microsoft Copilot using the new M365 Agents Toolkit. Out-of-Focus: This blog post will not cover the fundamentals of Azure AI Services, Azure AI Foundry, or the various components of Microsoft Fabric. It also won’t cover the different ways (low-code or pro-code) to build agents, orchestration frameworks (Semantic Kernel, Langchain, AutoGen, etc.) for orchestrating the agents, or hosting options (Azure App Service – Web App, Azure Kubernetes Service, Azure Container Apps, Azure Functions ). Please review the pointers listed towards the end to gain a holistic understanding of building and deploying mission-critical generative AI solutions. Logical Architecture of Multi-Agent Solution utilizing Microsoft Fabric Data Agent and Azure AI Foundry Agent. Fabric & Azure AI Foundry – Pro-code Agentic path Prerequisites a. Access to Azure Tenant & Subscription b. Work with Azure tenant administrator to have appropriate Azure Roles and Capacity to provision Azure Resources (services) and Deploy AI Models in certain regions. c. A paid F2 or higher Fabric capacity resource – Important to note that the Fabric compute capacity can be paused and resumed. Pause in case you wish to save costs after your learning. d. Access Fabric Admin Portal Power BI for enabling these settings > Fabric data agent tenant settings is enabled. > Copilot tenant switch is enabled. > Optional: Cross-geo processing for AI is enabled. (depends on your region and data sovereignty requirements) > Optional: Cross-geo storing for AI is enabled. (depends on your region) e. At least one of these: Fabric Data Warehouse, Fabric Lakehouse, one or more Power BI semantic models, or a KQL database with data. This blog post will cover how to create sample datasets using Fabric Notebooks. f. Power BI semantic models via XMLA endpoints tenant switch is enabled for Power BI semantic model data sources. Walkthrough/Set-up One-time setup for Fabric Workspace and all agents a) Visit https://app.powerbi.com Click “New workspace” button to create a new workspace, give it a name and ensure that it is tied/associated to the Fabric Capacity you have provisioned. b) Click Workspace settings of the newly created Workspace c) Review the information in License info. If the workspace isn’t associated with your newly created Fabric Capacity, please do the proper association (link the Fabric capacity to the workspace) and wait for 5-10 mins. 2. Create an Insurance Agent a) Create a new Lakehouse in your Fabric Workspace. Change the name to InsuranceLakehouse b) Create a new Fabric Notebook, assign a name, and associate the Insurance Lakehouse with it. c) Add the following Pyspark (Python) code-snippets in the Notebook. i) Faker library for Fabric Notebook ii) Insurance Sample Dataset in Fabric Notebook iii) Run both cells to generate the sample Insurance dataset. d) Create a new Fabric data agent, give it a name and add the Data Source (InsuranceLakehouse). i) Ensure that the Insurance Lakehouse is added as the data source in the Insurance Fabric data agent. ii) Click AI instructions button first to paste the sample instructions and finally the Publish button. iii) Paste the sample instructions in the field. A churn is represented by number 1. Calculate churn rate as : total number of churns (TC) / (TT) total count of churn entries. When asked to calculate churn for each policy type then TT should be total count of churn of that policy type e.g Life, legal. iv) Make sure to hit the Publish button. v) Capture two values from the URL and store them in secure/private place. We will use them to configure the knowledge source in Azure AI Foundry Agent. https://app.powerbi.com/groups/<WORKSPACE-ID>/aiskills/<ARTIFACT-ID>?experience=power-bi e) Create Azure AI Agent on Azure AI Foundry and use Fabric Data Agent as the Knowledge Source i) Visit https://ai.azure/com ii) Create new Azure AI Project and deploy gpt-4o-mini model (as an example model) in the region where the model is available. iii) Create new Azure AI Foundry Agent by clicking the “New agent” button. Give it a name (for e.g. AgentforInsurance) iv) Paste the sample Instructions in the Azure AI Agent as follows Use Fabric data agent tool to answer questions related to Insurance data. Fabric data agent as a tool has access to insurance data tables: Claims (amount, date, status), Customer (age, address, occupation, etc), Policy (premium amount, policy type: life insurance, auto insurance, etc) v) On the right-hand pane, click “+Add” button next to Knowledge. vi) Choose an existing Fabric data agent connection or click the new Connection. vii) In the next dialog, plug-in the values of the Workspace and Artifact ID you captured above, Ensure that “is secret” is checked, give a name to the connection and hit the Connect button. viii) Add the Code Interpreter as the tool in the Azure AI Foundry agent by clicking +Add next to Actions and selecting Code Interpreter. ix) Test the agent by clicking “Try in playground” button x) To test the Agent, you can try out these sample questions: What is the churn rate across my different insurance policy types What’s the month over month claims change in % for each insurance type? Show me graph view of month over month claims change in % for each insurance type for the year 2025 only Based on month over month claims change for the year 2025, can you show the forecast for the next 3 months in a graph? f) Exposing the Fabric data agent to end users: We will explore this in the Copilot Studio Section Fabric & Copilot Studio – Low-code agentic path Prerequisites For Copilot Studio, you have 3 options to work with: Copilot Studio trial or Copilot Studio license or Copilot Studio Pay as you go (connected to your Azure billing). Follow steps here to setup: Copilot Studio licensing - Microsoft Copilot Studio | Microsoft Learn Once you have Copilot Studio set up, navigate to https://copilotstudio.microsoft.com/ and start creating a new agent – Click on “create” and then “New Agent” Walkthrough/Set-up: Follow steps from “Fabric & Azure AI Foundry section” to create the Fabric Lakehouse and You could create the new agent by describing step by step in natural language but for this exercise we will “skip to configure” (button): Give the agent a name, add a helpful description (suggestion, add: Agent that has access to Insurance data and helps with data driven insights from Policy, Claims and Customer information). Then add the agent instruction prompt: “Answer questions related to Insurance data, you have access to insurance data agent, use the agent to gather insights from insurance Lakehouse containing customer, policy and claim information.” Finally click on “Create” You should have the following setup like below: Next, we want to add another agent for this agent to use – in this case this will be our Fabric Data Agent. Click on “Agents”: Next click on “Add” and add agent. From the screen click on Microsoft Fabric: If you haven’t set-up a connection to Fabric from Copilot Studio, you will be prompted to create a new connection, follow the prompts to sign in with your user and add a connection. Once that is done click “Next” to continue: From the Fabric screen, select the appropriate data agent and click “Next”: On the next screen, name the agent appropriately and use a friendly description “Agent that answers questions from the insurance lakehouse knowledge, has access to claims, policies and customer information” and finally click on “Add Agent”: On the “Tools” section click on refresh to make sure the tool description populates. Finally go back to overview and then Start Testing the agent from the side Test Panel. Click on Activity map to see the sequence of events. Type in the following question: “What’s the month over month claims change in % for auto insurance ?” You can see the Fabric data agent is called by the Copilot Agent in this scenario to answer your question: Now let’s prepare to surface this through Teams. You will need to publish the agent to a channel (in this case, we will use the Teams channel). First, navigate to channels: Click on the Teams and M365 Copilot channel and click add. After adding the channel, a pop up will ask y ou if you are ready to publish the agent: To view the app in Teams you need to make sure that you have setup proper policies in Teams. Follow this tutorial: Connect and configure an agent for Teams and Microsoft 365 Copilot - Microsoft Copilot Studio | Microsoft Learn. Now your app is available across Teams. Below is an example of how to use it from Teams – make sure you click on “Allow” for the fabric data agent connection: Fabric & AI Foundry & Copilot Studio – the end to end We saw how Fabric data agents can be created and utilized in Copilot Studio in a multi-agent setup. In the future, pro-code and low-code agentic developers are expected to work together to create agentic apps, instead of in silos. So, how do we solve the challenge of connecting all the components together in the same technology stack ? Let’s say a pro-code developer has created a custom agent in AI Foundry. Meanwhile, a low-code business user has put in business context to create another agent that requires access to the agent in AI Foundry. You’ll be pleased to know that Copilot Studio and Azure AI Foundry are becoming more integrated to enable complex, custom scenarios: Copilot Studio will soon release the integration to help with this: Summary: We demonstrated how one can build a Gen AI solution that allows seamless integration between Azure AI Foundry agents and Fabric data agents. We look forward to seeing what innovative solutions you can build by learning and working closely with your Microsoft contacts or your SI partner. This may include but not limited to: Utilizing a real industry domain to illustrate the concepts of building simple multi-agent solution. Showcasing the value of combining Fabric data agent and Azure AI agent Demonstrating how one can publish the conceptual solution to Teams or Copilot using the new M365 Agents Toolkit. Note that this blog post focused only on Fabric data agent and Azure AI Foundry Agent Service, but production ready solutions will need to consider Azure Monitor (for monitoring and observability) and Microsoft Purview for data governance. Pointers to Other Learning Resources Ways to build AI Agents: Build agents, your way | Microsoft Developer Components of Microsoft Fabric: What is Microsoft Fabric - Microsoft Fabric | Microsoft Learn Info on Microsoft Fabric data agent Create a Fabric data agent (preview) - Microsoft Fabric | Microsoft Learn Fabric Data Agent Tutorial: Fabric data agent scenario (preview) - Microsoft Fabric | Microsoft Learn New in Fabric Data Agent: Data source instructions for smarter, more accurate AI responses | Microsoft Fabric Blog | Microsoft Fabric Info on Azure AI Services, Models and Azure AI Foundry Azure AI Foundry documentation | Microsoft Learn What are Azure AI services? - Azure AI services | Microsoft Learn How to use Azure AI services in Azure AI Foundry portal - Azure AI Services | Microsoft Learn What is Azure AI Foundry Agent Service? - Azure AI Foundry | Microsoft Learn Explore Azure AI Foundry Models - Azure AI Foundry | Microsoft Learn What is Azure OpenAI in Azure AI Foundry Models? | Microsoft Learn Explore model leaderboards in Azure AI Foundry portal - Azure AI Foundry | Microsoft Learn & Benchmark models in the model leaderboard of Azure AI Foundry portal - Azure AI Foundry | Microsoft Learn How to use model router for Azure AI Foundry (preview) | Microsoft Learn Observability in Generative AI with Azure AI Foundry - Azure AI Foundry | Microsoft Learn Trustworthy AI for Azure AI Foundry - Azure AI Foundry | Microsoft Learn Cost Management for Models: Plan to manage costs for Azure AI Foundry Models | Microsoft Learn Provisioned Throughput Offering: Provisioned throughput for Azure AI Foundry Models | Microsoft Learn Extend Azure AI Foundry Agent with Microsoft Fabric Expand Azure AI Agent with New Knowledge Tools: Microsoft Fabric and Tripadvisor | Microsoft Community Hub How to use the data agents in Microsoft Fabric with Azure AI Foundry Agent Service - Azure AI Foundry | Microsoft Learn General guidance on when to adopt, extend and build CoPilot experiences: Adopt, extend and build Copilot experiences across the Microsoft Cloud | Microsoft Learn M365 Agents Toolkit Choose the right agent solution to support your use case | Microsoft Learn Microsoft 365 Agents Toolkit Overview - Microsoft 365 Developer | Microsoft Learn Github and Video to offer Azure AI Agent inside Teams or CoPilot via M365 Agents Toolkit OfficeDev/microsoft-365-agents-toolkit: Developer tools for building Teams apps & Deploying your Azure AI Foundry agent to Microsoft 365 Copilot, Microsoft Teams, and beyond Happy Learning! Contributors: Jeet J & Ritaja S Special thanks to reviewers: Joanne W, Amir J & Noah A648Views1like2CommentsApproaches to Integrating Azure Databricks with Microsoft Fabric: The Better Together Story!
Azure Databricks and Microsoft Fabric can be combined to create a unified and scalable analytics ecosystem. This document outlines eight distinct integration approaches, each accompanied by step-by-step implementation guidance and key design considerations. These methods are not prescriptive—your cloud architecture team can choose the integration strategy that best aligns with your organization’s governance model, workload requirements and platform preferences. Whether you prioritize centralized orchestration, direct data access, or seamless reporting, the flexibility of these options allows you to tailor the solution to your specific needs.1.1KViews6likes1CommentExternal Data Sharing With Microsoft Fabric
The demands and growth of data for external analytics consumption is rapidly growing. There are many options to share data externally and the field is very dynamic. One of the most frictionless and easy onboarding steps for external data sharing we will explore is with Microsoft Fabric. This external data allows users to share data from their tenant with users in another Microsoft Fabric tenant.6KViews3likes2CommentsAutomating Data Vault processes on Microsoft Fabric with VaultSpeed
This Article is Authored By Jonas De Keuster from VaultSpeed and Co-authored with Michael Olschimke, co-founder and CEO at Scalefree International GmbH & Trung Ta is a senior BI consultant at Scalefree International GmbH. The Technical Review is done by Ian Clarke, Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft Businesses often struggle to align their understanding of processes and products across disparate systems in corporate operations. In our previous blogs in this series, we explored the advantages of Data Vault as a methodology and why it is increasingly recognized due to its automation-friendly approach to modern data warehousing. Data Vault’s modular structure, scalability, and flexibility address the challenges of integrating diverse and evolving data sources. However, the key to successfully implementing a Data Vault lies in automation. Data Vault’s pattern-based modeling - organized around hubs, links, and satellites - provides a standardized framework well-suited to integrate data from horizontally scattered operational source systems. Automation tools like VaultSpeed enhance this methodology by simplifying the generation of Data Vault structures, streamlining workflows, and enabling rapid delivery of analytics-ready data solutions. By leveraging the strengths of Data Vault and VaultSpeed’s automation capabilities, organizations can overcome inefficiencies in traditional ETL processes, enabling scalable and adaptable data integration. Examples of such operational systems include Microsoft Dynamics 365 for CRM and ERP, SAP for enterprise resource planning, or Salesforce for customer data. Attempts to harmonize this complexity historically relied on pre-built industry data models. However, these models often fell short, requiring significant customization and failing to accommodate unique business processes. Different approaches to Data Integration Industry data models offer a standardized way to structure data, providing a head start for organizations with well-aligned business processes. They work well in stable, regulated environments where consistency is key. However, for organizations dealing with diverse sources and fast-changing requirements, Data Vault offers greater flexibility. Its modular, scalable approach supports evolving data landscapes without the need to reshape existing models. Both approaches aim to streamline integration. Data Vault simply offers more adaptability when complexity and change are the norm. So it depends on the use cases when it comes to choosing the right approach. Tackling data complexity with automation Integrating data from horizontally distributed sources is one of the biggest challenges data engineers face. VaultSpeed addresses this by connecting the physical metadata from source systems with the business's conceptual data model and creating a "town plan" for building a Data Vault model. This "town plan" for Data Vault model construction serves as the bedrock for automating various data pipeline stages. By aligning data's technical and business perspectives, VaultSpeed enables the automated generation of logical and physical data models. This automation streamlines the design process and ensures consistency between the data's conceptual understanding and physical implementation. Furthermore, VaultSpeed's automation extends to the generation of transformation code. This code converts data from its source format into the structure defined by the Data Vault model. Automating this process reduces the potential for errors and accelerates the development of the data integration pipeline. In addition to data models and transformation code, VaultSpeed also automates workflow orchestration. This involves defining and managing the tasks required to extract, transform, and load data into the Data Vault. By automating this orchestration, VaultSpeed ensures that the data integration process is executed reliably and efficiently. How VaultSpeed automates integration The following section will examine the detailed steps involved in the VaultSpeed workflow. We will examine how it combines metadata-driven and data-driven modeling approaches to streamline data integration and automate various data pipeline stages. Harvest metadata: VaultSpeed collects metadata from source systems such as OneLake, AzureSQL, SAP, and Dynamics 365, capturing schema details, relationships, and dependencies. Align with conceptual models: Using a business’s conceptual data model as a guiding framework, VaultSpeed ensures that physical source metadata is mapped consistently to the target Data Vault structure. Generate logical and physical models: VaultSpeed leverages its metadata repository and automation templates to produce fully defined logical and physical Data Vault models, including hubs, links, and satellites. Automate code creation: Once the models are defined, VaultSpeed generates the necessary transformation and workflow code using templates with embedded standards and conventions for Data Vault implementation. This ensures seamless data ingestion, integration, and consistent population of the Data Vault model. By automating these steps, VaultSpeed eliminates the manual effort required for traditional data modeling and integration, reducing errors and addressing the inefficiencies of data integration using traditional ETL. Due to the model driven approach, the code is always in sync with the data model. Unified integration with Microsoft Fabric Microsoft Fabric offers a robust data ingestion, storage, and analytics ecosystem. VaultSpeed seamlessly embeds within this ecosystem to ensure an efficient and automated data pipeline. Here’s how the process works: Ingestion (Extract and Load): Tools like ADF, Fivetran, or OneLake replication bring data from various sources into Fabric. These tools handle the extraction and replication of raw data from operational systems. Microsoft Fabric also supports mirrored databases, enabling real-time data replication from sources like CosmosDB, Azure SQL, or application data into the Fabric environment. This ensures data remains synchronized across the ecosystem, providing a consistent foundation for downstream modeling and analytics. Data Repository or Platform: Microsoft Fabric is the data platform providing the infrastructure for storing, managing, and securing the ingested data. Fabric uniquely supports warehouse and lakehouse experiences, bringing them together under a unified data architecture. This means organizations can combine structured, transactional data with unstructured or semi-structured data in a single platform, eliminating silos and enabling broader analytics use cases. Modeling and Transformation: VaultSpeed takes over at this stage, leveraging its advanced automation to model and transform data into a Data Vault structure. This includes creating hubs, links, and satellites while ensuring alignment with business taxonomies. Unlike traditional ETL tools, VaultSpeed is not involved in the runtime execution of transformations. Instead, it generates code that runs within Microsoft Fabric. This approach ensures better performance, reduces vendor lock-in, and enhances security since no data flows through VaultSpeed itself. By focusing exclusively on model-driven automation, VaultSpeed enables organizations to maintain full control over their data processing while benefiting from automation efficiencies. Additionally, Fabric's VertiPaq engine manages the transformation workloads automatically, ensuring optimal performance without requiring extensive manual tuning, a key capability in a Data Vault context where performance is critical for handling large volumes of data and complex transformations. This simplifies operations for data engineers and ensures that query performance remains efficient, even as data volumes and complexity grow. Consume: The integrated data layer within Microsoft Fabric serves multiple consumption paths. While tools like Power BI enable actionable insights through analytics dashboards, the same data foundation can also drive AI use cases, such as machine learning models or intelligent applications. By connecting ingestion tools, a unified data platform, and analytics or AI solutions, VaultSpeed ensures a streamlined and integrated workflow that maximizes the value of the Microsoft Fabric ecosystem. Loading at multiple speeds: real-time Data Vaults with Fabric Loading data into a Data Vault often requires balancing traditional batch processes with the demands of real-time ingestion within a unified model. Microsoft Fabric’s event-driven tools, such as Data Activator, empower organizations to process data streams in real-time while supporting traditional batch loads. VaultSpeed complements these capabilities by ensuring that both modes of ingestion feed seamlessly into the same Data Vault model, eliminating the need for separate architectures like the Lambda pattern. Key capabilities for real time Data Vault include: Event-driven updates: Automatically trigger incremental loads into the Data Vault when changes occur in CosmosDB, OneLake, or other sources. Automated workflow orchestration: VaultSpeed’s Flow Management Control (FMC) automates the entire data ingestion, transformation, and loading workflow. This includes handling delta detection, incremental updates, and batch processes, ensuring optimal efficiency regardless of the speed of data arrival. FMC integrates natively with Azure Data Factory (ADF) for seamless orchestration within the Microsoft ecosystem. For more complex or distributed workflows, FMC also supports Apache Airflow, enabling flexibility in managing a wide range of data pipelines. Seamless integration: Maintain synchronized pipelines for historical and real-time data within the Fabric environment. The FMC intelligently manages multiple data streams, dynamically adjusting to workload demands to support high-volume batch loads and real-time event-driven updates. These capabilities ensure analytics dashboards reflect the latest data, delivering immediate value to decision-makers. Automating the gold layer and delivering data products at scale Power BI is a cornerstone of Microsoft Fabric, and VaultSpeed makes it easier for data modelers to connect the dots. By automating the creation of the gold layer, VaultSpeed enables seamless integration between Data Vaults and Power BI. Benefits for data teams: Automated gold layer: VaultSpeed automates the creation of the gold layer, including templates for star schemas, One Big Table (OBT), and other analytics-ready structures. These automated templates allow businesses to generate consistent and scalable presentation layers without manual intervention. Accelerated time to insight: By reducing manual preparation steps, VaultSpeed enables teams to deliver dashboards and reports quickly, ensuring faster access to actionable insights. Deliver data products: The ability to automate and standardize star schemas and other presentation models empowers organizations to deliver analytics-ready data products at scale, efficiently meeting the needs of multiple business domains. Improved data governance: VaultSpeed’s lineage tracking ensures compliance and transparency, providing full traceability from raw data to the presentation layer. No-code automation: Eliminate the need for custom scripting, freeing up time to focus on innovation and higher-value tasks. Conclusion Integrating VaultSpeed and Microsoft Fabric redefines how data modelers and engineers approach Data Vault 2.0. This partnership unlocks the full potential of modern data ecosystems by automating workflows, enabling real-time insights, and streamlining analytics. If you’re ready to transform your data workflows, VaultSpeed and Microsoft Fabric provide the tools you need to succeed. The following article will focus on the DataOps part of automation. Further reading Automating common understanding: Integrating different data source views into one comprehensive perspective Why Data Vault is the best model for data warehouse automation: Read the eBook The Elephant in the Fridge by John Giles: A great reference on conceptual data modeling for Data Vault About VaultSpeed VaultSpeed empowers enterprises to deliver data products at scale through advanced automation for modern data ecosystems, including data lakehouse, data mesh, and fabric architectures. The no-code platform eliminates nearly all traditional ETL tasks, delivering significant improvements in automation across areas like data modeling, engineering, testing, and deployment. With seamless integration to platforms like Microsoft Fabric or Databricks, VaultSpeed enables organizations to automate the entire software development lifecycle for data products, accelerating delivery from design to deployment. VaultSpeed addresses inefficiencies in traditional data processes, transforming how data engineers and business users collaborate to build flexible, scalable data foundations for AI and analytics. About the Authors Jonas De Keuster is VP Product at VaultSpeed. He had close to 10 years of experience as a DWH consultant in various industries like banking, insurance, healthcare, and HR services, before joining the data automation vendor. This background allows him to help understand current customer needs and engage in conversations with members of the data industry. Michael Olschimke is co-founder and CEO of Scalefree International GmbH, a European Big Data consulting firm. The firm empowers clients across all industries to use Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of industry data warehousing professionals, taught academic classes, and published regularly on these topics. Trung Ta is a senior BI consultant at Scalefree International GmbH. With over 7 years of experience in data warehousing and BI, he has advised Scalefree’s clients in different industries (banking, insurance, government, etc.) and of various sizes in establishing and maintaining their data architectures. Trung’s expertise lies within Data Vault 2.0 architecture, modeling, and implementation, specifically focusing on data automation tools. <<< Back to Blog Series Title Page616Views1like0CommentsFabric Data Agents: Unlocking the Power of Agents as a Steppingstone for a Modern Data Platform
What Are Fabric Data Agents? Fabric Data Agents are intelligent, AI-powered assistants embedded within Microsoft Fabric, a unified data platform that integrates data ingestion, processing, transformation, and analytics. These agents act as intermediaries between users and data, enabling seamless interaction through natural language queries in the form of Q&A applications. Whether it's retrieving insights, analyzing trends, or generating visualizations, Fabric Data Agents simplify complex data tasks, making advanced analytics accessible to everyone—from data scientists to business analysts to executive teams. How Do They Work? At the center of Fabric Data Agents is OneLake, a unified and governed data lake that joins data from various sources, including on-premises systems, cloud platforms, and third-party databases. OneLake ensures that all data is stored in a common, open format, simplifying data management and enabling agents to access a comprehensive view of the organization's data. Through Fabric’s Data Ingestion capabilities, such as Fabric Data Factory, OneLake Shortcuts, and Fabric Database Mirroring, Fabric Data Agents are designed to connect with over 200 data sources, ensuring seamless integration across an organization's data estate. This connectivity allows them to pull data from diverse systems and provide a unified analytics experience. Here's how Fabric Data Agents work: Natural Language Processing: Using advanced NLP techniques, Fabric Data Agents enable users to interact with data through conversational queries. For example, users can ask questions like, "What are the top-performing investment portfolios this quarter?" and receive precise answers, grounded on enterprise data. AI-powered Insights: The agents process queries, reason over data, and deliver actionable insights, using Azure OpenAI models, all while maintaining data security and compliance. Customization: Fabric data agents are highly customizable. Users can provide custom instructions and examples to tailor their behavior to specific scenarios. Fabric Data Agents allow users to provide example SQL queries, which can be used to influence the agent’s behavior. They also can integrate with Azure AI Agent Service or Microsoft Copilot Studio, where organizations can tailor agents to specific use cases, such as risk assessment or fraud detection. Security and Compliance: Fabric Data Agents are built with enterprise-grade security features, including inheriting Identity Passthrough/On-Behalf-Of (OBO) authentication. This ensures that business users only access data they are authorized to view, keeping strict compliance with regulations like GDPR and CCPA across geographies and user roles. Integration with Azure: Fabric Data Agents are deeply integrated with Azure services, such as Azure AI Agent Service and Azure OpenAI Service. Practically, organizations can publish Fabric Data Agents to custom Copilots using these services and use the APIs in various custom AI applications. This integration ensures scalability, high availability, and performance and exceptional customer experience. Why Should Financial Services Companies Use Fabric Data Agents? The financial services industry faces unique challenges, including stringent regulatory requirements, the need for real-time decision-making, and empowering users to interact with an AI application in a Q&A fashion over enterprise data. Fabric Data Agents address these challenges head-on through: Enhanced Efficiency: Automate repetitive tasks, freeing up valuable time for employees to focus on strategic initiatives. Improved Compliance: Use robust data governance features to ensure compliance with regulations like GDPR and CCPA. Data-Driven Decisions: Gain deeper insights into customer behavior, market trends, and operational performance. Scalability: Seamlessly scale analytics capabilities to meet the demands of a growing organization, without really investing in building custom AI applications which require deep expertise. Integration with Azure: Fabric Data Agents are natively designed to integrate across Microsoft’s ecosystem, providing a comprehensive end-to-end solution for a Modern Data Platform. How different are Fabric Data Agents from Copilot Studio Agents? Fabric Data Agents and Copilot Studio Agents serve distinct purposes within Microsoft's ecosystem: Fabric Data Agents are tailored for data science workflows. They integrate AI capabilities to interact with organizational data, providing analytics insights. They focus on data processing and analysis using the medallion architecture (bronze, silver, and gold layers) and support integration with the Lakehouse, Data Warehouse, KQL Databases and Semantic Models. Copilot Studio Agents, on the other hand, are customizable AI-powered assistants designed for specific tasks. Built within Copilot Studio, they can connect to various enterprise data sources like OneLake, AI Search, SharePoint, OneDrive, and Dynamics 365. These agents are versatile, enabling businesses to automate workflows, analyze data, and provide contextual responses by using APIs and built-in connectors. What are the technical requirements for using Fabric Data Agents? A paid F64 or higher Fabric capacity resource Fabric data agent tenant settingsis enabled. Copilot tenant switchis enabled. Cross-geo processing for AIis enabled. Cross-geo storing for AIis enabled. At least one of these: Fabric Data Warehouse, Fabric Lakehouse, one or more Power BI semantic models, or a KQL database with data. Power BI semantic models via XMLA endpoints tenant switchis enabled for Power BI semantic model data sources. Final Thoughts In a data-driven world, Fabric Data Agents are poised to redefine how financial services organizations operate and innovate. By simplifying complex data processes, enabling actionable insights, and fostering collaboration across teams, these intelligent agents empower organizations to unlock the true potential of their data. Paired with the robust capabilities of Microsoft Fabric and Azure, financial institutions can confidently navigate industry challenges, drive growth, and deliver superior customer experiences. Adopting Fabric Data Agents is not just an upgrade—it's a transformative step towards building a resilient and future-ready business. The time to embrace the data revolution is now. Learn how to create Fabric Data Agents2.8KViews3likes1CommentDelivering Information with Azure Synapse and Data Vault 2.0
Data Vault has been designed to integrate data from multiple data sources, creatively destruct the data into its fundamental components, and store and organize it so that any target structure can be derived quickly. This article focused on generating information models, often dimensional models, using virtual entities. They are used in the data architecture to deliver information. After all, dimensional models are easier to consume by dashboarding solutions, and business users know how to use dimensions and facts to aggregate their measures. However, PIT and bridge tables are usually needed to maintain the desired performance level. They also simplify the implementation of dimension and fact entities and, for those reasons, are frequently found in Data Vault-based data platforms. This article completes the information delivery. The following articles will focus on the automation aspects of Data Vault modeling and implementation.635Views0likes1CommentCreating a AI-Driven Chatbot to Inquire Insights into business data
Introduction In the fast-paced digital era, the ability to extract meaningful insights from vast datasets is paramount for businesses striving for a competitive edge. Microsoft Dynamics 365 Finance and Operations (D365 F&O) is a robust ERP platform, generating substantial business data. To unlock the full potential of this data, integrating it with advanced analytics and AI tools such as Azure OpenAI, Azure Synapse Workspace, or Fabric Workspace is essential. This blog will guide you through the process of creating a chatbot to inquire insights using Azure OpenAI with Azure Synapse Workspace or Fabric Workspace. Architecture Natural Language Processing (NLP): Enables customers to inquire about business data such as order statuses, item details, and personalized order information using natural language. Seamless Data Integration: Real-time data fetching from D365 F&O for accurate and up-to-date information. Contextual and Personalized Responses: AI provides detailed, context-rich responses to customer queries, improving engagement and satisfaction. Scalability and Efficiency: Handles multiple concurrent inquiries, reducing the burden on customer service teams and improving operational efficiency. Understanding the Components Microsoft Dynamics 365 Finance and Operations (D365 F&O) D365 F&O is a comprehensive ERP solution designed to help businesses streamline their operations, manage finances, and control supply chain activities. It generates and stores vast amounts of transactional data essential for deriving actionable insights. Dataverse Dataverse is a cloud-based data storage solution that allows you to securely store and manage data used by business applications. It provides a scalable and reliable platform for data integration and analytics, enabling businesses to derive actionable insights from their data. Azure Synapse Analytics Azure Synapse Analytics is an integrated analytics service that brings together big data and data warehousing. It allows users to query data on their terms, deploying either serverless or provisioned resources at scale. The service provides a unified experience to ingest, prepare, manage, and serve data for instant business intelligence and machine learning requirements. Fabric Workspace Fabric Workspace provides a collaborative platform for data scientists, analysts, and business users to work together on data projects. It facilitates the seamless integration of various data sources and advanced analytics tools to drive innovative solutions. Azure SQL Database Azure SQL Database is a cloud-based relational database service built on Microsoft SQL Server technologies. It offers a range of deployment options, including single databases, elastic pools, and managed instances, allowing you to choose the best fit for your application needs. Azure SQL Database provides high availability, scalability, and security features, making it an ideal choice for modern applications. Data from Dynamics 365 Finance and Operations (F&O) is copied to an Azure SQL Database using a flow that involves Azure Data Lake Storage (ADLS) and Azure Data Factory (ADF) Azure OpenAI Azure OpenAI enables developers to build and deploy intelligent applications using powerful AI models. By integrating OpenAI’s capabilities with Azure’s infrastructure, businesses can create sophisticated solutions that leverage natural language processing, machine learning, and advanced analytics. Step-by-Step Guide to Creating the Chatbot Step 1: Export Data from D365 F&O To begin, export the necessary data from your D365 F&O instance. This data will serve as the foundation for your analytics and AI operations. Ensure the exported data is in a format compatible with Azure Synapse or Fabric Workspace. Step 2: Ingest Data into Azure Synapse Workspace or Fabric Workspace Next, ingest the exported data into Azure Synapse Workspace or Fabric Workspace. Utilize the workspace’s capabilities to prepare, manage, and optimize the data for further analysis. This step involves setting up data pipelines, cleaning the data, and transforming it into a suitable format for processing. Step 3: Set Up Azure OpenAI With your data ready, set up Azure OpenAI in your environment. This involves provisioning the necessary resources, configuring the OpenAI service, and integrating it with your Azure infrastructure. Ensure you have the appropriate permissions and access controls in place. Step 4: Develop the Chatbot Develop the chatbot using Azure OpenAI’s capabilities. Design the chatbot to interact with users naturally, allowing them to inquire insights and receive valuable information based on the data from D365 F&O. Utilize natural language processing to enhance the chatbot’s ability to understand and respond to user queries effectively. Step 5: Integrate the Chatbot with Azure Synapse or Fabric Workspace Integrate the developed chatbot with Azure Synapse Workspace or Fabric Workspace. This integration will enable the chatbot to access and analyze the ingested data, providing users with real-time insights. Set up the necessary APIs and data connections to facilitate seamless communication between the chatbot and the workspace. Step 6: Test and Refine the Chatbot Thoroughly test the chatbot to ensure it functions as expected. Address any issues or bugs, and refine the chatbot’s responses and capabilities. This step is crucial to ensure the chatbot delivers accurate and valuable insights to users. Best Practices for Data Access Data Security Data security is paramount when exporting sensitive business information. Implement the following best practices: Ensure that all data transfers are encrypted using secure protocols. Use role-based access control to restrict access to the data exported. Regularly audit and monitor data export activities to detect any unauthorized access or anomalies. Data Transformation Transforming data before accessing it can enhance its usability for analysis: Use Synapse data flows to clean and normalize the data. Apply business logic to enrich the data with additional context. Aggregate and summarize data to improve query performance. Monitoring and Maintenance Regular monitoring and maintenance ensure the smooth operation of your data export solution: Set up alerts and notifications for any failures or performance issues in the data pipelines. Regularly review and optimize the data export and transformation processes. Keep your Azure Synapse environment up to date with the latest features and enhancements. Benefits of Integrating AI and Advanced Analytics Enhanced Decision-Making By leveraging AI and advanced analytics, businesses can make data-driven decisions. The chatbot provides timely insights, enabling stakeholders to act quickly and efficiently. Improved Customer Experience A chatbot enhances customer interactions by providing instant responses and personalized information. This leads to higher satisfaction and engagement levels. Operational Efficiency Integrating AI tools with business data streamlines operations, reduces manual efforts, and increases overall efficiency. Businesses can optimize processes and resource allocation effectively. Scalability It can handle multiple concurrent inquiries, scaling as the business grows without requiring proportional increases in customer service resources. Conclusion Creating a chatbot to inquire insights using Azure OpenAI with Azure Synapse Workspace or Fabric Workspace represents a significant advancement in how businesses can leverage their data. By following the steps outlined in this guide, organizations can develop sophisticated AI-driven solutions that enhance decision-making, improve customer experiences, and drive operational efficiency. Embrace the power of AI and advanced analytics to transform your business and unlock new opportunities for growth.807Views1like0CommentsDecision Guide for Selecting an Analytical Data Store in Microsoft Fabric
Learn how to select an analytical data store in Microsoft Fabric based on your workload's data volumes, data type requirements, compute engine preferences, data ingestion patterns, data transformation needs, query patterns, and other factors.9.7KViews14likes5CommentsEfficient Log Management with Microsoft Fabric
Introduction In the era of digital transformation, managing and analyzing log files in real-time is essential for maintaining application health, security, and performance. There are many 3rd party solutions in this area allowing collecting / processing storing, analyzing and acting upon this data source. But sometimes as your systems scale, those solution can become very costly, their cost model increases based on the amount of ingested data and not according to the real resources utilization or customer value This blog post explores a robust architecture leveraging Microsoft Fabric SaaS platform focused on its Realtime Intelligence capabilities for efficient log files collection processing and analysis. The use cases can vary from simple application errors troubleshooting, to more advanced use case such as application trends detection: detecting slowly degrading performance issues: like average user session in the app for specific activities last more than expected to more proactive monitoring using log based KPIs definition and monitoring those APIS for alerts generation Regarding cost , since Fabric provides a complete separation between compute and storage you can grow your data without necessarily growing your compute costs and you still pay for the resources that re used in a pay as you go model. Architecture Overview The proposed architecture integrates Microsoft Fabric’s Real time intelligence (Realtime Hub) with your source log files to create a seamless, near real-time log collection solution It is based on Microsoft Fabric: a SAAS solution which is a unified suite integrating several best of breed Microsoft analytical experiences. Fabric is a modern data/ai platform based on unified and open data formats (parquet/delta) allowing both classical data lakes experiences using both traditional Lakehouse/warehouse SQL analytics as well as real-time intelligence on semi structured data , all in on a lake-centric SaaS platform. Fabric's open foundation with built-in governance enables you to connect to various clouds and tools while maintaining data trust. This is High level Overview of Realtime Intelligence inside Fabric Log events - Fabric based Architecture When looking in more details a solution for log collection processing storage and analysis we propose the following architecture Now let's discuss it in more details: General notes: Since Fabric is a SAAS solution, all the components can be used without deploying any infrastructure in advance, just by a click of a button and very simple configurations you can customize the relevant components for this solution The main components used in this solution are Data Pipeline Onelake and Eventhouse Our data source for this example is taken from this public git repo: https://github.com/logpai/loghub/tree/master/Spark The files were taken and stored inside an S3 bucket to simulate the easiness of the data pipeline integration to external data sources. A typical log file looks like this : 16/07/26 12:00:30 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 59219. 16/07/26 12:00:30 INFO spark.SparkEnv: Registering MapOutputTracker 16/07/26 12:00:30 INFO spark.SparkEnv: Registering BlockManagerMaster 16/07/26 12:00:30 INFO storage.DiskBlockManager: Created local directory at /opt/hdfs/nodemanager/usercache/curi/appcache/application_1460011102909_0176/blockmgr-5ea750cb-dd00-4593-8b55-4fec98723714 16/07/26 12:00:30 INFO storage.MemoryStore: MemoryStore started with capacity 2.4 GB Components Data Pipeline First challenge to solve is how to bring the log files from your system into Fabric this is the Log collection phase: many solutions exist for this phase each with its pros and cons In Fabric the standard approach to bring data in is by use of Copy Activity in ADF or in its Fabric SAAS version is now called Data Pipeline: Data pipeline is a low code / no code tool allowing to manage and automate the process of moving and transforming data within Microsoft Fabric, a serverless ETL tool with more than 100 connectors enabling integration with a wide variety of data sources, including databases, cloud services, file systems, and more. In addition, it supports an on prem agent called self-hosted integration runtime, this agent that you install on a VM, acts as a bridge allowing to run your pipeline on a local VM and securing your connection from on prem network to the cloud Let’s describe in more details our solution data pipeline: Bear in mind ADF is very flexible and supports reading at scale from a wide range of data sources / files integrated as well to all major cloud vendors from blob storage retrieval : like S3, GCS, Oracle Cloud, File systems, FTP/SFTP etc so that even if your files are generated externally to Azure this is not an issue at all. Visualization of Fabric Data Pipeline Log Collection ADF Copy Activity: Inside Data pipeline we will create an Activity called Copy Activity with the following basic config Source: mapped to your data sources: it can be azure blob storage with container containing the log files, other cloud object storage like S3 or GCS , log files will be retrieved in general from a specific container/folder and are fetched based on some prefix/suffix in the file name. To support incremental load process we can configure it to delete the source files that it reads so that once the files are successfully transferred to their target they will be automatically deleted from their source . On the next iteration pipeline will not have to process the same files again. Sink: Onelake/Lakehouse folder: we create ahead of time a Lakehouse which is an abstract data container allowing to hold and manage at scale your data either structured or unstructured, we will then select it from the list of connectors (look for Onelake/Lakehouse) Log Shippers: This is an optional component, sometimes it is not allowed for the ETL to access your OnPrem Vnet , in this case tools like Fluentd , Filebeat , Open Telemetry collector used to forward your application collected logs to the main entry point of the system: the Azure Blob Storage. Azcopy CLI: if you don’t wish to invest into expensive tools and all you need to copy your data in a scale/secure manner to Azure Storage, you might consider create your own log shipping solution based on the free Azcopy tool together withs some basic scripting around it for scheduling: Azcopy is a command-line utility designed for high-performance uploading, downloading, and copying data to and from Microsoft Azure Blob and File storage. Fabric first Activity : Copy from Source Bucket to Lakehouse Log Preparation Upon log files landing in the azure blob storage, EventStream can be used to trigger the Data Pipeline that will handle the data preparation and loading phase. So what is Data preparation phase’s main purpose? After the log files land in the storage and before they are loaded to the realtime logs database the KQL Database , it might be necessary to transform the data with some basic manipulations . The reasons for that might be different A Few examples Bad data formats: for example, sometimes logs files contain problematic characters like new lines inside a row (stack trace error message with new lines as part of the message field of the record) Metadata enrichment: sometimes the log file names contain in their name some meaningful data : for example file name describes the originating process name / server name , so this metadata can be lost once the file content is loaded into database Regulation restrictions: sometimes logs contain private data like names, credit card numbers, social security number etc called PII that must be removed , hashed or encrypted before the load to database In our case we will be running a pyspark notebook who reads the files from Onelake folder, fixes the new lines inside a row issue, and create new files in another Onelake folder, we call this notebook with a base parameter called log_path that defines the log files location on the Onelake to read from Fabric second Activity : Running the Notebook Log Loading Inside Data pipeline , the last step, after the transformation phase, we call again the Copy data activity but this time source and sink are differen: Source: Lakehouse folder (previous notebook output) Sink: Evenhouse specific Table (created ahead of time): it is basically an empty table (lograw) Visualization of Fabric last Activity : Loading to EventHouse In summary for this stage the log collection and preparation: we broke this into 3 data pipeline activities: Copy Activity: Read the log files from source: This is the first step of the log ingestion pipeline it is running inside our orchestrator Data pipeline. Run Notebook Activity : Transform the log files : this is the execution of a single or chain of notebooks Copy Activity : Load the log files into destination datatbase : KQL inside Evenhouse : the logs database table called lograw, it is a specific table created ahead of time inside EventHouse Database Inside The Eventhouse We needed to create a KQL database with a table to hold the raw ingested log records KQL datbase is a scalable and efficient storage solution for log files, optimized for high-volume data ingestion and retrieval. Eventhouses and KQL databases operate on a fully managed Kusto engine. With an Eventhouse or KQL database, you can expect available compute for your analytics within 5 to 10 seconds. The compute resources grow with your data analytic needs. Log Ingestion to KQL Database with Update Policy We can separate the ETL transformation logic of what happens to the data before, it reaches the Eventhouse KQL database and after that. Before it reached the database , the only transformation we did was calling during the data pipeline a notebook to handle the new lines merge logic, This cannot be easily done as part of the database ingestion logic simply because when we try to load the files with new lines as part of a field of a record , it breaks the convention and what happens is that the ingestion process creates separate table records for each new line of the exceptions stacktrace. On the other hand, we might need to define basic transformation rules: such as date formatting, type conversion (string to numbers) , parse and extract some interesting value from a String based on regular exception, create JSON (dynamic type) of a hierarchical string (XML / JSON string etc) for all these transformations we can work with what is called an update policy we can define a simple ETL logic inside KQL database as explained here During this step we create from logsraw staging table a new table called logparsed , that will be our destination final table for the log queries. Those are the KQL Tables defined to hold the log files .create table logsraw ( timestamp:string , log_level:string, module:string, message:string) .create table logsparsed ( formattedDatetime:datetime , log_level:string, log_module:string, message:string) This is the update policy that automatically converts data from, the staging table logsraw to the destination table logparsed .create-or-alter function parse_lograw() { logsraw | project formattedtime = todatetime(strcat("20", substring(timestamp, 0, 2), "-", substring(timestamp, 3, 2), "-", substring(timestamp, 6, 2), "T", substring(timestamp, 9, 8))), log_level, logmodule=module, message } .alter table logsparsed policy update @'[{ "IsEnabled": true, "Source": "logsraw", "Query": "parse_lograw()", "IsTransactional": true, "PropagateIngestionProperties": true}]' Since we don't need to retain the data in the staging table (lograw) we can define a retention policy of 0 TTL like this : .alter-merge table logsraw policy retention softdelete = 0sec recoverability = disabled Query Log files After data is ingested and transformed it lands in a basic logs table that is schematized : logparsed, in general we have some common fields that are mapped to their own columns like : log level (INFO/ ERROR/ DEBUG) , log category , log timestamp (a datetime typed column) and log message which can be in general either a simple error string or a complex JSON formatted string in which case it is usually preferred to be converted to dynamic type that will bring additional benefits like simplified query logic, and reduced data processing (to avoid expensive joins) Example for Typical Log Queries Category Purpose KQL Query Troubleshooting Looking for an error at specific datetime range logsparsed | where message contains "Exception" and formattedDatetime between ( datetime(2016-07-26T12:10:00) .. datetime(2016-07-26T12:20:00)) Statistics Basic statistics Min/Max timestamp of log events logsparsed | summarize minTimestamp=min(formattedDatetime), maxTimestamp=max(formattedDatetime) Exceptions Stats Check Exceptions Distributions logsparsed | extend exceptionType = case(message contains "java.io.IOException","IOException", message contains "java.lang.IllegalStateException","IllegalStateException", message contains "org.apache.spark.rpc.RpcTimeoutException", "RpcTimeoutException", message contains "org.apache.spark.SparkException","SparkException", message contains "Exception","Other Exceptions", "No Exception") | where exceptionType != "No Exception" | summarize count() by exceptionType Log Module Stats Check Modules Distribution logsparsed | summarize count() by log_module | order by count_ desc | take 10 Realtime Dashboards After querying the logs, it is possible to visualize the query results in Realtime dashboards, for that all what’s required Select the query Click on Pin to Dashboard After adding the queries to tiles inside the dashboard this is a typical dashboard we can easily build: Realtime dashboards can be configured to be refreshed in Realtime like illustrated here: in which case user can very easily configure how often to refresh the queries and visualization : at the extreme case it can be as low as Continuus There are many more capabilities implemented in the Real-Time Dashboard, like data exploration Alerting using Data Activator , conditional formatting (change items colors based on KPIs threshold) and this framework and capabilities are heavily invested and keep growing. What about AI Integration ? Machine Learning Models: Kusto supports out of the box time series analysis allowing for example anomaly detection: https://learn.microsoft.com/en-us/fabric/real-time-intelligence/dashboard-explore-data and clustering but if it’s not enough for you, you can always mirror the data of your KQL tables into Onelake delta parquet format by selecting OneLake availability This configuration will create another copy of your data in open format delta parquet : you have it available for any Spark/Python/SparkML/SQL analytics for whatever machine learning exploration and ML modeling you wish to explore train and serve This is illustrated here : Bear in mind , there is no additional storage cost to turn on OneLake availability Conclusion A well-designed real-time intelligence solution for log file management using Microsoft Fabric and EventHouse can significantly enhance an organization’s ability to monitor, analyze, and respond to log events. By leveraging modern technologies and best practices, organizations can gain valuable insights and maintain robust system performance and security.4KViews1like0Comments