Microsoft Fabric
52 TopicsDecision Guide for Selecting an Analytical Data Store in Microsoft Fabric
Learn how to select an analytical data store in Microsoft Fabric based on your workload's data volumes, data type requirements, compute engine preferences, data ingestion patterns, data transformation needs, query patterns, and other factors.5.9KViews6likes4CommentsEfficient Log Management with Microsoft Fabric
Introduction In the era of digital transformation, managing and analyzing log files in real-time is essential for maintaining application health, security, and performance. There are many 3rd party solutions in this area allowing collecting / processing storing, analyzing and acting upon this data source. But sometimes as your systems scale, those solution can become very costly, their cost model increases based on the amount of ingested data and not according to the real resources utilization or customer value This blog post explores a robust architecture leveraging Microsoft Fabric SaaS platform focused on its Realtime Intelligence capabilities for efficient log files collection processing and analysis. The use cases can vary from simple application errors troubleshooting, to more advanced use case such as application trends detection: detecting slowly degrading performance issues: like average user session in the app for specific activities last more than expected to more proactive monitoring using log based KPIs definition and monitoring those APIS for alerts generation Regarding cost , since Fabric provides a complete separation between compute and storage you can grow your data without necessarily growing your compute costs and you still pay for the resources that re used in a pay as you go model. Architecture Overview The proposed architecture integrates Microsoft Fabric’s Real time intelligence (Realtime Hub) with your source log files to create a seamless, near real-time log collection solution It is based on Microsoft Fabric: a SAAS solution which is a unified suite integrating several best of breed Microsoft analytical experiences. Fabric is a modern data/ai platform based on unified and open data formats (parquet/delta) allowing both classical data lakes experiences using both traditional Lakehouse/warehouse SQL analytics as well as real-time intelligence on semi structured data , all in on a lake-centric SaaS platform. Fabric's open foundation with built-in governance enables you to connect to various clouds and tools while maintaining data trust. This is High level Overview of Realtime Intelligence inside Fabric Log events - Fabric based Architecture When looking in more details a solution for log collection processing storage and analysis we propose the following architecture Now let's discuss it in more details: General notes: Since Fabric is a SAAS solution, all the components can be used without deploying any infrastructure in advance, just by a click of a button and very simple configurations you can customize the relevant components for this solution The main components used in this solution are Data Pipeline Onelake and Eventhouse Our data source for this example is taken from this public git repo: https://github.com/logpai/loghub/tree/master/Spark The files were taken and stored inside an S3 bucket to simulate the easiness of the data pipeline integration to external data sources. A typical log file looks like this : 16/07/26 12:00:30 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 59219. 16/07/26 12:00:30 INFO spark.SparkEnv: Registering MapOutputTracker 16/07/26 12:00:30 INFO spark.SparkEnv: Registering BlockManagerMaster 16/07/26 12:00:30 INFO storage.DiskBlockManager: Created local directory at /opt/hdfs/nodemanager/usercache/curi/appcache/application_1460011102909_0176/blockmgr-5ea750cb-dd00-4593-8b55-4fec98723714 16/07/26 12:00:30 INFO storage.MemoryStore: MemoryStore started with capacity 2.4 GB Components Data Pipeline First challenge to solve is how to bring the log files from your system into Fabric this is the Log collection phase: many solutions exist for this phase each with its pros and cons In Fabric the standard approach to bring data in is by use of Copy Activity in ADF or in its Fabric SAAS version is now called Data Pipeline: Data pipeline is a low code / no code tool allowing to manage and automate the process of moving and transforming data within Microsoft Fabric, a serverless ETL tool with more than 100 connectors enabling integration with a wide variety of data sources, including databases, cloud services, file systems, and more. In addition, it supports an on prem agent called self-hosted integration runtime, this agent that you install on a VM, acts as a bridge allowing to run your pipeline on a local VM and securing your connection from on prem network to the cloud Let’s describe in more details our solution data pipeline: Bear in mind ADF is very flexible and supports reading at scale from a wide range of data sources / files integrated as well to all major cloud vendors from blob storage retrieval : like S3, GCS, Oracle Cloud, File systems, FTP/SFTP etc so that even if your files are generated externally to Azure this is not an issue at all. Visualization of Fabric Data Pipeline Log Collection ADF Copy Activity: Inside Data pipeline we will create an Activity called Copy Activity with the following basic config Source: mapped to your data sources: it can be azure blob storage with container containing the log files, other cloud object storage like S3 or GCS , log files will be retrieved in general from a specific container/folder and are fetched based on some prefix/suffix in the file name. To support incremental load process we can configure it to delete the source files that it reads so that once the files are successfully transferred to their target they will be automatically deleted from their source . On the next iteration pipeline will not have to process the same files again. Sink: Onelake/Lakehouse folder: we create ahead of time a Lakehouse which is an abstract data container allowing to hold and manage at scale your data either structured or unstructured, we will then select it from the list of connectors (look for Onelake/Lakehouse) Log Shippers: This is an optional component, sometimes it is not allowed for the ETL to access your OnPrem Vnet , in this case tools like Fluentd , Filebeat , Open Telemetry collector used to forward your application collected logs to the main entry point of the system: the Azure Blob Storage. Azcopy CLI: if you don’t wish to invest into expensive tools and all you need to copy your data in a scale/secure manner to Azure Storage, you might consider create your own log shipping solution based on the free Azcopy tool together withs some basic scripting around it for scheduling: Azcopy is a command-line utility designed for high-performance uploading, downloading, and copying data to and from Microsoft Azure Blob and File storage. Fabric first Activity : Copy from Source Bucket to Lakehouse Log Preparation Upon log files landing in the azure blob storage, EventStream can be used to trigger the Data Pipeline that will handle the data preparation and loading phase. So what is Data preparation phase’s main purpose? After the log files land in the storage and before they are loaded to the realtime logs database the KQL Database , it might be necessary to transform the data with some basic manipulations . The reasons for that might be different A Few examples Bad data formats: for example, sometimes logs files contain problematic characters like new lines inside a row (stack trace error message with new lines as part of the message field of the record) Metadata enrichment: sometimes the log file names contain in their name some meaningful data : for example file name describes the originating process name / server name , so this metadata can be lost once the file content is loaded into database Regulation restrictions: sometimes logs contain private data like names, credit card numbers, social security number etc called PII that must be removed , hashed or encrypted before the load to database In our case we will be running a pyspark notebook who reads the files from Onelake folder, fixes the new lines inside a row issue, and create new files in another Onelake folder, we call this notebook with a base parameter called log_path that defines the log files location on the Onelake to read from Fabric second Activity : Running the Notebook Log Loading Inside Data pipeline , the last step, after the transformation phase, we call again the Copy data activity but this time source and sink are differen: Source: Lakehouse folder (previous notebook output) Sink: Evenhouse specific Table (created ahead of time): it is basically an empty table (lograw) Visualization of Fabric last Activity : Loading to EventHouse In summary for this stage the log collection and preparation: we broke this into 3 data pipeline activities: Copy Activity: Read the log files from source: This is the first step of the log ingestion pipeline it is running inside our orchestrator Data pipeline. Run Notebook Activity : Transform the log files : this is the execution of a single or chain of notebooks Copy Activity : Load the log files into destination datatbase : KQL inside Evenhouse : the logs database table called lograw, it is a specific table created ahead of time inside EventHouse Database Inside The Eventhouse We needed to create a KQL database with a table to hold the raw ingested log records KQL datbase is a scalable and efficient storage solution for log files, optimized for high-volume data ingestion and retrieval. Eventhouses and KQL databases operate on a fully managed Kusto engine. With an Eventhouse or KQL database, you can expect available compute for your analytics within 5 to 10 seconds. The compute resources grow with your data analytic needs. Log Ingestion to KQL Database with Update Policy We can separate the ETL transformation logic of what happens to the data before, it reaches the Eventhouse KQL database and after that. Before it reached the database , the only transformation we did was calling during the data pipeline a notebook to handle the new lines merge logic, This cannot be easily done as part of the database ingestion logic simply because when we try to load the files with new lines as part of a field of a record , it breaks the convention and what happens is that the ingestion process creates separate table records for each new line of the exceptions stacktrace. On the other hand, we might need to define basic transformation rules: such as date formatting, type conversion (string to numbers) , parse and extract some interesting value from a String based on regular exception, create JSON (dynamic type) of a hierarchical string (XML / JSON string etc) for all these transformations we can work with what is called an update policy we can define a simple ETL logic inside KQL database as explained here During this step we create from logsraw staging table a new table called logparsed , that will be our destination final table for the log queries. Those are the KQL Tables defined to hold the log files .create table logsraw ( timestamp:string , log_level:string, module:string, message:string) .create table logsparsed ( formattedDatetime:datetime , log_level:string, log_module:string, message:string) This is the update policy that automatically converts data from, the staging table logsraw to the destination table logparsed .create-or-alter function parse_lograw() { logsraw | project formattedtime = todatetime(strcat("20", substring(timestamp, 0, 2), "-", substring(timestamp, 3, 2), "-", substring(timestamp, 6, 2), "T", substring(timestamp, 9, 8))), log_level, logmodule=module, message } .alter table logsparsed policy update @'[{ "IsEnabled": true, "Source": "logsraw", "Query": "parse_lograw()", "IsTransactional": true, "PropagateIngestionProperties": true}]' Since we don't need to retain the data in the staging table (lograw) we can define a retention policy of 0 TTL like this : .alter-merge table logsraw policy retention softdelete = 0sec recoverability = disabled Query Log files After data is ingested and transformed it lands in a basic logs table that is schematized : logparsed, in general we have some common fields that are mapped to their own columns like : log level (INFO/ ERROR/ DEBUG) , log category , log timestamp (a datetime typed column) and log message which can be in general either a simple error string or a complex JSON formatted string in which case it is usually preferred to be converted to dynamic type that will bring additional benefits like simplified query logic, and reduced data processing (to avoid expensive joins) Example for Typical Log Queries Category Purpose KQL Query Troubleshooting Looking for an error at specific datetime range logsparsed | where message contains "Exception" and formattedDatetime between ( datetime(2016-07-26T12:10:00) .. datetime(2016-07-26T12:20:00)) Statistics Basic statistics Min/Max timestamp of log events logsparsed | summarize minTimestamp=min(formattedDatetime), maxTimestamp=max(formattedDatetime) Exceptions Stats Check Exceptions Distributions logsparsed | extend exceptionType = case(message contains "java.io.IOException","IOException", message contains "java.lang.IllegalStateException","IllegalStateException", message contains "org.apache.spark.rpc.RpcTimeoutException", "RpcTimeoutException", message contains "org.apache.spark.SparkException","SparkException", message contains "Exception","Other Exceptions", "No Exception") | where exceptionType != "No Exception" | summarize count() by exceptionType Log Module Stats Check Modules Distribution logsparsed | summarize count() by log_module | order by count_ desc | take 10 Realtime Dashboards After querying the logs, it is possible to visualize the query results in Realtime dashboards, for that all what’s required Select the query Click on Pin to Dashboard After adding the queries to tiles inside the dashboard this is a typical dashboard we can easily build: Realtime dashboards can be configured to be refreshed in Realtime like illustrated here: in which case user can very easily configure how often to refresh the queries and visualization : at the extreme case it can be as low as Continuus There are many more capabilities implemented in the Real-Time Dashboard, like data exploration Alerting using Data Activator , conditional formatting (change items colors based on KPIs threshold) and this framework and capabilities are heavily invested and keep growing. What about AI Integration ? Machine Learning Models: Kusto supports out of the box time series analysis allowing for example anomaly detection: https://learn.microsoft.com/en-us/fabric/real-time-intelligence/dashboard-explore-data and clustering but if it’s not enough for you, you can always mirror the data of your KQL tables into Onelake delta parquet format by selecting OneLake availability This configuration will create another copy of your data in open format delta parquet : you have it available for any Spark/Python/SparkML/SQL analytics for whatever machine learning exploration and ML modeling you wish to explore train and serve This is illustrated here : Bear in mind , there is no additional storage cost to turn on OneLake availability Conclusion A well-designed real-time intelligence solution for log file management using Microsoft Fabric and EventHouse can significantly enhance an organization’s ability to monitor, analyze, and respond to log events. By leveraging modern technologies and best practices, organizations can gain valuable insights and maintain robust system performance and security.639Views0likes0CommentsUnleash the power of data and generative AI with Microsoft Cloud for Manufacturing
Microsoft Cloud for Manufacturing uses data and generative AI to enhance operational efficiency and safety in the manufacturing sector. Learn more about the manufacturing data solutions in Microsoft Fabric, Factory Operations Agent in Azure AI, and the new Factory Safety Agent which help manufacturers unify and standardize data, gain real-time insights, and improve decision-making processes.689Views2likes0CommentsWhat’s Included with Microsoft’s Granted Offerings for Nonprofits?
Are you a nonprofit looking to boost your impact with cutting-edge technology? Microsoft is here to help! From free software licenses to guided technical documentation and support, this program offers a range of resources designed to empower your organization. In this blog, we’ll dive into the incredible tools and grants available to nonprofits through Microsoft, showing you how to make the most of these generous offerings. Whether you’re managing projects or just trying to simplify your day-to-day tasks, there’s something here for everyone. Let’s explore what’s possible!352Views0likes0CommentsMGDC for SharePoint FAQ: How to flatten datasets for SQL or Fabric
When you get your data from Microsoft Graph Data Connect (MGDC), you will typically get that data as a collection of JSON objects in an Azure Data Lake Storage (ADLS) Gen2 storage account. For those handling large datasets, it might be useful to move the data to a SQL Server or to OneLake (lakehouse). In those cases, you might need to flatten the datasets. This post describes how to do that. If you’re not familiar with MGDC for SharePoint, start with https://aka.ms/SharePointData. 1. Flattening Most of the MGDC for SharePoint datasets come with nested objects. That means that a certain object has other objects inside it. For instance, if you have a SharePoint Groups object, it might have multiple Group Members inside. If you have a SharePoint Permissions object, you could have many Permissions Recipients (also known as Sharees). For each SharePoint File object, you will have a single Author object inside. When you convert the datasets from JSON to other formats, it is possible that these other formats require (or perform better) if you don’t have any objects inside objects. To overcome that, you can turn those child objects into properties of the parent object. For instance, instead of having the File object with an Author object inside, you can have multiple author-related columns. For instance, you could have Author.Name and Author.Email as properties of the flattened File object. 2. Nested Objects You can get the full list of SharePoint datasets in MGDC at https://aka.ms/SharePointDatasets. Here is a table with a list of objects and their nested objects: Object How many? Primary Key Nested Object How many? Add to Primary Key Sites 1 per Site Id RootWeb 1 per Site Sites 1 per Site Id StorageMetrics 1 per Site Sites 1 per Site Id SensitivityLabelInfo 1 per Site Sites 1 per Site Id Owner 1 per Site Sites 1 per Site Id SecondaryContact 1 per Site Groups 1 per Group SiteId + GroupId Owner 1 per Group Groups 1 per Group SiteId + GroupId Members 1 per Member COALESCE(AADObjectId, Email, Name) Permissions 1 per Permission SiteId + ScopeId + RoleDefintion + LinkId SharedWithCount 1 per Recipient Type Type Permissions 1 per Permission SiteId + ScopeId + RoleDefintion + LinkId SharedWith 1 per Recipient or Sharee COALESCE(AADObjectId, Email, Name) Files 1 per File SiteId + WebId + ListId + ItemId Author 1 per File Files 1 per File SiteId + WebId + ListId + ItemId ModifiedBy 1 per File When you flatten a dataset and there is an object with multiple objects inside (like Group Members or Permission Recipients), the number of rows will increase. You also need to add to primary key to keep it unique. Also note that the File Actions, Sync Health and Sync Errors datasets do not have any nested objects. 3. One Object per Parent When the nested object has only one instance, things are simple. As we described for the Author nested object inside the File object, you promote the properties of the nested object to be properties of the parent object. This is because the Author is defined as the user that initially created the file. There is always one and only one Author. This can happen even happen multiple times for the same object. The File also has a ModifiedBy property. That is the single user that last changed the file. In that case, there is also only one ModifiedBy per File. The Site object also includes several properties in this style, like RootWeb, StorageMetrics, SensitivityLabelInfo, Owner and SecondaryContact. Note that, in the context of the Site object, there is only one owner. Actually two, but that second one is tracked in a separate object called SecondaryContact which is effectively the secondary owner. 4. Multiple Objects per Parent The SharePoint Permissions dataset has a special condition that might create trouble for flattening. There are two sets of nested objects with multiple objects each: SharedWith and SharedWithCount. SharedWith has the list of Recipients and SharedWithCount has a list of Recipient Types. If you just let the tools flatten it, you will end up a cross join of the two. As an example, if you have 4 recipients in an object and 2 types of recipients (internal users and external users, for instance) you will end up with 20 objects in the flattened dataset instead of the expected 10 objects (one per recipient). To avoid this, in this specific condition, I would recommend just excluding the SharedWithCount column from the object before flattening. 5. Conclusion I hope this clarifies how you can flatten the MGDC for SharePoint datasets, particularly SharePoint Permissions dataset. For further details about the MGDC for SharePoint, https://aka.ms/SharePointData.Optimizing fleet management with Microsoft connected fleets reference architecture
Discover how Microsoft and its partners are revolutionizing fleet management with cutting-edge technology. In this blog we explore the integration of real-time analytics, telematics, and business applications to create efficient, safe, and cost-effective fleet operations. Learn about the innovative solutions from industry leaders like Accenture, Bosch, TomTom, Connected Cars, Annata, HERE Technologies, DSA Daten-und Systemtechnik GmbH, and more. Dive into the future of fleet management with composable, modular, and flexible solutions that adapt to the fast-moving and interconnected world.2.8KViews2likes0CommentsSimplifying Migration to Fabric Real-Time Intelligence for Power BI Real Time Reports
Power BI with real-time streaming has been the preferred solution for users to visualize streaming data. Real-Time streaming in PowerBI is being retired. We recommend users to start planning the migration of their data processing pipeline to Fabric Real-Time Intelligence.2.5KViews2likes0Comments