Delta Lake
41 TopicsEfficient Log Management with Microsoft Fabric
Introduction In the era of digital transformation, managing and analyzing log files in real-time is essential for maintaining application health, security, and performance. There are many 3rd party solutions in this area allowing collecting / processing storing, analyzing and acting upon this data source. But sometimes as your systems scale, those solution can become very costly, their cost model increases based on the amount of ingested data and not according to the real resources utilization or customer value This blog post explores a robust architecture leveraging Microsoft Fabric SaaS platform focused on its Realtime Intelligence capabilities for efficient log files collection processing and analysis. The use cases can vary from simple application errors troubleshooting, to more advanced use case such as application trends detection: detecting slowly degrading performance issues: like average user session in the app for specific activities last more than expected to more proactive monitoring using log based KPIs definition and monitoring those APIS for alerts generation Regarding cost , since Fabric provides a complete separation between compute and storage you can grow your data without necessarily growing your compute costs and you still pay for the resources that re used in a pay as you go model. Architecture Overview The proposed architecture integrates Microsoft Fabric’s Real time intelligence (Realtime Hub) with your source log files to create a seamless, near real-time log collection solution It is based on Microsoft Fabric: a SAAS solution which is a unified suite integrating several best of breed Microsoft analytical experiences. Fabric is a modern data/ai platform based on unified and open data formats (parquet/delta) allowing both classical data lakes experiences using both traditional Lakehouse/warehouse SQL analytics as well as real-time intelligence on semi structured data , all in on a lake-centric SaaS platform. Fabric's open foundation with built-in governance enables you to connect to various clouds and tools while maintaining data trust. This is High level Overview of Realtime Intelligence inside Fabric Log events - Fabric based Architecture When looking in more details a solution for log collection processing storage and analysis we propose the following architecture Now let's discuss it in more details: General notes: Since Fabric is a SAAS solution, all the components can be used without deploying any infrastructure in advance, just by a click of a button and very simple configurations you can customize the relevant components for this solution The main components used in this solution are Data Pipeline Onelake and Eventhouse Our data source for this example is taken from this public git repo: https://github.com/logpai/loghub/tree/master/Spark The files were taken and stored inside an S3 bucket to simulate the easiness of the data pipeline integration to external data sources. A typical log file looks like this : 16/07/26 12:00:30 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 59219. 16/07/26 12:00:30 INFO spark.SparkEnv: Registering MapOutputTracker 16/07/26 12:00:30 INFO spark.SparkEnv: Registering BlockManagerMaster 16/07/26 12:00:30 INFO storage.DiskBlockManager: Created local directory at /opt/hdfs/nodemanager/usercache/curi/appcache/application_1460011102909_0176/blockmgr-5ea750cb-dd00-4593-8b55-4fec98723714 16/07/26 12:00:30 INFO storage.MemoryStore: MemoryStore started with capacity 2.4 GB Components Data Pipeline First challenge to solve is how to bring the log files from your system into Fabric this is the Log collection phase: many solutions exist for this phase each with its pros and cons In Fabric the standard approach to bring data in is by use of Copy Activity in ADF or in its Fabric SAAS version is now called Data Pipeline: Data pipeline is a low code / no code tool allowing to manage and automate the process of moving and transforming data within Microsoft Fabric, a serverless ETL tool with more than 100 connectors enabling integration with a wide variety of data sources, including databases, cloud services, file systems, and more. In addition, it supports an on prem agent called self-hosted integration runtime, this agent that you install on a VM, acts as a bridge allowing to run your pipeline on a local VM and securing your connection from on prem network to the cloud Let’s describe in more details our solution data pipeline: Bear in mind ADF is very flexible and supports reading at scale from a wide range of data sources / files integrated as well to all major cloud vendors from blob storage retrieval : like S3, GCS, Oracle Cloud, File systems, FTP/SFTP etc so that even if your files are generated externally to Azure this is not an issue at all. Visualization of Fabric Data Pipeline Log Collection ADF Copy Activity: Inside Data pipeline we will create an Activity called Copy Activity with the following basic config Source: mapped to your data sources: it can be azure blob storage with container containing the log files, other cloud object storage like S3 or GCS , log files will be retrieved in general from a specific container/folder and are fetched based on some prefix/suffix in the file name. To support incremental load process we can configure it to delete the source files that it reads so that once the files are successfully transferred to their target they will be automatically deleted from their source . On the next iteration pipeline will not have to process the same files again. Sink: Onelake/Lakehouse folder: we create ahead of time a Lakehouse which is an abstract data container allowing to hold and manage at scale your data either structured or unstructured, we will then select it from the list of connectors (look for Onelake/Lakehouse) Log Shippers: This is an optional component, sometimes it is not allowed for the ETL to access your OnPrem Vnet , in this case tools like Fluentd , Filebeat , Open Telemetry collector used to forward your application collected logs to the main entry point of the system: the Azure Blob Storage. Azcopy CLI: if you don’t wish to invest into expensive tools and all you need to copy your data in a scale/secure manner to Azure Storage, you might consider create your own log shipping solution based on the free Azcopy tool together withs some basic scripting around it for scheduling: Azcopy is a command-line utility designed for high-performance uploading, downloading, and copying data to and from Microsoft Azure Blob and File storage. Fabric first Activity : Copy from Source Bucket to Lakehouse Log Preparation Upon log files landing in the azure blob storage, EventStream can be used to trigger the Data Pipeline that will handle the data preparation and loading phase. So what is Data preparation phase’s main purpose? After the log files land in the storage and before they are loaded to the realtime logs database the KQL Database , it might be necessary to transform the data with some basic manipulations . The reasons for that might be different A Few examples Bad data formats: for example, sometimes logs files contain problematic characters like new lines inside a row (stack trace error message with new lines as part of the message field of the record) Metadata enrichment: sometimes the log file names contain in their name some meaningful data : for example file name describes the originating process name / server name , so this metadata can be lost once the file content is loaded into database Regulation restrictions: sometimes logs contain private data like names, credit card numbers, social security number etc called PII that must be removed , hashed or encrypted before the load to database In our case we will be running a pyspark notebook who reads the files from Onelake folder, fixes the new lines inside a row issue, and create new files in another Onelake folder, we call this notebook with a base parameter calledlog_path that defines the log files location on the Onelake to read from Fabric second Activity : Running the Notebook Log Loading Inside Data pipeline , the last step, after the transformation phase, we call again the Copy data activity but this time source and sink are differen: Source: Lakehouse folder (previous notebook output) Sink: Evenhouse specific Table (created ahead of time): it is basically an empty table (lograw) Visualization of Fabric last Activity : Loading to EventHouse In summary for this stage the log collection and preparation: we broke this into 3 data pipeline activities: Copy Activity: Read the log files from source: This is the first step of the log ingestion pipeline it is running inside our orchestrator Data pipeline. Run Notebook Activity : Transform the log files : this is the execution of a single or chain of notebooks Copy Activity : Load the log files into destination datatbase : KQL inside Evenhouse : the logs database table called lograw, it is a specific table created ahead of time inside EventHouse Database Inside The Eventhouse We needed to create aKQL database with a table to hold the raw ingested log records KQL datbase is a scalable and efficient storage solution for log files, optimized for high-volume data ingestion and retrieval. Eventhouses and KQL databases operate on a fully managed Kusto engine. With an Eventhouse or KQL database, you can expect available compute for your analytics within 5 to 10 seconds. The compute resources grow with your data analytic needs. Log Ingestion to KQL Database with Update Policy We can separate the ETL transformation logic of what happens to the data before, it reaches the Eventhouse KQL database and after that. Before it reached the database , the only transformation we did was calling during the data pipeline a notebook to handle the new lines merge logic, This cannot be easily done as part of the database ingestion logic simply because when we try to load the files with new lines as part of a field of a record , it breaks the convention and what happens is that the ingestion process creates separate table records for each new line of the exceptions stacktrace. On the other hand, we might need to define basic transformation rules: such as date formatting, type conversion (string to numbers) , parse and extract some interesting value from a String based on regular exception, create JSON (dynamic type) of a hierarchical string (XML / JSON string etc) for all these transformations we can work with what is called an update policy we can define a simple ETL logic inside KQL database as explained here During this step we create from logsraw staging table a new table calledlogparsed , that will be our destination final table for the log queries. Those are the KQL Tables defined to hold the log files .create table logsraw ( timestamp:string , log_level:string, module:string, message:string) .create table logsparsed ( formattedDatetime:datetime , log_level:string, log_module:string, message:string) This is the update policy that automatically converts data from, the staging table logsraw to the destination table logparsed .create-or-alter function parse_lograw() { logsraw | project formattedtime = todatetime(strcat("20", substring(timestamp, 0, 2), "-", substring(timestamp, 3, 2), "-", substring(timestamp, 6, 2), "T", substring(timestamp, 9, 8))), log_level, logmodule=module, message } .alter table logsparsed policy update @'[{ "IsEnabled": true, "Source": "logsraw", "Query": "parse_lograw()", "IsTransactional": true, "PropagateIngestionProperties": true}]' Since we don't need to retain the data in the staging table (lograw) we can define a retention policy of 0 TTL like this : .alter-merge table logsraw policy retention softdelete = 0sec recoverability = disabled Query Log files After data is ingested and transformed it lands in a basic logs table that is schematized : logparsed, in general we have some common fields that are mapped to their own columns like : log level (INFO/ ERROR/ DEBUG) , log category , log timestamp (a datetime typed column) and log message which can be in general either a simple error string or a complex JSON formatted string in which case it is usually preferred to be converted to dynamic type that will bring additional benefits like simplified query logic, and reduced data processing (to avoid expensive joins) Example for Typical Log Queries Category Purpose KQL Query Troubleshooting Looking for an error at specific datetime range logsparsed | where message contains "Exception" and formattedDatetime between ( datetime(2016-07-26T12:10:00) .. datetime(2016-07-26T12:20:00)) Statistics Basic statistics Min/Max timestamp of log events logsparsed | summarize minTimestamp=min(formattedDatetime), maxTimestamp=max(formattedDatetime) Exceptions Stats Check Exceptions Distributions logsparsed | extend exceptionType = case(message contains "java.io.IOException","IOException", message contains "java.lang.IllegalStateException","IllegalStateException", message contains "org.apache.spark.rpc.RpcTimeoutException", "RpcTimeoutException", message contains "org.apache.spark.SparkException","SparkException", message contains "Exception","Other Exceptions", "No Exception") | where exceptionType != "No Exception" | summarize count() by exceptionType Log Module Stats Check Modules Distribution logsparsed | summarize count() by log_module | order by count_ desc | take 10 Realtime Dashboards After querying the logs, it is possible to visualize the query results in Realtime dashboards, for that all what’s required Select the query Click on Pin to Dashboard After adding the queries to tiles inside the dashboard this is a typical dashboard we can easily build: Realtime dashboards can be configured to be refreshed in Realtime like illustrated here: in which case user can very easily configure how often to refresh the queries and visualization : at the extreme case it can be as low as Continuus There are many more capabilities implemented in the Real-Time Dashboard, likedata exploration Alerting using Data Activator , conditional formatting (change items colors based on KPIs threshold) and this framework and capabilities are heavily invested and keep growing. What about AI Integration ? Machine Learning Models: Kusto supports out of the box time series analysis allowing for example anomaly detection: https://learn.microsoft.com/en-us/fabric/real-time-intelligence/dashboard-explore-data and clustering but if it’s not enough for you, you can always mirror the data of your KQL tables into Onelake delta parquet format by selecting OneLake availability This configuration will create another copy of your data in open format delta parquet : you have it available for any Spark/Python/SparkML/SQL analytics for whatever machine learning exploration and ML modeling you wish to explore train and serve This is illustrated here : Bear in mind , there is no additional storage cost to turn on OneLake availability Conclusion A well-designed real-time intelligence solution for log file management using Microsoft Fabric and EventHouse can significantly enhance an organization’s ability to monitor, analyze, and respond to log events. By leveraging modern technologies and best practices, organizations can gain valuable insights and maintain robust system performance and security.234Views0likes0CommentsEnhanced autoscale capabilities in HDInsight clusters
HDInsight now has enhanced capabilities which include improved latency, and feedback loop alongside support for recommissioning nodemanagers in case of load-aware autoscale which improves cluster utilization massively and lowers the total cost of ownership significantly.4.9KViews2likes1CommentImplementing Business Logic using Data Vault 2.0 on Azure Fabric
This Article isAuthored ByMichael Olschimke, co-founder and CEO atScalefreeInternational GmbHandCo-authored with Kilian GrünhagenSeniorBI ConsultantfromScalefree TheTechnical Review is done byIan ClarkeandNaveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft Business logic serves an important role in the data-driven data platform. There is a business expectation of the information to be delivered. This expectation can be defined by two characteristics: the user expects the information in a certain structure (often a dimensional model) and they expect certain content, for example, all currency amounts to be in Euros. But there is a gap between these expectations and the actual data from the data sources. In the Data Vault 2.0 architecture, the Business Vault is used to bridge this gap and focuses on implementing the business logic to transform the data to meet the content expectations of the business users. Introduction The previous articles focused on modelling and implementing the Raw Data Vault, where the raw data is captured by integrating the data on shared business keys and their relationships and versioning all changes to descriptive data. In the Raw Data Vault layer, only so-called hard rules are applied. Hard rules don’t change the meaning of the content, they only change the structure of the incoming data set. The creative destruction of the data into business keys (stored in hubs), relationships between business keys (stored in links), and descriptive data (stored in satellites) is a prime example. But also the data type alignment to match the data types of the (often relational) data model is a good example: changing the data type of a whole number from a CSV string into an integer is not changing the content, but only the structure of the data. This is important to ensure auditability when the original data deliveries must be reproduced and to create multiple business perspectives when information users cannot agree on shared business logic or the definition of their concepts (“Who is a customer?”). In both cases, the unmodified raw data is required. How to Model the Business Vault The business logic to transform the raw data into useful information, for example by cleansing it, recomputing foreign currency amounts, or standardizing addresses, is implemented in the Business Vault. This sparsely modelled layer sits right between the Raw Data Vault with its unmodified raw data and the information mart where the final information in the expected structure and with the expected content is delivered to the presentation layer. “Sparsely modelled” refers to the fact that, believe it or not, some of your data is good enough for reporting. There is no need to cleanse, standardize, or otherwise transform the data because it is exactly what the information user expects. In this case, the dimensional entities in the information mart are directly derived from the Raw Data Vault entities. However, if business logic needs to be applied, it's done in the Business Vault. The entities are typically modelled in the same way as in the Raw Data Vault, so one can expect hubs, links, and many satellites in the Business Vault, including special entity types such as multi-active satellites, non-historized links, etc. For example, in the above diagram, there are the invoice hub and its two satellites originating from the CRM and ERP system. In addition to these Raw Data Vault entities, a computed (or business) satellite invoice_lroc_bsat with one calculated attribute for the invoice amount has been added. But in either case, with or without additional Business Vault entities, it also means that the final information is not done yet, as the desired target structure (e.g., a dimensional model) is not created yet. This model will be derived from the information mart. To do so, the dimensional modeler can combine (and often has to combine) entities from the Raw Data Vault and the Business Vault. Implementing Business Logic on Fabric In many cases, the business logic is relatively simple and can be implemented in SQL. In such cases, an SQL view is the preferred choice and is used to implement the Business Vault entity. If the business logic becomes too complex or the performance of the view is not as desired, an external script might be used as an alternative. For example, a Python script could retrieve data from the Raw Data Vault (but also from the Business Vault) and write the results into a table in the Business Vault. This external Python script is considered to be part of the Business Vault as long as the data platform team has it under its own version control. Keep in mind that there are other options, such as PIT tables, to improve the performance of virtualized entities in the Data Vault architecture. Note that there are actually two options to implement business logic in the Data Vault architecture: besides the discussed option in the Business Vault, it is also possible to implement business rules directly in the information marts, for example in dimension or fact entities. However, when doing so, the business logic is not re-usable. If the same business logic should be used for dimensions in multiple information marts, the business logic must be replicated. If the business logic is implemented in the Business Vault instead, the entities can be reused by multiple information marts. The Business Vault is often implemented in a cascading fashion: a Business Vault entity is not limited to a Raw Data Vault entity as its only data source. Instead, a Business Vault entity can source from multiple entities, both from the Raw Data Vault and other Business Vault entities. By doing so, the overall implementation logic is cascading across multiple Business Vault entities, which is a typical scenario. In some cases, developers try to avoid this, but end up with a Business Vault entity with complex implementation logic. From an auditing perspective, there is one more requirement: It should be possible to truncate a materialized Business Vault entity and rebuild it by applying the same, unmodified business logic to the same, unmodified source data. The results in the Business Vault entity must be the same. If this is not the case, either the source data has been modified or the business logic. Cleansing Dirty Data using Computed Satellites A typical entity type in the Business Vault is the computed satellite, also known as the business satellite. To be short: it’s just a satellite in the Business Vault. The only difference to its counterpart in the Raw Data Vault is that it contains computed results, not just raw data. This makes sense as descriptive data is stored in a satellite in the Raw Data Vault and subject to the application of business logic, for example to cleanse the data, standardize addresses and phone numbers and otherwise increase the value of the data. For example, if the Raw Data Vault satellite captures raw data from the data source, it might be erroneous (e.g. on the city name): This data is captured as it is in the Raw Data Vault of the Data Vault architecture, completely unmodified. The goal of the Raw Data Vault is to capture the good, the bad, and the ugly data, and no judgment is made about these categories. As discussed at the beginning of this article, data cleansing is part of the Business Vault, not of the Raw Data Vault. The next diagram shows the relationship between the computed satellite with the computed city attribute, the Raw Data Vault satellite, and their shared hub: The computed satellite is attached to the same hub, as it still describes the same store, just with cleansed data. In this case, the business logic is simple: data is cleansed by joining into a mapping table for the city name based on the raw data. For each city in the Raw Data Vault satellite, there is a mapping in the reference data for mapping the original city name to the cleansed city name: CREATE VIEW [dv_core].[store_address_crm_lroc_bsat] AS ( SELECT hk_store_hub ,load_datetime ,record_source ,hd_store_address_crm_lroc_sat ,address_street ,postal_code , CASE WHEN store.city != cities.CityName AND cities.ZipCode IS NOT NULL THEN cities.CityName ELSE store.city END AS city ,country FROM [dv_core].[store_address_crm_lroc0_sat] store LEFT JOIN [MS_BLOG_DWH].[dv_core].[city_ref_sat] cities on store.postal_code = cities.ZipCode); It is not uncommon for a lot of business logic to be implemented as simply as the above code. This is achieved by providing a materialized mapping table between dirty data and cleansed data as a reference hub and satellite. Once the computed satellite is deployed, the downstream developer for the information mart can now choose between the cleansed address data or the original address data to be used for the dimension entity by joining the appropriate satellite entity. Dealing with Duplicate Records Another often-used entity in the Business Vault is the Same-As-Link (SAL). The name of the SAL stems from the sentence, “This business key identifies the same business object as the other business key.” So, the link relates two business keys in a duplicate-to-master relationship. One business key identifies the master key to be used for reporting, and the other identifies the duplicate key. If the data contains multiple duplicates, multiple duplicate business keys might be mapped to the same master key. For example, the following table shows a source data set with duplicate records: There are different variations of Michael Olschimke, and because the operational system did not recognize that all of them refer to the same actual customer, the operational system assigned separate business keys to each record. The business key customer_id is captured by the hub customer_hub, while the name is captured by a satellite, not shown in the following diagram: Based on the descriptive data in the satellite, SQL Server’s SOUNDEX function (supported in Fabric Warehouse) can be used to calculate the similarity of two strings, based on the pronunciation of the text. The matches where the similarity is above a certain threshold are considered as duplicates and added to the same-as-link (SAL). That way, duplicates are marked and the mapping can be used later to retrieve a deduplicated dimension. The following code shows the virtual implementation of the same-as-link: CREATE VIEW [dv_core].[customer_sal] AS WITH crm_data AS ( SELECT crm.hk_customer_hub , crm.load_datetime , hub.record_source , hub.customer_id , crm.name , SOUNDEX(crm.name) as soundex_name , crm.email FROM dv_core.customer_hub hub LEFT JOIN dv_core.customer_crm_lroc_sat crm ON hub.hk_customer_hub = crm.hk_customer_hub AND crm.is_current = 0 ) , shop_data AS ( SELECT shop.hk_customer_hub , shop.load_datetime , hub.record_source , hub.customer_id , shop.name , SOUNDEX(shop.name) as soundex_name , shop.email FROM dv_core.customer_hub hub LEFT JOIN dv_core.customer_shop_lroc_sat shop ON hub.hk_customer_hub = shop.hk_customer_hub AND shop.is_current = 0 ) SELECT hk_customer_hub , load_datetime , record_source , hk_master , hk_duplicate FROM( SELECT crm.hk_customer_hub AS hk_customer_hub, LEAST(crm.load_datetime, shop.load_datetime) AS load_datetime, 'https://wiki.scalefree.com/business_rules/unique_customers' AS record_source, crm.hk_customer_hub AS hk_master, shop.hk_customer_hub AS hk_duplicate, DIFFERENCE(crm.name, shop.name) AS similarity_score FROM crm_data crm LEFT JOIN shop_data shop ON crm.soundex_name = shop.soundex_name )level1 WHERE level1.similarity_score = 4; Once the same-as-link is created, it can be joined with the hub to reduce the duplicates to the master record based on the SoundEx function. The actual business logic is implemented in the view - the users who query this model don’t necessarily need to know how to apply SoundEx - they just use the results by joining the link to the customer hub. Concluding the Value of the Business Vault By implementing the business logic in Business Vault entities, the business logic can be used by multiple information marts but also data scientists and other power users. The Business Vault model presents the result of the business logic, while the source code implements it in many cases. However, in other cases, the business logic could also be implemented in external scripts, such as Python. In this case, the Business Vault entity would be a physical table that is loaded from the external script. There are some cases, where it makes more sense to implement the business logic in the dimensional entity - for example in the dimension view. However, in such cases, the code will not be reused by multiple information marts. To use the same logic in multiple marts, the logic must be replicated. We will discuss the information marts and how we derive dimensional models from a Data Vault model in the next article of this series. <<< Back to Blog Series Title Page282Views0likes0CommentsExternal Data Sharing With Microsoft Fabric
The demands and growth of data for external analytics consumption is rapidly growing. There are many options to share data externally and the field is very dynamic. One of the most frictionless and easy onboarding steps for external data sharing we will explore is with Microsoft Fabric. This external data allows users to share data from their tenant with users in another Microsoft Fabric tenant.5.1KViews3likes1CommentImplementing Data Vault 2.0 on Fabric Data Warehouse
In the previous articles of this series, we have discussed how to model Data Vault on Microsoft Fabric. Our initial focus was on the basic entity types including hubs, links, and satellites; advanced entity types, such as non-historized links and multi-active satellites and the third article was modeling a more complete model, including a typical modeling process, for Microsoft Dynamics CRM data.6.9KViews0likes1CommentPart 1: Migrate Azure Analysis Services to Power BI Premium using Azure Databricks - Why
The migration from Azure Analysis Services to Power BI Premium and the move to Azure Databricks SQL as the underlying data source for Power BI leads to great performance, enhanced features, and cost savings.2.8KViews0likes0CommentsRevolutionizing Data Intelligence: Azure Databricks Updates
Data Intelligence Platform in Azure Databricks is revolutionizing the Data and AI landscape. This fully managed service, which is built onLakehouse architecture supported by Delta Lake, and is integrated with Microsoft Azure cloud capabilities, streamlines data, analytics, and AI initiatives by removing infrastructure concerns.The close partnership between Databricks and Microsoft enhances this integration, enabling users to focus on their data and AI goals and makes Azure the optimal public cloud for Databricks.3.6KViews2likes0CommentsAnnouncing Mosaic AI Vector Search General Availability in Azure Databricks
Today, at Microsoft Build, we are thrilled to announce the general availability of Mosaic AI Vector Search in Azure Databricks. Vector Search is a serverless vector database that helps customers build high-quality Generative AI applications using Retrieval Augmented Generation (RAG). With its native integration in Azure Databricks, Vector Search supports automatic data synchronization from source to index, eliminating complex and costly pipeline maintenance. It also leverages the same security and data governance tools organizations have already built for peace of mind.3.7KViews2likes0Comments