Introduction
In the era of digital transformation, managing and analyzing log files in real-time is essential for maintaining application health, security, and performance. There are many 3rd party solutions in this area allowing collecting / processing storing, analyzing and acting upon this data source. But sometimes as your systems scale, those solution can become very costly, their cost model increases based on the amount of ingested data and not according to the real resources utilization or customer value
This blog post explores a robust architecture leveraging Microsoft Fabric SaaS platform focused on its Realtime Intelligence capabilities for efficient log files collection processing and analysis.
The use cases can vary from simple application errors troubleshooting, to more advanced use case such as application trends detection: detecting slowly degrading performance issues: like average user session in the app for specific activities last more than expected to more proactive monitoring using log based KPIs definition and monitoring those APIS for alerts generation
Regarding cost , since Fabric provides a complete separation between compute and storage you can grow your data without necessarily growing your compute costs and you still pay for the resources that re used in a pay as you go model.
Architecture Overview
The proposed architecture integrates Microsoft Fabric’s Real time intelligence (Realtime Hub) with your source log files to create a seamless, near real-time log collection solution
It is based on Microsoft Fabric: a SAAS solution which is a unified suite integrating several best of breed Microsoft analytical experiences. Fabric is a modern data/ai platform based on unified and open data formats (parquet/delta) allowing both classical data lakes experiences using both traditional Lakehouse/warehouse SQL analytics as well as real-time intelligence on semi structured data , all in on a lake-centric SaaS platform. Fabric's open foundation with built-in governance enables you to connect to various clouds and tools while maintaining data trust.
This is High level Overview of Realtime Intelligence inside Fabric
Log events - Fabric based Architecture
When looking in more details a solution for log collection processing storage and analysis we propose the following architecture
Now let's discuss it in more details:
General notes:
Since Fabric is a SAAS solution, all the components can be used without deploying any infrastructure in advance, just by a click of a button and very simple configurations you can customize the relevant components for this solution
The main components used in this solution are Data Pipeline Onelake and Eventhouse
Our data source for this example is taken from this public git repo:
https://github.com/logpai/loghub/tree/master/Spark
The files were taken and stored inside an S3 bucket to simulate the easiness of the data pipeline integration to external data sources.
A typical log file looks like this :
16/07/26 12:00:30 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 59219.
16/07/26 12:00:30 INFO spark.SparkEnv: Registering MapOutputTracker
16/07/26 12:00:30 INFO spark.SparkEnv: Registering BlockManagerMaster
16/07/26 12:00:30 INFO storage.DiskBlockManager: Created local directory at /opt/hdfs/nodemanager/usercache/curi/appcache/application_1460011102909_0176/blockmgr-5ea750cb-dd00-4593-8b55-4fec98723714 16/07/26 12:00:30 INFO storage.MemoryStore: MemoryStore started with capacity 2.4 GB
Components
Data Pipeline
First challenge to solve is how to bring the log files from your system into Fabric this is the Log collection phase: many solutions exist for this phase each with its pros and cons
In Fabric the standard approach to bring data in is by use of Copy Activity in ADF or in its Fabric SAAS version is now called Data Pipeline:
Data pipeline is a low code / no code tool allowing to manage and automate the process of moving and transforming data within Microsoft Fabric, a serverless ETL tool with more than 100 connectors enabling integration with a wide variety of data sources, including databases, cloud services, file systems, and more.
In addition, it supports an on prem agent called self-hosted integration runtime, this agent that you install on a VM, acts as a bridge allowing to run your pipeline on a local VM and securing your connection from on prem network to the cloud
Let’s describe in more details our solution data pipeline:
Bear in mind ADF is very flexible and supports reading at scale from a wide range of data sources / files integrated as well to all major cloud vendors from blob storage retrieval : like S3, GCS, Oracle Cloud, File systems, FTP/SFTP etc so that even if your files are generated externally to Azure this is not an issue at all.
Visualization of Fabric Data Pipeline
- Log Collection
-
- ADF Copy Activity: Inside Data pipeline we will create an Activity called Copy Activity with the following basic config
- Source: mapped to your data sources: it can be azure blob storage with container containing the log files, other cloud object storage like S3 or GCS , log files will be retrieved in general from a specific container/folder and are fetched based on some prefix/suffix in the file name. To support incremental load process we can configure it to delete the source files that it reads so that once the files are successfully transferred to their target they will be automatically deleted from their source . On the next iteration pipeline will not have to process the same files again.
- Sink: Onelake/Lakehouse folder: we create ahead of time a Lakehouse which is an abstract data container allowing to hold and manage at scale your data either structured or unstructured, we will then select it from the list of connectors (look for Onelake/Lakehouse)
- Log Shippers: This is an optional component, sometimes it is not allowed for the ETL to access your OnPrem Vnet , in this case tools like Fluentd , Filebeat , Open Telemetry collector used to forward your application collected logs to the main entry point of the system: the Azure Blob Storage.
- Azcopy CLI: if you don’t wish to invest into expensive tools and all you need to copy your data in a scale/secure manner to Azure Storage, you might consider create your own log shipping solution based on the free Azcopy tool together withs some basic scripting around it for scheduling: Azcopy is a command-line utility designed for high-performance uploading, downloading, and copying data to and from Microsoft Azure Blob and File storage.
- ADF Copy Activity: Inside Data pipeline we will create an Activity called Copy Activity with the following basic config
Fabric first Activity : Copy from Source Bucket to Lakehouse
- Log Preparation
Upon log files landing in the azure blob storage, EventStream can be used to trigger the Data Pipeline that will handle the data preparation and loading phase.
So what is Data preparation phase’s main purpose?
After the log files land in the storage and before they are loaded to the realtime logs database the KQL Database , it might be necessary to transform the data with some basic manipulations . The reasons for that might be different
A Few examples
-
-
- Bad data formats: for example, sometimes logs files contain problematic characters like new lines inside a row (stack trace error message with new lines as part of the message field of the record)
- Metadata enrichment: sometimes the log file names contain in their name some meaningful data : for example file name describes the originating process name / server name , so this metadata can be lost once the file content is loaded into database
- Regulation restrictions: sometimes logs contain private data like names, credit card numbers, social security number etc called PII that must be removed , hashed or encrypted before the load to database
-
In our case we will be running a pyspark notebook who reads the files from Onelake folder, fixes the new lines inside a row issue, and create new files in
another Onelake folder, we call this notebook with a base parameter called log_path that defines the log files location on the Onelake to read from
Fabric second Activity : Running the Notebook
- Log Loading
Inside Data pipeline , the last step, after the transformation phase, we call again the Copy data activity but this time source and sink are differen:
-
- Source: Lakehouse folder (previous notebook output)
- Sink: Evenhouse specific Table (created ahead of time): it is basically an empty table (lograw)
Visualization of Fabric last Activity : Loading to EventHouse
In summary for this stage the log collection and preparation: we broke this into 3 data pipeline activities:
- Copy Activity: Read the log files from source: This is the first step of the log ingestion pipeline it is running inside our orchestrator Data pipeline.
- Run Notebook Activity : Transform the log files : this is the execution of a single or chain of notebooks
- Copy Activity : Load the log files into destination datatbase : KQL inside Evenhouse : the logs database table called lograw, it is a specific table created ahead of time inside EventHouse Database
Inside The Eventhouse
We needed to create a KQL database with a table to hold the raw ingested log records
KQL datbase is a scalable and efficient storage solution for log files, optimized for high-volume data ingestion and retrieval. Eventhouses and KQL databases operate on a fully managed Kusto engine. With an Eventhouse or KQL database, you can expect available compute for your analytics within 5 to 10 seconds. The compute resources grow with your data analytic needs.
-
Log Ingestion to KQL Database with Update Policy
We can separate the ETL transformation logic of what happens to the data
before, it reaches the Eventhouse KQL database and after that.
Before it reached the database , the only transformation we did was calling during the data pipeline a notebook to handle the new lines merge logic,
This cannot be easily done as part of the database ingestion logic simply because when we try to load the files with new lines as part of a field of a record , it breaks the convention and what happens is that the ingestion process creates separate table records for each new line of the exceptions stacktrace.
On the other hand, we might need to define basic transformation rules: such as date formatting, type conversion (string to numbers) , parse and extract some interesting value from a String based on regular exception, create JSON (dynamic type) of a hierarchical string (XML / JSON string etc) for all these transformations we can work with what is called an update policy we can define a simple ETL logic inside KQL database as explained here
During this step we create from logsraw staging table a new table called logparsed , that will be our destination final table for the log queries.
Those are the KQL Tables defined to hold the log files
.create table logsraw ( timestamp:string , log_level:string, module:string, message:string) .create table logsparsed ( formattedDatetime:datetime , log_level:string, log_module:string, message:string)
This is the update policy that automatically converts data from, the staging table logsraw to the destination table logparsed
.create-or-alter function parse_lograw() { logsraw | project formattedtime = todatetime(strcat("20", substring(timestamp, 0, 2), "-", substring(timestamp, 3, 2), "-", substring(timestamp, 6, 2), "T", substring(timestamp, 9, 8))), log_level, logmodule=module, message } .alter table logsparsed policy update @'[{ "IsEnabled": true, "Source": "logsraw", "Query": "parse_lograw()", "IsTransactional": true, "PropagateIngestionProperties": true}]'
Since we don't need to retain the data in the staging table (lograw) we can define a retention policy of 0 TTL like this :
.alter-merge table logsraw policy retention softdelete = 0sec recoverability = disabled
-
Query Log files
After data is ingested and transformed it lands in a basic logs table that is schematized : logparsed, in general we have some common fields that are mapped to their own columns like : log level (INFO/ ERROR/ DEBUG) , log category , log timestamp (a datetime typed column) and log message which can be in general either a simple error string or a complex JSON formatted string in which case it is usually preferred to be converted to dynamic type that will bring additional benefits like simplified query logic, and reduced data processing (to avoid expensive joins)
Example for Typical Log Queries
Category | Purpose | KQL Query |
Troubleshooting | Looking for an error at specific datetime range |
logsparsed
|
Statistics |
Basic statistics Min/Max timestamp of log events |
logsparsed
|
Exceptions Stats |
Check Exceptions Distributions |
logsparsed
|
Log Module Stats |
Check Modules Distribution |
logsparsed | summarize count() by log_module | order by count_ desc | take 10
|
Realtime Dashboards
After querying the logs, it is possible to visualize the query results in Realtime dashboards, for that all what’s required
-
- Select the query
- Click on Pin to Dashboard
After adding the queries to tiles inside the dashboard this is a typical dashboard we can easily build:
Realtime dashboards can be configured to be refreshed in Realtime like illustrated here: in which case user can very easily configure how often to refresh the queries and visualization : at the extreme case it can be as low as Continuus
There are many more capabilities implemented in the Real-Time Dashboard, like data exploration Alerting using Data Activator , conditional formatting (change items colors based on KPIs threshold) and this framework and capabilities are heavily invested and keep growing.
What about AI Integration ?
Machine Learning Models: Kusto supports out of the box time series analysis allowing for example anomaly detection: https://learn.microsoft.com/en-us/fabric/real-time-intelligence/dashboard-explore-data and clustering but if it’s not enough for you, you can always mirror the data of your KQL tables into Onelake delta parquet format by selecting OneLake availability
This configuration will create another copy of your data in open format delta parquet : you have it available for any Spark/Python/SparkML/SQL analytics for whatever machine learning exploration and ML modeling you wish to explore train and serve
This is illustrated here :
Bear in mind , there is no additional storage cost to turn on OneLake availability
Conclusion
A well-designed real-time intelligence solution for log file management using Microsoft Fabric and EventHouse can significantly enhance an organization’s ability to monitor, analyze, and respond to log events. By leveraging modern technologies and best practices, organizations can gain valuable insights and maintain robust system performance and security.
Updated Dec 30, 2024
Version 1.0rabindori
Microsoft
Joined October 06, 2024
Analytics on Azure Blog
Follow this blog board to get notified when there's new activity