Log Clustering in Azure Data Explorer

Microsoft

Apr 27, 2023

Azure Data Explorer (ADX) is commonly used to monitor cloud services, applications and IoT devices. As part of the monitoring workflow the service/device emits log records containing various metrics and textual strings reporting of its state, activity, operational warning/errors etc. These log records are parsed and stored in ADX for further analysis such as anomaly detection and investigation, resources consumption etc. In general, the log records contain semi-structured data: there are structured fields like timestamp, device identification, metric name/value pairs as well as free text ones for status or error strings.

ADX has a rich set of capabilities for anomaly detection and investigation (see Time series anomaly detection & forecasting in Azure Data Explorer and Machine learning capability in Azure Data Explorer). Specifically, for diagnosis anomalies and root cause analysis there are powerful clustering plugins, either autocluster() or basket() for finding patterns in a single records set, or diffpatterns() for exposing differentiating patterns between two records sets. But these plugins clusterize records based on common values across multiple columns; they are not applicable for parsing free text column(s) to extract tokens and clusterize the records based on common sets of these tokens. For that task ADX has either the reduce operator or the diffpatterns_text() plugin, both are quite powerful for specific scenarios but are not flexible enough to cover the large diversity of textual log records.

To fill this gap, we are happy to introduce a set of powerful log clustering functions. These functions are based on state of the art log clustering algorithm, combining ML, statistical analysis and the user’s domain knowledge. First, let’s see an example of clustering HDFS log lines by the new log_reduce_fl() function.

Here is a short snip of HDFS log lines:

data

081110 215858 15485 INFO dfs.DataNode$PacketResponder: Received block blk_5080254298708411681 of size 67108864 from /10.251.43.21

081110 215858 15494 INFO dfs.DataNode$DataXceiver: Receiving block blk_-7037346755429293022 src: /10.251.43.21:45933 dest: /10.251.43.21:50010

081110 215858 15496 INFO dfs.DataNode$PacketResponder: PacketResponder 2 for block blk_-7746692545918257727 terminating

081110 215858 15496 INFO dfs.DataNode$PacketResponder: Received block blk_-7746692545918257727 of size 67108864 from /10.251.107.227

081110 215858 15511 INFO dfs.DataNode$DataXceiver: Receiving block blk_-8578644687709935034 src: /10.251.107.227:39600 dest: /10.251.107.227:50010

081110 215858 15514 INFO dfs.DataNode$DataXceiver: Receiving block blk_722881101738646364 src: /10.251.75.79:58213 dest: /10.251.75.79:50010

081110 215858 15517 INFO dfs.DataNode$PacketResponder: PacketResponder 2 for block blk_-7110736255599716271 terminating

081110 215858 15517 INFO dfs.DataNode$PacketResponder: Received block blk_-7110736255599716271 of size 67108864 from /10.251.42.246

081110 215858 15533 INFO dfs.DataNode$DataXceiver: Receiving block blk_7257432994295824826 src: /10.251.26.8:41803 dest: /10.251.26.8:50010

081110 215858 15533 INFO dfs.DataNode$DataXceiver: Receiving block blk_-7771332301119265281 src: /10.251.43.210:34258 dest: /10.251.43.210:50010

Clusterize 100K log lines:

HDFS_log_100k
| invoke log_reduce_fl(reduce_col="data")

Count	LogReduce	example
55356	081110 <NUM> <NUM> INFO dfs.FSNamesystem: BLOCK* <>: <> <> <> <> <> <*> <IP>	081110 220623 26 INFO dfs.FSNamesystem: BLOCK* NameSystem.delete: blk_1239016582509138045 is added to invalidSet of 10.251.123.195:50010
10278	081110 <NUM> <NUM> INFO dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: blockMap updated: <IP> is added to blk_<NUM> size <NUM>	081110 215858 27 INFO dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.250.11.85:50010 is added to blk_5080254298708411681 size 67108864
10256	081110 <NUM> <NUM> INFO dfs.DataNode$PacketResponder: Received block blk_<NUM> of size <NUM> from <IP>	081110 215858 15485 INFO dfs.DataNode$PacketResponder: Received block blk_5080254298708411681 of size 67108864 from /10.251.43.21
10256	081110 <NUM> <NUM> INFO dfs.DataNode$PacketResponder: PacketResponder <NUM> for block blk_<NUM> terminating	081110 215858 15496 INFO dfs.DataNode$PacketResponder: PacketResponder 2 for block blk_-7746692545918257727 terminating
9140	081110 <NUM> <NUM> INFO dfs.DataNode$DataXceiver: Receiving block blk_<NUM> src: <IP> dest: <IP>	081110 215858 15494 INFO dfs.DataNode$DataXceiver: Receiving block blk_-7037346755429293022 src: /10.251.43.21:45933 dest: /10.251.43.21:50010
3047	081110 <NUM> <NUM> INFO dfs.FSNamesystem: BLOCK* NameSystem.allocateBlock: /user/root/rand3/_temporary/_task_<NUM>_<NUM>_m_<NUM>_<NUM>/part-<NUM>. <*>	081110 215858 26 INFO dfs.FSNamesystem: BLOCK* NameSystem.allocateBlock: /user/root/rand3/_temporary/_task_200811101024_0005_m_001805_0/part-01805. blk_-7037346755429293022
1402	081110 <NUM> <NUM> INFO <>: <> block blk_<NUM> <> <>	081110 215957 15556 INFO dfs.DataNode$DataTransfer: 10.250.15.198:50010:Transmitted block blk_-3782569120714539446 to /10.251.203.129:50010
177	081110 <NUM> <NUM> INFO dfs.DataBlockScanner: Verification succeeded for <*>	081110 215859 13 INFO dfs.DataBlockScanner: Verification succeeded for blk_-7244926816084627474
36	081110 <NUM> <NUM> INFO dfs.DataNode$BlockReceiver: Receiving empty packet for block <*>	081110 215924 15636 INFO dfs.DataNode$BlockReceiver: Receiving empty packet for block blk_3991288654265301939
12	081110 <NUM> <NUM> INFO dfs.FSNamesystem: BLOCK* <> <> <> <> <> <> <*> <IP>	081110 215953 19 INFO dfs.FSNamesystem: BLOCK* ask 10.250.15.198:50010 to replicate blk_-3782569120714539446 to datanode(s) 10.251.203.129:50010
12	081110 <NUM> <NUM> INFO dfs.DataNode: <IP> Starting thread to transfer block blk_<NUM> to <IP>	081110 215955 18 INFO dfs.DataNode: 10.250.15.198:50010 Starting thread to transfer block blk_-3782569120714539446 to 10.251.203.129:50010
12	081110 <NUM> <NUM> INFO dfs.DataNode$DataXceiver: Received block blk_<NUM> src: <IP> dest: <IP> of size <NUM>	081110 215957 15226 INFO dfs.DataNode$DataXceiver: Received block blk_-3782569120714539446 src: /10.250.15.198:51013 dest: /10.250.15.198:50010 of size 14474705
6	081110 <NUM> <NUM> <> dfs.FSNamesystem: BLOCK NameSystem.addStoredBlock: <> <> <> <> for blk_<NUM> <> <> size <NUM>	081110 215924 27 WARN dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_2522553781740514003 on 10.251.202.134:50010 size 67108864
6	081110 <NUM> <NUM> INFO dfs.DataNode$DataXceiver: <> <> <> <> <>: <> <> <> <> <>	081110 215936 15714 INFO dfs.DataNode$DataXceiver: writeBlock blk_720939897861061328 received exception java.io.IOException: Could not read from stream
3	081110 <NUM> <NUM> INFO dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: <> <> <> for blk_<NUM> <> <> size <NUM> <> <> <> <> <> <> <> <*>.	081110 220635 28 INFO dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_-81196479666306310 on 10.250.17.177:50010 size 53457811 But it does not belong to any file.
1	081110 <NUM> <NUM> <> <>: <> <> <> <> <> <> <>. <> <> <> <> <>.	081110 220631 19 WARN dfs.FSDataset: Unexpected error trying to delete block blk_-2012154052725261337. BlockInfo not found in volumeMap.

We can see that the top pattern counts for 55.3% of the log records, followed by 4 patterns, each of them counts for ~10% of the records, while the last 8 patterns count for less than 0.1% of the records.

Overall, there are 5 text clustering functions:

Function	Description
log_reduce_fl()	Finds common patterns in textual logs, outputs a summary table
log_reduce_full_fl()	Finds common patterns in textual logs, outputs a full table
log_reduce_train_fl()	Finds common patterns in textual logs, outputs a model
log_reduce_predict_fl()	Applies a trained model to find common patterns in textual logs, outputs a summary table
log_reduce_predict_full_fl()	Applies a trained model to find common patterns in textual logs, outputs a full table

The first 2 can be used for ad-hoc analysis, encapsulating both the training and the scoring phases, while the last 2 can be used for scenarios of novelty detection, where a model is trained to build a list of known patterns, and applied on new log records to classify them to either one of the known patterns or a new anomalous one.

These functions are published as part of the Functions library, all use a common text clustering algorithm which is currently implemented in Python, thus requires enabling ADX inline python() plugin.

For further information on the algorithm and controlling the various parameters have a look at the above docs. Feel free to contact us for questions, thoughts, and any other feedback!

Updated Apr 27, 2023

Version 2.0

Microsoft

Joined March 26, 2019

View Profile