Blog Post

Azure Data Explorer Blog
6 MIN READ

The new SCAN Operator: Process Mining in Azure Data Explorer

adieldar's avatar
adieldar
Icon for Microsoft rankMicrosoft
May 23, 2021

Introduction

Azure Data Explorer (ADX) is commonly used to store and analyze event logs of business processes. These logs are generated everywhere – applications emit events capturing user actions, IoT devices send sensors data, cloud services generate monitoring logs etc. Once the data is stored in ADX, it can be analyzed for insights. Kusto Query Language (KQL) has extensive set of operators and functions for various types of analysis, including relational operators (filtering, joining, aggregations etc.), time series analysis (anomaly detection, forecasting etc.), advanced plugins (root cause analysis, user analytics) and even extensibility to embed external Python/R packages. Today we are delighted to introduce the scan operator, adding native process mining and temporal analytics to ADX!

 

Process Mining

Process mining is a new emerging domain, dealing with analysis of events logs to discover the underlying process models, validate, and potentially improve them. In that context, an event contains at least process instance id, timestamp, and action. A process instance id (a.k.a. case id) might be identifier(s) of the specific user, device, session, or any other native partitioning of the events log. A process model describes a sequence of actions, commonly with time constraints, to achieve specific task. Some examples for process models:

  • Users login to a shopping web site, browse products, add to cart, check out, pay and logout.
  • A patient is admitted at a clinic, tested for pulse and blood pressure by the triage team, checked by the specialist, diagnosed, and dismissed.
  • A washing machine cycle: starting, weighing the laundry, filling water, heating, washing, rinsing, spinning, and finishing.

There are 3 classes of process mining techniques:

  1. Discovery – automatically extract process model(s) out of the raw event logs.
  2. Conformance checking – comparing existing process model with actual event log, detect and analyze discrepancies.
  3. Performance mining – enhance the existing process to improve performance – reduce time between steps, improve retention etc.

 

The ‘scan’ operator

The scan operator is arguably the most advanced analytics operator in ADX.  It extends ADX native analytics to include process mining, user analytics, recursive calculations and more. The user defines a process model in KQL, then feeds in a huge events log table. ADX scans it and extracts the instances (each one is an events sequence) conforming to that process model, filtering those deviating from it. The model definition syntax lets the user define sequential states and forward transition rules from the first to the last state. These states and rules are translated to a set of simple state machines that define the scan logic. Let’s review a few examples to understand the syntax, as well as its power and versatility for a wide range of scenarios.

 

Searching for ‘rallies’ of top stocks

Here we analyze historical prices of top tech stocks from 2019 till today. This period was interesting, from start of 2020 till today the market is impacted by the Covid-19 pandemic, still the overall trend of these stocks is positive. In the query below we look for longest sequences of increasing price per each stock. The following query is built by these steps:

  1. Pre-processing:
    • For each stock order the prices by time and calculate the daily change.
  2. Extracting sequences using the ‘scan’ operator:
    • A sequence starts when the price increases.
    • The sequence continues as long as the price increases or doesn’t change.
    • The sequence ends when the price decreases.
  3. Post-processing:
    • For each sequence delete the last event, as it contains the price decrease.
    • Summarize the length and delta per sequence.
    • For each stock extract the sequence with the biggest percentage increase.

 

 

TopStocksCleaned
| where Date >= datetime(2019-01-01) and Ticker in('AAPL', 'AMZN', 'GOOG', 'MSFT')
| partition by Ticker
(
    order by Date asc 
    | extend pDate=prev(Date), pAdjClose=prev(AdjClose)
    | extend delta = AdjClose - pAdjClose
    | scan with_match_id=m_id declare(down:bool=false, step:string) with 
    (
        //  if state of s1 is empty we require price increase, else continue as long as price doesn't decrease 
        step s1: delta >= 0 and (delta > 0 or isnotnull(s1.delta)) => step = 's1';
        //   exit the 'rally' when price decrease, also forcing a single match 
        step s2: delta < 0 and s2.down == false => down = true, step = 's2';
    )
)
| where step == 's1'   //  select only records with price increase
| summarize (start_date, start_AdjClose, start_delta)=arg_min(pDate, pAdjClose, delta), end_date=max(Date), trading_days=count(), total_delta=sum(delta) by Ticker, m_id
| extend delta_pct = total_delta*100.0/start_AdjClose
| summarize arg_max(delta_pct, *) by Ticker
| project Ticker, start_date, end_date, trading_days, delta_pct, start_AdjClose, total_delta
| order by delta_pct

 

 

Ticker

start_date

end_date

trading_days

delta_pct

start_AdjClose

total_delta

AAPL

2020-07-29

2020-08-07

7

20.752

93.75

19.455

AMZN

2020-04-13

2020-04-21

6

18.461

2040

376.610

MSFT

2020-02-28

2020-03-03

2

14.034

152.410

21.389

GOOG

2021-01-28

2021-02-03

4

12.422

1843.939

229.060

 

We can see that Apple had the longest and biggest rally, 7 consecutive trading days with almost 20% increase! Let’s chart all stocks:

 

 

 

TopStocksCleaned
| where Date >= datetime(2020-01-01) and Ticker in('AAPL', 'AMZN', 'GOOG', 'MSFT')
| project Ticker, Date, AdjClose
| evaluate pivot(Ticker, any(AdjClose))
| render timechart with(ysplit=panels)

 

 

 

We can nicely see the biggest increases, yellow highlighted.

 

Analyzing users’ mitigations for failing KQL queries

In this example we analyze sequences of users' queries following a failed one, trying to extract interesting insights. The following query is quite complex, its steps are:

  1. Pre-processing:
    • For each user order the queries by time.
    • Filter consecutive queries with less than a second gap – these are likely automated queries.
  2. Extracting sequences using the ‘scan’ operator:
    • Start a sequence by a query that was not completed successfully.
    • Continue the sequence until successful query or time out.
    • Each step updates the sequence length and the number of retries (of exactly the same query).
  3. Post-processing:
    • Aggregate sequences by user and retries only flag.
    • Final aggregation by sequences length, final state and retries flag.

 

 

let Kusteam = toscalar((KustoAll | summarize make_list(Email)));
let etime = datetime(2021-05-20);
let stime = etime - 14d;
let max_gap = 5m;           //  max time gap between user's consecutive queries to be grouped in the same sequence
let min_gap = 1s;           //  min time gap between user's consecutive queries, ignoring automated queries
//
QueryCompletion_v2
| where User in(Kusteam)
| where Timestamp between (stime..etime)
| partition hint.strategy=native by User (
    order by Timestamp asc
    | extend nTimestamp = next(Timestamp)
    | where nTimestamp - Timestamp >= min_gap   //  filter automated queries
    | scan with_match_id=m_id declare(step:string='empty', timeout:boolean=false, retries:long=0, SequenceLen:long=0) with 
    (
        step s1: State != 'Completed' => step = 'not completed', SequenceLen = s1.SequenceLen + 1, retries = s1.retries + iff(s1.Text == Text, 1, 0);
        step s2: s2.step == 'empty' and (State == 'Completed' or Timestamp - s1.Timestamp >= max_gap) =>
                 step = 'end', timeout = Timestamp - s1.Timestamp >= max_gap, SequenceLen = s1.SequenceLen + 1, retries = s1.retries + iff(s1.Text == Text, 1, 0);
    )
)
| where not(timeout) // these records are not related to the past sequence
| extend RetriesOnly = (retries == SequenceLen - 1), FinalState = State
| summarize arg_max(SequenceLen, FinalState) by User, m_id, RetriesOnly
| summarize SequenceNum = count() by SequenceLen, FinalState, RetriesOnly
| where SequenceLen > 1 and FinalState != 'Cancelled'
| order by FinalState asc, SequenceLen asc, RetriesOnly asc

 

 

Results:

SequenceLen

FinalState

RetriesOnly

SequenceNum

2

Completed

False

306

2

Completed

True

20

3

Completed

False

66

3

Completed

True

2

4

Completed

False

25

4

Completed

True

1

5

Completed

False

11

6

Completed

False

4

2

Failed

False

8

2

Failed

True

20

3

Failed

False

2

3

Failed

True

7

4

Failed

False

2

4

Failed

True

1

5

Failed

False

2

5

Failed

True

32

6

Failed

False

13

7

Failed

False

3

11

Failed

False

1

 

Let’s first look on sequences of 2 events, i.e. a failed query followed by a successful one or a time out. These sequences account for 354/526=67% of all sequences. We can see that 326/526=62% of the failures were resolved by a single follow-up query. Out of these, 306/326=94% queries completed successfully after modifying the query, while only 20/326=6% completed just by retrying the same query. Looking on sequences of 3 events, additional 68/526=12.9% of the failed queries were resolved, but retrying the same query twice is much less effective: only 2/68=2.9% succeeded. The last thing to note is that as the sequences become longer, the chance of successful completion decreases, only a few sequences of 6 queries end successfully, longer ones just continued to fail, until the users abandoned them.

 

Summary

In this short blog we have seen only 2 examples of ‘scan’, however the scope and power of this operator is far beyond these specific use cases. This operator is very effective for various scenarios, from simple fill forward table columns or calculating cumulative sums, through analyzing events preceding some failure for root cause analysis, and up to funnel analysis, rectifying obstacles in UX workflows, improving performance and beyond. Have a look at scan doc, where you can learn how to use it and review more interesting examples. Note that the scan operator is new, currently in preview, still you are very welcome to try it and share your feedback with us!

 

Updated May 23, 2021
Version 1.0
  • steffen_zeidler's avatar
    steffen_zeidler
    Copper Contributor

    With the following test I would like to better understand the scan operator.

     

    range i from 0 to 4 step 1
    | scan with_match_id=matchId declare (s: long, n: long=0) with 
    (
        step s1: i % 2 == 0 => s=1, n=s1.n + 1;
        step s2: true => s=2, n=s2.n + 1;
        step s3: true => s=3, n=s3.n + 1;
    )
    isnmatchId
    0110
    1210
    2310
    2111
    3320
    3211
    4331
    4112

     

    Why is s3 match 0 still active after s1 starts match 1?
    Why is s3.n equal to 3 (expected is 1) at i 4 match 1?

     

  • steffen_zeidler Scan can have multiple active sequences - s3 is still active when s1 starts a new match, then i4 matches both the active s1 and the active s3 (which then causes s3.n to increase to 3). See the last part of the scan operator (Matching Logic) for complete description of the logic implemented by scan.

  • steffen_zeidler's avatar
    steffen_zeidler
    Copper Contributor

    Thank you for the answer. Now it is clear that several steps can be active at the same time.
    However, I could not resolve the second question.
    In the test sample above, following the rules in the Matching logic documentation:
    record i4 satisfies the condition of s3 using the state of the previous step s2
    1. The state of s_k s3 is deleted.
    2. The state of s_(k-1) s2 becomes the state of s_k s3, and the state of s_(k-1) s2 becomes empty.
    But if the state of s3 is deleted, then the state s3.n should be reset to 0 and not remain the old state 2.

  • wernerzirkel's avatar
    wernerzirkel
    Brass Contributor

    I am trying to use the scan operator to find time-based sequences A -> B where a certain error C occurred in between. I think that is a very common pattern, but I do not know how to solve it. What is the right approach ?