Azure Sentinel correlation rules: the join KQL operator

Microsoft

Dec 05, 2019

In the SIEM world, rules are often called correlation rules. While this is not always the case, and therefore I prefer the term detection rules, it conveys the importance of correlation for SIEM. What makes SIEM different from specialized security controls is the ability to correlate between events from multiple sources.

Last time, I presented the make_list() KQL aggregation function to implement a correlation. It corresponds to the use of an explicit state machine for correlation in traditional SIEMs using "Active Lists" or "reference sets." Therefore, the Azure Sentinel version avoids the state machine and is much simpler to build and maintain.

In this post, I will describe implicit correlation rules and implementing them using the KQL operator join.

This post of part of a series of blog posts on writing rules in Azure Sentinel:

Azure Sentinel correlation rules using lists
Azure Sentinel correlation rules using the join operator (this post)
Implementing Lookups in Azure Sentinel
Approximate, partial, and combined lookups
Handling sliding windows in Azure Sentinel rules

Implicit correlation

Traditional SIEMs offer an additional option: implicit correlation. Such a correlation rule relates two or more events and, if matching, would trigger an alert. Behind the scenes, the SIEM would maintain a state: once one of the events is received, let's call it the anchoring event, a context is created to "remember" this occurrence. In the ArcSight Jargon, this is called a "partial match." This context would be retained for the period, or "window," stated for the correlation.

Those implicit correlation rules are easier to create and maintain than explicit correlations but are very costly in memory to keep all those open contexts. Pick a too common anchoring event (failed login, for example) and a long window (say an hour), and your SIEM would saturate. This is why explicit correlation using lists is often preferred.

The account added and deleted use case example

To illustrate a correlation that often uses implicit correction rules, I will use the account created and deleted within a day scenario. This well-known attack technique utilizes creating an account, presumably with elevated permissions, abusing it, and then deleting it to cover tracks. You can find many SIEM implementations of a rule to detect such a pattern (ArcSight, Splunk). Since this use case requires matching an account create to an account delete event, it is an excellent example of a correlation rule.

Selecting the events to correlate

Let's start! Before we use "join" in correlating, we need to select the events to correlate. The first part of the query does just that and names the relevant account creation and account deletion events account_created and account_deleted, respectively.

let timeframe = 10m;
let lookback = 1d;
let account_created =
SecurityEvent 
| where TimeGenerated > ago(lookback+timeframe)
| where EventID == "4720" // A user account was created
| where AccountType =~ "User"
| project creationTime = TimeGenerated, CreateEventID = EventID, Activity, Computer, 
  TargetUserName, UserPrincipalName, AccountUsedToCreate = SubjectUserName, TargetSid, 
  SubjectUserSid;
let account_deleted =
SecurityEvent
| where TimeGenerated > ago(timeframe) 
| where EventID == "4726" // A user account was deleted 
| where AccountType == "User"
| project deletionTime = TimeGenerated, DeleteEventID = EventID, Activity, Computer, 
  TargetUserName, UserPrincipalName, AccountUsedToDelete = SubjectUserName, TargetSid, 
  SubjectUserSid;

The "where" clauses select the relevant events, while the "project" clauses prepare the relevant events fields needed for the correlation. The "project" clauses are not strictly needed but help readability: since the two selected event sets come from the same table, the field names would be the same. The "project" clause assigns different, easy to understand names to the fields, designating them as belonging to the add or delete events, making the join statement below much more readable.

Of interest is the time selection part (bolded) of the where clauses that use different periods.

The anchoring event, in this case, the delete event, is checked for in the last 10 minutes:

where TimeGenerated > ago(timeframe)

"Timeframe" should match the alert rule scheduling.

The reference event, in this case, the add event, can be queried over a longer period to match the use case goal, one day in this case:

where TimeGenerated > ago(lookback+timeframe)

The timeframe is added to the lookback period to implement a sliding window. It ensures that alerts that occur across time frames are not missed. For example, if the anchoring event happened 9 minutes before the rule schedule and the reference event 24 hours and 5 minutes, an alert is due but would not trigger without this time selection fix. It creates a risk of duplicate alerting, which is handled later on as part of the join part of the query.

This time framing ensures the accuracy (i.e., no duplicates or omissions) and timeliness of the alert, regardless of the correlation window length.

Correlating using Join

account_created | join kind= inner (account_deleted) on Computer, TargetUserName
| where deletionTime - creationTime < lookback
| where tolong(deletionTime - creationTime) >= 0
| project TimeDelta = deletionTime - creationTime, creationTime, CreateEventID, Computer, TargetUserName, UserPrincipalName, AccountUsedToCreate, 
deletionTime, DeleteEventID, AccountUsedToDelete
| extend timestamp = creationTime, AccountCustomEntity = AccountUsedToCreate, HostCustomEntity = Computer

The "where" clause serves to implement the time framing correctly:

The purple condition ensures that no more than lookback time separated the two events, compensating for the fact that we query the reference event for slightly more than 24 hours.
The blue condition ensures that user deletion happened after the user addition.

Both of these are required for rule accuracy but actually do not enhance security. Arguably, 24.05 hours between the events are still a reason to be concerned. Likewise, deletion and then addition might be just as suspicious as the reverse order of events.

If the join and additional conditions return results, an alert is triggered. The last two green lines prepare the data for investigation and assign entities, which will be discussed in a future post.

Now that you learned about the Sentinel way of writing a correlation rule, you may want to go back and compare it to the ArcSight and Splunk implementations. My take is that while the explicit handling of the sliding time window takes getting used to, in general, the rule is very transparent and explicit, allowing for easier development and maintenance. Being query-based, it also eliminates the large memory requirements that implicit correlations put.

Want to learn more about writing Azure Sentinel alert rules? Here are some resources to explore: