Handling ingestion delay in Azure Sentinel scheduled alert rules
Published Jan 12 2021 06:13 AM 12.7K Views
Microsoft

 

At Azure Sentinel we take pride in the ability to ingest data from a variety of sources(learn more).

However, data ingestion time may vary for different data sources under different circumstances.

 

In this blog post we will address the delay challenge: understanding the impact of the ingestion delay and how to fix it.

 

Why is the delay significant?

When ingesting data into Sentinel we might encounter delays for a variety of reasons.

These delays might affect our scheduled queries in the following way:

 

When writing a custom detection rule, we set the “Run query every” and the “Lookup data from the last” parameters.

For our , let's assume we run our query every five minutes and lookup data in a five-minute "look-back" window.

 

 

romarsia_0-1609858039791.png

 

Ideally (when there is no delay), this detection will not miss any events.

 

So, how would it look?

romarsia_1-1609858039796.png

 

The event arrives as it is generated, and the window contains the event as we wanted.

Now, assume there is some delay for our data source.

For this example, let’s say the event was ingested 2 minutes after it was generated (delay of 2 minutes).

Example #1:

 

romarsia_2-1609858039801.png

 

 

As we see, the event is generated within the first "look-back" window, but on the first run the event is not ingested into the workspace.

No big deal – after all, we’ll catch it on the second run, right?

Wrong! The next time the query is scheduled to run, the event has been ingested, however it will be filtered out by the time-generated filter (since it happened more than five minutes ago), and the rule won’t fire an alert.

So, what can I do?

 

How to handle the delay:

To solve the issue, we’ll need to be able to know what is the delay for our data type.

In our example it was easy since we knew it was 2 minutes.

But we can figure it out by using the Kusto function “ingestion_time()” and calculating the difference between “TimeGenerated” and the ingestion time - we will talk more about it later.

So, after finalizing the number we can try to address our problem.

 

The basic intuition is probably telling you - "We need to increase the window size. It will help us".

You are correct: this is indeed a part of the solution.

Since our "look-back" window is 5 minutes, and our delay is 2 minutes we will set the "look-back" window to 7 minutes.

romarsia_3-1609858039802.png

We can see now that our missed event is contained in the "look-back" window (7 minutes), problems solved.

romarsia_4-1609858039804.png

 

As you can probably realize by now, we can create duplication since the look back windows overlap.

For a different event, the following diagram applies.

Example #2:

romarsia_5-1609858039810.png

Since the event “TimeGenerated” is in both windows it will fire 2 alerts meaning we need to find a way to solve the duplication.
Going back to our original problem (example #1), we missed events since our data wasn't ingested when our scheduled query ran.
Then we extended the "look-back" to include the event but caused duplication.
So, we would like to associate the event to the window we extended to contain it.
Meaning: ingestion_time() > ago(5m) (original rule "look-back" = 5min).
This will associate our event from the previous example to the first window.

 
 
 
 
 
 
 
 
 
 

Snip20210105_10.png

As we can see the ingestion time restriction trims the additional 2 minutes added to the look back.

And for our previous example we can see that the event is captured by the second run look-back window.

Snip20210105_12.png

 

This query summarizes it up:

 

 

 

 

 

let ingestion_delay= 2min;
let rule_look_back = 5min;
CommonSecurityLog
| where TimeGenerated >= ago(ingestion_delay + rule_look_back)
| where ingestion_time() > (rule_look_back)

 

 

 

 

 

Now when we know how to tackle the ingestion delay that was given to us let’s see how we can determine the ingestion delay by ourselves.

Note: when joining multiple tables each has its own delay, we need to take it into consideration.

 

Calculating ingestion delay:

Ingestion delay may be caused from a verity of reasons and may change according to the source of the data.

Azure Sentinel collects logs from multiple sources which means that we may see different delays.

This blog post provides a dashboard holding latency and delays for the different data types flowing into the workspace.

In the following example we calculated the 99 and 95 percentiles of delay by the Device Vendor and Device product (for the common security logs table).

 

Note: when joining multiple data types, we need to understand what kind of changes to apply to the "look-back".

Snip20210106_14.png

 

The query:

 

 

 

 

CommonSecurityLog
| extend delay = ingestion_time() - TimeGenerated
| summarize percentiles(delay,95,99) by DeviceVendor, DeviceProduct

 

 

 

 

Now that we have this data, we can know what kind of changes to apply to the "look-back" based on the delay that we have.

The fact that we distinguish between different sources emphasize the different delay from different data types.

 

Important to know: Azure Sentinel scheduled alert rules are delayed by 5min.

This allows data types with a smaller delay to be ingested on time for the scheduled run.

For more information, please visit our documentation.

 

 

 

 

6 Comments
Brass Contributor

@romarsia This is a very interesting article, thank you for the effort.

 

 Few doubts :

 

1. 

Azure Sentinel scheduled alert rules are delayed by 5min.

This allows data types with a smaller delay to be ingested on time for the scheduled run.

For more information, please visit our documentation.

 

--> Does this 5 min delay mean that if I set the alert to run every 5 mins and lookup data from past 10 mins, it actually looks up data from the last 15 mins?

If so does this not bring about duplication as discussed above.

Could you please share the link to the documentation that is mentioned above that discusses this in detail?

 

 

2. 

 

 

let ingestion_delay= 2min;
let rule_look_back = 5min;
CommonSecurityLog
| where TimeGenerated >= ago(ingestion_delay + rule_look_back)
| where ingestion_time() > (rule_look_back)

 

 

 

In this query,

ingestion_time() returns a date time object (like 1/14/2021, 2:26:07.477 PM) and rule_look_back is in minutes,

 

Ex:

where ingestion_time() > 5min
 
When I tested this this comparison operation shows an output set of rows but does not filter correctly probably due to mismatch in data type. 
 
3.
So do we need to design all our analytic queries with the above query conditions in mind ?
Also when setting the rule logic, does the rule_look_back set in the query take precedence over the lookup data from field set in UI ?
 
4. 
Instead of applying such modifications per rule, would it not solve these issues if the Azure Sentinel platform considers Ingestion time for alert generation and not TimeGenerated. Would it not remove the need to keep track of the delay and per table type query modification.
 
Copper Contributor


Very good content, @romarsia .

Thanks.

I have a doubt.

 

The timegenereted() is the time that the event was generated at the source and is informed in the source payload, right?

 

If for some reason my source does not send the time the log was created, what will be the timegenereted time?

 

If I'm not mistaken, for the MMA collector, if I don't have the timegenereted specified in the log, I'll specify _TimeReceived, which is the time that the log arrived at the installed agent. That's it?

 

What would it be like for firewall logs, for example?

Microsoft

@luizao_lf  you are correct about the MMA agent.

The Timegenerated should represent the time the event was created.

How the time generated is set depends on the source and the pipe used for ingestion 

Copper Contributor

If the event is from windows, for example, if you do not have timegenereted () at the source, the value of timegenereted() will be the moment the log arrived at the MMA agent.

 

But what about a firewall that does not have an MMA agent? If the team does not contain the source, what value will be sent in the timegenereted() field? Will it be the value when it arrives at the CEF concentrator? or would it be otherwise?

Steel Contributor

@romarsia 

Thanks for the in-detail information.

 

I was just wondering, is Line 1 below actually required, when the "real" filtering is all done with Line 2?

| where TimeGenerated >= ago(ingestion_delay + rule_look_back)
| where ingestion_time() > (rule_look_back) 

 

In regards to monitoring the latest logs arriving in the last x minutes/hours, there does not seem to be any use for Line 1,

as long as the "lookup data from the last" setting in the rule accounts for any possible ingestion delay.

 

Also, I think line 2 is missing an "ago".

| where ingestion_time() > ago(rule_look_back)

 

Microsoft

I think you want to use: ago(rule_look_back)

| where TimeGenerated >= ago(ingestion_delay + rule_look_back)
| where ingestion_time() > (rule_look_back)

 

Version history
Last update:
‎Jan 12 2021 06:43 AM
Updated by: