SOLVED

Alerting when data are missing

Copper Contributor

Hello,

I would like to have incident if there is a gap in ingested data for key build in sentinel data connectors or custom integration for lets say 1 hour or more.

 

For commonsecurity log which are our CEF I was thinking of something like this which shows last data received.

 

Would similar be applicable for data connectors? How do you monitor data ingestion? Our management expects if there is delay in logs follow up with data source owners. Thank you

 

let Sources = dynamic(["Incapsula", "Cyber-Ark", "ArcSight"]);

CommonSecurityLog

| where isnotempty(DeviceVendor) and DeviceVendor !in (Sources)

| where (DeviceVendor == '{selectedDeviceVendor}' or '{selectedDeviceVendor}' == "All") and (DeviceProduct == '{selectedDeviceProduct}' or '{selectedDeviceProduct}' == "All")

| summarize LastLogReceived = arg_max(TimeGenerated, *) by DeviceVendor, DeviceProduct

| extend HeartBeatMessage = iff(datetime_diff('second',now() ,LastLogReceived) > 3600, strcat("Not active since ",datetime_diff('second',now() ,LastLogReceived)*1s, ' hours ago')  ,"Active Logs Received")

| extend Heartbeat =datetime_diff('second',now() ,LastLogReceived)

| project DeviceVendor, DeviceProduct, Heartbeat,HeartBeatMessage

 

3 Replies
thanks, i saw the post for data connectors, but is there any recommended way how to monitor custom ingestion? we had incident and we missed some data from cef source. was hoping microsoft will have some formal advice as even 15 min loss of CEF data could represent some serious problem from compliance perspective. we were not able to investigate several incidents already
best response confirmed by T150732D (Copper Contributor)
Solution
This gets better with ASIM (but not all Network vendors are covered yet.) https://docs.microsoft.com/en-us/azure/sentinel/network-normalization-schema

1. You can use the methods above to see if the whole CommonSecurityLog table received anything within a time period like 15mins - this tends to work if you only have one or two sending systems. So you only get a full failure, rather than one sending host of two has failed. This still can be a good rule to have (maybe also do the same test for Syslog as well as CEF if you have that). A one hour threshold might be a good safe value, to reduce false positives but it does mean the whole solution could have been down for up to 59mins.

2. What you will also need, and this is preferred (I think), is to monitor each sending device (and this is where ASIM helps identify what those devices are, in the product you are using you need to find something like the sending computer name / IP - this is often in AdditionalExtensions column and you need to parse it out to get the device name or IP.
It can take some work to find the sending source rather than the CEF server receiving the data (which is often the Computer column), and this often varies per CEF vendor).

You then need to check each of these Devices or IP's for ingestion delays past 15mins.

3. Another check could also be to reference the Heartbeat table for the agents and check when the agent last did a heartbeat. You might union the results with CEF, in case the agents is reported down but still actually working (unlikely but it could happen)

Summary:
Unfortunately this is complex and without full normalization (ASIM) or SentinelHealth supported for all Tables there are gaps
1 best response

Accepted Solutions
best response confirmed by T150732D (Copper Contributor)
Solution
This gets better with ASIM (but not all Network vendors are covered yet.) https://docs.microsoft.com/en-us/azure/sentinel/network-normalization-schema

1. You can use the methods above to see if the whole CommonSecurityLog table received anything within a time period like 15mins - this tends to work if you only have one or two sending systems. So you only get a full failure, rather than one sending host of two has failed. This still can be a good rule to have (maybe also do the same test for Syslog as well as CEF if you have that). A one hour threshold might be a good safe value, to reduce false positives but it does mean the whole solution could have been down for up to 59mins.

2. What you will also need, and this is preferred (I think), is to monitor each sending device (and this is where ASIM helps identify what those devices are, in the product you are using you need to find something like the sending computer name / IP - this is often in AdditionalExtensions column and you need to parse it out to get the device name or IP.
It can take some work to find the sending source rather than the CEF server receiving the data (which is often the Computer column), and this often varies per CEF vendor).

You then need to check each of these Devices or IP's for ingestion delays past 15mins.

3. Another check could also be to reference the Heartbeat table for the agents and check when the agent last did a heartbeat. You might union the results with CEF, in case the agents is reported down but still actually working (unlikely but it could happen)

Summary:
Unfortunately this is complex and without full normalization (ASIM) or SentinelHealth supported for all Tables there are gaps

View solution in original post