Azure "conditional" searches in Log Analitycs

Copper Contributor

Hi:

 

This is my situation. I have a scheduled task writing a keepalive event in the registry each 15 minutes in several Windows servers.

 

I can trace them (or any other pointing to a problem) with something like this:

 

Event
| where EventLog == "System"
| where Source == "MyEvtOrigin"                                             
| where TimeGenerated > now()-20m and TimeGenerated < now()-5m             
| summarize events_count=count() by Computer, EventID
 
With this I can get the servers that are running in the same time lapse.
 
Heartbeat
| where OSType == 'Windows'
| where TimeGenerated > now()-20m and TimeGenerated < now()-5m 
| summarize arg_max(TimeGenerated, *) by SourceComputerId // que estén arrancadas
| top 500000 by Computer asc
 
How can I query the events_count in case there are none (meaning my task is not running anymore) to trigger an alert but only considering those servers thar are running?
 
When I try to do a join like this:
 
 
Heartbeat
| where OSType == 'Windows'
| where TimeGenerated > now()-20m and TimeGenerated < now()-5m 
| summarize arg_max(TimeGenerated, *) by SourceComputerId 
| top 500000 by Computer asc
| join kind= inner (
Event
where EventLog == "System"
where Source == "MyEvtOrigin"                                             
where TimeGenerated > now()-20m and TimeGenerated < now()-5m             
summarize events_count=count() by Computer, EventID
| where events_count < 1 
| sort by TimeGenerated asc nulls last
) on Computer
| summarize arg_max(TimeGenerated, *)by Computer, EventID ,events_count, SourceComputerId
| top 500000 by Computer asc
 
The resulting alarm is triggeres both if the job fails and when the server is down.
 
 
3 Replies

Hi,

the query would be

let LiveServers = Heartbeat
| where OSType == 'Windows'
| where TimeGenerated > now()-20m and TimeGenerated < now()-5m 
| distinct Computer;
Event
| where Computer in (LiveServers)
| where EventLog == "System"
| where Source == "MyEvtOrigin"                                             
| where TimeGenerated > now()-20m and TimeGenerated < now()-5m             
| summarize events_count=count() by Computer, EventID

As Event does not have SourceComputerId so we use Computer as unique name. We put all the names of servers from the first query into a table and than we use that table to filter the second query. I think this will work for your case. On the first query I would not use such precise scope for TimeGenerated. I would rather use ago(20m) for example. If you will create alert out of this I would remove the TImeGenerated scope completely as the time frame you set in the alert properties. Also have a look at this on alerting on more than one column:

https://cloudadministrator.net/2018/06/08/aggregate-on-more-than-one-column-for-azure-log-search-ale...

 

Thanks for your swift answer.

 

It works but only partialy.

I have learnt that there is another problem, related to the usage o 

events_count=count()

as my task runs each 15 min, in an hour you get 4 events, perfect. But as it is meant to be a kind of "keep alive" if I disable the task "events_count=count()" instead of returning a 0 value for a particilar server makes the line not to appear and this way wouldn't trigger the alert. Is there any way to capture 0 results in "events_count=count()"?

 

If I understand correctly you will need to reverse the logic than:

let LiveServers = Event
| where EventLog == "System"
| where Source == "MyEvtOrigin"                                             
| where TimeGenerated > now()-20m and TimeGenerated < now()-5m             
| summarize events_count=count() by Computer, EventID
| distinct Computer;
Heartbeat
| where TimeGenerated > now()-20m and TimeGenerated < now()-5m
| where OSType == 'Windows'
| where Computer notin (LiveServers) 

 

The logic for the above query is:

- Find me all computers that have my live event for certain period and put them into table

- Find me all Windows computers that are producing heartbeat events and filter to show me those that are not in the above table

 

You will have to figure out the timings on your own. I usually restrict time only from a time in the past until now. Especially for alerts as there you specify the time frame in the alert properties.