Nov 21 2019
06:32 AM
- last edited on
Apr 08 2022
10:12 AM
by
TechCommunityAP
Nov 21 2019
06:32 AM
- last edited on
Apr 08 2022
10:12 AM
by
TechCommunityAP
Hi Team,
I am trying to write a KQL query to catch if any single heartbeat missed.
Like we could see in my below screenshot, this server is sending heartbeat after every minute interval.
And now there is gap in heartbeat when i stopped the scx service, so now i want to track if any single heartbeat will miss then i should have an alert notification.
Nov 22 2019 09:45 AM
Solution
personally I prefer the example query of
// Availability rate
// Calculate the availability rate of each connected computer
Heartbeat
// bin_at is used to set the time grain to 1 hour, starting exactly 24 hours ago
| summarize heartbeatPerHour = count() by bin_at(TimeGenerated, 1h, ago(24h)), Computer
| extend availablePerHour = iff(heartbeatPerHour > 0, true, false)
| summarize totalAvailableHours = countif(availablePerHour == true) by Computer
| extend availabilityRate = totalAvailableHours*100.0/24
Heartbeats are expected to be missed (pauses, glitches, load etc...) and the data will catch-up - so you may get false positives.
You can use a date_diff to compare
Go to Log Analytics and Run Query
Heartbeat
| where TimeGenerated >= ago(1h)
| where Computer == "hardening-demo"
| project Computer, TimeGenerated
| order by TimeGenerated desc
| project n = TimeGenerated, nminus = prev(TimeGenerated), TimeGenerated, Computer
| where isnotempty(nminus)
// show time NOW vs time n -1 row
| extend second = datetime_diff('second',nminus, n)
| where second >= 60
Results for seconds below 60 (mainly 9 and 51 for the demo data) - just remove the last line of the above query to see this
n | nminus | TimeGenerated | Computer | second |
---|---|---|---|---|
2019-11-22T17:42:37.88Z | 2019-11-22T17:42:46.523Z | 2019-11-22T17:42:37.88Z | hardening-demo | 9 |
2019-11-22T17:41:46.52Z | 2019-11-22T17:42:37.88Z | 2019-11-22T17:41:46.52Z | hardening-demo | 51 |
2019-11-22T17:41:37.877Z | 2019-11-22T17:41:46.52Z | 2019-11-22T17:41:37.877Z | hardening-demo | 9 |
2019-11-22T17:40:46.52Z | 2019-11-22T17:41:37.877Z | 2019-11-22T17:40:46.52Z | hardening-demo | 51 |
2019-11-22T17:40:37.873Z | 2019-11-22T17:40:46.52Z | 2019-11-22T17:40:37.873Z | hardening-demo | 9 |
Jan 28 2020 10:46 AM
@CliveWatson Just to add to this conversation, I've come up with a slightly different way of doing this--would love feedback:
let current = now();
let ostype = 'Windows';
let computername = '';
let environment = 'Non-Azure';
let threshold = 600;
Heartbeat
| where TimeGenerated >= ago(1h)
// --for a specific computer:
| where Computer contains computername
// --for a specific computer group:
//| where Computer in (group)
// --for a specific OS type:
| where OSType contains ostype
// --for on-prem or Azure VMs:
| where ComputerEnvironment contains environment
| project Computer, TimeGenerated, current
| order by TimeGenerated desc
| project nminus = prev(TimeGenerated), current, Computer
| where isnotempty(nminus)
| extend ['LastHeartbeat (in seconds)'] = datetime_diff('second', current, nminus)
| summarize arg_max(nminus, *) by Computer
| where ['LastHeartbeat (in seconds)'] >= threshold
| project Computer, QueryTime = current, LastTimeStamp = nminus, ['LastHeartbeat (in seconds)']
Jan 29 2020 10:23 AM
Looks good @Scott Allison , I would just swap contains to has as per best practise https://docs.microsoft.com/en-us/azure/kusto/query/best-practices
Jan 29 2020 12:02 PM
@CliveWatson Thanks! I've seen weirdness with has versus contains. I haven't noted what that weirdness is, but if I run across it again, I'll be sure to share.
Feb 03 2020 04:37 AM
@CliveWatson - here's a perfect example of why the HAS operator isn't useful for many operations:
This query returns the expected results every time:
Heartbeat
| where Computer contains 'abc'
| distinct Computer
For example, this would return:
SERVERABC1
SERVERABC2
COMPUTERABC24
When I replace CONTAINS with HAS, I get 0 results. So in 99% of my use cases, HAS doesn't work at all.
Feb 03 2020 05:07 AM
That is the behavior I'd expect
From the docs:
Prefer has
operator over contains
when looking for full tokens. has
is more performant as it doesn't have to look-up for substrings.
What does that mean in practice:
1. This query example will fail (as its not a substring). Computers named: aks-nodepool1.nnnnnnnnn
Go to Log Analytics and run query
Computer |
---|
aks-nodepool1-25494468-4 |
aks-nodepool1-25494468-1 |
Feb 03 2020 11:12 AM
@CliveWatson Definitely makes sense. Today, I don't have but a few use cases to use HAS (querying Event Logs or Syslog comes to mind). Your explanation clears things up for me.
Appreciate it!
May 19 2022 08:29 AM - edited May 20 2022 02:45 PM
Data source: Azure
Below query is based on Events which are registered and cleared after a while.
Heartbeat
// list records for last 30 days:
| where TimeGenerated > ago(30d)
| summarize LastCall = arg_max(TimeGenerated,*) by Computer
// retrieve machines that have not sent a heartbeat in the last 4 hours:
| where LastCall < ago(4h)
| project Computer, LastCall, timestamp = format_datetime(LastCall, 'MM-dd-yyyy hh:mm'),
startofday = format_datetime(startofday(now()), 'MM-dd-yyyy')
,VMUUID, SourceComputerId, ComputerIP, ResourceId
| sort by LastCall
The above KustoQL query runs every day, so today it may return 10 Computers (10 events), tomorrow 7 computers (7 events), etc.
Want to keep records of all events (today, tomorrow, etc.) for 31 days.
How to use Power BI or Power BI Dataflow to achieve above ,
otherwise how do I get data from (Log Analytics workspaces) into Azure SQL Database and apply incremental refresh (so todays records don't overwrite yesterdays records) ?
May 19 2022 10:48 AM
You can store the data for longer (at a cost) by increasing retention. You can also see the others days from within the query
Heartbeat
| where TimeGenerated > startofday(ago(30d))
| summarize count(), LastCall = max(TimeGenerated) by Computer, bin(TimeGenerated,1d)
| where LastCall < ago(1m)
| render columnchart
May 20 2022 06:09 AM - edited May 20 2022 06:11 AM
If I run your query, I get 4 rows returned for today. So it's like a snapshot for today, when query was run. How can I get a snapshot for everyday over the last 30 days where VM downtime is 4 hours or longer, time for getting snapshot is 9am daily ?
Nov 22 2019 09:45 AM
Solution
personally I prefer the example query of
// Availability rate
// Calculate the availability rate of each connected computer
Heartbeat
// bin_at is used to set the time grain to 1 hour, starting exactly 24 hours ago
| summarize heartbeatPerHour = count() by bin_at(TimeGenerated, 1h, ago(24h)), Computer
| extend availablePerHour = iff(heartbeatPerHour > 0, true, false)
| summarize totalAvailableHours = countif(availablePerHour == true) by Computer
| extend availabilityRate = totalAvailableHours*100.0/24
Heartbeats are expected to be missed (pauses, glitches, load etc...) and the data will catch-up - so you may get false positives.
You can use a date_diff to compare
Go to Log Analytics and Run Query
Heartbeat
| where TimeGenerated >= ago(1h)
| where Computer == "hardening-demo"
| project Computer, TimeGenerated
| order by TimeGenerated desc
| project n = TimeGenerated, nminus = prev(TimeGenerated), TimeGenerated, Computer
| where isnotempty(nminus)
// show time NOW vs time n -1 row
| extend second = datetime_diff('second',nminus, n)
| where second >= 60
Results for seconds below 60 (mainly 9 and 51 for the demo data) - just remove the last line of the above query to see this
n | nminus | TimeGenerated | Computer | second |
---|---|---|---|---|
2019-11-22T17:42:37.88Z | 2019-11-22T17:42:46.523Z | 2019-11-22T17:42:37.88Z | hardening-demo | 9 |
2019-11-22T17:41:46.52Z | 2019-11-22T17:42:37.88Z | 2019-11-22T17:41:46.52Z | hardening-demo | 51 |
2019-11-22T17:41:37.877Z | 2019-11-22T17:41:46.52Z | 2019-11-22T17:41:37.877Z | hardening-demo | 9 |
2019-11-22T17:40:46.52Z | 2019-11-22T17:41:37.877Z | 2019-11-22T17:40:46.52Z | hardening-demo | 51 |
2019-11-22T17:40:37.873Z | 2019-11-22T17:40:46.52Z | 2019-11-22T17:40:37.873Z | hardening-demo | 9 |