Forum Discussion

Dante Nahuel Ciai's avatar
Dante Nahuel Ciai
Brass Contributor
Jan 15, 2018

Availability on OMS

Hi everyone. I'm trying to find a way of getting Availability of servers on OMS, but I can't find any... By Availability I mean the % of uptime of a given server during a certain period of time. ...
  • Noa Kuperberg's avatar
    Noa Kuperberg
    Feb 13, 2018

    Sure. I tweaked it a bit to match what you ask for: 

    let start_time=startofday(datetime("2017-01-01"));
    let end_time=endofday(datetime("2017-01-31"));
    Heartbeat
    | where TimeGenerated > start_time and TimeGenerated < end_time
    | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer
    | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
    | summarize total_available_hours=countif(available_per_hour==true) by Computer 
    | extend total_number_of_buckets=round((end_time-start_time)/1h)
    | extend availability_rate=total_available_hours*100/total_number_of_buckets

    The first 2 lines define variables, set to the start and end time you mentioned.

     

    Next, we use these variables to limit the query to that time range: 

    | where TimeGenerated > start_time and TimeGenerated < end_time

    Then we count the heartbeats reported from each computer, in buckets (bins) of 1 hour, starting at the start time you define: 

    | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer

    Now we can see how many heartbeats were reported by each computer each hour. If the number is  0 we understand the computer was probably offline at that time.

    We use a new column to mark if a computer was available or not each hour: 

    | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)

    and then count the number of hours each computer was indeed "alive": 

    | summarize total_available_hours=countif(available_per_hour==true) by Computer

    Note that this way we give a little leeway for missing heartbeat reports each hour. Instead of expecting a report every 5 or 10 minutes, we only mark a computer as "unavailable" if we didn't get any report from it during a full hour.

     

    At this point we get a number for each computer, something like this: 

     

    So we know each computer was alive 11 hours in the select time range. But what does it mean? how many hours were there altogether? is this 11 out of 11 hours (100% availability) or out of 110 hours (only 10% availability)?

    Here's how we can calculate the total number of hours in the selected time range: 

    | extend total_number_of_buckets=round((end_time-start_time)/1h)+1

    I admit it might not be the best calculation of buckets.. there is probably a better way but I can't think of it now..

     finally we calculate the ratio between available hours and total hours:

    | extend availability_rate=total_available_hours*100/total_number_of_buckets

    and get this:

     

    HTH,

    Noa

Resources