Forum Discussion

Dante Nahuel Ciai's avatar
Dante Nahuel Ciai
Brass Contributor
Jan 15, 2018
Solved

Availability on OMS

Hi everyone.

I'm trying to find a way of getting Availability of servers on OMS, but I can't find any...

By Availability I mean the % of uptime of a given server during a certain period of time.

So, if a server was up 98 of a total 100 hours, the availability for that period is 98%.

I'm looking to do that in OMS, but I'm not sure it's possible.

Thanks in advance.

  • Noa Kuperberg's avatar
    Noa Kuperberg
    Feb 13, 2018

    Sure. I tweaked it a bit to match what you ask for: 

    let start_time=startofday(datetime("2017-01-01"));
    let end_time=endofday(datetime("2017-01-31"));
    Heartbeat
    | where TimeGenerated > start_time and TimeGenerated < end_time
    | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer
    | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
    | summarize total_available_hours=countif(available_per_hour==true) by Computer 
    | extend total_number_of_buckets=round((end_time-start_time)/1h)
    | extend availability_rate=total_available_hours*100/total_number_of_buckets

    The first 2 lines define variables, set to the start and end time you mentioned.

     

    Next, we use these variables to limit the query to that time range: 

    | where TimeGenerated > start_time and TimeGenerated < end_time

    Then we count the heartbeats reported from each computer, in buckets (bins) of 1 hour, starting at the start time you define: 

    | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer

    Now we can see how many heartbeats were reported by each computer each hour. If the number is  0 we understand the computer was probably offline at that time.

    We use a new column to mark if a computer was available or not each hour: 

    | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)

    and then count the number of hours each computer was indeed "alive": 

    | summarize total_available_hours=countif(available_per_hour==true) by Computer

    Note that this way we give a little leeway for missing heartbeat reports each hour. Instead of expecting a report every 5 or 10 minutes, we only mark a computer as "unavailable" if we didn't get any report from it during a full hour.

     

    At this point we get a number for each computer, something like this: 

     

    So we know each computer was alive 11 hours in the select time range. But what does it mean? how many hours were there altogether? is this 11 out of 11 hours (100% availability) or out of 110 hours (only 10% availability)?

    Here's how we can calculate the total number of hours in the selected time range: 

    | extend total_number_of_buckets=round((end_time-start_time)/1h)+1

    I admit it might not be the best calculation of buckets.. there is probably a better way but I can't think of it now..

     finally we calculate the ratio between available hours and total hours:

    | extend availability_rate=total_available_hours*100/total_number_of_buckets

    and get this:

     

    HTH,

    Noa

36 Replies

  • Micah Castorina's avatar
    Micah Castorina
    Copper Contributor

    Does anyone have a way to restrict the script to pull heartbeats Mon-Fri and 7am-6pm? I keep getting errors. This is what I have:

    let start_time=startofday(datetime("2018-06-1 07:30:00"));
    let end_time=endofday(datetime("2018-06-30 18:00:00"));
    Heartbeat
    | where TimeGenerated > start_time and TimeGenerated < end_time
    | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer
    | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
    | summarize total_available_hours=countif(available_per_hour==true) by Computer
    | extend total_number_of_buckets=round((end_time-start_time)/1h)
    | extend availability_rate=total_available_hours*100/total_number_of_buckets
    • MaheshG's avatar
      MaheshG
      Copper Contributor
      I have used the query, there are some differences which is not reflecting for us in the log analytics.

      We are doing the Monthly patches for the Virtual Machines.So when i tried this query it is showing the availability rate as 100% percentage.
      But we are rebooting the server after the patching activity, the Availability report percentage should differ but it is showing 100% percent for all the servers.

      Can anyone help us on this ? how can we get the exact report ?
      • Noa Kuperberg's avatar
        Noa Kuperberg
        Icon for Microsoft rankMicrosoft

        MaheshG 

        Hi! I've update the above query to reflect your week days and hours (Mon-Fri, 07:00-17:59).

        Also, the above query considers every hour in which there was even 1 heartbeat as "up time" (available), so this is probably the part you want to tweak. The resolution here depends on your agent. If it reports a heartbeat every 5 minutes, you can do this:

         

         

        let start_time=startofday(datetime("2019-08-01 07:00:00"));
        let end_time=endofday(datetime("2019-08-30 18:00:00"));
        Heartbeat
        | where TimeGenerated >= start_time and TimeGenerated <= end_time
        | where dayofweek(TimeGenerated) >= 1d and dayofweek(TimeGenerated) <= 5d  // Monday-Friday
        | where hourofday(TimeGenerated) >= 7 and hourofday(TimeGenerated) <=17   // 7:00-17:59
        | summarize heartbeat_per_5_minutes=count() by bin_at(TimeGenerated, 5m, start_time), Computer
        | extend available_per_5_min=iff(heartbeat_per_5_minutes>0, true, false)
        | summarize total_available_buckets=countif(available_per_5_min==true)
        , total_unavailable_buckets=countif(available_per_5_min==false) by Computer
        | extend total_number_of_buckets=round(total_available_buckets+total_unavailable_buckets)
        | extend availability_rate=total_available_buckets*100/total_number_of_buckets

         

         

        Note than anyway, if the reboot was quick and the agent sends a heartbeat every 5 minutes, it might go unnoticed.

         

        HTH,

        Noa

  • jeffrey fowels's avatar
    jeffrey fowels
    Copper Contributor

    let start_time=startofday(ago(30d));
    let end_time=startofday(now());
    Heartbeat
    | where TimeGenerated > start_time and TimeGenerated < end_time
    | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer
    | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
    | summarize total_available_hours=countif(available_per_hour==true) by Computer
    | extend total_number_of_buckets=round((end_time-start_time)/1h)
    | extend availability_rate=total_available_hours*100/total_number_of_buckets
    | order by availability_rate desc

  • Stanislav is right, it's possible :)

    Here's an example that calculates the availability rate of each computer, starting at midnight.

    let midnight=startofday(now());
    Heartbeat
    | where TimeGenerated>midnight
    | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, midnight), Computer
    | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
    | summarize total_available_hours=countif(available_per_hour==true) by Computer 
    | extend number_of_buckets=hourofday(now())+1
    | extend availability_rate=total_available_hours*100/number_of_buckets

    Run it on our playground and tweak it as makes sense to you.

    • ScottAllison's avatar
      ScottAllison
      Iron Contributor

      Love this query. I'm having trouble modifying it to meet my needs.

       

      In addition to what this query provides, I'd also like to show the last TimeGenerated for each Computer. I can't seem to get the logic to work correctly. Any help is appreciated.

    • Dante Nahuel Ciai's avatar
      Dante Nahuel Ciai
      Brass Contributor

      Noa, your script is amazing, however i'm struggling to understand it and tweak it to my needs (30 fixed days, for example from 1st to 31 of january)

      Could you gimme a hand to understand it?

       

      let midnight=startofday(now()) ; #First part. I need to change this to between((2018-01-01) .. (2017-01-31)); am I correct?
      Heartbeat
      | where TimeGenerated>midnight
      | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, midnight) #im not sure i understand why do you use bin_at instead of just bin, Computer
      | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
      | summarize total_available_hours=countif(available_per_hour==true) by Computer 
      | extend number_of_buckets=hourofday(now())+1
      | extend availability_rate=total_available_hours*100/number_of_buckets

       

      • Noa Kuperberg's avatar
        Noa Kuperberg
        Icon for Microsoft rankMicrosoft

        Sure. I tweaked it a bit to match what you ask for: 

        let start_time=startofday(datetime("2017-01-01"));
        let end_time=endofday(datetime("2017-01-31"));
        Heartbeat
        | where TimeGenerated > start_time and TimeGenerated < end_time
        | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer
        | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
        | summarize total_available_hours=countif(available_per_hour==true) by Computer 
        | extend total_number_of_buckets=round((end_time-start_time)/1h)
        | extend availability_rate=total_available_hours*100/total_number_of_buckets

        The first 2 lines define variables, set to the start and end time you mentioned.

         

        Next, we use these variables to limit the query to that time range: 

        | where TimeGenerated > start_time and TimeGenerated < end_time

        Then we count the heartbeats reported from each computer, in buckets (bins) of 1 hour, starting at the start time you define: 

        | summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer

        Now we can see how many heartbeats were reported by each computer each hour. If the number is  0 we understand the computer was probably offline at that time.

        We use a new column to mark if a computer was available or not each hour: 

        | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)

        and then count the number of hours each computer was indeed "alive": 

        | summarize total_available_hours=countif(available_per_hour==true) by Computer

        Note that this way we give a little leeway for missing heartbeat reports each hour. Instead of expecting a report every 5 or 10 minutes, we only mark a computer as "unavailable" if we didn't get any report from it during a full hour.

         

        At this point we get a number for each computer, something like this: 

         

        So we know each computer was alive 11 hours in the select time range. But what does it mean? how many hours were there altogether? is this 11 out of 11 hours (100% availability) or out of 110 hours (only 10% availability)?

        Here's how we can calculate the total number of hours in the selected time range: 

        | extend total_number_of_buckets=round((end_time-start_time)/1h)+1

        I admit it might not be the best calculation of buckets.. there is probably a better way but I can't think of it now..

         finally we calculate the ratio between available hours and total hours:

        | extend availability_rate=total_available_hours*100/total_number_of_buckets

        and get this:

         

        HTH,

        Noa

  • Hi There is no out of the box solution for this. Probably one way to calculate this is by using Heartbeat events. For Windows machines they are logged every 1 minute and for Linux every 5 minutes. You could probably calculate what should be the sum of the events for each machine and also the maximum sum for specific time frame. From those two values you can calculate the percentage. The query language is pretty rich so this should be possible.

Resources