SOLVED

Help understanding Processor counters

Brass Contributor

Hi all

I'm trying to create a good query for Log Analytics to measure CPU average usage and peaks in order to determine whether the VM is under/over utilized.

For a long time I've been using this query:

 

Perf
                                                                 |
 where CounterName == "% Processor Time" and TimeGenerated > ago(30d)
        |
 summarize avgCPU = avg(CounterValue) by Computer
              |
 where avgCPU < 30) on Computer

 

 

However, I started researching a bit more so I could add peaks and that messed things up quite a bit.

 

First of all, I discovered that, if I added "Processor" as a filter for "ObjectName", the average would greatly variate from the result given without that filter.

On closer inspection I noticed that processor would bring up only the Instance named "_total", but without that filter, the query would return all processes, including _total and _idle
So, which one would be more accurate to determine the average utilization of the CPU of a VM, for all the cores at any given time?
And, also, if the answer is not using the Processor ObjectName as a filter, why is it returning the _Idle and _total processes as well? Isn't this bad for calculating the average?? Shouldn't I exclude this two?


The second issue arised when I included the max value for the counter, trying to get peaks of CPU.
So, when using this query:

 

Perf
                                                                 |
 where CounterName == "% Processor Time" and TimeGenerated > ago(1d)
        |
 summarize avgCPU = avg(CounterValue), maxCPU = max(CounterValue) by Computer
              |
 where avgCPU < 30) on Computer

 

I got values that were way over 100... and since this should be a measure of total CPU based on a 100% utilization, I think this is wrong, but I'm not sure why.
For some VMs I get 1000 or 450 values, which makes no sense.
Can you help me understand why?

Thanks in advance.

4 Replies
best response confirmed by Stanislav Zhelyazkov (MVP)
Solution

Hi@Dante Nahuel Ciai 

The right query will be:

Perf
| where CounterName =~ '% Processor Time' and ObjectName =~ 'Processor' and InstanceName =~ '_Total' 
| summarize AggregatedValue = avg(CounterValue) by _ResourceId

or if you have on-premises VMs

Perf
| where CounterName =~ '% Processor Time' and ObjectName =~ 'Processor' and InstanceName =~ '_Total' 
| summarize AggregatedValue = avg(CounterValue) by Computer

Basically you only need _Total values for the counter. Besides average you can also use percentile()  . I am not sure how max() will work for you as you can have a VM that once had for a second CPU at 100% and then all the time it was as low as 1%. Overall it depends on your logic and what kind of analysis you want to do.

@Stanislav Zhelyazkov Thank you for the answer. I came up with the same during the weekend. I removed the max() and instead went for percentile 95, and then check that value, which, if I understood correctly the counter, means that 95% of the sampled time, the counter value is below that value
So, if I go 
Percentiles(CPU,5,50,95) and I get
0.5,20,100

it means that 5% of the time, the cpu is below 0.5%, 50% is below 20% and 95% is below 100%
is that correct?

Also, I could use bin to use max(), correct?

@Dante Nahuel Ciai Not sure if I can explain it better than the Kusto article or Wikipedia( https://en.wikipedia.org/wiki/Percentile#The_Nearest_Rank_method) but I can give you example where this is used a lot. It is used in measuring latency for web sites as there the average is not so important. Instead there you use percentile as you would want 95% of the customers to not experience high latency. Overall your explanation is also correct. You can use bin which will slice the data into time bins but really depends depends on the bins. Overall I do not think max is suitable for processor time. For example let's say that every hour you have the CPU going to 100 % for a second. If you slice your data to bins of 1 hour and calculate the maximum you will get that the CPU had maximum of 100% every hour but does that brings you any insights that you VM is not performing well?

Hi Dante, there are a couple of workbooks that we ship in Azure Monitor that may help. Azure Portal>Monitor>Workbooks>Virtual Machine ... the one named Performance Analysis uses data collected by AzMon for VMs in the InsightsMetrics table. The one named Perf Counters offers a similar view but uses the Perf table. You can choose the counter of interest and multiple ways of aggregating the data (e.g. avg, P80, P95) as well as which aggregation to use for the trend line. Cheers, Scott
1 best response

Accepted Solutions
best response confirmed by Stanislav Zhelyazkov (MVP)
Solution

Hi@Dante Nahuel Ciai 

The right query will be:

Perf
| where CounterName =~ '% Processor Time' and ObjectName =~ 'Processor' and InstanceName =~ '_Total' 
| summarize AggregatedValue = avg(CounterValue) by _ResourceId

or if you have on-premises VMs

Perf
| where CounterName =~ '% Processor Time' and ObjectName =~ 'Processor' and InstanceName =~ '_Total' 
| summarize AggregatedValue = avg(CounterValue) by Computer

Basically you only need _Total values for the counter. Besides average you can also use percentile()  . I am not sure how max() will work for you as you can have a VM that once had for a second CPU at 100% and then all the time it was as low as 1%. Overall it depends on your logic and what kind of analysis you want to do.

View solution in original post