Oct 08 2020 09:28 AM
Oct 08 2020 09:28 AM
I'm trying to create a good query for Log Analytics to measure CPU average usage and peaks in order to determine whether the VM is under/over utilized.
For a long time I've been using this query:
Perf | where CounterName == "% Processor Time" and TimeGenerated > ago(30d) | summarize avgCPU = avg(CounterValue) by Computer | where avgCPU < 30) on Computer
However, I started researching a bit more so I could add peaks and that messed things up quite a bit.
First of all, I discovered that, if I added "Processor" as a filter for "ObjectName", the average would greatly variate from the result given without that filter.
On closer inspection I noticed that processor would bring up only the Instance named "_total", but without that filter, the query would return all processes, including _total and _idle
So, which one would be more accurate to determine the average utilization of the CPU of a VM, for all the cores at any given time?
And, also, if the answer is not using the Processor ObjectName as a filter, why is it returning the _Idle and _total processes as well? Isn't this bad for calculating the average?? Shouldn't I exclude this two?
The second issue arised when I included the max value for the counter, trying to get peaks of CPU.
So, when using this query:
Perf | where CounterName == "% Processor Time" and TimeGenerated > ago(1d) | summarize avgCPU = avg(CounterValue), maxCPU = max(CounterValue) by Computer | where avgCPU < 30) on Computer
I got values that were way over 100... and since this should be a measure of total CPU based on a 100% utilization, I think this is wrong, but I'm not sure why.
For some VMs I get 1000 or 450 values, which makes no sense.
Can you help me understand why?
Thanks in advance.
Oct 12 2020 06:17 AMSolution
The right query will be:
Perf | where CounterName =~ '% Processor Time' and ObjectName =~ 'Processor' and InstanceName =~ '_Total' | summarize AggregatedValue = avg(CounterValue) by _ResourceId
or if you have on-premises VMs
Perf | where CounterName =~ '% Processor Time' and ObjectName =~ 'Processor' and InstanceName =~ '_Total' | summarize AggregatedValue = avg(CounterValue) by Computer
Basically you only need _Total values for the counter. Besides average you can also use percentile() . I am not sure how max() will work for you as you can have a VM that once had for a second CPU at 100% and then all the time it was as low as 1%. Overall it depends on your logic and what kind of analysis you want to do.
Oct 12 2020 07:19 AM
@Stanislav Zhelyazkov Thank you for the answer. I came up with the same during the weekend. I removed the max() and instead went for percentile 95, and then check that value, which, if I understood correctly the counter, means that 95% of the sampled time, the counter value is below that value
So, if I go
Percentiles(CPU,5,50,95) and I get
it means that 5% of the time, the cpu is below 0.5%, 50% is below 20% and 95% is below 100%
is that correct?
Also, I could use bin to use max(), correct?
Oct 12 2020 07:49 AM
@Dante Nahuel Ciai Not sure if I can explain it better than the Kusto article or Wikipedia( https://en.wikipedia.org/wiki/Percentile#The_Nearest_Rank_method) but I can give you example where this is used a lot. It is used in measuring latency for web sites as there the average is not so important. Instead there you use percentile as you would want 95% of the customers to not experience high latency. Overall your explanation is also correct. You can use bin which will slice the data into time bins but really depends depends on the bins. Overall I do not think max is suitable for processor time. For example let's say that every hour you have the CPU going to 100 % for a second. If you slice your data to bins of 1 hour and calculate the maximum you will get that the CPU had maximum of 100% every hour but does that brings you any insights that you VM is not performing well?
Oct 20 2020 09:45 PM