Blog Post

Core Infrastructure and Security Blog
5 MIN READ

Augmenting Azure Advisor Cost Recommendations for Automated Continuous Optimization – Part 3

hspinto's avatar
hspinto
Icon for Microsoft rankMicrosoft
Jul 27, 2020

This is the third post of a series dedicated to the implementation of automated Continuous Optimization with Azure Advisor Cost recommendations. For a contextualization of the solution here described, please read the introductory post for an overview of the solution and also the second post for the details and deployment of the main solution components.

 

Introduction

 

If you read the previous posts and deployed the Azure Optimization Engine solution, by now you have a Log Analytics workspace containing Advisor Cost recommendations as well as Virtual Machine properties and performance metrics, collected in a daily basis. We have all the historical data that is needed to augment Advisor recommendations and help us validate and, ultimately, automate VM right-size remediations. As a bonus, we have now a change history of our Virtual Machine assets and Advisor recommendations, which can also be helpful for other purposes.

 

So, what else is needed? Well, we need first to generate the augmented Advisor recommendations, by adding performance metrics and VM properties to each recommendation, and then store them in a repository that can be easily consumed and manipulated both by visualization tools and by automated remediation tasks. Finally, we visualize these and further recommendations with a simple Power BI report.

 

Anatomy of a recommendation

 

There isn’t much to invent here, as the Azure Advisor recommendation schema fits very well our purpose. We just need to add to this schema some other relevant fields:

 

  • Fit score – each recommendation type will have its own algorithm to compute the fit score. For example, for VM right-size recommendations, we’ll calculate it based on the VM metrics and whether the target SKU meets the storage and networking requirements.
  • Details URL – a link to a web page where we can see the actual justification for the recommendation (e.g., the results of a Log Analytics query chart showing the performance history of a VM).
  • Additional information – a JSON-formatted value containing recommendation-specific details (e.g., current and target SKUs, estimated savings, etc.).
  • Tags – if the target resource contains tags, we’ll just add them to the recommendation, as this may be helpful for reporting purposes.

 

Generating augmented Advisor recommendations

 

Having in the same Log Analytics repository all the data we need makes things really easy. We just need to build a query that joins Advisor recommendations with VM performance and properties and then automate a periodic export of the results for additional processing (see sample results below). As Advisor right-size recommendations consider only the last seven days of VM performance, we just have to run it once per week.

 

Advisor Cost recommendations augmented with Log Analytics data collected from ARG and Guest OS sources

 

 

For each exported recommendation, we’ll then execute a simple fit score algorithm that decreases the recommendation fit whenever a performance criterion is not met. We are considering these relatively weighted criteria against the recommended target SKU and observed performance metrics:

 

  • [Very high importance] Does it support the current data disks count?
  • [Very high] Does it support the current network interfaces count?
  • [High] Does it support the percentile(n) un-cached IOPS observed for all disks in the respective VM?
  • [High] Does it support the percentile(n) un-cached disks throughput?
  • [Medium] Is the VM below a given percentile(n) processor and memory usage percentage?
  • [Medium] Is the VM below a given percentile(n) network bandwidth usage?

 

The fit score ranges from 0 (lowest) to 5 (highest). If we don’t have performance metrics for a VM, the fit score is still decreased though in a lesser proportion. If we are processing a non-right-size recommendation, we still include it in the report, but the fit score is not computed (remaining at the -1 default value).

 

Bonus recommendation: orphaned disks

 

The power of this solution is that having so valuable historical data in our repository and adding other sources to it will allow us to generate our own custom recommendations as well.  One recommendation that easily comes out of the data we have been collecting is a report of orphaned disks – for example, disks that belonged to a VM that was meanwhile deleted (see sample query below). But you can easily think of others, even beyond cost optimization.

 

A very simple query for generating orphaned disks recommendations

 

 

Azure Optimization Engine reporting

 

Now that we have an automated process that generates and augments optimization recommendations, the next step is to add visualizations to it. For this purpose, there is nothing better than Power BI. To make things easier, we have meanwhile ingested our recommendations into an Azure SQL Database, where we can better manage and query data. We use it as the data source for our Power BI report, with many perspectives (see sample screenshots below).

The overview page gives us a high-level understanding of the recommendations’ relative distribution. We can also quickly see how many right-size recommended target SKUs are supported by the workload characteristics. In the example below, we have many “unknowns”, since only a few VMs were sending performance metrics to the Log Analytics workspace.

 

Azure Optimization Engine overview dashboard

 

 

In the exploration page, we can investigate all the available recommendations, using many types of filters and ordering criteria.

 

Azure Optimization Engine recommendations exploration page

 

 

After selecting a specific recommendation, we can drill through it and navigate to the Details or History pages.

 

Drilling through a specific recommendation, to see details or history

 

 

In the Details page, we can analyze all the data that was used to generate and validate the recommendation. Interestingly, the Azure Advisor API has recently included additional details about the thresholds values that were observed for each performance criterion. This can be used to cross-check with the metrics we are collecting with the Log Analytics agent.

 

Recommendation details page

 

 

 

In the History page, we can observe how the fit score has evolved over time for a specific recommendation. If the fit score has been stable at high levels for the past weeks, then the recommendation can likely be implemented without risks.

 

Recommendation history page

 

 

 

Each recommendation includes a details URL that opens an Azure Portal web page with additional information not available in the report. If we have performance data in Log Analytics for that instance, we can even open a CPU/memory chart with the performance history.

 

Recommendation justification (VM guest metrics) in Log Analytics

 

 

Deploying the solution and next steps

 

Everything described so far in these posts is available for you to deploy and test, in the Azure Optimization Engine repository. You can find there deployment and usage instructions and, if you have suggestions for improvements or for new types of recommendations, please open a feature request issue or… why not be brave and contribute to the project? 😉

 

The final post of this series will discuss how we can automate continuous optimization with the help of all the historical data we have been collecting and also how the AOE can be extended with additional recommendations (not limited to cost optimization).

 

Thank you for having been following! See you next time! 😉

Updated Dec 07, 2020
Version 5.0
  • atsky3000, the best way to check AOE permissions is to open the AOE Automation Account's Identity blade and then check the "Azure role assignments" option. From there, you can switch across all the tenant subscriptions and see whether it has the Reader role in the corresponding scope. In the example below, the Reader role was granted at a higher scope (Management Group) and thus covers all subscriptions below that scope. 

     

  • atsky3000's avatar
    atsky3000
    Copper Contributor

    Thank you for your previous reply.

    I have another question. for one of our partners I have deployed once again a solution and it is working for gathering data from a single subscription. Did add the reader rights for the AOE Enterprise App ( managed identity ) account to all other subscriptions. ( but still, the workbooks can't seem to list them )

    Any idea how to troubleshoot this?

     

    PS re-ran the runbooks and I still get only one subscription.

     

    Thank you in advance.

  • atsky3000, thank you for the feedback. Your assumptions are correct. CSP subscriptions are indeed partially supported. The note in the GitHub project's FAQ declares that some Workbooks that depend on accessing some billing details may have some limitations for CSP customers. This is the case of the Benefits Usage, Benefits Simulation, Reservations Usage, Reservations Potential and Savings Plans Usage workbooks, which depend on AOE having access to the Pricesheet, Reservations and Savings Plans APIs in the consumption agreement scope in the customer's tenant. But all the other features of AOE - including Recommendations - should work for CSP customers.

  • atsky3000's avatar
    atsky3000
    Copper Contributor

    Hi, hspinto 

     

    First of all, thank you for the effort in creating this.

    I have rolled out a couple of deployments in different tenants/environments and everything looks to be running smoothly.

    Except for the fact that I am trying to wrap my head around how to get this working with CSP-type subscriptions.

     

    Basically, we are the CSP and we want to set up the cost recommendations. I saw in the github notes that CSP subscriptions should be supported.

    The first thing that I encountered is that the - Other category can't be inputted:

     

    Upon checking the Setup-BenefitsUsageDependencies script I see that there is no section for that.

    But that is just FYI.

     

    As far as I can see the MCA option should be ok too.

    When I input our Billing account and profile IDs the error is thrown right away:

     

    Note that I tried both, doing this with global admin on the tenant and also with my delegated admin account on the CSP level with admin agent rights.

     

    I suppose that is most likely because I am running this from the end-customer tenant and granting Billing reader permissions to the automation account can't be done just like that.

     

    My assumption is that for the CSP the only supported scenario is to use this from the CSP tenant itself - not the end customer.

    Maybe you have some other insights or successful cases?

    Otherwise, in CSP cases this is just not a practical feature, unfortunately.

     

     

  • AnkitGarg, the query can be found in the respective recommendations runbook. However, as it is built dynamically it is maybe easier for you to try out this hard-coded example below:

     

     

    let advisorInterval = 7d;
    let perfInterval = 7d;
    let perfTimeGrain = 1h;
    let cpuPercentileValue = 99;
    let memoryPercentileValue = 99;
    let networkPercentileValue = 99;
    let diskPercentileValue = 99;
    let rightSizeRecommendationId = 'e10b1381-5f0a-47ff-8c7b-37bd13d7c974';
    let billingInterval = 30d;
    let etime = todatetime(toscalar(AzureOptimizationConsumptionV1_CL | summarize max(UsageDate_t))); 
    let stime = etime-billingInterval; 
    let RightSizeInstanceIds = materialize(AzureOptimizationAdvisorV1_CL 
    | where todatetime(TimeGenerated) > ago(advisorInterval) and Category == 'Cost' and RecommendationTypeId_g == rightSizeRecommendationId
    | distinct InstanceId_s);
    let LinuxMemoryPerf = Perf 
    | where TimeGenerated > ago(perfInterval) and _ResourceId in (RightSizeInstanceIds) 
    | where CounterName == '% Used Memory' 
    | summarize hint.strategy=shuffle PMemoryPercentage = percentile(CounterValue, memoryPercentileValue) by _ResourceId;
    let WindowsMemoryPerf = Perf 
    | where TimeGenerated > ago(perfInterval) and _ResourceId in (RightSizeInstanceIds) 
    | where CounterName == 'Available MBytes' 
    | project TimeGenerated, MemoryAvailableMBs = CounterValue, _ResourceId;
    let MemoryPerf = AzureOptimizationVMsV1_CL 
    | where TimeGenerated > ago(1d)
    | distinct InstanceId_s, MemoryMB_s
    | join kind=inner hint.strategy=broadcast (
    	WindowsMemoryPerf
    ) on $left.InstanceId_s == $right._ResourceId
    | extend MemoryPercentage = todouble(toint(MemoryMB_s) - toint(MemoryAvailableMBs)) / todouble(MemoryMB_s) * 100 
    | summarize hint.strategy=shuffle PMemoryPercentage = percentile(MemoryPercentage, memoryPercentileValue) by _ResourceId
    | union LinuxMemoryPerf;
    let ProcessorPerf = Perf 
    | where TimeGenerated > ago(perfInterval) and _ResourceId in (RightSizeInstanceIds) 
    | where ObjectName == 'Processor' and CounterName == '% Processor Time' and InstanceName == '_Total' 
    | summarize hint.strategy=shuffle PCPUPercentage = percentile(CounterValue, cpuPercentileValue) by _ResourceId;
    let WindowsNetworkPerf = Perf 
    | where TimeGenerated > ago(perfInterval) and _ResourceId in (RightSizeInstanceIds) 
    | where CounterName == 'Bytes Total/sec' 
    | summarize hint.strategy=shuffle PCounter = percentile(CounterValue, networkPercentileValue) by InstanceName, _ResourceId
    | summarize PNetwork = sum(PCounter) by _ResourceId;
    let DiskPerf = Perf
    | where TimeGenerated > ago(perfInterval) and _ResourceId in (RightSizeInstanceIds) 
    | where CounterName in ('Disk Reads/sec', 'Disk Writes/sec', 'Disk Read Bytes/sec', 'Disk Write Bytes/sec') and InstanceName !in ('_Total', 'D:', '/mnt/resource', '/mnt')
    | summarize hint.strategy=shuffle PCounter = percentile(CounterValue, diskPercentileValue) by bin(TimeGenerated, perfTimeGrain), CounterName, InstanceName, _ResourceId
    | summarize SumPCounter = sum(PCounter) by CounterName, TimeGenerated, _ResourceId
    | summarize MaxPReadIOPS = maxif(SumPCounter, CounterName == 'Disk Reads/sec'), 
                MaxPWriteIOPS = maxif(SumPCounter, CounterName == 'Disk Writes/sec'), 
                MaxPReadMiBps = (maxif(SumPCounter, CounterName == 'Disk Read Bytes/sec') / 1024 / 1024), 
                MaxPWriteMiBps = (maxif(SumPCounter, CounterName == 'Disk Write Bytes/sec') / 1024 / 1024) by _ResourceId;
    AzureOptimizationAdvisorV1_CL 
    | where todatetime(TimeGenerated) > ago(advisorInterval) and Category == 'Cost'
    | extend InstanceName_s = iif(isnotempty(InstanceName_s),InstanceName_s,InstanceName_g)
    | distinct InstanceId_s, InstanceName_s, Description_s, SubscriptionGuid_g, TenantGuid_g, ResourceGroup, Cloud_s, AdditionalInfo_s, RecommendationText_s, ImpactedArea_s, Impact_s, RecommendationTypeId_g
    | join kind=leftouter (
        AzureOptimizationConsumptionV1_CL
        | where UsageDate_t between (stime..etime)
        | extend VMConsumedQuantity = iif(InstanceId_s contains 'virtualmachines' and MeterCategory_s == 'Virtual Machines', todouble(Quantity_s), 0.0)
        | extend VMPrice = iif(InstanceId_s contains 'virtualmachines' and MeterCategory_s == 'Virtual Machines', todouble(UnitPrice_s), 0.0)
        | extend FinalCost = iif(InstanceId_s contains 'virtualmachines', VMPrice * VMConsumedQuantity, todouble(Cost_s))
        | summarize Last30DaysCost = sum(FinalCost), Last30DaysQuantity = sum(VMConsumedQuantity) by InstanceId_s
    ) on InstanceId_s
    | join kind=leftouter (
        AzureOptimizationVMsV1_CL 
        | where TimeGenerated > ago(1d) 
        | distinct InstanceId_s, NicCount_s, DataDiskCount_s, Tags_s
    ) on InstanceId_s 
    | where RecommendationTypeId_g != rightSizeRecommendationId or (RecommendationTypeId_g == rightSizeRecommendationId and toint(NicCount_s) >= 0 and toint(DataDiskCount_s) >= 0)
    | join kind=leftouter hint.strategy=broadcast ( MemoryPerf ) on $left.InstanceId_s == $right._ResourceId
    | join kind=leftouter hint.strategy=broadcast ( ProcessorPerf ) on $left.InstanceId_s == $right._ResourceId
    | join kind=leftouter hint.strategy=broadcast ( WindowsNetworkPerf ) on $left.InstanceId_s == $right._ResourceId
    | join kind=leftouter hint.strategy=broadcast ( DiskPerf ) on $left.InstanceId_s == $right._ResourceId
    | extend MaxPIOPS = MaxPReadIOPS + MaxPWriteIOPS, MaxPMiBps = MaxPReadMiBps + MaxPWriteMiBps
    | extend PNetworkMbps = PNetwork * 8 / 1000 / 1000
    | distinct Last30DaysCost, Last30DaysQuantity, InstanceId_s, InstanceName_s, Description_s, SubscriptionGuid_g, TenantGuid_g, ResourceGroup, Cloud_s, AdditionalInfo_s, RecommendationText_s, ImpactedArea_s, Impact_s, RecommendationTypeId_g, NicCount_s, DataDiskCount_s, PMemoryPercentage, PCPUPercentage, PNetworkMbps, MaxPIOPS, MaxPMiBps, Tags_s
    | join kind=leftouter ( 
        AzureOptimizationResourceContainersV1_CL
        | where TimeGenerated > ago(1d)
        | where ContainerType_s =~ 'microsoft.resources/subscriptions' 
        | project SubscriptionGuid_g, SubscriptionName = ContainerName_s 
    ) on SubscriptionGuid_g