Blog Post

Core Infrastructure and Security Blog
5 MIN READ

Augmenting Azure Advisor Cost Recommendations for Automated Continuous Optimization – Part 1

hspinto's avatar
hspinto
Icon for Microsoft rankMicrosoft
May 10, 2020

Here is Hélder Pinto, Customer Engineer at Microsoft, presenting the first post of a series dedicated to the implementation of automated Continuous Optimization with Azure Advisor Cost recommendations.

 

Introduction

 

We can define Continuous Optimization (CO) in the context of Microsoft Azure and IT environments in general as an iterative process aiming at constantly assessing optimization opportunities in our infrastructure and implement required changes. This is done routinely in many organizations – finding and implementing performance, security, or cost optimization opportunities can be a full-time job in some large enterprise scenarios.

 

Azure Advisor is a great, free governance tool that every Azure customer should look at often and use as a source of CO. In Azure Advisor, you can find recommendations for several categories: Performance, High Availability, Security (sourced from Azure Security Center), Operational Excellence, and Cost (coming from Azure Cost Management). Each recommendation comes with a justification, impact level and details on how to optimize the impacted resources. See more details here.

 

Example of Advisor Cost recommendations

 

 

 

The problem 

 

All Advisor recommendations are actionable and, in many cases, automatable – ultimately, CO should be as automatic as CI/CD. For example, when Advisor finds Virtual Machines missing endpoint protection, we can automate the deployment of the corresponding VM extension.

 

When it comes to Cost recommendations, there are also some scenarios where optimization automation is easy to implement: deleting orphaned Public IPs or moving snapshots to Standard Storage are some examples. However, when customers look at Virtual Machine right-size recommendations – one of the recommendations with higher cost optimization impact –, very few feel comfortable enough to act accordingly, even more if asked to automate the action. These are some questions many customers ask:

 

  • What was the rationale behind this recommendation? Which actual metrics and thresholds were considered to make the decision?”
  • “How confident can I be to execute the recommended downsize and not have a performance issue afterwards?
  • How can I be sure the recommended size can cope with the current amount of data disks or network interfaces?

As of today, the documentation about Cost recommendations states that Advisor “considers virtual machines for shut-down when P95th of the max of max value of CPU utilization is less than 3% and network utilization is less than 2% over a 7 day period. Virtual machines are considered for right size when it is possible to fit the current load in a smaller SKU (within the same SKU family*) or a smaller number # of instance such that the current load doesn’t go over 80% utilization when non-user facing workloads and not above 40% when user-facing workload.

 

We have an idea of how the recommendation was generated, but customers need more details, such as the actual VM metrics** that supported the recommendation and projections of the smaller SKU against the current capacity requirements. Furthermore, the algorithm is not yet considering storage I/O metrics nor disk or NIC count properties.

 

Reviewing and validating recommendations one-by-one requires thus a significant effort, the more for large-scale customers with hundreds of right-size recommendations. Ideally, any detail accompanying a recommendation should be provided in a machine-understandable manner, so that automated actions can be implemented on top.

 

For the reasons above, some customers I work with have asked guidance on how to get better informed cost recommendations so that they can automate remediation or, at least, filter out recommendations that haven’t a strong enough ground, from the customer perspective, to be further investigated. This post gives an overview of a possible solution for this requirement and the remaining articles in the series will provide you with the actual technical details of the solution.

 

A solution architecture

 

Better informing Advisor Cost right-size recommendations and enabling remediation automation means:

 

  1. Augmenting Advisor recommendations with Virtual Machine performance metrics and properties:
    • Processor (including per core), memory, disk (IOPS and throughput), and network usage metrics
    • OS Type
    • Current SKU memory
    • Data disk and network interface count
  2. Calculating a fit score based on customer-defined performance thresholds and platform-defined target SKU limits:
    • Customer-provided CPU, memory, or disk usage thresholds
    • Azure platform VM SKU limits (max. data disk count, max. NIC count, max. IOPS, etc.)
  3. Providing augmented recommendations both in human- and machine-accessible ways:
    • Visualize latest recommendations in a dashboard
    • Provide historical perspective, i.e., for how long a recommendation has been done
    • Access via query API or CSV exports

These requirements imply the usage of multiple sources of information beyond Azure Advisor itself. The architecture depicted below summarizes the tools and services that can possibly be used to build such a solution. The solution architecture is a typical ETL and analytics pattern implemented on top of Azure Automation and Log Analytics.

 

Solution architecture, from data collection to recommendations generation and visualization

 

 

Dimension Data Collection Runbooks periodically collect data from Azure Advisor (Cost recommendations) and Azure Resource Graph (Virtual Machine and Managed Disk properties) and dump to an Azure Storage Account container a selection of properties as CSV raw data . The Azure Storage Account repository has a long-term retention and can be used as well as a source for replaying transformation and load operations. We are using here only a couple of data sources, but others could easily be plugged in (Azure Billing, Azure Monitor metrics, etc.).

 

Raw data files are ingested into the Log Analytics workspace by the Log Analytics Ingestion Runbook, where Virtual Machines also send performance metrics thanks to the Log Analytics agents. With all the data together in the same repository and with the power of Log Analytics, it is possible to augment Azure Advisor recommendations with very useful information and even generate new recommendation types out of Azure Resource Graph and performance counters .

 

Recommendation Generation Runbooks have recommendation type-specific logic that queries the Log Analytics workspace as well as other sources of information (e.g., the Compute ARM provider) to augment or generate a recommendation. In this scenario, Advisor Cost recommendations are merged with VM properties and performance metrics aggregates. Other scenarios could easily arise from the available data, such as recommendations for deleting orphaned disks. Again, all these runbooks dump the augmented recommendations to an Azure Storage Account container.

 

Finally, the SQL Server Injection Runbook periodically parses the raw recommendation files and sends them as new rows to an Azure SQL database containing all the recommendations history. The result of the process is visible in a Power BI dashboard – see sample view below.

 

Example of an augmented Advisor Cost recommendations dashboard

 

 

 

A little note on some architectural decisions. First, the Power BI report could connect directly to the Azure Storage Account instead of using an intermediary SQL Database. However, using a SQL interface enables better querying and data transformation capabilities and makes integration with third-party reporting solutions easier. Secondly, one can argue that, when trying to save on Azure costs, we are adding other costs (Automation, Storage Account, SQL Database and Log Analytics). The fact is that almost all components are very cheap and the amount of ingested data in Log Analytics for VM metrics can be optimized to collect only the required counters with a large collection interval. Moreover, if an organization was already using Log Analytics for VM metrics, we are just leveraging this data with no additional costs.

 

In the upcoming posts in this series (part 2, part 3 and part 4), we will look at the implementation details of each step of the process and understand how this solution can be used to automate right-size remediations with high confidence. Stay tuned!

 

* Actually, in some cases, Advisor may suggest B-family target SKUs. Otherwise, it remains in the same family.

** Since late 2020, Advisor provides details about the the observed metrics (at the host level) the recommendation was based on.

Updated Feb 24, 2021
Version 10.0

20 Comments

  • akumar1911, if the role is not granted, some of the Azure AD objects and RBAC assignment extraction jobs will fail and won't provide the data that is needed for the following recommendations:

    • Service Principal credentials/certificates without expiration date
    • Service Principal credentials/certificates expired or about to expire

    Additionally, the Identities and Roles workbook will be partially broken (no data for Azure AD objects and roles).

    Otherwise, the AOE will continue to run normally for all the remaining recommendations/features.

  • hspinto During the deployment , AOE Service Principal needs Global Reader permission on AAD. What recommendations it provides utilising this role and if we don't assign the Service Principal Global reader role, what could be the implication on the AOE?

  • martinfelts have you assigned the needed permissions to the AOE Run As Account across all the 86 subscriptions? By default, the Azure Automation Run As Account is created with Reader role only over the respective subscription. However, you can widen the scope of its recommendations just by granting the same Reader role to other subscriptions or, even simpler, to a top-level Management Group. Assigning those permissions at Management Group level is recommended. Otherwise, can you PM me so that we discuss this offline?

  • martinfelts's avatar
    martinfelts
    Copper Contributor

    I have deployed this solution in my environment. I currently have 86 subscriptions, I used a pre-existing log analytics workspace and I can see the workbook jobs running and it appears that they are attempting to collect data but the json files are empty, the csv files are empty and my database is empty and I cannot seem to figure out what is causing this. Any help would be greatly appreciated.

  • Thanks for the feedback, Andy Ball. Since these articles were posted last year, some more detail was added to Azure Advisor right-size recommendations, namely the host metrics that drove the recommendation (of course, this is still not as ideal as having guest OS metrics covering as well disk I/O). I know that other features implemented in the AOE solution are in the Advisor roadmap, but I do not have information to share about planned roll out.

  • Andy Ball's avatar
    Andy Ball
    Copper Contributor

    Hi hspinto  

     

    I have this exact requirement for a large enterprise customer and i was just sketching out and HLD and event started to code a quick prototype and then stumbled upon your solution . Just reading through and looks super awesome 🙂

     

    Are there any thoughts / plans on making this an official solution as part of Azure Advisor at some point ? 

     

    cheers

    Andy. 

  • DmitryB, the new version has just been released. Please, give it a try. Feedback is welcome! If you find any issue, please file a new GitHub issue in the repo.

  • DmitryB's avatar
    DmitryB
    Copper Contributor

    Great, thank you for the swift reply. Looking forward for the new release. Amazing work, keep on.

  • DmitryB, by default, AOE's Azure Automation Run As Account has a role only over the subscription where it was deployed to. However, you can widen the scope of its recommendations just by granting the same Reader role to other subscriptions or, even simpler, to a top-level Management Group.

     

    If you need to collect performance metrics from other Log Analytics workspaces in the same tenant, you'll have to wait for the new release, which will be published next week.


     
  • DmitryB's avatar
    DmitryB
    Copper Contributor

    Thank you great article. Wonder if there is an instruction how to bind AOE to another subscription within the same tenant to collect performance metrics and built recommendations.