Blog Post

Core Infrastructure and Security Blog
6 MIN READ

Automating Continuous Optimization with the Azure Optimization Engine

hspinto's avatar
hspinto
Icon for Microsoft rankMicrosoft
Nov 09, 2020

Hello, dear readers! Here is Hélder Pinto again, writing the last post of a series dedicated to the implementation of automated Continuous Optimization with Azure Advisor recommendations. For a contextualization of the solution here described, please read the introductory post for an overview of the solution, the second post for the details and deployment of the main solution components and also the third post to see how the Azure Optimization Engine generates recommendations and reports on it.

 

Introduction

 

If you didn’t have time to read the full post series about the Azure Optimization Engine, let me quickly recap. The Azure Optimization Engine (AOE) is an extensible solution designed to generate custom optimization recommendations for your Azure environment. See it like a fully customizable Azure Advisor. It leverages Azure Resource Graph, Log Analytics, Automation, and, of course, Azure Advisor itself, to build a rich repository of custom optimization opportunities. The first recommendations use-case covered by AOE was augmenting Azure Advisor Cost recommendations, particularly Virtual Machine right-sizing, with VM metrics and properties all enabling for better informed right-size decisions. Other recommendations can be easily added/augmented with AOE, not only for cost optimization but also for security, high-availability, and other Well Architected Framework pillars.

In this last post, I will show you how we can use AOE to automate the remediation of optimization opportunities – the ultimate goal of the engine – and how to extend it with new custom recommendations.

 

Do you really want me to automate right-size recommendations?!

 

The customer pain that sparked this series was all about remediating dozens or hundreds of VM right-size recommendations, for which one had trouble in reaching information that could help in making well-informed decisions. An Azure administrator can well spend many hours of investigation and many interactions with other colleagues before deciding to downsize a VM. Many times, it becomes an unfeasible task and Azure inefficiencies can last forever.

 

What if customers could simply automate those recommendation remediations? It may at first seem a reckless and naive option, but let’s face it: if we were highly confident that the recommendation was feasible and the environment we were touching was not critical, wouldn’t we prefer to automate?

 

Automate non-critical VM downsize when fit score is high for several weeks in a row

 

With the help of AOE, we now have a database of recommendations that have all the details we need to make an automated decision (see all the details in the previous post)

 

  • A fit score that is based on several factors:
    • Are the number of data disks and network interfaces supported by the size recommended by Azure Advisor?
    • Is the performance observed in my VM supported in the target size, according to my own memory, processor, and I/O thresholds and aggregations?
  • Historical data that allows me to check whether a right-size recommendation has been made for long enough to be taken seriously. A recommendation that just popped up this week is less trustworthy than a recommendation that Advisor is insisting on for the last 4 weeks.
  • VM properties that describe how critical it is (e.g., Azure subscription, resource group, or specific tags)

 

Based on these details, we can perfectly write a remediation runbook that simply queries the recommendations database for VMs that have been recommended for right-size for the past X weeks and with a fit score larger than Y. The T-SQL query could be this one:

 

 

 

 

 

 

 

SELECT RecommendationId, InstanceId, Tags, COUNT(InstanceId)
        FROM [dbo].[Recommendations] 
        WHERE RecommendationSubTypeId = '$rightSizeRecommendationId' AND FitScore >= $minFitScore AND GeneratedDate >= GETDATE()-(7*$minWeeksInARow)
        GROUP BY InstanceId, InstanceName, Tags
        HAVING COUNT(InstanceId) >= $minWeeksInARow

 

 

 

 

 

 

 

Additionally, we could filter the VMs to remediate to include only those that had a specific tag value. For a scenario where $minWeeksInARow=4 and $minFitScore=4.5 and tag environment=dev, these would be the automated remediation results:

 

Recommendation

Fit Score

Weeks in a Row

Env. tag

Action

we1-prd-dc01

4.6

6

prod

None

we1-dev-app02

4.7

4

dev

Downsize

we1-dev-sql03

4.3

5

dev

None

we1-dev-app03

4.8

2

dev

None

 

The AOE includes a Remediate-AdvisorRightSizeFiltered runbook that implements exactly the algorithm above. After having deployed the solution, you just have to define values for the following Automation variables and finally schedule the runbook for the desired time and frequency. Happy rightsizing!

 

  • AzureOptimization_RemediateRightSizeMinWeeksInARow – the minimum number of consecutive weeks a recommendation must be made for the same VM.
  • AzureOptimization_RemediateRightSizeMinFitScore – the minimum fit scores a recommendation must have each week.
  • AzureOptimization_RemediateRightSizeTagsFilter – an optional VM tags JSON filter that will be used to select VMs that will be subject to remediation. Example: [ { "tagName": "a", "tagValue": "b" }, { "tagName": "c", "tagValue": "d" } ]

 

How to generate my own custom recommendations?

 

OK, now you have customized right-size recommendations, but you probably want more. You want to identify other cost saving opportunities that may be specific to the environments you manage and that Advisor does not cover yet, such as underutilized App Service Plans or SQL Databases, ever-growing Storage Accounts, VMs stopped but not deallocated, etc.. In the previous post, you saw that writing a recommendation runbook for orphaned disks was really easy.

 

In this post, I want to show you that the AOE is not meant only for Cost optimization but can be used for other Well Architected Framework pillars – High Availability, Performance, Operational Excellence and Security. I’ve recently added to the AOE a recommendation for the High Availability pillar, identifying VMs with unmanaged disks. This new recommendation does not need additional data sources, as the Virtual Machine data already being exported from Azure Resource Graph is enough to identify VMs in this situation.

 

If you want to generate your own custom recommendations, you just have to first make sure you are collecting the required data with the Data Collection runbooks – follow the pattern of the existing runbooks that dump the data as CSV into a Storage Account and then rely on the data source-agnostic Log Analytics ingestion runbook. Having the required data in Log Analytics, you can write a new recommendation runbook that runs a weekly query for optimization opportunities. Looking at the Recommend-VMsWithUnmanagedDisksToBlobStorage runbook, you’ll identify the recommendation generation pattern:

 

  1. Collect AOE generic and recommendation-specific variables (Log Analytics workspace, Storage Account container, etc.)
  2. Authenticate against the Azure environment
  3. Obtain a context reference to the Storage Account where the recommendations file is going to be stored.
  4. Execute the recommendation query against Log Analytics.
  5. Build the recommendations objects – pay attention to the properties, as all those are required for the recommendation to correctly show up in the Power BI report. You will need to create a new GUID for your new recommendation type.
  6. Finally, export the recommendations file to blob storage – use a file naming specific to the recommendation type.

 

Don’t forget to link the runbook to the AzureOptimization_RecommendationsWeekly schedule and that’s all you must do. On the next scheduled recommendations generation run, you’ll have your new recommendations flowing into the Power BI report!

 

Thank you for having been following this series! 😉

 

Disclaimer

The Azure Optimization Engine is not supported under any Microsoft standard support program or service. The scripts are provided AS IS without warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample scripts and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.

Updated Jul 18, 2024
Version 4.0
  • perrymahlmann's avatar
    perrymahlmann
    Copper Contributor

    I am having a problem as one of my runbooks gets stopped as per Microsoft policy.  the runbook is Export-AADObjectsToBlobStorage 

     

     

    The job has been stopped because it reached the fair share limit of job execution more than 3 hours. For long-running jobs, it's recommended to use a Hybrid Runbook Worker. Hybrid Runbook Workers don't have a limitation on how long a runbook can execute. Refer https://docs.microsoft.com/en-us/azure/automation/automation-runbook-execution#fair-share for more details.

  • Yes, the Export-AADObjectsToBlobStorage runbook collects objects metadata from your whole Azure AD. It may take a while to complete if your AAD has a large number of users and groups. For runbooks running for more than 3 hours (or crashing frequently, e.g., due to memory constraints) it is recommended to implement Hybrid Workers.

     

    If you are not interested in getting users/groups metadata, you can filter them out by updating the AzureOptimization_AADObjectsFilter automation variable.