Microsoft Secure Tech Accelerator
Apr 03 2024, 07:00 AM - 11:00 AM (PDT)
Microsoft Tech Community
Augmenting Azure Advisor Cost Recommendations for Automated Continuous Optimization – Part 1
Published May 10 2020 04:00 PM 20K Views
Microsoft

Here is Hélder Pinto, Customer Engineer at Microsoft, presenting the first post of a series dedicated to the implementation of automated Continuous Optimization with Azure Advisor Cost recommendations.

 

Introduction

 

We can define Continuous Optimization (CO) in the context of Microsoft Azure and IT environments in general as an iterative process aiming at constantly assessing optimization opportunities in our infrastructure and implement required changes. This is done routinely in many organizations – finding and implementing performance, security, or cost optimization opportunities can be a full-time job in some large enterprise scenarios.

 

Azure Advisor is a great, free governance tool that every Azure customer should look at often and use as a source of CO. In Azure Advisor, you can find recommendations for several categories: Performance, High Availability, Security (sourced from Azure Security Center), Operational Excellence, and Cost (coming from Azure Cost Management). Each recommendation comes with a justification, impact level and details on how to optimize the impacted resources. See more details here.

 

Example of Advisor Cost recommendationsExample of Advisor Cost recommendations

 

 

 

The problem 

 

All Advisor recommendations are actionable and, in many cases, automatable – ultimately, CO should be as automatic as CI/CD. For example, when Advisor finds Virtual Machines missing endpoint protection, we can automate the deployment of the corresponding VM extension.

 

When it comes to Cost recommendations, there are also some scenarios where optimization automation is easy to implement: deleting orphaned Public IPs or moving snapshots to Standard Storage are some examples. However, when customers look at Virtual Machine right-size recommendations – one of the recommendations with higher cost optimization impact –, very few feel comfortable enough to act accordingly, even more if asked to automate the action. These are some questions many customers ask:

 

  • What was the rationale behind this recommendation? Which actual metrics and thresholds were considered to make the decision?”
  • “How confident can I be to execute the recommended downsize and not have a performance issue afterwards?
  • How can I be sure the recommended size can cope with the current amount of data disks or network interfaces?

As of today, the documentation about Cost recommendations states that Advisor “considers virtual machines for shut-down when P95th of the max of max value of CPU utilization is less than 3% and network utilization is less than 2% over a 7 day period. Virtual machines are considered for right size when it is possible to fit the current load in a smaller SKU (within the same SKU family*) or a smaller number # of instance such that the current load doesn’t go over 80% utilization when non-user facing workloads and not above 40% when user-facing workload.

 

We have an idea of how the recommendation was generated, but customers need more details, such as the actual VM metrics** that supported the recommendation and projections of the smaller SKU against the current capacity requirements. Furthermore, the algorithm is not yet considering storage I/O metrics nor disk or NIC count properties.

 

Reviewing and validating recommendations one-by-one requires thus a significant effort, the more for large-scale customers with hundreds of right-size recommendations. Ideally, any detail accompanying a recommendation should be provided in a machine-understandable manner, so that automated actions can be implemented on top.

 

For the reasons above, some customers I work with have asked guidance on how to get better informed cost recommendations so that they can automate remediation or, at least, filter out recommendations that haven’t a strong enough ground, from the customer perspective, to be further investigated. This post gives an overview of a possible solution for this requirement and the remaining articles in the series will provide you with the actual technical details of the solution.

 

A solution architecture

 

Better informing Advisor Cost right-size recommendations and enabling remediation automation means:

 

  1. Augmenting Advisor recommendations with Virtual Machine performance metrics and properties:
    • Processor (including per core), memory, disk (IOPS and throughput), and network usage metrics
    • OS Type
    • Current SKU memory
    • Data disk and network interface count
  2. Calculating a fit score based on customer-defined performance thresholds and platform-defined target SKU limits:
    • Customer-provided CPU, memory, or disk usage thresholds
    • Azure platform VM SKU limits (max. data disk count, max. NIC count, max. IOPS, etc.)
  3. Providing augmented recommendations both in human- and machine-accessible ways:
    • Visualize latest recommendations in a dashboard
    • Provide historical perspective, i.e., for how long a recommendation has been done
    • Access via query API or CSV exports

These requirements imply the usage of multiple sources of information beyond Azure Advisor itself. The architecture depicted below summarizes the tools and services that can possibly be used to build such a solution. The solution architecture is a typical ETL and analytics pattern implemented on top of Azure Automation and Log Analytics.

 

Solution architecture, from data collection to recommendations generation and visualizationSolution architecture, from data collection to recommendations generation and visualization

 

 

Dimension Data Collection Runbooks periodically collect data from Azure Advisor (Cost recommendations) and Azure Resource Graph (Virtual Machine and Managed Disk properties) and dump to an Azure Storage Account container a selection of properties as CSV raw data . The Azure Storage Account repository has a long-term retention and can be used as well as a source for replaying transformation and load operations. We are using here only a couple of data sources, but others could easily be plugged in (Azure Billing, Azure Monitor metrics, etc.).

 

Raw data files are ingested into the Log Analytics workspace by the Log Analytics Ingestion Runbook, where Virtual Machines also send performance metrics thanks to the Log Analytics agents. With all the data together in the same repository and with the power of Log Analytics, it is possible to augment Azure Advisor recommendations with very useful information and even generate new recommendation types out of Azure Resource Graph and performance counters .

 

Recommendation Generation Runbooks have recommendation type-specific logic that queries the Log Analytics workspace as well as other sources of information (e.g., the Compute ARM provider) to augment or generate a recommendation. In this scenario, Advisor Cost recommendations are merged with VM properties and performance metrics aggregates. Other scenarios could easily arise from the available data, such as recommendations for deleting orphaned disks. Again, all these runbooks dump the augmented recommendations to an Azure Storage Account container.

 

Finally, the SQL Server Injection Runbook periodically parses the raw recommendation files and sends them as new rows to an Azure SQL database containing all the recommendations history. The result of the process is visible in a Power BI dashboard – see sample view below.

 

Example of an augmented Advisor Cost recommendations dashboardExample of an augmented Advisor Cost recommendations dashboard

 

 

 

A little note on some architectural decisions. First, the Power BI report could connect directly to the Azure Storage Account instead of using an intermediary SQL Database. However, using a SQL interface enables better querying and data transformation capabilities and makes integration with third-party reporting solutions easier. Secondly, one can argue that, when trying to save on Azure costs, we are adding other costs (Automation, Storage Account, SQL Database and Log Analytics). The fact is that almost all components are very cheap and the amount of ingested data in Log Analytics for VM metrics can be optimized to collect only the required counters with a large collection interval. Moreover, if an organization was already using Log Analytics for VM metrics, we are just leveraging this data with no additional costs.

 

In the upcoming posts in this series (part 2, part 3 and part 4), we will look at the implementation details of each step of the process and understand how this solution can be used to automate right-size remediations with high confidence. Stay tuned!

 

* Actually, in some cases, Advisor may suggest B-family target SKUs. Otherwise, it remains in the same family.

** Since late 2020, Advisor provides details about the the observed metrics (at the host level) the recommendation was based on.

20 Comments
Copper Contributor

Thank you great article. Wonder if there is an instruction how to bind AOE to another subscription within the same tenant to collect performance metrics and built recommendations.

Microsoft

@DmitryB, by default, AOE's Azure Automation Run As Account has a role only over the subscription where it was deployed to. However, you can widen the scope of its recommendations just by granting the same Reader role to other subscriptions or, even simpler, to a top-level Management Group.

 

If you need to collect performance metrics from other Log Analytics workspaces in the same tenant, you'll have to wait for the new release, which will be published next week.


 
Copper Contributor

Great, thank you for the swift reply. Looking forward for the new release. Amazing work, keep on.

Microsoft

@DmitryB, the new version has just been released. Please, give it a try. Feedback is welcome! If you find any issue, please file a new GitHub issue in the repo.

Copper Contributor

Hi @hspinto  

 

I have this exact requirement for a large enterprise customer and i was just sketching out and HLD and event started to code a quick prototype and then stumbled upon your solution . Just reading through and looks super awesome :)

 

Are there any thoughts / plans on making this an official solution as part of Azure Advisor at some point ? 

 

cheers

Andy. 

Microsoft

Thanks for the feedback, @Andy Ball. Since these articles were posted last year, some more detail was added to Azure Advisor right-size recommendations, namely the host metrics that drove the recommendation (of course, this is still not as ideal as having guest OS metrics covering as well disk I/O). I know that other features implemented in the AOE solution are in the Advisor roadmap, but I do not have information to share about planned roll out.

Copper Contributor

I have deployed this solution in my environment. I currently have 86 subscriptions, I used a pre-existing log analytics workspace and I can see the workbook jobs running and it appears that they are attempting to collect data but the json files are empty, the csv files are empty and my database is empty and I cannot seem to figure out what is causing this. Any help would be greatly appreciated.

Microsoft

@martinfelts have you assigned the needed permissions to the AOE Run As Account across all the 86 subscriptions? By default, the Azure Automation Run As Account is created with Reader role only over the respective subscription. However, you can widen the scope of its recommendations just by granting the same Reader role to other subscriptions or, even simpler, to a top-level Management Group. Assigning those permissions at Management Group level is recommended. Otherwise, can you PM me so that we discuss this offline?

Copper Contributor

@hspinto During the deployment , AOE Service Principal needs Global Reader permission on AAD. What recommendations it provides utilising this role and if we don't assign the Service Principal Global reader role, what could be the implication on the AOE?

Microsoft

@akumar1911, if the role is not granted, some of the Azure AD objects and RBAC assignment extraction jobs will fail and won't provide the data that is needed for the following recommendations:

  • Service Principal credentials/certificates without expiration date
  • Service Principal credentials/certificates expired or about to expire

Additionally, the Identities and Roles workbook will be partially broken (no data for Azure AD objects and roles).

Otherwise, the AOE will continue to run normally for all the remaining recommendations/features.

Copper Contributor

@hspinto thank you very for confirming :smiling_face_with_smiling_eyes:

Copper Contributor

Hi @hspinto , I'm currently deploying this tool on my environment and for some reason one of the runbook is failing for some reason. Any thoughts that you can shed light on this please?

 

daryl316_0-1654592949529.png

 

Microsoft

@daryl316, can you share the detailed log, to see at which point the runbook failed? Have you assigned the Azure AD Directory Reader role to the Automation managed identity?

Copper Contributor

Thanks for your feedback @hspinto, please see the screenshot below for the logs. Also, confirming that the automation managed identity has been assigned as Global Reader role in Azure AD.

 

daryl316_0-1654639140314.png

 

Microsoft

@daryl316, the job is iterating over the several subscriptions the Managed Identity has Reader access to. It is able to successfully collect RBAC assignments for the first one, but it seems that when it tries to run Get-AzRoleAssignment -IncludeClassicAdministrators against the second subscription in the array, it gets a NotFound error. I never saw that happening. It seems the second subscription does not have RBAC. Are you able to test Get-AzRoleAssignment -IncludeClassicAdministrators in the context of all your subscriptions, to try to identify which one is failing and why? You can use the Azure Cloud Shell for that. Use lines 80 and 82 from the runbook as a guidance.

Copper Contributor

Hi @hspinto, looks like it made more sense. For some reason when accessing one of the subscription we are getting error like this: "The current subscription does not allow you to perform any actions on Azure resources. Use a different subscription." Even as an owner of the subscription is unable to access this. It appears that this a legacy subscription which no longer in used but can only be removed by raising a ticket with Microsoft. This could be the reason why runbook is failing to operate. As a workaround we edited the runbook and changed this variable ***($ErrorActionPreference = "Stop")*** in the runbook from ***"stop"*** to ***"silentlycontinue"*** and re-run it again. Once the changes has made and re-published the two runbooks, we're able to get a successful response and confirmed that the data were exported successfully.

Microsoft

@daryl316 , I advise you to make the following change in line 82

 

$assignments = Get-AzRoleAssignment -IncludeClassicAdministrators -ErrorAction Continue

 

And also bring back ErrorActionPreference to "Stop", otherwise you'll never know when something goes wrong in other parts of the runbook. 

Copper Contributor

Hi @hspinto , thanks a lot for your input. We reverted back our changes and followed your recommendation and seems we got a better result. We will continue to monitor and circle back if there's anymore issues but looks like we're good on this. Thanks again for your help!! :)

Copper Contributor

Hi @hspinto I am using Hybrid worker configuration and private link to connect to SQL db. All the export runbooks are working fine but the recommend runbooks are not able to generate any data, however the json file is created in the storage account without any data. When i tried to troubleshoot and run it in the test mode, it generates the below error. I tried updating the Az.operationalinsights module to 3.0.0 which is the current module, but no luck.

 
 
Logging in to Azure with ManagedIdentity...

Environments                                                                                                            
------------                                                                                                            
{[AzureChinaCloud, AzureChinaCloud], [AzureCloud, AzureCloud], [AzureGermanCloud, AzureGermanCloud], [AzureUSGovernme...
Finding tables where recommendations will be generated from...
Will run query against tables AzureOptimizationAADObjectsV1_CL

Name               : xxxxxx
Account            : xxxxxxxxxx
Environment        : AzureCloud
Subscription       : xxxxxxxxxxxxxx
Tenant             : xxxxxxxxxxxxxxxxx
TokenCache         : 
VersionProfile     : 
ExtendedProperties : {}Query failed. Debug the following query in the AOE Log Analytics workspace:     let expiryInterval = 30d;
    let AppsAndKeys = materialize (AzureOptimizationAADObjectsV1_CL
    | where TimeGenerated > ago(1d)
    | where ObjectType_s in ('Application','ServicePrincipal')
    | where ObjectSubType_s != 'ManagedIdentity'
    | where Keys_s startswith '['
    | extend Keys = parse_json(Keys_s)
    | project-away Keys_s
    | mv-expand Keys
    | evaluate bag_unpack(Keys)
    | union ( 
        AzureOptimizationAADObjectsV1_CL
        | where TimeGenerated > ago(1d)
        | where ObjectType_s in ('Application','ServicePrincipal')
        | where ObjectSubType_s != 'ManagedIdentity'
        | where isnotempty(Keys_s) and Keys_s !startswith '['
        | extend Keys = parse_json(Keys_s)
        | project-away Keys_s
        | evaluate bag_unpack(Keys)
    )
    );
    let ExpirationInRisk = AppsAndKeys
    | where EndDate < now()+expiryInterval
    | project ApplicationId_g, KeyId, RiskDate = EndDate;
    let NotInRisk = AppsAndKeys
    | where EndDate > now()+expiryInterval
    | project ApplicationId_g, KeyId, ComfortDate = EndDate;
    let ApplicationsInRisk = ExpirationInRisk
    | join kind=leftouter ( NotInRisk ) on ApplicationId_g
    | where isempty(ComfortDate)
    | summarize ExpiresOn = max(RiskDate) by ApplicationId_g;
    AppsAndKeys
    | join kind=inner (ApplicationsInRisk) on ApplicationId_g
    | summarize ExpiresOn = max(EndDate) by ApplicationId_g, ObjectType_s, DisplayName_s, Cloud_s, KeyType, TenantGuid_g
    | order by ExpiresOn desc
The term 'Invoke-AzOperationalInsightsQuery' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.

 

 

 

Microsoft

@akumar1979, you must first ensure all Az modules are installed in the Hybrid Worker machine, including Az.OperationalInsights and Az.ResourceGraph. Then, you'll need to reschedule all the runbooks that will interact with the Private Endpoint so that they run on the hybrid worker. If the PE is enabled only in the SQL Database, then you should reschedule all the Ingest-* and Recommend-* runbooks. If you don't have a PE on the Storage Account, the Export-* runbooks can and should run normally in the Azure sandbox.

Co-Authors
Version history
Last update:
‎Feb 24 2021 09:21 AM
Updated by: