This is the second post of a series dedicated to the implementation of automated Continuous Optimization with Azure Advisor Cost recommendations. For a contextualization of the solution described in this and following posts, please read the introductory post.
As we saw in the previous post, if we want to better inform decisions on top of Azure Advisor right-size recommendations, we need to combine these with additional data coming from other sources:
Now let’s look at each of these data sources in detail and start building our Azure Optimization Engine pipeline.
I am not going into the details of configuring the Log Analytics agent in your Azure Virtual Machines, as you have very good guidance in the official documentation. For collecting performance metrics with the Log Analytics agent, you have several options, but this series covers the Perf solution option, as it provides more control over the metrics and collection intervals and we also want to optimize Azure consumption in the optimization tool itself 😉 So, besides having all your VMs onboarded to Log Analytics, go to the Advanced Settings > Data blade and configure at least the following counters (if you already had your VMs being monitored by Log Analytics, then you just need to check if all needed counters are there):
Disk counters need to be separated in both read and write dimensions, because of the impact of read/write host caching definitions - when looking at disk performance, we must consider cached vs. non-cached virtual machine disk throughput/IOPS limits. For the same reason, we need to collect metrics for all disk instances, because host caching is defined per disk. Network throughput is collected as totals, because network bandwidth limits are independent of network adapter or direction. Processor metrics are collected for all CPU instances, because overall percentages can be misleading (e.g., 50% total CPU usage may be a result of 100% usage in 1 core and 0% in another). Each performance counter instance collected at a 60 second interval consumes about 430 KB per computer per day in Log Analytics. In a scenario with 4 logical disks and 4 CPU cores, each computer would generate 27 performance counter instances (20 for logical disk, 1 for memory, 1 for network adapter and 5 for processor). If all performance counters were collected at the same 60 seconds frequency, each computer would generate ~11 MB of data per day. Of course, you can adjust the collection interval for some counters, if you want your solution to be costs-savvy (see example below).
Collecting VM and disks properties with ARG is super easy. The VM queries are straightforward and self-explanatory:
resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| extend dataDiskCount = array_length(properties.storageProfile.dataDisks), nicCount = array_length(properties.networkProfile.networkInterfaces)
| order by id asc
… for ARM VMs and …
resources
| where type =~ 'Microsoft.ClassicCompute/virtualMachines'
| extend dataDiskCount = iif(isnotnull(properties.storageProfile.dataDisks), array_length(properties.storageProfile.dataDisks), 0), nicCount = iif(isnotnull(properties.networkProfile.virtualNetwork.networkInterfaces), array_length(properties.networkProfile.virtualNetwork.networkInterfaces) + 1, 1)
| order by id asc
… for Classic VMs.
For Managed Disks, the query is more complicated, because we want to distinguish between OS and Data disks:
resources
| where type =~ 'Microsoft.Compute/disks'
| extend DiskId = tolower(id), OwnerVmId = tolower(managedBy)
| join kind=leftouter (
resources
| where type =~ 'Microsoft.Compute/virtualMachines' and array_length(properties.storageProfile.dataDisks) > 0
| extend OwnerVmId = tolower(id)
| mv-expand DataDisks = properties.storageProfile.dataDisks
| extend DiskId = tolower(DataDisks.managedDisk.id), diskCaching = tostring(DataDisks.caching), diskType = 'Data'
| project DiskId, OwnerVmId, diskCaching, diskType
| union (
resources
| where type =~ 'Microsoft.Compute/virtualMachines'
| extend OwnerVmId = tolower(id)
| extend DiskId = tolower(properties.storageProfile.osDisk.managedDisk.id), diskCaching = tostring(properties.storageProfile.osDisk.caching), diskType = 'OS'
| project DiskId, OwnerVmId, diskCaching, diskType
)
) on OwnerVmId, DiskId
| project-away OwnerVmId, DiskId, OwnerVmId1, DiskId1
| order by id asc
No support for Classic VM disks, though, as they are unmanaged resources lying as a page blob in some Azure Storage container. Contributors are welcome!
For larger environments, we’ll need to implement pagination as ARG only returns the first 1000 rows for each query (the ARG runbooks scripts below will show you how).
Now that we have some ground about the data we need to collect, we can finally start deploying the Azure Optimization Engine (AOE) solution! We are just looking at data collection for the moment – no augmented recommendations yet – but we will deploy here all the necessary foundations for the complete solution to be presented in the upcoming posts. In the links below, you’ll be directed to the AOE repository, where you’ll find sooner or later the complete working solution.
The AOE solution is made of the following building blocks:
So, to deploy the AOE, you just need to run the Deploy-AzureAutomationEngine script in an elevated prompt and authenticating to Azure with a user account having Owner permissions over the chosen subscription and enough privileges to register Azure AD applications (see details). You’ll be asked several details about your deployment options, including whether you want to reuse an existing Log Analytics workspace or start with a new one.
The deployment will take some minutes to complete and you’ll then be asked to enter a password for the Run As certificate for the Automation account. A couple of minutes more and the script will hopefully terminate successfully. In the event of an error, you can re-deploy with the same parameters, as the process is idempotent. You can check the Automation Account schedules created by the deployment (see picture below), which will trigger in a matter of a 1-2-hour timeframe.
Some minutes after all the ingestion runbooks run, you’ll be able to query in Log Analytics for those tables. We’ll make use of these records to generate our recommendations.
Once successfully deployed, and assuming you have your VMs onboarded to Log Analytics and collecting all the required performance counters, we have everything that is needed to start augmenting Advisor recommendations and even generate custom ones! Let it boil for some weeks and keep tuned – in the next post, we’re discussing how AOE produces the actual recommendations and we’ll going to finally see some light!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.