Hello, dear readers! Here is Hélder Pinto again, now writing about a topic that came out of my experience in one of my customers, who decided to stop using the Azure Diagnostics Extension in their virtual machine estate but had a massive challenge: how to remove the extension across 1000s of VMs and be sure that the diagnostics data was removed from Azure Storage and, by the way, save more than 10K euros per month? Let’s see how we did it.
Introduction
The Azure Diagnostics extension is an agent that collects monitoring data from the guest operating system of Azure virtual machines. With this extension, you can collect guest metrics and many types of logs and then send it to Azure Storage (default sink), Azure Monitor metrics or even to Azure Event Hubs (to be ingested by a third-party sink). No matter the additional sinks you may configure, the Azure Diagnostic extension always collects data into an Azure Storage account, using mostly Table storage*.
If, for some reason, you decide to stop collecting logs and metrics with the Azure Diagnostics extension, doing it could be as simple as uninstalling the extension from your VMs. But wait! You’re likely need to also get rid of all the data the agent collected over time. What if the Storage Accounts used by the extension are shared with other services? Will you still be able to identify those Storage Accounts after the extension is removed? The mission is not as simple as it seemed! 🙂 Let’s see below how we can do it effectively (and efficiently!).
Azure Diagnostics extension and data cleanup guide
If you need to remove the Azure Diagnostics extension at scale from your Azure virtual machine estate and finally clean the data that it generated, at least the largest one that lives in Azure Storage Tables, then you have here a complete procedure and scripts that will help you successfully achieve your goals.
The procedure is divided into three steps:
- Assess which Azure Storage accounts are being used as a sink for the Diagnostics extension - carefully keep the generated CSV, because we will need this list for the last step.
- Uninstall at scale the Diagnostics extension from your virtual machines.
- Remove at scale the Azure Storage tables that were generated by the Diagnostics extension - we will use here the list extracted in the first step.
Requirements
- Az PowerShell modules
- Az.ResourceGraph module
- The user executing the scripts should have the Contributor role in the Azure subscriptions. If virtual machines have resource locks, then the user must have the Owner role.
Step 1 - Extract the list of Storage Accounts containing Azure Diagnostics data
In a PowerShell prompt, run the Export-VmDiagnosticsStorageAccounts.ps1 script:
.\Export-VmDiagnosticsStorageAccounts.ps1 [-Cloud <AzureCloud | AzureChinaCloud | AzureGermanCloud | AzureUSGovernment>]
This will generate a CSV file containing a list of all the Storage Accounts that are being used by the Azure Diagnostics extensions (see sample content below). Save this file, as we will need it for the last step.
The magic behind this script is an Azure Resource Graph (I LOOOVE this service) query that quickly returns what you need:
resources
| where type =~ 'microsoft.compute/virtualmachines/extensions' and tostring(properties.type) in ('LinuxDiagnostic', 'IaaSDiagnostics')
| extend storageAccountName = iif(isempty(tostring(properties.settings.StorageAccount)),tostring(properties.settings.storageAccount),tostring(properties.settings.StorageAccount))
| project id, storageAccountName
| join kind=inner (
resources
| where type =~ 'microsoft.storage/storageAccounts'
| project storageAccountName = name, resourceGroup, subscriptionId
) on storageAccountName
| summarize count() by storageAccountName, resourceGroup, subscriptionId
Step 2 - Uninstall at scale the Diagnostics extension from virtual machines
In the same PowerShell prompt, run the Uninstall-VmDiagnosticsExtensionAtScale.ps1 script. The script is prepared to deal with the following scenarios:
- Deallocated virtual machines - it will start them, remove the extension, and shut them down again (only VMs with the extension will be started).
- Virtual machines that have a resource lock - it will remove the lock, remove the extension, and re-add the exact same lock - this requires you to have the Owner role for those virtual machines.
- Target a specific resource group or subscription.
- Make a dry run of the process with the Simulate switch.
Here is the full script syntax:
.\Uninstall-VmDiagnosticsExtensionAtScale.ps1 [-Cloud <AzureCloud | AzureChinaCloud | AzureGermanCloud | AzureUSGovernment>] [-TargetSubscriptionId <subscription Id>] [-TargetResourceGroup <resource group name>] [-RemoveLocks] [-StartVMs] [-Simulate]
With some examples
.\Uninstall-VmDiagnosticsExtensionAtScale.ps1 -RemoveLocks -StartVMs -Simulate - this will simulate an execution, starting deallocated VMs and removing resource locks before uninstalling the extension (of course, VMs won't be started nor locks removed)
.\Uninstall-VmDiagnosticsExtensionAtScale.ps1 -TargetSubscriptionId aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee -StartVMs - this will uninstall the extension only for the aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee subscription, starting deallocated VMs if needed.
The script will on-the-fly get the list of VMs to uninstall the extension from and will complete quickly, as the uninstallation is run asynchronously. At the end, you will get a CSV file containing the results of each uninstallation try, e.g., whether the VM was running or not, it had resource locks, or the extension was uninstalled. You must give at least 30 minutes for the process to finish. After this period, you can run the following query in Resource Graph Explorer, to check how successful the process was:
resources
| where type =~ 'microsoft.compute/virtualmachines/extensions' and tostring(properties.type) in ('LinuxDiagnostic', 'IaaSDiagnostics')
| project id, name
| extend vmId = substring(id, 0, indexof(id, '/extensions/'))
| join kind=inner (
resources
| where type =~ 'microsoft.compute/virtualmachines'
| project vmId = id, vmName = name, resourceGroup, subscriptionId, powerState = tostring(properties.extended.instanceView.powerState.code)
) on vmId
| project-away vmId, vmId1
| order by id asc
And here a sample output of the CSV file generated by this script:
If, for some reason, there is some extension that does not remove successfully, refer to the troubleshooting documentation. Nevertheless, you can proceed with no fear to the final step - removing Azure Storage Tables. Those zombie Diagnostics extensions will recreate and continue writing into the Storage tables, but at least you'll have reduced your problem to a fraction of the dimension it had before. After fixing the extension issues, you can repeat steps 2 and 3.
Step 3 - Remove the Azure Storage Tables used by the Diagnostics extension
In this final step, you'll use the CSV generated in step 1 and order the removal of all the Azure Storage Tables that are fed by the Diagnostics extension. The Remove-VmDiagnosticsTables.ps1 script is very simple to use. If needed, you can target a specific subscription instead of the whole tenant.
.\Remove-VmDiagnosticsTables.ps1 -StorageAccountsCsvPath <path to the storage account list CSV generated in step 1> [-Cloud <AzureCloud | AzureChinaCloud | AzureGermanCloud | AzureUSGovernment>] [-TargetSubscriptionId <subscription Id>]
The script removes only the Storage Tables used by the Azure Diagnostics extension, leaving untouched all the remaining data that exist in the Storage Account, such as blobs or other tables used by other applications.
In the next day, you'll likely notice a drop in your Azure Storage Table costs. Happy cleanup!
* Metrics and logs stored in Azure Tables do not have a retention mechanism, therefore your data (and Storage costs) keep growing over time.