[Guest perspective: The following article is a guest post by Andrew Luckwell. The views and opinions expressed in this article are those of the author. We thank Andrew for sharing his perspective.]
Azure resources can be configured in many ways, including ways which affect their performance, security, reliability, available features and ultimately cost.
The challenge is, all these resources and configurations are completely available to us by default. As long as someone has permission, they can create any resource and configuration they like. This implicit “anything goes” gives our technical teams the freedom to decide what’s best. Like a kid in a toy shop, they will naturally favour the biggest, fastest and coolest toys.
The immediate risk of course, is building beyond business requirements. Too much SKU, too much resilience, too much performance and too high cost.
Left unchecked, and we risk increasingly challenging and long-term issues:
- Over-delivering will quickly become the norm.
- Excessive resources configurations will become the habitual default in all environments.
- Teams will become mis-aligned from wider business requirements.
- Teams will become used to working in a frictionless environment, and challenge any restrictions.
- FinOps teams will be stuck in endless cost optimisation work.
You may already be feeling the pain. Trapped in a cycle of repetitive, reactive cost optimisation work, seeing the same repeat offenders and looking for a way out. To break (or prevent) the cycle, a new approach is needed. We must switch priorities from detection and removal, to prevention and control. We must keep waste out. We must avoid over-provisioning.
We can achieve this with governance.
What is governance
Governance is a collection of rules, processes and tools that control how an organization consumes IT resources. It ensures our teams deploy resources that align to certain business goals, like security, cost, resource management and compliance.
Governance rules are like rules for a boardgame. They define how the game should be played, no matter who is playing the game.
This is important. It aligns everyone to our organization's rules regardless of role, position, seniority and authority. It helps ensures people play by the rules rather than their rules. Try playing Monopoly with no rules. What’s going to happen? I will pass go, and I will collect 200 dollars.
For Microsoft Azure, and the cloud in general, governance is centered around controlling how resources can and cannot be configured.
- Storage Accounts should be configured like this.
- Virtual Machines must be configured like that.
- Disks can’t be configured with this.
It's as much about keeping wrong configurations out, as the right configurations in. When we enforce configurations that meet our goals and restrict those that don’t, we drastically increase our chance of success.
Why governance matters for FinOps
Almost all over-provisioning and waste can be traced back to how a resource is configured. From SKU, to size, redundancy and additional features, if it’s not needed it’s being wasted.
That’s all over-provisioning and waste is; Resources, properties and values that we don’t need.
- Too much SKU, like Premium Disks Standard HDD/SSD.
- Too much redundancy, like Storage Accounts with GRS when LRS is fine.
- Too many features, like App Gateways with WAF but it’s disabled.
Have a think for a moment. What over-provisioning have you seen in the past? Was it one or two resource properties causing the problems?
Whatever you’ve seen, with governance we can stop it happening again.
When we control how resources get configured, we can control over-provisioning and waste, too. We can determine configurations we don’t need through our optimization efforts, and then create rules that define the configurations we do need:
“We don’t need Premium SSD disks.” becomes “Disks must be Standard HDD/SSD.”
“We don’t need Storage Accounts with GRS.” becomes “Storage Accounts must use LRS.”
“We don’t need WAF enabled Application Gateways” becomes “Application Gateways should be Standard SKUs”
These rules effectively remove the option to build beyond requirements. They will help teams avoid building too much/too big, stay within their means, hold them a bit more accountable and protect us from future overspend.
Detection becomes Prevention. Removal becomes Control.
Over time, we will:
- Help our teams deliver just enough.
- Raise and improve awareness of over-configurations and waste.
- Help keep waste out once it’s found.
- Reduce the chances over-provisioning in future.
- Steadily reduce the need for ongoing Cost Optimisation efforts.
- Free up time for other FinOps stuff.
This is why governance is a natural evolution from cost optimization, and why it’s critical for FinOps teams who want to be more proactive and spend less time cleaning up after tech teams.
How can we natively govern Microsoft Azure?
In Microsoft Azure, we can use the native governance service Azure Policy to help control our environments. We can embed our governance rules into Azure itself and have Azure Policy do the heavy lifting of checking, reporting and enforcing.
Azure Policy has many useful features:
- Supports over 74000 resource properties, including all that generate costs.
- Can audit resources, deny deployments and even auto-resolve resources as they come into Azure.
- Provides easy reporting of compliance issues, saving time on manual checks.
- Checks every deployment from every source. From Portal to Terraform, it’s got you covered.
- Supports different scopes from Resource Groups to Management Groups, allowing policies to be used at any scale.
- Supports parameters, making policies re-usable and quick to modify when responding to change in requirements.
- Exemptions can be used on resources we want to ignore for now.
- Supports different enforcement modes, for safe rollout of new policies.
- It comes at no additional cost. Free!
These features make Azure Policy an extremely flexible and powerful tool that can help control resources, properties and values at any scale.
We can:
- Create Policies for almost any cost-impacting value. SKUs, Redundancy Tiers, Instance Sizes, you name it…
- Use different effects based on how ‘strict’ the rule should be. For example, we can use Deny (resource creation) for resource missing “Must have” attributes, and Audit to check if resources are still compliant with “Should have” attributes.
- Use a combination of effects, enforcement modes and exemptions to control the rollout of new policies.
- Reuse the Policies on multiple environments (like development versus production), with different values and effects depending on the environment's needs.
- Quickly change the values when needed. When requirements change, the parameters can be modified with little effort.
How to avoid unwanted Friction
A common concern with governance is that it will create friction, interrupt work and slow teams down. This is a valid concern, and Azure Policy’s features allow for a controlled and safe rollout. With a good plan there is no need to worry.
Consider the following:
- Start with Audit-Only policies and non-production environments.
- Start with simpler resources and regular/top offenders.
- Test policies in sandboxes before using them in live environments.
- Use the ‘Do not Enforce’ mode when first assigning Deny policies. This treats them as Audit-only, allowing review before being enforced.
- Always parameterize Effects and Values, for quick modification when needed.
- Use exemptions when there are sensitive resources that are best to ignore for now.
- Work with your teams and agree to a fair and balanced approach. Governance is for everyone and should include everyone where possible.
The biggest challenge of all may be breaking habits formed over years of freedom in the Cloud. It’s natural to resist change, especially when it takes away our freedom.
Remember, it’s friction where it’s needed, Interuption where it’s needed, slow down where it’s needed.
They key to getting teams onboard is delivering the right message. Why are we doing this? How will they benefit? How does it help them? How could they be impacted if you do nothing?
This needs to be more than “To meet our FinOps goals”. That’s your goal, not theirs. They won’t care.
Try something like:
We keep seeing over-utilization and waste and are spending an additional ‘X amount’ of time and money trying to remove it. This is now impacting our ability invest properly into our IT teams, affecting other departments and impacting our overall growth.
If we can get over-spend reduced and under control, we can re-invest where you need it; tooling, people, training and anything else that makes your lives better.
We want to implement governance rules and policies that will prevent issues reoccurring. With your insights and support we can achieve this faster, avoid unwanted impact, and can re-invest back into our IT teams once done.
Sound good to you?!
This is far more compelling and gives them reason to get onboard and help out.
How to start your FinOps governance journey
Making the jump from workload optimization into governance might initially sound challenging, but it’s actually pretty straightforward.
Consider the typical workload optimization cycle:
- Discover potential misconfiguration, optimization and waste cleanup opportunities.
- Compare to actual business requirements.
- Optimize workload to meet those business requirements.
A governance practice extends this to the following:
- Identify potential misconfigurations, optimization and waste cleanup opportunities.
- Compare to actual business requirements.
- Optimize workload to meet those business requirements.
- Create an Azure Policy based on how the resource should have originally been configured, and how it should remain in future.
Thats it, one extra step.
Most of the hard work has already happened in steps 1-3, in the workload optimization we’ve already been doing. Step 4 simply turns the optimization into rule that says “This resource must be like this from now on”, preventing it happening again.
Let's do it again with a real resource, an Azure Disk:
- Identify Premium SSD Disks in non-production environment.
- Compare to business requirements, which confirms Standard HDD is fine.
- Change Disk SKU from Premium to Standard HDD.
- Create Azure Policy that only allows Disks with Standard HDD in the environment and denies other SKUs.
Done. No more Premium SSDs in this environment again. Prevention and Control.
The real work lies in being able to understand and identify how resources become over-provisioned and wasteful. Until then we will struggle to optimize, let alone govern.
The Wasteful Eight
There's so many resources and properties available. Understanding all the ways they can create waste can be challenging.
Fortunately, we can group resource properties into eight main categories, which make our efforts a bit easier. Lets look at the Wasteful Eight:
Category | Examples |
Over-provisioned SKUs |
- Disks with Premium SSD instead of Standard HDD/SSD. |
Too much redundancy | - Storage Accounts configured with GRS, when LRS is fine. - Recovery Services Vaults with GRS, when LRS is fine. - SQL Databases with Zone Redundancy enabled. |
Too large / too many instances. | - Azure VMs with too many CPUs. - SQL Databases with too many vCores/DTUs. - Disks which are over 1024GB. |
Supports auto-scaling/serverless, but aren’t using it. | - Application Gateway doesn’t have auto-scaling enabled. - App Service Plans without Auto-Scaling. - SQL Databases using fixed provisioning, instead of Serverless or Elastic Pools |
Too many backups. | - Backups that are too frequent. - Backups with too long retention periods. - Non-prod backups with similar retentions as Prod. |
Too much logging. | - Logging enabled in non-prod. - Log retentions too long. - Logging to Log Analytics instead of Storage Accounts. - Log Analytics not using cheaper table plans. |
Extra features that are disabled, or not being used. | - Application Gateway with WAF SKU, but the WAF is disabled. - Azure Firewall with Premium SKU, but IDPS is disabled. - Storage Accounts with SFTP enabled but not used. |
Orphaned/Unused. | - Unattached Disks - Empty App Service Plans - Unattached NAT Gateways |
All resources will fall somewhere in the above categories. A single resource can be found in most of them.
For example, an Application Gateway can:
- Have an over-provisioned/unused SKU (WAF vs Standard).
- Have auto-scaling disabled.
- Have too many instances.
- Have excessive logging enabled.
- Have the WAF SKU, but the WAF is for some reason disabled.
- Be orphaned, by having no backend VMs.
Breaking down any resource like this will uncover most of its cost-impacting properties and give us a good idea of what to focus on. A few outliers are inevitable, but the vast majority will be covered.
Let's explore the Application Gateway examples further, the reasons why each item is wasteful and the subsequent Policies we might consider in a non-production environment. I’ve also included some links to respective Azure Policy definitions available in GitHub (test before use!).
Discovery | Reason | Governance Rule/Policy | Allowed Values and effects if applicable |
Application Gateway has WAF SKU but doesn’t need it. | We use another firewall product. | Allowed Application Gateway SKUs | Standard Deny |
Application Gateway isn’t configured with Auto-Scaling, creating inefficient use of instances. | Auto-Scaling improves efficiency by scaling up and down as demand changes. Manual scaling is inefficient. | Application Gateway should be configured with Auto-Scaling. | Deny |
Application Gateway min/max instance counts are higher than needed. | Setting Min/Max instance thresholds avoids them being too high. Particularly the min count, which might not need more than 1 instance. | Allowed App Gateway Min/Max instance counts | Min Count: 1 Max Count: 2 Deny |
Non-Prod Application Gateways have logging enabled, when it’s not needed. | We don't have usage that needs to be logged in non-production environments. | Non-Prod Application Gateways should avoid logging | Deny |
Application Gateway has WAF but it’s disabled. | A disabled WAF is doing nothing yet still paid for. Either use it, or change the Tier to Standard to reduce costs. | Application Gateway WAF is disabled. | Audit |
Application Gateway has no Backend Pool resources. | Indicates an orphaned/unused App Gateway. It should be removed. | Application Gateway has empty Backend Pool and appears Orphaned | Audit |
Now this might seem a bit over the top. Do we really to be controlling our App Gateway min/max scaling counts? It depends. If you have a genuine problem with too many instances then yes, you probably should.
The point is, you can if you need to. This simply demonstrates how powerful governance and Azure Policy can be at controlling how resources are used.
A more likely starting point will be things like SKUs, Sizes, Redundancy Tiers and Logging. These are the high risk, high impact areas you’ve probably seen before and want to avoid again.
Once you exhaust those it's time to jump into Cost Management and explore your most expensive resources and services. Explore the Billing Meters to see how each resources costs are broken down. This is where your money is going and where your governance rules will have the biggest impact.
Where to find Azure Policies
If you want to use Azure policy you're going to need some Policy Definitions. A Definition is your governance rule defined in Azure. It tells Azure what configurations you do and don't want, and how to deal with problems.
It's recommended that you start with some of the in-built policies first, before creating your own. These are provided by Microsoft, available inside Azure Policy to be applied, and are maintained by Microsoft. Fortunately, there are plenty of policies to choose from: built-in, community provided, Azure Landing Zone related and a few of my own:
Azure Built-in Policy Repo: https://github.com/Azure/azure-policy
Azure Community Policy Repo: https://github.com/Azure/Community-Policy
Azure Landing Zones Policies: https://github.com/Azure/Enterprise-Scale/blob/main/docs/wiki/ALZ-Policies.md
My stuff: https://github.com/aluckwell/Azure-Cost-Governance
Making the search even easier is the AzAdvertizer. This handy tool brings thousands of policies into a single location, with easy search and filter functionality to help find useful ones. It even includes 'Deploy to Azure' links for quick deployment.
AzAdvertizer: https://www.azadvertizer.net/azpolicyadvertizer_all.html
Of the thousands of policies in AzAdvertizer, the list below is a great starting point for FinOps. These are all built-in, ready to go and will help you get familiar with how Azure Policy works:
Policy Name | Use Case | Link |
Not Allowed Resource Types | Block the creation of resources you don't need. Helps control when resource types can/can't be used. |
https://www.azadvertizer.net/azpolicyadvertizer/6c112d4e-5bc7-47ae-a041-ea2d9dccd749.html |
Allowed virtual machine size SKUs | Allow the use of specific VM SKUs and Sizes and block SKUs that are too big or not fit for our use-case. |
https://www.azadvertizer.net/azpolicyadvertizer/cccc23c7-8427-4f53-ad12-b6a63eb452b3.html |
Allowed App Services Plan SKUs | Allow the use of specific App Service Plan SKUs. Block SKUs that are too big or not fit for our use-case. |
https://www.azadvertizer.net/azpolicyadvertizer/27e36ba1-7f72-4a8e-b981-ef06d5c78c1a.html |
[Preview]: Do not allow creation of Recovery Services vaults of chosen storage redundancy. | Avoid Recovery Services Vaults with too much redundancy. If you don't need GRS, block it. |
https://www.azadvertizer.net/azpolicyadvertizer/8f09fda1-91a2-4e14-96a2-67c6281158f7.html |
Storage accounts should be limited by allowed SKUs | Avoid too much redundancy and performance when it's not needed. |
https://www.azadvertizer.net/azpolicyadvertizer/7433c107-6db4-4ad1-b57a-a76dce0154a1.html |
Configure Azure Defender for Servers to be disabled for resources (resource level) with the selected tag | Disable Defender for Servers on Virtual Machines if they don't need it. Help control the rollout of Defender for Servers, avoiding machines that don't need it. |
https://www.azadvertizer.net/azpolicyadvertizer/080fedce-9d4a-4d07-abf0-9f036afbc9c8.html |
Unused App Service plans driving cost should be avoided | Highlight when App Service Plans are 'Orphaned'. Either put them to use or get them deleted ASAP. |
New policies are always being added, and existing policies improved (see the Versioning). Check back occasionally for changes and new additions that might be useful.
When you get the itch to create your own, I'd suggest watching the following videos to understand the nuts and bolts of Azure Policy, and then onto Microsoft Learn for further reading.
https://www.youtube.com/watch?v=4wGns611G4w
https://www.youtube.com/watch?v=fhIn_kHz4hk
https://learn.microsoft.com/azure/governance/policy/overview
Good luck!