How to reduce the total cost of ownership (TCO) of your Azure Kubernetes Service (AKS) cluster

Microsoft

Jan 02, 2023

This article contains a few recommendations for reducing the total cost of ownership (TCO) of your Azure Kubernetes Service (AKS) cluster.

Recommendations

If you want to minimize the number of unused cores, you can use the following general guidelines to improve the density of your workloads and reduce the number of VMs to the bare minimum.

Use the cluster autoscaler, Kubernetes Event-Driver Autoscaler (KEDA), and Horizontal Pod Autoscaler to scale in and scale out the number of pods and the number of nodes based on the traffic conditions.
Make sure to properly set requests and limits for pods to avoid assigning too many resources in terms of CPU and memory to the user-defined workloads and improve application density. You can observe the average and maximum consumption of CPU and memory using Prometheus or Container Insights and properly configure limits and quotas for your pods in the YAML manifests, Helm charts, Kustomize manifests for your deployments. For more information, see Best practices for application developers to manage resources in Azure Kubernetes Service (AKS). There are 3^rd party tools like Densify that, by gathering granular container data from frameworks like Prometheus, learning the patterns of activity, and applying policies, can suggest requests and limits for each pod container, optimizing the overall density.
Use ResourceQuota objects to set quotas for the total amount of memory and CPU that can be used by all Pods running in a given namespace to prevent or reduce the likelihood of the noisy neighbor's issue, improve the application density, and reduce the number of agent nodes and hence the total cost of ownership. Likewise, Use LimitRange objects to configure the default requests in terms of CPU and memory for pods running in a namespace. Azure Policy integrates with AKS through built-in policies to apply at-scale enforcements and safeguards on your cluster in a centralized, consistent manner. Follow the documentation to enable the Azure Policy add-on on your cluster and apply the Ensure CPU and memory resource limits policy, ensuring CPU and memory resource limits are defined on containers in an Azure Kubernetes Service cluster.
Use the Vertical Pod Autoscaler (VPA), based on the open-source Kubernetes version, to analyze and set CPU and memory resources required by your pods. Instead of running tests to calculate the optimal CPU and memory requests and limits for the containers in your pods, you can configure vertical Pod autoscaling to provide recommended values for CPU and memory requests and limits that you can use to update your pods manually, or you can configure vertical Pod autoscaling to update the values automatically. When configured, the Vertical Pod Autoscaler (VPA) automatically sets resource requests and limits on containers per workload based on past usage. This ensures pods are scheduled onto nodes with the required CPU and memory resources and can fix the slack issue (the gap between requested and used CPU) explained in this article: Kubernetes Resource Use and Management in Production. Also, see the following articles on Kubernetes autoscaling and Vertical Pod Autoscaler:
Choose the right VM size for the node pools of your AKS cluster based on the needs in terms of CPU and memory of your workloads. Azure offers many different instance types that match a wide range of use cases, with entirely different CPU, memory, storage, and networking capacity combinations.
Create multiple node pools with different VM sizes for particular purposes and workloads and use Kubernetes node labels, node selector, pod and node affinity and anti-affinity, taints and tolerations, and pod topology spread constraints to place applications on specific node pools to avoid noisy neighbor issues and improve the pod density. Keep node resources available for workloads that require them, and don't allow other workloads to be scheduled on these nodes. Using different VM sizes for different node pools can also be used to optimize costs. For more information, see Use multiple node pools in Azure Kubernetes Service (AKS) - Azure Kubernetes Service | Microsoft Docs
The higher the VM SKU, the higher the number of vCores, and the higher the chance of unused cores. Assuming that the Kubernetes Scheduler and cluster autoscaler do a good job consolidating pods in a set of nodes, there will always be a fraction of unused vCores.
The higher the number of node pools, the higher the chance of unused vCores, as each node pool scales separately.

Here are some more general recommendations to reduce the TCO of an AKS cluster in addition to the previous guidelines and considerations:

Review the Cost optimization section of the Azure Well-Architected Framework for AKS.
Use Azure Advisor to monitor and release unused resources. Find and release any resource not used by your AKS cluster, such as public IPs, managed disks, etc. For more information, see Find and delete unattached Azure managed and unmanaged disks.
Use Microsoft Cost Management budgets and reviews to keep track of expenditures.
Use Azure Reservations to reduce the cost of the agent nodes. Azure Reservations help you save money by committing to one-year or three-year plans for multiple products. Committing allows you to get a discount on the resources you use. Reservations can significantly reduce your resource costs by up to 72% from pay-as-you-go prices. Reservations provide a billing discount and don't affect the runtime state of your resources. After you purchase a reservation, the discount automatically applies to matching resources. You can purchase reservations from the Azure portal, APIs, PowerShell, and Azure CLI.
As an alternative to Azure Reservations, you can use Azure Savings Plans to save money by committing to a fixed hourly spend on the VMs used by your AKS clusters for one-year or three-year terms. A savings plan can significantly reduce your resource costs by up to 65% from pay-as-you-go prices. Discount rates per meter vary by commitment term (1-year or 3-year), not commitment amount. With Azure Reservations, you commit to a specific virtual machine type in a particular Azure region. For example, a D2v4 VM series in West Europe for one year. With an Azure savings plan, you commit to spending a fixed hourly amount collectively on all the compute resources. For example, $5.00/hour on compute services for one year. Reservations only apply to the identified compute service and region combination. Savings plan benefits apply to all usage from participating compute services across the globe, up to the hourly commitment. It would be best if you opted for a reservation for highly stable workloads that run continuously and where you have no expected changes to the VM series or region because Azure Reservations provide the greatest savings. Consider a compute savings plan for dynamic workloads where you need to run different-sized virtual machines or frequently change data center regions. Savings plans provide more flexibility and automatic optimization than reservations.
Add one or more spot node pools to your AKS cluster. As you know, a spot node pool is backed by a spot Virtual Machine Scale Set (VMSS). Using spot VMs for nodes with your AKS cluster allows you to take advantage of unutilized capacity in Azure at significant cost savings. The amount of available unutilized capacity will vary based on many factors, including node size, region, and time of day. When deploying a spot node pool, Azure will allocate the spot nodes if there's capacity available. But there's no SLA for the spot nodes. A spot scale set that backs the spot node pool is deployed in a single fault domain and offers no high availability guarantees. When Azure needs the capacity back, the Azure infrastructure will evict spot nodes, when you create a spot node pool. You can define the maximum price you want to pay per hour and enable the cluster autoscaler, which is recommended for use with spot node pools. Based on the workloads running in your cluster, the cluster autoscaler scales out and scales in the number of nodes in the node pool. For spot node pools, the cluster autoscaler will scale out the number of nodes after an eviction if additional nodes are still needed. For more information, see Add a spot node pool to an Azure Kubernetes Service (AKS) cluster.
System pools must contain at least one node, while user node pools may contain zero or more nodes. Hence, you could set up a user node pool to automatically scale from 0 to N node. Using a horizontal pod autoscaler based on CPU and memory or the metrics of an external system like Apache Kafka, RabbitMQ, Azure Service Bus, etc., you could configure your workloads to scale out and scale in using Kubernetes Event-driven Autoscaling (KEDA).
Your AKS workloads may not need to run continuously, for example, a development cluster with node pools running specific workloads. To optimize your costs, you can completely turn off an AKS cluster or stop one or more node pools in your AKS cluster, allowing you to save on compute costs. For more information, see:

Deploy and manage containerized applications with Azure Kubernetes Service (AKS) running on Ampere Altra Arm-based processors. For more information, see Azure Virtual Machines with Ampere Altra Arm-based processors.
Migrate application workloads written in full .NET Framework, which requires running in Windows containers to .NET Standard. Migrated workloads will run in Linux containers with a smaller footprint in terms of a container image and hence provide better density. This decreases the number of agent nodes required to host and run applications.
For multitenant solutions, physical isolation is more costly and adds management overhead. Logical isolation requires more Kubernetes experience and increases the surface area for changes and security threats, but it shares the costs.
You can use open-source tools like Kubecost to monitor and govern an AKS cluster cost. Cost allocation can be scoped to a deployment, service, label, pod, and namespace, giving flexibility in how you charge back or show back costs to cluster users. For more information, see Cost governance with Kubecost. In a multi-tenant scenario, you can track and associate Azure costs to individual tenants, based on their actual usage. Third-party solutions, such as kubecost, can help you calculate and break down costs across different teams and tenants. Alternative solutions to Kubecost include PerfectScale, a solution that allows reducing Kubernetes costs while improving performance and resilience with data-driven intelligence built for continuous optimization, and OpenCost, which models give teams visibility into current and historical Kubernetes spending and resource allocation.
If you use an Azure Log Analytics workspace and Azure Monitor Container Insights to monitor the health and performance of your AKS workloads, you can use the ContainerLogV2 schema for container logs and configure this table to use the Basic Log data plan, which provides a low-cost way to ingest and retain logs for troubleshooting, debugging, auditing, and compliance.
As documented in Use Azure tags in Azure Kubernetes Service (AKS), you can use Azure tags on an AKS cluster to associate its related resources to a given workload or tenant. For some resources, such as a managed data disk created via a persistent volume claim or an Azure Public IP created by a public Kubernetes service, you can also use Kubernetes manifests to set Azure tags. Azure tags are a helpful mechanism to track resource usage and chargeback their costs to separate tenants or business units within an organization.

Next Steps
In addition to this guide, also check the following resources:

Conclusions

The cost optimization pillar provides principles and recommendations for balancing business goals with budget justification to create a cost-effective workload while avoiding capital-intensive solutions. Cost optimization is about ways to reduce unnecessary expenses and improve operational efficiencies, and it's a critical part of any AKS project.

Don't hesitate to write a comment below if you want to suggest additional recommendations to reduce the total cost of ownership of an AKS cluster. I'll include your observations in this article. Thanks!

Updated Apr 18, 2023

Version 15.0

Microsoft

Joined March 05, 2021

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity

3 Comments

paolosalvatori
Microsoft
Jan 03, 2023
Thanks lopyeg for the advice, links, and sample. In my opinion, Vertical Pod Autoscaler (VPA) is a great way to discover and eventually automatically set resource requests and limit containers per workload based on past usage. As an alternative to checking the presence of requests and limits in the definition of deployments or pods, you could create a policy, using OPA or Kyverno, to check whether the definition of deployments and pods contains a VPA and, if not, deny the creation.
lopyeg
Copper Contributor
Jan 03, 2023
Thank you paolosalvatori. You covered a huge scope of activities
I have implemented almost all of them. From my experience, not surprisengly, setting resource limits was the hardest, because it requried collaboration with developers.
Still in many cases developers are not interested in "ops" tasks.
AKS Addon Policy is a great helper, but it works on deployment stage, which is too late sometimes.
What I did - based on the same Open Policy Agent syntax (used in in Gatekeeper) - set CI checks (conftest tool: https://github.com/open-policy-agent/conftest) to fail a Azure Devops build, if there is no resources block in k8s manifests.

like
`deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.limits.memory
msg = "Containers must provide limits for memory. More details https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/"
}
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.limits.cpu
msg = "Containers must provide limits for cpu. More details https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/"
}`
JamesvandenBerg
MVP
Jan 03, 2023
Thank you paolosalvatori for these awesome tips to reduce TCO on AKS

Blog Post

How to reduce the total cost of ownership (TCO) of your Azure Kubernetes Service (AKS) cluster