Blog Post

FastTrack for Azure
8 MIN READ

How to reduce the total cost of ownership (TCO) of your Azure Kubernetes Service (AKS) cluster

paolosalvatori's avatar
Jan 02, 2023

This article contains a few recommendations for reducing the total cost of ownership (TCO) of your Azure Kubernetes Service (AKS) cluster.

 

Recommendations

If you want to minimize the number of unused cores, you can use the following general guidelines to improve the density of your workloads and reduce the number of VMs to the bare minimum.

 

Here are some more general recommendations to reduce the TCO of an AKS cluster in addition to the previous guidelines and considerations:

 

  • Review the Cost optimization section of the Azure Well-Architected Framework for AKS.
  • Use Azure Advisor to monitor and release unused resources. Find and release any resource not used by your AKS cluster, such as public IPs, managed disks, etc. For more information, see Find and delete unattached Azure managed and unmanaged disks
  • Use Microsoft Cost Management budgets and reviews to keep track of expenditures.
  • Use Azure Reservations to reduce the cost of the agent nodes. Azure Reservations help you save money by committing to one-year or three-year plans for multiple products. Committing allows you to get a discount on the resources you use. Reservations can significantly reduce your resource costs by up to 72% from pay-as-you-go prices. Reservations provide a billing discount and don't affect the runtime state of your resources. After you purchase a reservation, the discount automatically applies to matching resources. You can purchase reservations from the Azure portal, APIs, PowerShell, and Azure CLI.
  • As an alternative to Azure Reservations, you can use Azure Savings Plans to save money by committing to a fixed hourly spend on the VMs used by your AKS clusters for one-year or three-year terms. A savings plan can significantly reduce your resource costs by up to 65% from pay-as-you-go prices. Discount rates per meter vary by commitment term (1-year or 3-year), not commitment amount. With Azure Reservations, you commit to a specific virtual machine type in a particular Azure region. For example, a D2v4 VM series in West Europe for one year. With an Azure savings plan, you commit to spending a fixed hourly amount collectively on all the compute resources. For example, $5.00/hour on compute services for one year. Reservations only apply to the identified compute service and region combination. Savings plan benefits apply to all usage from participating compute services across the globe, up to the hourly commitment. It would be best if you opted for a reservation for highly stable workloads that run continuously and where you have no expected changes to the VM series or region because Azure Reservations provide the greatest savings. Consider a compute savings plan for dynamic workloads where you need to run different-sized virtual machines or frequently change data center regions. Savings plans provide more flexibility and automatic optimization than reservations.

  • Add one or more spot node pools to your AKS cluster. As you know, a spot node pool is backed by a spot Virtual Machine Scale Set (VMSS). Using spot VMs for nodes with your AKS cluster allows you to take advantage of unutilized capacity in Azure at significant cost savings. The amount of available unutilized capacity will vary based on many factors, including node size, region, and time of day. When deploying a spot node pool, Azure will allocate the spot nodes if there's capacity available. But there's no SLA for the spot nodes. A spot scale set that backs the spot node pool is deployed in a single fault domain and offers no high availability guarantees. When Azure needs the capacity back, the Azure infrastructure will evict spot nodes, when you create a spot node pool. You can define the maximum price you want to pay per hour and enable the cluster autoscaler, which is recommended for use with spot node pools. Based on the workloads running in your cluster, the cluster autoscaler scales out and scales in the number of nodes in the node pool. For spot node pools, the cluster autoscaler will scale out the number of nodes after an eviction if additional nodes are still needed. For more information, see Add a spot node pool to an Azure Kubernetes Service (AKS) cluster.
  • System pools must contain at least one node, while user node pools may contain zero or more nodes. Hence, you could set up a user node pool to automatically scale from 0 to N node. Using a horizontal pod autoscaler based on CPU and memory or the metrics of an external system like Apache Kafka, RabbitMQ, Azure Service Bus, etc., you could configure your workloads to scale out and scale in using Kubernetes Event-driven Autoscaling (KEDA).
  • Your AKS workloads may not need to run continuously, for example, a development cluster with node pools running specific workloads. To optimize your costs, you can completely turn off an AKS cluster or stop one or more node pools in your AKS cluster, allowing you to save on compute costs. For more information, see:
  • Deploy and manage containerized applications with Azure Kubernetes Service (AKS) running on Ampere Altra Arm-based processors. For more information, see Azure Virtual Machines with Ampere Altra Arm-based processors.
  • Migrate application workloads written in full .NET Framework, which requires running in Windows containers to .NET Standard. Migrated workloads will run in Linux containers with a smaller footprint in terms of a container image and hence provide better density. This decreases the number of agent nodes required to host and run applications.
  • For multitenant solutions, physical isolation is more costly and adds management overhead. Logical isolation requires more Kubernetes experience and increases the surface area for changes and security threats, but it shares the costs.
  • You can use open-source tools like Kubecost to monitor and govern an AKS cluster cost. Cost allocation can be scoped to a deployment, service, label, pod, and namespace, giving flexibility in how you charge back or show back costs to cluster users. For more information, see Cost governance with KubecostIn a multi-tenant scenario, you can track and associate Azure costs to individual tenants, based on their actual usage. Third-party solutions, such as kubecost, can help you calculate and break down costs across different teams and tenants. Alternative solutions to Kubecost include PerfectScale, a solution that allows reducing Kubernetes costs while improving performance and resilience with data-driven intelligence built for continuous optimization, and OpenCost, which models give teams visibility into current and historical Kubernetes spending and resource allocation. 
  • If you use an Azure Log Analytics workspace and Azure Monitor Container Insights to monitor the health and performance of your AKS workloads, you can use the ContainerLogV2 schema for container logs and configure this table to use the Basic Log data plan, which provides a low-cost way to ingest and retain logs for troubleshooting, debugging, auditing, and compliance.

  • As documented in Use Azure tags in Azure Kubernetes Service (AKS), you can use Azure tags on an AKS cluster to associate its related resources to a given workload or tenant. For some resources, such as a managed data disk created via a persistent volume claim or an Azure Public IP created by a public Kubernetes service, you can also use Kubernetes manifests to set Azure tags. Azure tags are a helpful mechanism to track resource usage and chargeback their costs to separate tenants or business units within an organization.

Next Steps
In addition to this guide, also check the following resources:

Conclusions

The cost optimization pillar provides principles and recommendations for balancing business goals with budget justification to create a cost-effective workload while avoiding capital-intensive solutions. Cost optimization is about ways to reduce unnecessary expenses and improve operational efficiencies, and it's a critical part of any AKS project.

Don't hesitate to write a comment below if you want to suggest additional recommendations to reduce the total cost of ownership of an AKS cluster. I'll include your observations in this article. Thanks!

Updated Apr 18, 2023
Version 15.0
  • Thanks lopyeg for the advice, links, and sample. In my opinion, Vertical Pod Autoscaler (VPA) is a great way to discover and eventually automatically set resource requests and limit containers per workload based on past usage. As an alternative to checking the presence of requests and limits in the definition of deployments or pods, you could create a policy, using OPA or Kyverno, to check whether the definition of deployments and pods contains a VPA and, if not, deny the creation.

  • lopyeg's avatar
    lopyeg
    Copper Contributor

    Thank you paolosalvatori. You covered a huge scope of activities 

    I have implemented almost all of them. From my experience, not surprisengly, setting resource limits was the hardest, because it requried collaboration with developers.
    Still in many cases developers are not interested in "ops" tasks.
    AKS Addon Policy is a great helper, but it works on deployment stage, which is too late sometimes.
    What I did - based on the same Open Policy Agent syntax (used in in Gatekeeper) - set CI checks (conftest tool: https://github.com/open-policy-agent/conftest) to fail a Azure Devops build,  if there is no resources block in k8s manifests.

    like
    `deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.resources.limits.memory
    msg = "Containers must provide limits for memory. More details https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/"
    }

    deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.resources.limits.cpu
    msg = "Containers must provide limits for cpu. More details https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/"
    }`