According to Flexera State of Cloud Report 2022, the top cloud initiatives for 2022 across all organizations is once again “Optimizing existing use of cloud (cost savings)” – for the sixth year in a row.
As more and more organizations continue to expand their cloud usage, it is imperative to have the right cloud computing strategy in place to accelerate time to market while ensuring cost effectiveness to maximize ROI.
In order to help customers to maximize cost savings, we have published a set of actionable architecture guidance and best practices in Microsoft Azure Well-Architected Framework, which includes a guide on Cost Optimization for AKS. There is also another article Baseline architecture for an Azure Kubernetes Service (AKS) cluster, which provides a recommended baseline for most AKS clusters including some best practices on handling cost.
In the sections below, we will be highlighting some of those recommendations.
Sizing your cluster - Use cluster pre-set configuration
A lot of thinking is required to choose the right VM SKU, number of nodes, number of availability zones, etc. Cluster pre-set configuration provides a set of recommendations for different environments while highlighting the impact to cost. For example, mission critical production environment would require higher spec VM SKU with redundancy across AZs (Azure Availability Zones) and Azure monitor turned on, while for Dev/Test cluster things can run light with unnecessary feature(s) turned off. Below are some of the screen shots during AKS cluster creation screen, and how you can make use of Cluster pre-set configuration.
Sizing your workloads – Set resources requests and limits
Use pod requests and limits to manage the compute resources within an AKS cluster. Pod requests and limits inform the Kubernetes scheduler which compute resources to assign to a pod. AKS also integrates with Azure Policy to provide a centralized and at scale enforcement for built-in policies, e.g., apply the default CPU requests and memory resource limits, which ensure that CPU and memory resource limits are defined on cluster containers.
As it is sometimes difficult to set the right value for limits and requests, use Vertical Pod Autoscaler (currently in preview) to automatically set resource requests and limits on containers per workload, based on past usage. This ensures pods are scheduled onto nodes that have the required CPU and memory resources.
AKS Cluster Start/Stop
For AKS cluster(s) that are not required to be running all of the time, for example Dev/Test environment when it is not in use, a Production cluster that is performing temporarily work (e.g. ML training), you can completely turn off your cluster by using cluster start/stop. Doing this will shut down all the node pools (both system and user) so that you can save on the compute cost while still maintaining your objects and cluster state for when you start it again. See the screenshot for stopping your cluster below.
Running Dev/Test workloads or POC with AKS
AKS comes with a free tier that do not charge for control plane– that means you will be getting a fully managed Kubernetes control plane at no cost. This is great if you would want to do a quick POC on AKS, without having to pay for the control plane.
For production environment, you should always upgrade to AKS with Uptime SLA which guarantees 99.95% availability.
Explore using Azure Spot Virtual Machines
Azure Spot virtual machine give you to ability to consume unutilized capacity in Azure with deep discounts (up to 90% as compared to pay-as-you-go prices). It is suitable only for workloads which can handle interruptions such as batch processing jobs, Dev/Test environments, etc. To use Spot VMs in AKS, you'll need to add a secondary Spot node pool to the cluster. Give it a try using these steps !
Leverage on Azure Reserved Virtual Machines
For workloads running on AKS that have a very consistent usage pattern and have long-term requirements, you can subscribe to Azure Reserved Virtual Machine which allows you to save up to 72% on pay-as-you-go prices with 1 or 3 year commitments. Below is a snapshot of Azure pricing calculator, showing the Compute VM price difference between Pay-as-you-go, 1 year, and 3 year reserved for the D2 v3 vm sku.
Optimize node resource usage in AKS
One of the biggest benefits with cloud computing is elasticity – the ability to expand/shrink the compute resource (nodes) according to usage demand. There are a couple of scaling options n AKS, summarized in this article. Using these techniques correctly ensures that there is always enough capacity to meet demand, while not over-provisioning that may result in idle resources that go to waste. In addition, AKS also supports KEDA add on, which provides event-driven autoscaling based on a rich catalogue of 50+ KEDA scalers. When it comes to logging and monitoring AKS clusters using Azure Monitor Container Insights, ensure that you follow the recommendation mentioned in this article.
Finally, you should also check out the best practices for storage and backups in AKS, for choosing the appropriate storage type and backup approach for your cluster (nodes) that best fit your application needs and ensuring cost efficiencies.
Improve operational efficiencies in AKS
Managing multiple clusters can result in a lot of extra day to day operational overhead for engineers too. You can consider features like AKS auto upgrade and AKS Node Auto-Repair to help to smoothen the day-2 operations. For more info, please refer the set of best practices for AKS Operators too.
What’s next?
Review the AKS Cost Optimization design checklist and list of recommendations on Microsoft Azure Well Architected Framework page
Follow this self-paced learning on Microsoft Learn on Optimize costs on Azure Kubernetes Services
Check out this Advanced Azure Kubernetes Service (AKS) microservice architecture use case on Azure Architecture Centre
If you are running AKS on-prem or at the edge, Azure Hybrid Benefit can also be used to further reduce costs for running containerized application on-premises and edge locations. please refer to this link for more info.