While it's possible to run the Kubernetes nodes either in on-demand or spot node pools separately, we can optimize the application cost without compromising the reliability by placing the pods unevenly on spot and OnDemand VMs using the https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/. With baseline amount of pods deployed in OnDemand node pool offering reliability, we can scale on spot node pool based on the load at a lower cost.
Kubernetes Topology Spread
In this post, we will go through a step by step approach on deploying an application spread unevenly on spot and OnDemand VMs.
Prerequisites
- Azure Subscription with permissions to create the required resources
- Azure CLI
- kubectl CLI
1. Create a Resource Group and an AKS Cluster
Create a resource group in your preferred Azure location using Azure CLI as shown below
az group create --name CostOptimizedK8sRG --location westeurope --tags 'Reason=Blog'
Let's create an AKS cluster using one of the following commands.
az aks create -g CostOptimizedK8sRG -n CostOptimizedCluster --auto-upgrade-channel node-image --enable-managed-identity --enable-msi-auth-for-monitoring --enable-cluster-autoscaler --min-count 1 --max-count 5 --kubernetes-version 1.26.0 --ssh-key-value ~/.ssh/id_rsa.pub --tags 'Reason=Blog' --uptime-sla -z 1 2 3
OR
az aks create -g CostOptimizedK8sRG -n CostOptimizedCluster --auto-upgrade-channel node-image --enable-managed-identity --enable-msi-auth-for-monitoring --enable-cluster-autoscaler --min-count 1 --max-count 5 --generate-ssh-keys --kubernetes-version 1.26.0 --tags 'Reason=Blog' --uptime-sla -z 1 2 32. Create two node pools using spot and OnDemand VMs
az aks nodepool add -g CostOptimizedK8sRG --cluster-name CostOptimizedCluster -n appspotpool -e -k 1.26.0 --labels 'deploy=spot' --min-count 3 --max-count 5 --max-pods 10 --mode User --os-sku Ubuntu --os-type Linux --priority Spot --spot-max-price -1 --tags 'Reason=Blog' -z 1 2 3az aks nodepool add -g CostOptimizedK8sRG --cluster-name CostOptimizedCluster -n appondempool -e -k 1.26.0 --labels 'deploy=ondemand' --min-count 3 --max-count 5 --max-pods 10 --mode User --os-sku Ubuntu --os-type Linux --priority Regular --tags 'Reason=Blog' -z 1 2 3
az aks get-credentials -g CostOptimizedK8sRG -n CostOptimizedClusterkubectl get nodes -o custom-columns='Name:.metadata.name,Zone:.metadata.labels.topology\.kubernetes\.io/zone' Name Zone
aks-appondempool-79574777-vmss000000 westeurope-1
aks-appondempool-79574777-vmss000001 westeurope-2
aks-appondempool-79574777-vmss000002 westeurope-3
aks-appspotpool-41295273-vmss000000 westeurope-1
aks-appspotpool-41295273-vmss000001 westeurope-2
aks-appspotpool-41295273-vmss000002 westeurope-3
aks-nodepool1-17327460-vmss000000 westeurope-13. Deploy a sample application
kubectl apply -f azure-vote.yamldeployment.apps/azure-vote-back created
service/azure-vote-back created
deployment.apps/azure-vote-front created
service/azure-vote-front createdkubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
azure-vote-back-65c595548d-249xw 1/1 Running 0 90s 10.244.9.4 aks-appondempool-79574777-vmss000002 <none> <none>
azure-vote-front-d99b7676c-2nvg2 1/1 Running 0 90s 10.244.11.4 aks-appondempool-79574777-vmss000000 <none> <none>4. Update the application deployment using topology spread constraints
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedulerequiredDuringSchedulingIgnoredDuringExecution we ensure that the pods are placed in nodes which has deploy as key and the value as either spot or ondemand. Whereas for preferredDuringSchedulingIgnoredDuringExecution we will add weight such that spot nodes has more preference over OnDemand nodes for the pod placement.affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: deploy
operator: In
values:
- spot
- ondemand
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 99
preference:
matchExpressions:
- key: deploy
operator: In
values:
- spot
- weight: 1
preference:
matchExpressions:
- key: deploy
operator: In
values:
- ondemandtopologySpreadConstraints with two label selectors. One with deploy label as topology key, the attribute maxSkew as 3 and DoNotSchedule for whenUnsatisfiable which ensures that not less than 3 instances (as we use 9 replicas) will be in single topology domain (in our case spot and ondemand). As the nodes with spot as value for deploy label has the higher weight preference in node affinity, scheduler will most likely will place more pods on spot than OnDemand node pool. For the second label selector we use topology.kubernetes.io/zone as the topology key to evenly distribute the pods across availability zones, as we use ScheduleAnyway for whenUnsatisfiable scheduler won't enforce this distribution but attempt to make it if possible.topologySpreadConstraints:
- labelSelector:
matchLabels:
app: azure-vote-front
maxSkew: 3
topologyKey: deploy
whenUnsatisfiable: DoNotSchedule
- labelSelector:
matchLabels:
app: azure-vote-front
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnywaykubectl apply -f azure-vote.yaml deployment.apps/azure-vote-back unchanged
service/azure-vote-back unchanged
deployment.apps/azure-vote-front configured
service/azure-vote-front unchangedkubectl get pods -o wide -l=app=azure-vote-frontNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
azure-vote-front-97b44f89b-627md 1/1 Running 0 4m37s 10.244.9.8 aks-appondempool-79574777-vmss000002 <none> <none>
azure-vote-front-97b44f89b-66878 1/1 Running 0 100s 10.244.6.6 aks-appspotpool-41295273-vmss000001 <none> <none>
azure-vote-front-97b44f89b-68tn6 1/1 Running 0 100s 10.244.8.6 aks-appspotpool-41295273-vmss000000 <none> <none>
azure-vote-front-97b44f89b-79gz6 1/1 Running 0 100s 10.244.10.7 aks-appondempool-79574777-vmss000001 <none> <none>
azure-vote-front-97b44f89b-7kjzz 1/1 Running 0 100s 10.244.9.9 aks-appondempool-79574777-vmss000002 <none> <none>
azure-vote-front-97b44f89b-gvlww 1/1 Running 0 100s 10.244.8.4 aks-appspotpool-41295273-vmss000000 <none> <none>
azure-vote-front-97b44f89b-jwwgk 1/1 Running 0 100s 10.244.8.5 aks-appspotpool-41295273-vmss000000 <none> <none>
azure-vote-front-97b44f89b-mf84z 1/1 Running 0 100s 10.244.7.4 aks-appspotpool-41295273-vmss000002 <none> <none>
azure-vote-front-97b44f89b-p8sxw 1/1 Running 0 100s 10.244.6.5 aks-appspotpool-41295273-vmss000001 <none> <none> Conclusion
maxSkew configuration in topology spread constraints is the maximum skew allowed as the name suggests, so it's not guaranteed that the maximum number of pods will be in a single topology domain. However, this approach is a good starting point to achieve optimal placement of pods in a cluster with multiple node pools.