In Azure Kubernetes Service (AKS), the concept of pod spread is important to ensure that pods are distributed efficiently across nodes in a cluster. This helps to optimize resource utilization, increase application performance, and maintain high availability.
This article outlines a decision-making process for estimating the number of Pods running on an AKS cluster. We will look at pod distribution across designated node pools, distribution based on pod-to-pod dependencies and distribution where pod or node affinities are not specified. Finally, we explore the impact of pod spread on scaling using replicas and the role of the Horizontal Pod Autoscaler (HPA). We will close with a test run of all the above scenarios.
We will assume an AKS cluster exists with system and user node pools. The example used in this article considers two departments, Customer and Counsellor, represented by separate node pools and illustrated in the figure below.
Each node pool houses two distinct applications, Webserver and Redis. Each application has its own deployment and service within the cluster. For example, Customer-Webserver and Customer-Redis represent the Customer Web Server and Redis applications.
Furthermore, every deployment in the cluster has its own HPA definition. The HPA is designed to automatically adjust the number of pods based on CPU utilization. This feature enables the system to dynamically scale up or down the resources required by the application.
<Apply this step> if there is a need to assign pods to particular nodes based on resource requirements or dependencies between pods and nodes.
For example, pods for the Customer department should be assigned to the Customer node pool, while pods for the Counsellor department should be assigned to the Counsellor node pool.
To assign nodes to specific node pools,
az aks nodepool update -g aks01 --cluster-name aks01 --name customer --labels dept=customer --no-wait
az aks nodepool update -g aks01 --cluster-name aks01 --name counsellor --labels dept=counsellor --no-wait
To ensure that pods are scheduled on the appropriate nodes,
<Apply this step> if there are inter-pod dependencies. This is necessary when one pod needs to launch before another or if the pods have to be co-located on the same node due to low latency considerations.
If for example the Redis pod starts first with no dependency, while the Webserver pod has an affinity with the Redis pod. In this case the Redis pod will need to be running first before Webserver is scheduled.
From below for Webserver pod, field requiredDuringSchedulingIgnoredDuringExecution is used to specify that this pod should be scheduled on nodes where there is at least one pod with the label app=customer-redis. So only if Redis pod exists does the Webserver pod get scheduled.
The topologyKey field applies to kubernetes.io/hostname topology key. This ensures that the Webserver Pod is scheduled on nodes that have at least one pod with the app=customer-redis label.
Pod dependencies are defined using Pod Affinity rules, which ensures in this case that an instance of Redis is running before the Webserver pod gets scheduled. More details about Pod Affinity and scheduling in Kubernetes can be found at this link.
<Apply this step> if there are no dependencies between pods or nodes. This involves setting up a Pod topology spread when neither podAffinity nor nodeAffinity are specified. This helps to distribute pods evenly across all available nodes, ensuring optimal resource utilization with better application performance.
In the example below, the topologySpreadConstraints field is used to define constraints that the scheduler uses to spread pods across the available nodes. In this case, the constraint is defined with a maxSkew of 1, indicating that the pod count difference between the two nodes must not exceed 1.
The topologyKey specifies the key used i.e., kubernetes.io/hostname, will group the nodes based on their hostnames. The whenUnsatisfiable field specifies what action the scheduler should take if the constraint cannot be satisfied. In this case, ScheduleAnyway is specified, which means that the scheduler should schedule the pod anyway, even if the constraint cannot be satisfied.
Finally, the labelSelector specifies the labels, either app: customer-redis or app: customer-webserver, will be used to select the pods.
<Apply this step> if you need to determine the number of replicas required for deploying an application/pod. This helps in identifying the total number of pods per node.
To perform this step, you can use the Kubernetes instance calculator tool, which takes input of CPU/Memory requests and limits required by the application. Details on the tool along with data fill can be found in the GitHub link. The tool allows you to order the instance type by efficiency or cost.
Tips on using this tool are listed below.
Below shows an example of a Redis deployment file with the desired replica count for pods in the deployment.
At the time of initial deployment, there are 30 pods that are distributed evenly across 3 nodes, with half of the pods running Redis and the other half running the Webserver.
> kubectl get deploy
NAME READY UP-TO-DATE AVAILABLE
customer-redis 15/15 15 15
customer-webserver 15/15 15 15
<Apply this step> if you need to automatically scale pods in response to changes in CPU utilization, in which case the HPA feature would be ideal.
The replica count from the previous section was calculated to be 15. HPA has a minimum replica setting and this will be set to match the number of replicas in the deployment, which is 15.
Below example is for the Customer-Webserver and Customer-Redis applications.
The example below for Webserver deployment would also apply to the Redis deployment.
To summarize, the minReplicas field is set to 15, which matches the number of replicas specified in the deployment. This means that the HPA will not scale down the number of pods below the initial deployment value of 15.
The maxReplicas field is set to 25, which means that the HPA will not scale up the number of pods beyond 25 even if the CPU utilization is high.
The targetCPUUtilizationPercentage field is set to 50%, which indicates that the HPA should aim for a CPU utilization of 50% for the deployment. This value is a tradeoff between cost and optimal performance. A lower value of this parameter will reduce the risk of auto-scale lag but will require a higher number of pods (~4x) to manage the same workload, which could lead to increased costs.
To scale based on incoming HTTP traffic, consider KEDA with the HTTP add-on. This addon allows for the scale-to-zero of a deployment using HTTP request queue metrics, which can further optimize the cost and performance of an AKS cluster.
This section illustrates the above concepts. This involves setting up the AKS cluster with a system and 2 user node pool layouts, blocking the system pool from being scheduled for workload pods, applying labels to the workload node pools, and applying YAML spec files for Redis and web server.
The instructions also include scaling up the deployment to 50 pods, which triggers the addition of two new nodes. However, attempting to scale beyond 50 pods will result in new pods going into Pending state and eventually being removed as it exceeds the HPA count and capacity on nodes.
Setup AKS cluster with Node pool layout as seen below. Customer (user) node pool has Min=3 and Max=5
# add NoSchedule to system nodepool
az aks nodepool update -g aks01 --cluster-name aks01 --name agentpool --node-taints CriticalAddonsOnly=true:NoSchedule --no-wait
# validate nodeTaints is set to "CriticalAddonsOnly=true:NoSchedule"
az aks nodepool show --resource-group aks01 --cluster-name aks01 --name agentpool
az aks nodepool update -g aks01 --cluster-name aks01 --name customer --labels dept=customer --no-wait
az aks nodepool update -g aks01 --cluster-name aks01 --name counsellor --labels dept=counsellor --no-wait
$ kubectl apply -f customer-redis.yaml
deployment.apps/customer-redis created
service/customer-redis created
horizontalpodautoscaler.autoscaling/customer-redis created
$ kubectl apply -f customer-webserver.yaml
deployment.apps/customer-webserver created
service/customer-webserver created
horizontalpodautoscaler.autoscaling/customer-webserver created
$ kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE
customer-redis 15/15 15 15
customer-webserver 15/15 15 15
$ kubectl get pods -o wide --sort-by=".spec.nodeName"
<displays Pods distributed across Nodes as seen in earlier figure>
$ kubectl scale deployment.apps/customer-redis --replicas 50
deployment.apps/customer-redis scaled
$ kubectl scale deployment.apps/customer-webserver --replicas 50
deployment.apps/customer-webserver scaled
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
aks-agentpool-27905097-vmss000000 220m 5% 2010Mi 15%
aks-counsellor-27905097-vmss000000 118m 6% 1256Mi 58%
aks-customer-27905097-vmss000000 115m 6% 1350Mi 62%
aks-customer-27905097-vmss000001 112m 5% 1483Mi 68%
aks-customer-27905097-vmss000002 109m 5% 1385Mi 64%
aks-customer-27905097-vmss000003 277m 14% 1130Mi 52%
aks-customer-27905097-vmss000004 223m 11% 1114Mi 51%
$ kubectl get deploy
NAME READY UP-TO-DATE AVAILABLE
customer-redis 25/25 25 25
customer-webserver 25/25 25 25
$ kubectl scale deployment.apps/customer-redis --replicas 51
deployment.apps/customer-redis scaled
$ kubectl scale deployment.apps/customer-webserver --replicas 51
deployment.apps/customer-webserver scaled
customer-redis.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: customer-redis
spec:
selector:
matchLabels:
app: customer-redis
replicas: 15
template:
metadata:
labels:
app: customer-redis
spec:
# topologySpreadConstraints:
# - maxSkew: 1
# topologyKey: kubernetes.io/hostname
# whenUnsatisfiable: ScheduleAnyway
# labelSelector:
# matchLabels:
# app: customer-redis
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dept
operator: In
values:
- customer
containers:
- name: customer-redis
image: redis:3.2-alpine
ports:
- containerPort: 80
resources:
limits:
memory: 128Mi
cpu: 100m
requests:
memory: 128Mi
cpu: 100m
---
apiVersion: v1
kind: Service
metadata:
name: customer-redis
labels:
app: customer-redis
spec:
ports:
- port: 80
selector:
app: customer-redis
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: customer-redis
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: customer-redis
minReplicas: 15
maxReplicas: 26
targetCPUUtilizationPercentage: 50
customer-webserver.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: customer-webserver
spec:
selector:
matchLabels:
app: customer-webserver
replicas: 15
template:
metadata:
labels:
app: customer-webserver
spec:
# topologySpreadConstraints:
# - maxSkew: 1
# topologyKey: kubernetes.io/hostname
# whenUnsatisfiable: ScheduleAnyway
# labelSelector:
# matchLabels:
# app: customer-webserver
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dept
operator: In
values:
- customer
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- customer-redis
topologyKey: "kubernetes.io/hostname"
containers:
- name: customer-webserver
image: nginx:1.16-alpine
ports:
- containerPort: 80
resources:
limits:
memory: 128Mi
cpu: 100m
requests:
memory: 128Mi
cpu: 100m
---
apiVersion: v1
kind: Service
metadata:
name: customer-webserver
labels:
app: customer-webserver
spec:
ports:
- port: 80
selector:
app: customer-webserver
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: customer-webserver
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: customer-webserver
minReplicas: 15
maxReplicas: 26
targetCPUUtilizationPercentage: 50
Pod spread is an important aspect of an AKS cluster management that can help optimize resource utilization and improve application performance. By assigning pods to specific node pools, setting up Pod-to-Pod dependencies, and defining Pod topology spread, one can ensure that applications run efficiently and smoothly. As illustrated through examples, using node and pod affinity rules as well as topology spread constraints, can help distribute pods across nodes in a way that balances workload and avoids performance bottlenecks. Ultimately, the key to effective Pod spread is understanding your application's requirements and designing your cluster's architecture accordingly.
The sample scripts are not supported by any Microsoft standard support program or service. The sample scripts are provided AS IS without a warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample scripts and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.