Core Infrastructure and Security Blog

10 MIN READ

Decision Flow to Estimate Pod Spread on AKS

Microsoft

Mar 20, 2023

Introduction

In Azure Kubernetes Service (AKS), the concept of pod spread is important to ensure that pods are distributed efficiently across nodes in a cluster. This helps to optimize resource utilization, increase application performance, and maintain high availability.

This article outlines a decision-making process for estimating the number of Pods running on an AKS cluster. We will look at pod distribution across designated node pools, distribution based on pod-to-pod dependencies and distribution where pod or node affinities are not specified. Finally, we explore the impact of pod spread on scaling using replicas and the role of the Horizontal Pod Autoscaler (HPA). We will close with a test run of all the above scenarios.

Prerequisites

We will assume an AKS cluster exists with system and user node pools. The example used in this article considers two departments, Customer and Counsellor, represented by separate node pools and illustrated in the figure below.

Each node pool houses two distinct applications, Webserver and Redis. Each application has its own deployment and service within the cluster. For example, Customer-Webserver and Customer-Redis represent the Customer Web Server and Redis applications.

Furthermore, every deployment in the cluster has its own HPA definition. The HPA is designed to automatically adjust the number of pods based on CPU utilization. This feature enables the system to dynamically scale up or down the resources required by the application.

Step 1: Setup Pod spread based on designated Node pools.

<Apply this step> if there is a need to assign pods to particular nodes based on resource requirements or dependencies between pods and nodes.

For example, pods for the Customer department should be assigned to the Customer node pool, while pods for the Counsellor department should be assigned to the Counsellor node pool.

To assign nodes to specific node pools,

Labels are applied to the respective node pool. For example, the Customer node pool may be labeled as dept=customer, while the Counsellor node pool may be labeled as dept=counsellor.
These labels then get applied to the nodes within each node pool, allowing pods to be scheduled on the appropriate nodes.

az aks nodepool update -g aks01 --cluster-name aks01 --name customer --labels dept=customer --no-wait 
az aks nodepool update -g aks01 --cluster-name aks01 --name counsellor --labels dept=counsellor --no-wait

To ensure that pods are scheduled on the appropriate nodes,

Node affinity is used to constrain pods to nodes that match specific node labels. This is done by adding the appropriate nodeAffinity rules to the pod's configuration file, allowing the scheduler to assign the pod to the correct node pool, based on its requirements and when expression matches node labels.
The code snippet below configures two different types of pods, namely Counsellor and Customer, using node affinity to schedule them on nodes with specific labels i.e., dept=counsellor for Counsellor pods, and dept=customer for Customer pods. This ensures that pods are scheduled on the appropriate nodes based on the department affiliation.

Step 2: Setup Pod spread based on Pod-to-Pod dependencies.

<Apply this step> if there are inter-pod dependencies. This is necessary when one pod needs to launch before another or if the pods have to be co-located on the same node due to low latency considerations.

If for example the Redis pod starts first with no dependency, while the Webserver pod has an affinity with the Redis pod. In this case the Redis pod will need to be running first before Webserver is scheduled.

From below for Webserver pod, field requiredDuringSchedulingIgnoredDuringExecution is used to specify that this pod should be scheduled on nodes where there is at least one pod with the label app=customer-redis. So only if Redis pod exists does the Webserver pod get scheduled.

The topologyKey field applies to kubernetes.io/hostname topology key. This ensures that the Webserver Pod is scheduled on nodes that have at least one pod with the app=customer-redis label.

Pod dependencies are defined using Pod Affinity rules, which ensures in this case that an instance of Redis is running before the Webserver pod gets scheduled. More details about Pod Affinity and scheduling in Kubernetes can be found at this link.

Step 3: Setup Pod spread when there is no Pod or Node affinity.

<Apply this step> if there are no dependencies between pods or nodes. This involves setting up a Pod topology spread when neither podAffinity nor nodeAffinity are specified. This helps to distribute pods evenly across all available nodes, ensuring optimal resource utilization with better application performance.

In the example below, the topologySpreadConstraints field is used to define constraints that the scheduler uses to spread pods across the available nodes. In this case, the constraint is defined with a maxSkew of 1, indicating that the pod count difference between the two nodes must not exceed 1.

The topologyKey specifies the key used i.e., kubernetes.io/hostname, will group the nodes based on their hostnames. The whenUnsatisfiable field specifies what action the scheduler should take if the constraint cannot be satisfied. In this case, ScheduleAnyway is specified, which means that the scheduler should schedule the pod anyway, even if the constraint cannot be satisfied.

Finally, the labelSelector specifies the labels, either app: customer-redis or app: customer-webserver, will be used to select the pods.

Step 4: Setup pod spread based on Replica count.

<Apply this step> if you need to determine the number of replicas required for deploying an application/pod. This helps in identifying the total number of pods per node.

To perform this step, you can use the Kubernetes instance calculator tool, which takes input of CPU/Memory requests and limits required by the application. Details on the tool along with data fill can be found in the GitHub link. The tool allows you to order the instance type by efficiency or cost.

Tips on using this tool are listed below.

Get Pod Requests CPU and Memory values and Pod Limits CPU and Memory values based on the application requirements. Vertical Pod Autoscaler in AKS can provide an estimation of the application resource requests and limits.
Make necessary adjustments based on other considerations. Input data into the tool to get the output of the Max Pod Count.
Consider the example below which uses instance of type=B2s.
- The tool calculates a total of 10 pods per node based on the input resource settings.
- Divide this by the number of apps to be deployed (in this case, 2) to get 5 pods per node.
- To deploy 2 apps across 3 nodes (assuming similar resource needs), you would need a total of 30 pods (10 x 3).
- This would result in 15 replicas (5 x 3) for each application deployment. Input this into the deployment specification file.

Below shows an example of a Redis deployment file with the desired replica count for pods in the deployment.

At the time of initial deployment, there are 30 pods that are distributed evenly across 3 nodes, with half of the pods running Redis and the other half running the Webserver.

> kubectl get deploy 
NAME               READY   UP-TO-DATE   AVAILABLE 
customer-redis     15/15   15           15    
customer-webserver 15/15   15           15

Step 5: Setup pod spread by scaling with Horizontal Pod Autoscaler (HPA).

<Apply this step> if you need to automatically scale pods in response to changes in CPU utilization, in which case the HPA feature would be ideal.

The replica count from the previous section was calculated to be 15. HPA has a minimum replica setting and this will be set to match the number of replicas in the deployment, which is 15.

Below example is for the Customer-Webserver and Customer-Redis applications.

The Node Pool autoscaler has a minimum node count set to 3 and the maximum count set to 5.

Assuming each node can run up to 10 pods, the total number of pods that can run in the node pool is 50 (10 pods/node x 5 nodes).
Since there are 2 applications in this deployment, the maximum number of replicas that can be set for each application through HPA is calculated by dividing the total number of pods (50) by the number of applications (2), resulting in a maximum replica count of 25 for each deployment.

The example below for Webserver deployment would also apply to the Redis deployment.

To summarize, the minReplicas field is set to 15, which matches the number of replicas specified in the deployment. This means that the HPA will not scale down the number of pods below the initial deployment value of 15.

The maxReplicas field is set to 25, which means that the HPA will not scale up the number of pods beyond 25 even if the CPU utilization is high.

The targetCPUUtilizationPercentage field is set to 50%, which indicates that the HPA should aim for a CPU utilization of 50% for the deployment. This value is a tradeoff between cost and optimal performance. A lower value of this parameter will reduce the risk of auto-scale lag but will require a higher number of pods (~4x) to manage the same workload, which could lead to increased costs.

To scale based on incoming HTTP traffic, consider KEDA with the HTTP add-on. This addon allows for the scale-to-zero of a deployment using HTTP request queue metrics, which can further optimize the cost and performance of an AKS cluster.

Test Run

This section illustrates the above concepts. This involves setting up the AKS cluster with a system and 2 user node pool layouts, blocking the system pool from being scheduled for workload pods, applying labels to the workload node pools, and applying YAML spec files for Redis and web server.

The instructions also include scaling up the deployment to 50 pods, which triggers the addition of two new nodes. However, attempting to scale beyond 50 pods will result in new pods going into Pending state and eventually being removed as it exceeds the HPA count and capacity on nodes.

Prerequisites

Setup AKS cluster with Node pool layout as seen below. Customer (user) node pool has Min=3 and Max=5

Block system pool from being scheduled for workload pods.

# add NoSchedule to system nodepool
az aks nodepool update -g aks01 --cluster-name aks01 --name agentpool --node-taints CriticalAddonsOnly=true:NoSchedule --no-wait

# validate nodeTaints is set to "CriticalAddonsOnly=true:NoSchedule"
az aks nodepool show --resource-group aks01 --cluster-name aks01 --name agentpool

Apply labels to workload node pools.

az aks nodepool update -g aks01 --cluster-name aks01 --name customer --labels dept=customer --no-wait 
az aks nodepool update -g aks01 --cluster-name aks01 --name counsellor --labels dept=counsellor --no-wait

Apply the YAML spec files for customer-redis and customer-webserver. Files can be found in the Test files section below.

$ kubectl apply -f customer-redis.yaml
deployment.apps/customer-redis created
service/customer-redis created
horizontalpodautoscaler.autoscaling/customer-redis created

$ kubectl apply -f customer-webserver.yaml
deployment.apps/customer-webserver created
service/customer-webserver created
horizontalpodautoscaler.autoscaling/customer-webserver created

Confirm the 30-pod deployment to 3 nodes.

$ kubectl get deployment
NAME               READY   UP-TO-DATE   AVAILABLE
customer-redis     15/15   15           15      
customer-webserver 15/15   15           15      

$ kubectl get pods -o wide --sort-by=".spec.nodeName"
<displays Pods distributed across Nodes as seen in earlier figure>

Scale up to 50. Since this exceeds capacity for 3 nodes, HPA adds 2 additional nodes. Once nodes are up, pods get scaled to a total count of 50.

$ kubectl scale deployment.apps/customer-redis --replicas 50
deployment.apps/customer-redis scaled

$ kubectl scale deployment.apps/customer-webserver --replicas 50
deployment.apps/customer-webserver scaled

$ kubectl top nodes
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%  
aks-agentpool-27905097-vmss000000    220m         5%     2010Mi          15%
aks-counsellor-27905097-vmss000000   118m         6%     1256Mi          58%
aks-customer-27905097-vmss000000     115m         6%     1350Mi          62%
aks-customer-27905097-vmss000001     112m         5%     1483Mi          68%
aks-customer-27905097-vmss000002     109m         5%     1385Mi          64%
aks-customer-27905097-vmss000003     277m         14%    1130Mi          52%
aks-customer-27905097-vmss000004     223m         11%    1114Mi          51%

$ kubectl get deploy
NAME               READY   UP-TO-DATE   AVAILABLE
customer-redis     25/25   25           25      
customer-webserver 25/25   25           25

Scale beyond 50. New Pods now go to Pending (eventually removed) as it exceeds HPA count and capacity on nodes.

$ kubectl scale deployment.apps/customer-redis --replicas 51
deployment.apps/customer-redis scaled

$ kubectl scale deployment.apps/customer-webserver --replicas 51
deployment.apps/customer-webserver scaled

Test files

customer-redis.yml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: customer-redis
spec:
  selector:
    matchLabels:
      app: customer-redis
  replicas: 15
  template:
    metadata:
      labels:
        app: customer-redis
    spec: 
      # topologySpreadConstraints:
      # - maxSkew: 1
      #   topologyKey: kubernetes.io/hostname
      #   whenUnsatisfiable: ScheduleAnyway
      #   labelSelector:
      #     matchLabels:
      #       app: customer-redis   
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: dept
                operator: In
                values:
                - customer                   
      containers:
      - name: customer-redis
        image: redis:3.2-alpine
        ports:
        - containerPort: 80
        resources:
          limits:
            memory: 128Mi
            cpu: 100m
          requests:
            memory: 128Mi  
            cpu: 100m
---
apiVersion: v1
kind: Service
metadata:
  name: customer-redis
  labels:
    app: customer-redis
spec:
  ports:
  - port: 80
  selector:
    app: customer-redis
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: customer-redis
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: customer-redis
  minReplicas: 15
  maxReplicas: 26
  targetCPUUtilizationPercentage: 50

customer-webserver.yml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: customer-webserver
spec:
  selector:
    matchLabels:
      app: customer-webserver
  replicas: 15
  template:
    metadata:
      labels:
        app: customer-webserver
    spec:   
      # topologySpreadConstraints:
      # - maxSkew: 1
      #   topologyKey: kubernetes.io/hostname
      #   whenUnsatisfiable: ScheduleAnyway
      #   labelSelector:
      #     matchLabels:
      #       app: customer-webserver       
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: dept
                operator: In
                values:
                - customer
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - customer-redis
            topologyKey: "kubernetes.io/hostname"                      
      containers:
      - name: customer-webserver
        image: nginx:1.16-alpine
        ports:
        - containerPort: 80
        resources:
          limits:
            memory: 128Mi
            cpu: 100m
          requests:
            memory: 128Mi  
            cpu: 100m
---
apiVersion: v1
kind: Service
metadata:
  name: customer-webserver
  labels:
    app: customer-webserver
spec:
  ports:
  - port: 80
  selector:
    app: customer-webserver
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: customer-webserver
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: customer-webserver
  minReplicas: 15
  maxReplicas: 26
  targetCPUUtilizationPercentage: 50

Conclusion:

Pod spread is an important aspect of an AKS cluster management that can help optimize resource utilization and improve application performance. By assigning pods to specific node pools, setting up Pod-to-Pod dependencies, and defining Pod topology spread, one can ensure that applications run efficiently and smoothly. As illustrated through examples, using node and pod affinity rules as well as topology spread constraints, can help distribute pods across nodes in a way that balances workload and avoids performance bottlenecks. Ultimately, the key to effective Pod spread is understanding your application's requirements and designing your cluster's architecture accordingly.

Disclaimer

The sample scripts are not supported by any Microsoft standard support program or service. The sample scripts are provided AS IS without a warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample scripts and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.

Updated Mar 18, 2023

Version 1.0

JojiVarghese

varghesejoji

Microsoft

Joined May 18, 2022

View Profile

Core Infrastructure and Security Blog

Follow this blog board to get notified when there's new activity