ITOps Talk Blog

4 MIN READ

Kubernetes Operations: Prioritize Workload in Overcommitted Clusters

Microsoft

Nov 20, 2018

One of the benefits in adopting a system like Kubernetes is facilitating burstable and scalable workload. Horizontal application scaling involves adding or removing instances of an application to match demand. Kubernetes Horizontal Pod Autoscaler enables automated pod scaling based on demand. This is cool, however can lead to unpredictable load on the cluster, which may put the cluster into an overcommitted state. Fortunately, with a goal of squeezing every bit of CPU and memory from a cluster, overcommitment may not only be ok but desirable.

The following image represents a three node cluster that runs three applications. Pink is the most critical. Red is burst-able and durable. This means if we need to stop a few instances of red, things will be ok. Blue is non-critical. I have also tried to depict in this image a cluster that is a fully maxed out state. There are no more resources available for additional workload.

Imaging now that a scale out operation is needed on the pink application. This puts the cluster in an overcommitted state with critical workload requiring scheduling. How can Kubernetes facilitate this critical request in an overcommitted state? One option is to use Pod Priority and Preemption, which allows a priority weight to be added to a scheduling request. In the event of overcommitment, priority is evaluated, and lower priority workload is restarted (preemption) to allow for scheduling of the priority workload.

Pod Priority and Preemption tutorial

In this article, we will walk through an end-to-end demonstration of using Pod Priority and Pre-emption to ensure critical workload has priority to cluster resources.

In order to complete this tutorial, you need a Kubernetes cluster that consists of three nodes. I've included steps for deploying an appropriately sized Azure Kubernetes cluster. If you need an Azure Subscription or would like to read up on additional operational practices for Azure Kubernetes Service, see the following links.

Create an Azure Kubernetes Service Cluster

First things first, ensure you have an appropriately sized Kubernetes cluster for this tutorial (three nodes).

Create a resource group.

az group create --name AKSOperationsDemos --location eastus

Create the cluster. Note, the Azure CLI defaults are suitable for this demo.

az aks create --resource-group AKSOperationsDemos --name AKSOperationsDemos --kubernetes-version 1.11.3

Connect to the cluster as cluster admin.

az aks get-credentials --resource-group AKSOperationsDemos --name AKSOperationsDemos --admin

Create a priority class for critical workload

Create an instance of a Pod Priority Class with a weight of 1000000. This can be used to ensure that high priority workload is given priority to cluster resource.

To do so, create a file names pc.yml and copy in the following yaml.

apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false

Create the priority class.

kubectl create -f pc.yml

Consume all CPU cores

Run some workload to consume all CPU cores in the cluster. In the following example, a deployment consisting of three replicas is started with a CPU request of one core each. This will effectively consume the available CPU resources of the cluster.

Create a file named slam-cpu.yml and copy in the following yaml.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: consume-cpu
spec:
  replicas: 3
  selector:
    matchLabels:
      app: consume-cpu
  template:
    metadata:
      labels:
        app: consume-cpu
    spec:
      containers:
      - name: nepetersv1
        image: neilpeterson/aks-helloworld:v1
        resources:
          requests:
            cpu: 1
            memory: 128Mi
          limits:
            cpu: 1
            memory: 128Mi

Run the deployment.

kubectl create -f slam-cpu.yml

Start low-priority workload

Now start another pod without specifying a priority class.

Create a file named pod-no-priority.yml and copy in the following YAML.

apiVersion: v1
kind: Pod
metadata:
  name: pod-no-priority
spec:
  containers:
  - name: pod-no-priority
    image: neilpeterson/aks-helloworld:v1
    resources:
      requests:
        cpu: 1
        memory: 128Mi
      limits:
        cpu: 1
        memory: 256Mi

Run the pod.

kubectl create -f pod-no-priority.yml

At this point, what you will find is that the new pod cannot be scheduled due to lack of CPU resources. To see this, list the pods on the cluster and note that the pod-no-priority is in a Pending state.

kubectl get pods

consume-cpu-6c8d576684-gf5sk   0/1       ContainerCreating   0          52s
consume-cpu-6c8d576684-mtvmn   0/1       ContainerCreating   0          52s
consume-cpu-6c8d576684-pnkff   0/1       ContainerCreating   0          52s
pod-no-priority                0/1       Pending             0          10s

Return a list of events for the pod to see the actual issue.

kubectl describe pod pod-no-priority

Parsing the output you should see that the pod cannot be scheduled to insufficient cpu.

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  1s (x18 over 53s)  default-scheduler  0/3 nodes are available: 3 Insufficient cpu.

Run high priority workload

Finally run another pod, however this time assign the high-priority class to the pod.

Create a file named pod-priority.yml and copy in the following yaml. Take note that the pod spec includes the priority class created in a previous step.

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-priority
spec:
  containers:
  - name: pod-with-priority
    image: neilpeterson/nepetersv1
    resources:
      requests:
        cpu: 1
        memory: 128Mi
      limits:
        cpu: 1
        memory: 256Mi
  priorityClassName: high-priority

Run the pod.

kubectl create -f pod-priority.yaml

Now return a list of pods. If done quickly you may be able to catch one of the lower priority pods being terminated.

kubectl get pods

NAME                           READY     STATUS        RESTARTS   AGE
consume-cpu-6c8d576684-gf5sk   1/1       Running       0          7m
consume-cpu-6c8d576684-mtvmn   1/1       Running       0          7m
consume-cpu-6c8d576684-p7tqx   0/1       Pending       0          3s
consume-cpu-6c8d576684-pnkff   1/1       Terminating   0          7m
pod-no-priority                0/1       Pending       0          6m
pod-with-priority              0/1       Pending       0          3s

Once the lower priority pod has been terminated, the pod with priority is started in its place.

kubectl get pods

NAME                           READY     STATUS    RESTARTS   AGE
consume-cpu-6c8d576684-gf5sk   1/1       Running   0          8m
consume-cpu-6c8d576684-mtvmn   1/1       Running   0          8m
consume-cpu-6c8d576684-p7tqx   0/1       Pending   0          1m
pod-no-priority                0/1       Pending   0          8m
pod-with-priority              1/1       Running   0          1m

Very cool indeed. Feel free to contact me on Twitter (@nepeters) or comment below for discussion on the topic.

Updated Mar 23, 2019

Version 5.0

kubernetes

Neil Peterson

Microsoft

Joined May 27, 2017

View Profile

ITOps Talk Blog

Follow this blog board to get notified when there's new activity