Introduction
At last year's KubeCon North America, Microsoft announced the adoption of Karpenter in Azure Kubernetes Service (AKS) as an alternative to the Cluster Autoscaler (CA), referred to as Node Autoprovisioning (NAP). While Cluster Autoscaler has been the default node scaler in AKS/Kubernetes, there have been significant challenges that led to the adoption of Karpenter. This post delves into these challenges and explores how Karpenter addresses them.
Challenges with Cluster Autoscaler
Here is Node Autosclaing flow chart for Cluster-Autoscaler
-
Limited to VMSS Groups: Cluster Autoscaler can only operate with Virtual Machine Scale Sets (VMSS) in AKS. Each VMSS consists of a specific group of VM instances with a specific VM SKU, hardware, and CPU:Memory ratio (e.g., Standard D4sv5 with 4 CPUs and 16 GB RAM).
2. Node Pool Constraints: When deploying new pods, if the existing node capacity is exhausted, CA attempts to spin up a new node of the same VMSS SKU type. If that instance is unavailable, pods remain in a pending state.
3. Scalability Limitations: CA can only scale up based on specific node pool SKU VMSS availability. It cannot leverage the capacity of other VM SKUs even if they have available resources.
Introducing Karpenter (Node Autoprovisioning)
Karpenter is an efficient node autoscaler for Kubernetes clusters, designed to optimize performance and cost. It can scale up and down worker nodes faster than Cluster Autoscaler and can launch appropriate individual nodes without creating traditional node groups in AKS.
-
Key Features of Karpenter:
-
Efficiency: Faster scaling of Kubernetes nodes.
-
Flexibility: Launches nodes without needing VMSS.
-
Cost Optimization: Reduces overall costs and helps with patching of node images and Kubernetes versions.
-
Nodepool YAML based config which defined what types of nodes it can provision
-
Handling Disruptions
Disruption Controller responsible for Terminating/Replacing nodes in kubernetes cluster.
its uses one of 3 automated methods to finalise which nodes to handle via Disruption controller
-
Expiration: Karpenter will mark nodes as expired and disrupt them after they have lived a set number of seconds. this parameters act as TTL for k8s nodes
spec:
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 300s
-
Consolidateafter: it used to configure disrupiton interval,amount of time it should wait before considering disruption cycle again
-
Consolidation: It operates to actively reduce cluster cost by analyzing nodes
Consolidation policy has two modes
a)When Empty: Karpenter will only disrupt nodes with no workloads pods
b)Whenunderutilized: It will attempt to reduce/replace nodes when underutilised
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: ondemand
spec:
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 60s
Enable NAP(Karpenter) on AKS
There are few pre requisites to enable NAP on AKS
-
Install Az CLI with preview extension of version greater than 0.5.17
-
Regsiter NAP provides called "NodeAutoProvisioningPreview"
-
AKS with network configuration as Cilium + Overlay
Enable NAP on existing AKS cluster
Make sure existing AKS cluster has 'Azure' network plugin with Cilium as Network Policy. Key thing in this command is feature flag '--node-provisioning-mode Auto', which set NAP as default Node Autoscaler
az aks update --name aksclustername --resource-group rgname --node-provisioning-mode Auto
Deploy NAP with new AKS cluster
az aks create --name aksclustername--resource-group rgname--node-provisioning-mode Auto --network-plugin azure --network-plugin-mode overlay --network-dataplane cilium
Verify Karpenter Enablement:
kubectl api-resources | grep -e aksnodeclasses -e nodeclaims -e nodepools
aksnodeclasses aksnc,aksncs karpenter.azure.com/v1alpha2 false AKSNodeClass
nodeclaims karpenter.sh/v1beta1 false NodeClaim
nodepools karpenter.sh/v1beta1 false NodePool
Disabling Cluster-Autoscaler
To switch from Cluster-Autoscaler to Karpenter, disable Cluster-Autoscaler on your AKS cluster:
az aks update --name aksclustername --resource-group aksrg --disable-cluster-autoscaler
Deploying a Sample Application
To see Node-Autoprovisioning in action, deploy a sample application:
osama [ ~ ]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-default-h2jxh Ready agent 35m v1.27.9
aks-nodepool1-41633911-vmss000000 Ready agent 3d19h v1.27.9
Scale replicas of Vote Application to trigger scale out events
osama [ ~ ]$ kubectl scale deployment azure-vote-front --replicas=12 -n karpenter-demo-ns
^[[Adeployment.apps/azure-vote-front scaled
osama [ ~ ]$ kubectl scale deployment azure-vote-back --replicas=12 -n karpenter-demo-ns
deployment.apps/azure-vote-back scaled
Verify auto scaling of nodes by reading via karpenter using below kubectl cmd
kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp' -n 10
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
default 50m Normal Unconsolidatable nodeclaim/default-95f54 SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
default 50m Normal Unconsolidatable node/aks-default-95f54 SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
default 38m Normal DisruptionBlocked nodepool/default No allowed disruptions due to blocking budget
default 5m33s Normal Unconsolidatable nodeclaim/default-h2jxh Can't remove without creating 2 candidates
default 5m33s Normal Unconsolidatable node/aks-default-h2jxh Can't remove without creating 2 candidates
default 2m12s Normal DisruptionBlocked nodepool/system-surge No allowed disruptions due to blocking budget
karpenter-demo-ns 63s Normal Nominated pod/azure-vote-front-6855444955-bnq7p Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns 63s Normal Nominated pod/azure-vote-front-6855444955-gbwk6 Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns 63s Normal Nominated pod/azure-vote-front-6855444955-l2bgj Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns 63s Normal Nominated pod/azure-vote-front-6855444955-nvc56 Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns 63s Normal Nominated pod/azure-vote-front-6855444955-22glj Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns 63s Normal Nominated pod/azure-vote-front-6855444955-sxdl6 Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns 63s Normal Nominated pod/azure-vote-front-6855444955-t69w4 Pod should schedule on: nodeclaim/default-mrh7w
Customise Karpenter Config
Karpenter leverage new resource type in kubernetes Kind i.e. Nodepools
-
Customise Nodepools: Specific specific VM series or VM family or even Specific CPU or Memory ratio.
-
Select node based on features sets like GPU enable or Network Acceleration
-
Defined Archiecture of CPU type either ARM or AMD based on capablity of specfic workload
-
Architect your nodes for resiliency by configure zone topology
-
Limit numbers of CPU & Memory could be utilised from nodes on nodelevel
Here is default Nodepool Yaml for karpenter(NAP), Which has confiuration on Node SKU types and Capacity, Also limit on nodes CPU:Memory along with Weight incase of Multiple nodepools
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 10s
template:
spec:
nodeClassRef:
name: default
# Requirements that constrain the parameters of provisioned nodes.
# These requirements are combined with pod.spec.affinity.nodeAffinity rules.
# Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported.
# https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- ondemand
- key: karpenter.azure.com/sku-family
operator: In
values:
- E
- D
- key: karpenter.azure.com/sku-name
operator: In
values:
- Standard_E2s_v5
- Standard_D4s_v3
limits:
cpu: "1000"
memory: 1000Gi
weight: 100
Using Spot Node with Karpenter
-
Add toleration in Sample AKS-Vote application i.e. "karpenter.sh/disruption:NoSchedule" which comes as default in spot node when provision with AKS Cluster
-
Please refer my github repo for Application yaml and sample nodepool config
spec: nodeSelector: "kubernetes.io/os": linux tolerations: - key: "kubernetes.azure.com/scalesetpriority" operator: "Equal" value: "spot" effect: "NoSchedule" containers: - name: azure-vote-front image: mcr.microsoft.com/azuredocs/azure-vote-front:v1
-
Scale down your application replicas to allow Karpenter to evict existing on-demand nodes and replace them with Spot nodes:
osama [ ~/karpenter ]$ kubectl get nodes NAME STATUS ROLES AGE VERSION aks-nodepool1-41633911-vmss000000 Ready agent 3d21h v1.27.9 aks-nodepool1-41633911-vmss00000b Ready agent 24m v1.27.9 osama [ ~/karpenter ]$ kubectl get pods -n karpenter-demo-ns -o wide No resources found in karpenter-demo-ns namespace. osama [ ~/karpenter ]$ kubectl scale deployment azure-vote-back --replicas=10 -n karpenter-demo-ns deployment.apps/azure-vote-back scaled osama [ ~/karpenter ]$ kubectl scale deployment azure-vote-front --replicas=10 -n karpenter-demo-ns deployment.apps/azure-vote-front scaled osama [ ~/karpenter ]$
-
Deploy and scale vote application replicas so that karpenter spins up spot nodes based on nodepool configuration and schedule pods after toleration validation on spot
-
Karpenter spins up new spot nodes and Nominate that node for sceduling sample vote-app
osama [ ~/karpenter ]$ kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp' NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-pz8sp Pod should schedule on: nodeclaim/default-52gbg karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-ckdcq Pod should schedule on: nodeclaim/default-52gbg karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-v9nqj Pod should schedule on: nodeclaim/default-52gbg karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-vswvs Pod should schedule on: nodeclaim/default-52gbg karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-lnxmp Pod should schedule on: nodeclaim/default-52gbg karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-jc2jz Pod should schedule on: nodeclaim/default-52gbg karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-hwnbh Pod should schedule on: nodeclaim/default-52gbg karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-r7msb Pod should schedule on: nodeclaim/default-52gbg karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-96lm9 Pod should schedule on: nodeclaim/default-52gbg karpenter-demo-ns 104s Normal Nominated pod/azure-vote-back-687ddb67bd-5qcvk Pod should schedule on: nodeclaim/default-52gbg default 1s Normal DisruptionLaunching nodeclaim/default-bkz6c Launching NodeClaim: Expiration/Replace default 1s Normal DisruptionWaitingReadiness nodeclaim/default-bkz6c Waiting on readiness to continue disruption default 1s Normal DisruptionBlocked nodepool/system-surge No allowed disruptions due to blocking budget default 1s Normal DisruptionWaitingReadiness nodeclaim/default-5vp7x Waiting on readiness to continue disruption default 1s Normal DisruptionLaunching nodeclaim/default-5vp7x Launching NodeClaim: Expiration/Replace
Configuring Multiple NodePools
-
To configure separate NodePools for Spot and On-Demand capacity:
Spot nodes configure with E series VM "Standard E2s_v5" and OnDemand with D series VM as "Standard_D4s_v5"
-
In multi-nodepool scenario each nodepool needs to be configured with 'Weight' attribute, nodepool with highest weight would be priotized over another, here we have Spot node with weight:100 and ondemand with weight:60
osama [ ~ ]$ kubectl get nodepool default -o yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
disruption:
budgets:
- nodes: 100%
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
template:
spec:
nodeClassRef:
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- key: karpenter.azure.com/sku-family
operator: In
values:
- B
- key: karpenter.azure.com/sku-name
operator: In
values:
- Standard_B2s_v2
weight: 100
-
If we do not specify an explicit SKU name, Karpenter will consider the entire VM series.
-
To validate that the sample VoteApp is running on Spot nodes, use the following commands:
-
The output should indicate that the nodes are of capacity type "spot":
osama [ ~ ]$ kubectl get pods -n karpenter-demo-ns -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES azure-vote-back-687ddb67bd-w7ghm 1/1 Running 0 63m 10.244.3.11 aks-default-5cr5f <none> <none> azure-vote-front-6855444955-64558 1/1 Running 0 63m 10.244.3.168 aks-default-5cr5f <none> <none> osama [ ~ ]$ kubectl describe node aks-default-5cr5f | grep karpenter.sh karpenter.sh/capacity-type=spot karpenter.sh/initialized=true karpenter.sh/nodepool=default karpenter.sh/registered=true karpenter.sh/nodepool-hash: 12393960163388511505 karpenter.sh/nodepool-hash-version: v2
Simulating Spot Node EvictionTo test the spot eviction scenario, simulate a spot eviction using the Azure CLI:
osama [ ~ ]$ az vm simulate-eviction --resource-group MC_aks-lab_aks-karpenter_eastus --name aks-default-5cr5f osama [ ~ ]$ date Tue May 21 06:20:02 PM IST 2024
-
Monitor the availability of your VoteApp using a simple curl command:
while true; do echo "$(date) $(curl -s -v -o /dev/null -w 'HTTP %{http_code}\n' http://voteapp.com 2>&1 | grep 'HTTP')"; sleep 2; done
-
After running the spot simulation, the existing node will be marked for termination, and a new Spot node will be created to schedule the VoteApp pods. Within less than a minute, the VoteApp should start responding with HTTP 200 status codes.
-
root@MININT-8C81HDE:/home/osamaex while true; do echo "$(date) $(curl -s -v -o /dev/null -w 'HTTP %{http_code}\n' http://voteapp.com 2>&1 | grep 'HTTP')"; sleep 2; done Tue May 21 18:20:04 IST 2024 > GET / HTTP/1.1 < HTTP/1.1 200 OK HTTP 200 Tue May 21 18:20:07 IST 2024 > GET / HTTP/1.1 < HTTP/1.1 200 OK HTTP 200 Tue May 21 18:20:09 IST 2024 > GET / HTTP/1.1 < HTTP/1.1 200 OK HTTP 200 Tue May 21 18:20:12 IST 2024 HTTP 000 $Failure-Alert Tue May 21 18:21:14 IST 2024 > GET / HTTP/1.1 < HTTP/1.1 200 OK $Successful-Response HTTP 200 Tue May 21 18:22:58 IST 2024 > GET / HTTP/1.1 < HTTP/1.1 200 OK HTTP 200
-
Check the events logged by Karpenter:
-
kuctl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp'
-
Results of events logged by karpenter to replace spot noded with ondemand
-
osama [ ~ ]$ NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE default 23s Warning FailedDraining node/aks-default-5cr5f Failed to drain node, 10 pods are waiting to be evicted karpenter-demo-ns 22s Normal Evicted pod/azure-vote-back-687ddb67bd-w7ghm Evicted pod karpenter-demo-ns 22s Normal Evicted pod/azure-vote-front-6855444955-64558 Evicted pod karpenter-demo-ns 21s Normal Nominated pod/azure-vote-back-687ddb67bd-tb2pv Pod should schedule on: nodeclaim/default-6zkkl karpenter-demo-ns 21s Normal Nominated pod/azure-vote-front-6855444955-7wzss Pod should schedule on: nodeclaim/default-6zkkl
-
Verify that the pods are running on the new Spot node:
kubectl get pods -n karpenter-demo-ns -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES azure-vote-back-687ddb67bd-tb2pv 1/1 Running 0 18m 10.244.2.103 aks-default-6zkkl <none> <none> azure-vote-front-6855444955-7wzss 1/1 Running 0 18m 10.244.2.47 aks-default-6zkkl <none> <none>
Save Cost by utilizing Reserved Instance VM's
-
NodePool configuration allows you to specify different VM series along with multiple VM SKUs. Create a separate NodePool with the highest weight value and specify all Reserved Instance VM SKU families or explicit SKU names using the
karpenter.azure.com/sku-name
orkarpenter.azure.com/sku-family
parameter.spec: nodeClassRef: name: default requirements: - key: kubernetes.io/arch operator: In values: - amd64 - key: kubernetes.io/os operator: In values: - linux - key: karpenter.sh/capacity-type operator: In values: - on-demand - key: karpenter.azure.com/sku-family operator: In values: - D - key: karpenter.azure.com/sku-name operator: In values: - [Standard_D2s_v3, Standard_D4s_v3, Standard_D8s_v3, Standard_D16s_v3, Standard_D32s_v3, Standard_D64s_v3, Standard_D96s_v3] weight: 90
Conclusion
The adoption of Karpenter in AKS signifies a major advancement in node scaling efficiency, flexibility, and cost optimization. By addressing the limitations of the Cluster Autoscaler and introducing dynamic, rapid provisioning of nodes, Karpenter provides a robust solution for managing Kubernetes clusters. Its flexibility in handling different VM types, faster scaling capabilities, and cost optimization make it a valuable addition to Kubernetes cluster management. By leveraging Karpenter, organizations can achieve more responsive and cost-effective Kubernetes deployments.