HA Configuration in AKS: System and Worker Nodes Across Multiple Availability Zones

Microsoft

May 09, 2025

In the context of Azure Kubernetes Service (AKS), high availability (HA) is critical to ensure your application remains resilient and operational during infrastructure failures. A key strategy for achieving HA is to distribute your system nodes (control plane) and worker nodes across multiple Availability Zones (AZs) within a region.

This blog walks through not only the what but also the why and how of setting up three system nodes and three worker nodes in HA mode, ensuring your AKS cluster maintains uptime even if one or more AZs experience an outage.

📊 Diagram: AKS High Availability Configuration

A typical HA setup includes:

Control plane nodes (system-managed) spread across AZs
Worker nodes (user-managed) manually deployed in multiple AZs
At least three zones to tolerate a full AZ failure

🎯 Why Three System and Three Worker Nodes?

Component	Reason
System Nodes	Quorum-based control plane ensures failover and etcd consistency.
Worker Nodes	Avoids single point of failure; allows Kubernetes scheduler to distribute pods for maximum uptime.

Minimum Three Zones:

Two out of three AZs can maintain quorum for control plane.
One or two zones failing won’t bring down the cluster or workloads.

🔍 When to Use This Architecture

This HA configuration is ideal for:

Production workloads requiring 99.99% uptime
Critical applications in sector such as finance, healthcare, and retail sectors
Scenarios where regulatory compliance or business continuity is essential

⚙️ System Nodes in HA Mode

When you create an AKS cluster with Availability Zone support enabled, Azure automatically provisions the control plane across three zones, if supported in the region.

🔧 Configuration Example

To create a zonal control plane in Terraform or ARM/Bicep, you must:

Choose a region with at least three availability zones (e.g., East US 2, West Europe)
Use the zone_redundant flag

az aks create \
--resource-group myRG \
--name myAKSCluster \
--location eastus2 \
--zones 1 2 3 \
--node-count 3 \
--node-vm-size Standard_DS2_v2 \
--generate-ssh-keys \
--enable-managed-identity \
--enable-node-public-ip

🧱 Worker Nodes in HA Mode

Worker nodes must be manually distributed across availability zones using zonal node pools. You should create separate node pools and specify the zone during creation.

📌 Step-by-Step CLI to Add HA Worker Node Pools

➤ Add Worker Node Pool in Zone 1

az aks nodepool add \ 
--resource-group myRG \ 
--cluster-name myAKSCluster \ 
--name workerpool1 \ 
--node-count 1 \ 
--zones 1 \ 
--node-vm-size Standard_DS2_v2 \ 
--mode User

➤ Add Worker Node Pool in Zone 2

az aks nodepool add \ 
--resource-group myRG \ 
--cluster-name myAKSCluster \
--name workerpool2 \ 
--node-count 1 \ 
--zones 2 \ 
--node-vm-size Standard_DS2_v2 \ 
--mode User

➤ Add Worker Node Pool in Zone 3

az aks nodepool add \ 
--resource-group myRG \ 
--cluster-name myAKSCluster \ 
--name workerpool3 \ 
--node-count 1 \ 
--zones 3 \ 
--node-vm-size Standard_DS2_v2 \
--mode User

📈 Autoscaling for Cost Optimization and Elasticity

To make your HA setup more cost-efficient and responsive to demand, enable Cluster Autoscaler for node pools. This allows the system to automatically adjust the number of nodes based on real-time workload requirements—scaling up during peak usage and scaling down during idle periods—thereby optimizing resource utilization and reducing unnecessary costs.

🔧 Enable Autoscaler on Worker Pools

➤ Worker pool: 1

az aks nodepool update \ 
--resource-group myRG \ 
--cluster-name myAKSCluster \ 
--name workerpool1 \ 
--enable-cluster-autoscaler \ 
--min-count 1 \ 
--max-count 5

➤ Worker pool: 2

az aks nodepool update \ 
--resource-group myRG \ 
--cluster-name myAKSCluster \ 
--name workerpool2 \ 
--enable-cluster-autoscaler \ 
--min-count 1 \ 
--max-count 5

➤ Worker pool: 3

az aks nodepool update \ 
--resource-group myRG \ 
--cluster-name myAKSCluster \ 
--name workerpool3 \ 
--enable-cluster-autoscaler \ 
--min-count 1 \ 
--max-count 5

✅ Benefits of Autoscaling:

Cost control: Scale down during low usage periods.
Resiliency: Automatically add nodes if workloads exceed capacity.
HA synergy: If one zone scales down, the others can scale up to compensate.

✅ Conclusion

Implementing a high availability architecture in Azure Kubernetes Service (AKS) is essential for ensuring that both control and data planes can survive infrastructure failures without affecting application uptime. By distributing system nodes (managed by Azure) and worker nodes (user-managed) across multiple availability zones:

You minimize the risk of downtime caused by zone-level outages.
You increase fault tolerance by ensuring application workloads are not pinned to a single location.
You align with best practices for mission-critical, production-grade AKS deployments.

Whether you're hosting microservices, APIs, or stateful apps, this approach ensures that your Kubernetes environment is prepared for failure—without failing your users.

Updated May 09, 2025

Version 1.0

khanshaban

Microsoft

Joined April 19, 2024

View Profile

Azure Infrastructure Blog

Follow this blog board to get notified when there's new activity