Modern cloud‑native applications deployed on Azure Kubernetes Service must continue operating during infrastructure failures such as node crashes or availability zone outages. While Azure Kubernetes Service provides built‑in resiliency capabilities, validating workload behavior through structured high availability (HA) testing is critical before production deployment. This article presents a practical high availability testing strategy for Azure Kubernetes Service–based applications deployed in a single Azure region with Availability Zones, enabling engineering teams to identify potential single points of failure and evaluate workload resiliency across compute and platform layers.
During migration testing engagements for enterprise workloads moving to Azure Kubernetes Service (AKS), we frequently validate high availability behavior under simulated infrastructure failures within a single Azure region using Availability Zones.
In one such workload migration scenario involving a multi‑replica API service, infrastructure‑level HA configuration alone did not guarantee runtime resiliency during node or zone‑level disruptions. This required us to perform controlled failure simulations at pod, node, and availability zone levels to observe application behavior under real traffic conditions.
A properly designed HA architecture helps evaluate service availability during component failures, automatically recover workloads, and reduce single points of failure.
Microsoft’s HA guidance for Azure Kubernetes Service is broadly based on four key principles:
- Redundancy
- Monitoring
- Recovery
- State Management (Checkpointing)
Basic HA Architecture:
Key HA Components:
- Multi-zone node pools
- Replica pods
- Load balancing
Why High Availability Testing is Critical
Configuring HA at the infrastructure level does not automatically guarantee application resilience. In real‑world production environments:
- Misconfigured health probes can disrupt failover
- Pods may be unintentionally scheduled in a single zone
- External dependencies such as databases or APIs may fail silently
Therefore, HA testing validates the runtime behavior of the system under failure conditions rather than relying solely on architectural assumptions.
Types of High Availability Testing in Azure Kubernetes Service
1. Pod-Level Failure Testing
Simulate:
- Pod crashes
- Container failures
Expected Outcome:
- Pod is automatically restarted
- Traffic is rerouted through Kubernetes Service endpoints
2. Node-Level Failure Testing
Simulate:
- Node failure or shutdown
Expected Outcome:
- Pods are rescheduled to healthy nodes
- Application remains available if replica configuration exists
3. Availability Zone Failure Testing
Simulate:
- Complete outage of an availability zone
Example approach:
- Drain nodes within a specific zone
- Inject failures using chaos engineering tools
Expected Outcome:
- Traffic is served from workloads running in other zones
4. Network Failure Testing
Simulate:
- Network latency
- Packet loss
- DNS failures
Recommended Tool:
- Azure Chaos Studio
5. Dependency Failure Testing
Simulate:
- Database unavailability
- External API failures
Expected Outcome:
- Retry mechanisms are triggered
- Circuit breaker patterns activate as designed
Test Execution Approach:
During each test scenario:
a. Monitor pod scheduling using: kubectl get pods -o wide
b. Capture Kubernetes events using: kubectl get events
c. Validate node health using: kubectl get nodes
d. Confirm replica redistribution across available nodes/zones
e. Observe workload recovery behavior and readiness probe response
These validation steps help confirm whether application replicas remain available during infrastructure disruptions.
Testing Context
This HA validation workflow was executed as part of pre‑production migration testing for customer workloads transitioning from legacy hosting environments to Azure Kubernetes Service.
The objective was to ensure that application availability remained within SLA thresholds during infrastructure‑level disruptions simulated within a single Azure region using availability zones.
Expected Outcomes:
Successful HA testing should validate:
a. Automatic pod recovery upon failure
b. Replica redistribution across available nodes
c. Service availability during node disruptions
d. Traffic continuity during pod rescheduling
e. Workload resiliency across availability zones
Chaos Engineering for HA Validation
Chaos Engineering involves intentionally injecting failures into the system to validate resilience under unpredictable runtime conditions.
To simulate network latency and node‑level disruptions during live traffic testing, we used Azure Chaos Studio to inject controlled failures within one availability zone while monitoring workload behavior through Azure Monitor.
This helped us validate whether replica pods deployed across other zones continued serving traffic without prolonged request failures during failover events.
End‑to‑End HA Test Workflow in Azure Kubernetes Service
Objective
Validate that an API service deployed on Azure Kubernetes Service maintains high availability and minimal downtime during failure scenarios across availability zones.
Pre‑Requisites
- API deployed with multiple replicas (≥2)
- Pods distributed across multiple availability zones
- Azure Load Balancer or Ingress configured
- Liveness and readiness probes enabled
- Monitoring via Azure Monitor or Prometheus
Step 1 – Multi‑Zone Deployment Validation
Ensure that pods are distributed across zones with no single point of failure.
Step 2 – Baseline Traffic Validation
Generate continuous traffic using tools such as:
- JMeter
Expected Outcome:
- Load balancer distributes traffic across pods in different zones
- Requests are served successfully from all zones
Step 3 – Failure Injection
Inject failure during live traffic:
kubectl delete pod <pod-name> kubectl drain <node-name> --ignore-daemonsets
During live traffic execution, we simulated pod and node failure within a single availability zone using kubectl drain and pod deletion commands.
Following failure injection, we monitored:
- Replica redistribution across alternate zones
- Endpoint updates within Kubernetes Service
- Readiness probe behavior during replacement pod scheduling
We observed temporary latency spikes during pod rescheduling, while requests continued to be served from workloads running in alternate zones.
Step 4 – Failover Validation
Verify failover behavior using the following validation commands:
kubectl get pods -o wide
kubectl get endpoints
kubectl describe pod <pod-name>
kubectl get events --sort-by=.metadata.creationTimestamp
Confirm that:
- Traffic is no longer routed to unhealthy pods
- Replacement pods are scheduled on healthy nodes in alternate availability zones
- Requests are redirected to healthy workloads
- No prolonged request failures are observed during pod rescheduling
Step 5 – Self‑Healing Validation
Confirm:
- Azure Kubernetes Service restores desired replica count
- Scheduler provisions replacement pods in available zones
Step 6 – Traffic Rebalancing
Validate:
- Traffic redistribution across recovered pods
- Restoration of steady‑state multi‑zone deployment
Step 7 – Observability and Metrics Analysis
Monitor the following metrics during the test:
- Recovery Time Objective (RTO)
- Error Rate (%)
- Latency (P95 / P99)
- Throughput
- Pod Scheduling Time
Step 8 – Availability Validation
Expected Outcome:
- Zero or minimal request failures
- Temporary latency spikes during failover
- Continuous system availability within SLA thresholds
Key Metrics to Track
- Recovery Time
- Error Rate
- Latency
- Pod Restart Count
- Availability Percentage
Common Mistakes in HA Testing
Avoid the following anti‑patterns:
- Single replica deployments
- Missing readiness probes
- Pods scheduled within a single node or zone
- Ignoring database HA
- Not testing real failure scenarios
High Availability vs Disaster Recovery
| Aspect | High Availability | Disaster Recovery |
|---|---|---|
| Scope | Within Region | Across Regions |
| Goal | Minimize Downtime | Recover from Disaster |
| Example | Multi‑Zone Azure Kubernetes Service | Multi‑Region Azure Kubernetes Service |
Important Note: High Availability leverages availability zones, while Disaster Recovery requires multi‑region deployment strategies.
Resources
- Azure Services That Support Availability Zones
- Learn about which Azure services provide availability zone support, including zonal and zone-redundant options, and the requirements that some services have.
- Azure region pairs and nonpaired regions
- Learn about Azure region pairs and regions without a pair.
- List of Azure regions
- Find Azure regions, their physical location, geography, availability zone support, as well as their corresponding paired regions if they have one.
- Reliability Guides for Azure Services
- See a list of reliability guides for Azure products and services. Learn about transient fault handling, availability zones, and multi-region support.
- What are Azure regions?
- Learn about Azure regions, and how to use them to design resilient solutions.
- Reliability in Azure Virtual Machines
- Learn about resiliency in Azure Virtual Machines, including resilience to transient faults, availability zone failures, region-wide failures, and service maintenance. Understand backup options and SLA details.
- Availability options for Azure Virtual Machines - Azure Virtual Machines
Next Steps: Based on migration testing outcomes, integrating HA validation into pre‑production test cycles enabled early identification of zone‑level scheduling issues and readiness probe misconfigurations, improving workload resiliency prior to production cutover.