Blog Post

Microsoft Developer Community Blog
5 MIN READ

High Availability Testing for Azure Kubernetes Service in a Single Region with Availability Zones

jiteshhp's avatar
jiteshhp
Icon for Microsoft rankMicrosoft
May 04, 2026

Modern cloud‑native applications deployed on Azure Kubernetes Service must continue operating during infrastructure failures such as node crashes or availability zone outages. While Azure Kubernetes Service provides built‑in resiliency capabilities, validating workload behavior through structured high availability (HA) testing is critical before production deployment. This article presents a practical high availability testing strategy for Azure Kubernetes Service–based applications deployed in a single Azure region with Availability Zones, enabling engineering teams to identify potential single points of failure and evaluate workload resiliency across compute and platform layers.

During migration testing engagements for enterprise workloads moving to Azure Kubernetes Service (AKS), we frequently validate high availability behavior under simulated infrastructure failures within a single Azure region using Availability Zones.

In one such workload migration scenario involving a multi‑replica API service, infrastructure‑level HA configuration alone did not guarantee runtime resiliency during node or zone‑level disruptions. This required us to perform controlled failure simulations at pod, node, and availability zone levels to observe application behavior under real traffic conditions.

A properly designed HA architecture helps evaluate service availability during component failures, automatically recover workloads, and reduce single points of failure.

Microsoft’s HA guidance for Azure Kubernetes Service is broadly based on four key principles:

  • Redundancy
  • Monitoring
  • Recovery
  • State Management (Checkpointing)

Basic HA Architecture:

 

Key HA Components:

  • Multi-zone node pools
  • Replica pods
  • Load balancing

Why High Availability Testing is Critical

Configuring HA at the infrastructure level does not automatically guarantee application resilience. In real‑world production environments:

  • Misconfigured health probes can disrupt failover
  • Pods may be unintentionally scheduled in a single zone
  • External dependencies such as databases or APIs may fail silently

Therefore, HA testing validates the runtime behavior of the system under failure conditions rather than relying solely on architectural assumptions. 

Types of High Availability Testing in Azure Kubernetes Service

1. Pod-Level Failure Testing

Simulate:

  • Pod crashes
  • Container failures

Expected Outcome:

  • Pod is automatically restarted
  • Traffic is rerouted through Kubernetes Service endpoints 

2. Node-Level Failure Testing

Simulate:

  • Node failure or shutdown

Expected Outcome:

  • Pods are rescheduled to healthy nodes
  • Application remains available if replica configuration exists 

3. Availability Zone Failure Testing

Simulate:

  • Complete outage of an availability zone

Example approach:

  • Drain nodes within a specific zone
  • Inject failures using chaos engineering tools

Expected Outcome:

  • Traffic is served from workloads running in other zones 

4. Network Failure Testing

Simulate:

  • Network latency
  • Packet loss
  • DNS failures

Recommended Tool:

  • Azure Chaos Studio 

5. Dependency Failure Testing

Simulate:

  • Database unavailability
  • External API failures

Expected Outcome:

  • Retry mechanisms are triggered
  • Circuit breaker patterns activate as designed 

Test Execution Approach:

During each test scenario:

  a. Monitor pod scheduling using: kubectl get pods -o wide

  b. Capture Kubernetes events using: kubectl get events

  c. Validate node health using: kubectl get nodes

  d. Confirm replica redistribution across available nodes/zones

  e. Observe workload recovery behavior and readiness probe response

These validation steps help confirm whether application replicas remain available during infrastructure disruptions.

Testing Context

This HA validation workflow was executed as part of pre‑production migration testing for customer workloads transitioning from legacy hosting environments to Azure Kubernetes Service.

The objective was to ensure that application availability remained within SLA thresholds during infrastructure‑level disruptions simulated within a single Azure region using availability zones.

Expected Outcomes:

Successful HA testing should validate:

  a. Automatic pod recovery upon failure

  b. Replica redistribution across available nodes

  c. Service availability during node disruptions

  d. Traffic continuity during pod rescheduling

  e. Workload resiliency across availability zones

Chaos Engineering for HA Validation

Chaos Engineering involves intentionally injecting failures into the system to validate resilience under unpredictable runtime conditions.

To simulate network latency and node‑level disruptions during live traffic testing, we used Azure Chaos Studio to inject controlled failures within one availability zone while monitoring workload behavior through Azure Monitor.

This helped us validate whether replica pods deployed across other zones continued serving traffic without prolonged request failures during failover events.

End‑to‑End HA Test Workflow in Azure Kubernetes Service

Objective

Validate that an API service deployed on Azure Kubernetes Service maintains high availability and minimal downtime during failure scenarios across availability zones. 

Pre‑Requisites

  • API deployed with multiple replicas (≥2)
  • Pods distributed across multiple availability zones
  • Azure Load Balancer or Ingress configured
  • Liveness and readiness probes enabled
  • Monitoring via Azure Monitor or Prometheus 

Step 1 – Multi‑Zone Deployment Validation

Ensure that pods are distributed across zones with no single point of failure. 

Step 2 – Baseline Traffic Validation

Generate continuous traffic using tools such as:

  • JMeter

Expected Outcome:

  • Load balancer distributes traffic across pods in different zones
  • Requests are served successfully from all zones 

Step 3 – Failure Injection

Inject failure during live traffic:

kubectl delete pod <pod-name> kubectl drain <node-name> --ignore-daemonsets

During live traffic execution, we simulated pod and node failure within a single availability zone using kubectl drain and pod deletion commands.

Following failure injection, we monitored:

- Replica redistribution across alternate zones

- Endpoint updates within Kubernetes Service

- Readiness probe behavior during replacement pod scheduling

We observed temporary latency spikes during pod rescheduling, while requests continued to be served from workloads running in alternate zones.

Step 4 – Failover Validation

Verify failover behavior using the following validation commands:

  kubectl get pods -o wide

  kubectl get endpoints

  kubectl describe pod <pod-name>

  kubectl get events --sort-by=.metadata.creationTimestamp

Confirm that:

  - Traffic is no longer routed to unhealthy pods

  - Replacement pods are scheduled on healthy nodes in alternate availability zones

  - Requests are redirected to healthy workloads

  - No prolonged request failures are observed during pod rescheduling

Step 5 – Self‑Healing Validation

Confirm:

  • Azure Kubernetes Service restores desired replica count
  • Scheduler provisions replacement pods in available zones 

Step 6 – Traffic Rebalancing

Validate:

  • Traffic redistribution across recovered pods
  • Restoration of steady‑state multi‑zone deployment 

Step 7 – Observability and Metrics Analysis

Monitor the following metrics during the test:

  • Recovery Time Objective (RTO)
  • Error Rate (%)
  • Latency (P95 / P99)
  • Throughput
  • Pod Scheduling Time 

Step 8 – Availability Validation

Expected Outcome:

  • Zero or minimal request failures
  • Temporary latency spikes during failover
  • Continuous system availability within SLA thresholds 

Key Metrics to Track

  • Recovery Time
  • Error Rate
  • Latency
  • Pod Restart Count
  • Availability Percentage 

Common Mistakes in HA Testing

Avoid the following anti‑patterns:

  • Single replica deployments
  • Missing readiness probes
  • Pods scheduled within a single node or zone
  • Ignoring database HA
  • Not testing real failure scenarios 

High Availability vs Disaster Recovery

AspectHigh AvailabilityDisaster Recovery
ScopeWithin RegionAcross Regions
GoalMinimize DowntimeRecover from Disaster
ExampleMulti‑Zone Azure Kubernetes ServiceMulti‑Region Azure Kubernetes Service

Important Note: High Availability leverages availability zones, while Disaster Recovery requires multi‑region deployment strategies.

Resources

 

Next Steps: Based on migration testing outcomes, integrating HA validation into pre‑production test cycles enabled early identification of zone‑level scheduling issues and readiness probe misconfigurations, improving workload resiliency prior to production cutover.

 

Updated Apr 29, 2026
Version 1.0
No CommentsBe the first to comment