Blog Post

Microsoft Mission Critical Blog
6 MIN READ

Hardening Spring Boot Health Probes on AKS: How to Prevent Restart Storms Before They Start

AndreasSemmelmann's avatar
Feb 02, 2026

Spring Boot workloads on AKS can become unstable when health probes are wired too aggressively. A brief control‑plane latency spike or transient platform degradation can cascade into probe failures—and escalate into mass pod restarts across microservices. This post shows how a few small, proven changes in probe design dramatically reduce the blast radius, improve resiliency, and eliminate restart storms in real production clusters.

Overview

Transient platform degradations can turn into outages if health probes are overly strict or wired to the wrong endpoint. In this case study, multiple incidents were reported in which many Spring Boot pods restarted in a short time window on Azure Kubernetes Service (AKS), contributing to visible downtime.

A key lesson is that probing the consolidated /actuator/health endpoint can amplify blast radius: if any health contributor degrades, the overall endpoint can report unhealthy. When that endpoint is used for liveness, kubelet can restart pods at scale and create a feedback loop (mass restarts → node pressure → additional failures).

The remediation combined two changes:

  1. separating liveness and readiness onto the dedicated Actuator probe endpoints (/actuator/health/liveness and /actuator/health/readiness), and 
  2. tuning probe thresholds (especially timeoutSeconds) to tolerate brief latency spikes.

This article targets platform engineers and SREs and provides a baseline configuration, a troubleshooting checklist, and a simple validation approach.

Environment (for reproducibility)

This scenario was observed on AKS (Kubernetes 1.30.3) with Spring Boot 2.3.x and an NGINX Ingress Controller deployed as a separate workload. Node OS image and JDK details are not required for the probe wiring and threshold tuning discussed here.

The Challenge

On AKS, even a short-lived control plane latency spike can ripple into workload behavior if kubelet health checks are configured too aggressively. Probes are meant to protect reliability, but when they are wired to the wrong signal they can turn a brief degradation into a restart loop.

  • Problem statement: Probe design amplified a transient AKS control plane degradation into a mass pod restart event.
  • Business impact: Visible downtime and unstable service behavior due to restart storms across multiple microservices.
  • Who’s affected: SREs, platform engineers, and application teams operating Spring Boot workloads on Kubernetes (especially AKS).

 

What Happened?

We saw a familiar pattern: many Spring Boot pods restarted within a short window, and probes started failing across a large part of the fleet at the same time.

Incident timeline

On 2025-03-25, downtime was reported across multiple Spring Boot-based microservices due to widespread pod restarts, and the event was associated with elevated Kubernetes API server connectivity/latency issues on the Linux node pool. A similar pattern was reported again on 2025-06-02: probes failed for many pods in a short window, restarts followed, and the system needed ~15 minutes to stabilize while CPU/memory pressure was elevated.

Why this failure mode is common

What made the situation worse was the probe design: liveness and readiness were both wired to the same composite health endpoint (/actuator/health) and the liveness timeout was very strict. Under transient latency, that combination can turn “brief slowness” into “restart many pods”, and restarts add even more pressure to nodes and the cluster.

The Solution

We kept the solution intentionally simple and AKS-focused: reduce the blast radius during transient cluster/platform slowness, and prevent kubelet from turning short probe timeouts into mass restarts.

Concretely, we did two things:

  1. moved readiness/liveness to the dedicated Actuator probe endpoints, and
  2. increased probe timeouts/thresholds to tolerate brief latency spikes.

1) Use dedicated Actuator probe endpoints

Spring Boot Actuator exposes health endpoints under /actuator/health. The consolidated endpoint is intentionally broad (it reflects multiple health contributors). For Kubernetes probes, it is usually better to use dedicated readiness/liveness endpoints so a transient dependency issue can stop traffic without forcing restarts.

In this case, we moved

  • readiness to /actuator/health/readiness (so AKS can stop sending traffic when the instance is not ready) and
  • liveness to /actuator/health/liveness (so short slowness does not trigger restarts).
  • We kept /actuator/health for human-facing checks and dashboards.

This article focuses on Kubernetes probe wiring and thresholds. The exact Spring Boot Actuator configuration (application.yml, environment variables, and the enabled health groups/contributors) is application-specific and does not change the core recommendation: use the dedicated probe endpoints for readiness/liveness and tune probe thresholds for transient latency.

If you want a minimal Spring Boot baseline for these endpoints (Spring Boot 2.3+), it typically looks like this:

management:
  endpoint:
    health:
      probes:
        enabled: true

2) Tune probe thresholds to match reality

If the platform experiences brief latency spikes, timeoutSeconds: 1 is often too aggressive for liveness.

 

Implementation (Step-by-Step)

The YAML snippets below illustrate the probe configurations used before and after the remediation.

Step 1 — Baseline probe behavior (before)

Scenario 1 (before): readiness + liveness wired to /actuator/health

 

 

Baseline/original readiness probe (as captured):

readinessProbe:
  httpGet:
    path: /actuator/health         # Spring Boot health endpoint for readiness
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3

Baseline/original liveness probe (as captured; failureThreshold not specified):

livenessProbe:
  httpGet:
    path: /actuator/health         # Spring Boot health endpoint for liveness
    port: 8080
  initialDelaySeconds: 40
  periodSeconds: 15
  timeoutSeconds: 1
  # failureThreshold not specified (defaults apply)

Step 2 — Separate liveness and readiness endpoints (after)

Scenario 2 (after): readiness gates traffic, liveness avoids restart loops

 

 

Remediated readiness probe:

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 5
  timeoutSeconds: 5
  failureThreshold: 3

Remediated liveness probe:

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 30
  timeoutSeconds: 30
  failureThreshold: 5

Step 3 — Add a startup probe (recommended)

Use a startup probe to prevent liveness/readiness from flapping while the JVM warms up (classloading, DB migrations, cache priming). The values below are a safe starting point for many Spring Boot services; tune them based on observed startup time.

startupProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  # Allows up to 5 minutes for cold start: 30 * 10s = 300s
  failureThreshold: 30
  periodSeconds: 10
timeoutSeconds: 5

Architecture / Dataflow

This diagram shows the causal chain at a glance: a transient platform issue can surface as slower health responses, which then interacts with probe thresholds to decide whether traffic is removed or containers restart.

Validation (How to Prove It Worked)

Validation is a simple before/after check: after the change, probe failures and restarts should drop, and short AKS/platform slowness should lead to traffic being gated (readiness) instead of mass restarts (liveness). If you can capture sanitized metrics, focus on restart rate, probe failures, ingress 5xx, recovery time, and (when available) control plane latency.

Troubleshooting Checklist (How to Diagnose)

Use this when you see synchronized restarts across many pods.

  1. Confirm the restart pattern. Start by watching pods and checking placement.
    1. kubectl get pods -n <ns> -w
    2. kubectl get pods -n <ns> -o wide
  2. Check events and probe failures. You want to see whether kubelet is killing containers due to probe timeouts.
    1. kubectl get events -n <ns> --sort-by=.lastTimestamp
    2. kubectl describe pod <pod> -n <ns>
  3. Identify restart reasons. Look for CrashLoopBackOff, OOMKilled, and repeated probe failure events.
  4. Validate Actuator endpoint behavior from inside the pod. This confirms which endpoint flips and how fast it responds.
    1. kubectl exec -n <ns> <pod> -- curl -sS -m 5 http://127.0.0.1:8080/actuator/health
    2. kubectl exec -n <ns> <pod> -- curl -sS -m 5 http://127.0.0.1:8080/actuator/health/readiness
    3. kubectl exec -n <ns> <pod> -- curl -sS -m 5 http://127.0.0.1:8080/actuator/health/liveness
  5. Correlate with AKS/platform signals. If available, correlate probe failures with control plane latency signals.

Security Notes (Don’t Create a New Exposure)

Treat Actuator as an internal-only surface. Probes need access, but that does not mean the internet does.

Controls that typically work well for this pattern:

  • Avoid routing Actuator endpoints through an internet-facing ingress.
  • If ingress is unavoidable, use internal exposure and strict allowlists.
  • Keep Actuator exposure minimal (only the health endpoints needed for probes).

Discussion & Feedback

If you’ve run similar AKS incidents, I’d love to compare notes:

  • Have you seen probe failures cascade into mass restarts?
  • Do you wire liveness to a “full health” endpoint today, and why?
  • What timeout and failure threshold values have proven reliable in production?

Resources

If you want to go deeper, these references cover the probe mechanics and the Spring Boot side of the health model:

⚠️ Microsoft Support Statement

This article represents field experiences and community best practices. For official Microsoft support and SLA-backed guidance:

Production issues: For production-impacting problems, contact Microsoft Support.

🔒 Customer Privacy Notice

This article describes real-world scenarios from customer engagements. All customer-specific information has been anonymized:

Company names are replaced with industry categories, exact metrics are generalized where necessary, and infrastructure details are sanitized.

🤝 Community Contribution

We welcome corrections, improvements, and additional real-world examples. If you spot an issue or have a better probe hardening pattern, share it via comments or reach out.

🤖 AI Tools Disclosure

Parts of this article were created with assistance from AI tools to improve clarity and structure. Review and validate all content before publication.

Updated Feb 05, 2026
Version 2.0
No CommentsBe the first to comment