Azure Governance and Management Blog

5 MIN READ

Why Even Stateless AKS Clusters Might Need Backup

Microsoft

May 15, 2025

When we talk about backup solutions for Kubernetes clusters, the need is usually most obvious in the context of stateful workloads those managing persistent data like databases, file systems, or applications with long lived storage. For such workloads, backups are non-negotiable, acting as the safety net for data recovery, disaster protection, and business continuity.

But what about stateless AKS clusters?

At first glance, stateless workloads seem like perfect candidates to not back up. After all, by design:

They don’t rely on persistent volumes or databases
Their deployments can be recreated from container images, Helm charts, or Git repositories
The infrastructure is defined declaratively using infrastructure-as-code (IaC) or GitOps pipelines
CI/CD systems can tear down and recreate entire environments on demand

In such idealized environments, backup appears redundant. Any failure or disaster can be handled by simply re-running the pipeline or restoring manifests from Git. If everything is version-controlled, reproducible, and ephemeral, why invest in backup infrastructure and introduce the complexity and cost of a backup solution?

This line of thinking is valid in theory. In fully mature DevOps setups with no manual drift, strict Git hygiene, no compliance overhead, and zero state stored in-cluster, the cost and complexity of backups might outweigh their limited benefits.

The answer lies in real world operational gaps, compliance requirements, and reliability expectations that engineering teams encounter, even in seemingly “stateless” environments. In this article, we'll explore the key scenarios where backing up stateless AKS clusters adds value, even in cloud native, DevOps driven organizations.

When Stateless Doesn't Mean Disposable

In organizations with well-established DevOps practices, GitOps and infrastructure as code pipelines manage everything declaratively. The entire cluster state from workloads to configurations is reproducible from Git. In such environments, backup may seem redundant.

However, not every team is operating at peak DevOps maturity. Even in mature teams, gaps and exceptions exist. Let’s look at situations where backup for stateless clusters becomes essential.

It’s important to note that most Kubernetes backup solutions, including Azure Backup for AKS, capture the desired state by backing up YAML manifests, Custom Resources, and selected cluster metadata while typically opting out of backing up the full etcd store, which is Kubernetes’ actual source of truth. As a result, a restore usually reflects what was intended to run, not necessarily what was actually running at the moment of failure. This distinction underscores the importance of regular, automated backups to bridge the gap between configuration intent and operational reality.

1. Absence or Inconsistency in GitOps Practices

While GitOps aims to be the single source of truth, reality often diverges from this ideal:

Manual changes are introduced in the heat of production incidents.
Drift occurs between the declared and actual cluster state.
Some components are not version controlled (e.g., in cluster secrets, ad hoc cron jobs, Helm releases with local values).

Consider this as an example:

A retail engineering team hot-fixed a frontend deployment by patching environment variables directly in the cluster during a Black Friday outage. They never committed the changes to Git, and after the dust settled, the service stopped working during redeployment. The backup captured the exact deployment spec from that moment—environment variables included—allowing the team to restore the working state without guesswork.

In such cases, a backup acts as a point in time capture of the actual cluster state, allowing teams to recover or audit what truly ran in the environment not just what was intended as per the Git repositories.

2. Compliance and Regulatory Requirements

Industries with strict governance—such as finance, healthcare, and government—often require:

Auditability of production configurations
Retention of infrastructure state for post mortem or regulatory reviews
Proof of controls around cluster state

Even for stateless workloads, backups can satisfy these compliance demands, especially when you need to demonstrate that critical workloads and configurations can be recovered exactly as they were.

Consider this as an example:

A fintech company undergoing a routine audit had to produce a record of its production deployments from the previous quarter. Their GitOps pipeline had undergone changes, and some manifests were overwritten. Fortunately, they were able to generate historical records from AKS backup snapshots, satisfying audit requirements and avoiding potential penalties.

3. Forensic Analysis and Post Incident Review

When investigating a production outage or security incident, engineers often need to understand:

What was deployed?
What configmaps or secrets were present?
Were any unusual workloads running?

Consider this as an example:

After a failed release, a security team launched an investigation and discovered that a crypto-mining container had been injected as a sidecar. It wasn’t in source control, and log retention had expired. A namespace-level backup taken a few hours before the incident provided a complete snapshot, enabling them to analyze the rogue container’s configuration and timeline.

Backups offer a forensic lens into the cluster state at a specific point in time. This is invaluable for root cause analysis and may reveal insights that are no longer visible due to pod churn or log retention limits.

4. Accelerated Recovery and Operational Simplicity

Even stateless applications can take time to recover in the event of:

Cluster level failures
Region outages
Misconfigured redeployments

In contrast to rehydrating from scratch, restoring a cluster or a namespace from a backup snapshot can significantly reduce RTO (Recovery Time Objective) especially for complex environments with multiple microservices, RBAC settings, custom resources, and interdependencies.

Consider this as an example:

An e-commerce platform hosted in a single AKS cluster experienced a regional outage. Though all apps were stateless and versioned in Git, restoring the entire namespace from a recent backup complete with network policies, secrets, and service bindings allowed them to relaunch in a new region within minutes, far faster than a redeployment pipeline alone would have allowed.

5. Preserving the Actual Source of Truth

In many organizations, Git is not always the full source of truth:

Onboarding of legacy applications with partial IaC coverage
Teams using Helm or Kustomize without consistent repo structures
Clusters with long lived manual tweaks

In these cases, the cluster itself becomes the de facto source of truth. A backup ensures you can retain and reproduce what worked even if the corresponding manifests don’t exist (or are outdated).

Consider this as an example:

A SaaS provider relied on Helm charts with custom values set locally by different teams. During a platform upgrade, they lost track of multiple override configurations not committed to Git. The backup captured the running state, including those local Helm values, allowing for an exact rollback and informing future IaC improvements.

Conclusion

Stateless does not mean disposable and certainly not irrelevant in backup planning. While there are cases where backup for stateless AKS clusters can be safely skipped, many real-world environments face drift, compliance demands, or operational complexity that make backups not just helpful, but necessary.

If your organization isn’t yet at peak GitOps hygiene or operates in regulated industries then backing up even your stateless workloads is a wise investment. Not because they hold data, but because they hold context and context is critical for resilience, auditability, and operational excellence.

To get started with protecting both stateless and stateful workloads, explore Azure Backup for AKS and see how it fits into your Kubernetes resilience strategy.

Updated May 15, 2025

Version 1.0

management

operational excellence

well-architected

rajats2210

Microsoft

Joined November 01, 2023

View Profile

Azure Governance and Management Blog

Follow this blog board to get notified when there's new activity