Blog Post

Microsoft Mission Critical Blog
3 MIN READ

Accelerating AKS Upgrades with Fleet Manager: Finding the Right Balance

manandak's avatar
manandak
Icon for Microsoft rankMicrosoft
Feb 25, 2026

Introduction Upgrading Azure Kubernetes Service (AKS) clusters at scale can be time-consuming, especially when managing multiple environments and clusters. Azure Fleet Manager provides powerful controls to orchestrate these upgrades efficiently. However, with this flexibility comes important design considerations and trade-offs that platform teams must understand. Disclaimer: This article draws on publicly available documentation as of February 2026 and is intended to provide insight into how Fleet Manager manages AKS upgrades, along with the key factors to consider when defining an effective upgrade strategy. The views expressed in this article are those of the author and do not necessarily reflect the official policy or position of Microsoft. The author is a Microsoft employee.

At the heart of AKS Fleet Manager upgrades are three foundational concepts: update runsupdate stages, and update groups. 

  • Update run: An update run represents an update being applied to a collection of AKS clusters, consisting of the update goal and sequence. The update goal describes the desired updates (for example, upgrading to Kubernetes version 1.28.3). The update sequence describes the exact order to apply the update to multiple member clusters, expressed using stages and groups. If unspecified, all the member clusters are updated one by one sequentially. An update run can be stopped and started. 
  • Update stage: Update runs are divided into stages, which are applied sequentially. For example, a first update stage might update test environment member clusters, and a second update stage would then later update production environment member clusters. A wait time can be specified to delay between the application of subsequent update stages. 
  • Update group: Each update stage contains one or more update groups, which are used to select the member clusters to be updated. Within an update stage, updates are applied to all the different update groups in parallel; within an update group, member clusters update sequentially. Each member cluster of the fleet can only be part of one update group. 

 

 

 

Image Source – Azure Portal- > AKS Fleet Manager -> Upgrade Groups Explanation 

The Approach 

To reduce the overall time required to complete AKS upgrades across all clusters, there are two primary levers available: 

  1. Reduce the number of update stages, since stages are upgraded sequentially. 
  2. Increase the number of update groups within a stage, as update groups are upgraded in parallel. Each update stage can contain up to 50 update groups, allowing as many as 50 AKS clusters to be upgraded concurrently. 

While both approaches can significantly speed up the upgrade process, each introduces its own risks that must be carefully considered. 

 

Finding the right Balance 

  • Reducing Update Stages: Speed at the Cost of Safety 

Reducing the number of update stages typically means grouping AKS clusters from multiple environments—such as dev, test, and production—into one or two stages. Although this can shorten the overall upgrade timeline, it is not recommended. 

This approach severely limits the time available to validate application behavior in lower environments before rolling changes into higher-risk environments like production. Microsoft best practices explicitly recommend keeping the first update stage small, with a minimal number of update groups. This helps contain the blast radius if a regression is introduced. 

It’s also important to note that AKS does not currently support rollback after an upgrade. If a regression occurs, the only remediation option is to provision a new AKS cluster running the previous version, which can be both time-consuming and operationally expensive. 

  •  Increasing Update Groups: Parallelism with Capacity Risks 

An alternative—and generally safer—approach is to increase the number of update groups starting from the second update stage onward. This allows more clusters to be upgraded in parallel, reducing the overall upgrade duration while still preserving a controlled validation phase. 

However, parallel upgrades come with their own challenges. Running multiple AKS upgrades simultaneously increases the risk of failures due to capacity constraints in an Availability Zone, particularly when node pools rely on VM SKUs with limited availability. The risk grows even further when node pools are configured with a higher Max Surge value, as more nodes are created concurrently during upgrades. 

At the time of writing this blog, there is one important limitation to be aware of: 
If even a single AKS cluster upgrade fails, the entire Fleet upgrade run is halted. There is an open feature request to introduce a configurable safe-failure threshold, which would allow the upgrade process to continue even if a limited number of cluster upgrades fail: 

👉 https://github.com/Azure/AKS/issues/5338 

 

Conclusion: Designing a Thoughtful Upgrade Strategy 

While Azure Fleet Manager makes it possible to significantly reduce the overall duration for AKS upgrade, doing so safely requires thoughtful planning. The key is to strike the right balance between: 

  • Reducing overall upgrade duration by increasing parallelism, and 
  • Minimize risk and disruption by preserving adequate validation stages and respecting capacity constraints. 

Successful AKS upgrade strategies are rarely one-size-fits-all. They require close collaboration, environmental awareness, and a clear understanding of both platform limitations and operational risk. With the right design, Fleet Manager can be a powerful enabler for fast, safe, and scalable AKS upgrades. 

For some additional resources, check out the following:   

https://learn.microsoft.com/en-us/azure/kubernetes-fleet/overview 

https://learn.microsoft.com/en-us/azure/kubernetes-fleet/concepts-update-orchestration 

Updated Feb 25, 2026
Version 1.0
No CommentsBe the first to comment