Blog Post

Azure Migration and Modernization Blog
7 MIN READ

Autonomous Self-Healing for Azure VMware Solution Private Clouds

RohanB's avatar
RohanB
Icon for Microsoft rankMicrosoft
Mar 27, 2026

Overview

Azure VMware Solution operates a global fleet of production private clouds, each running a full VMware NSX and vCenter Server control plane. As an example, when a VMware NSX Manager cluster loses quorum, NSX can surface multiple related alarms, but the documented impact is more specific than a single simultaneous cascade: management and control-plane updates stop, cluster health may degrade, and some Edge or transport-node symptoms can follow, while existing Tier-0 dynamic routing generally remains operational. In summary, multiple symptoms may share an upstream fault and must be verified against cluster status, service health, storage, Compute Manager state, and transport-node connectivity. Without a model encoding directional dependency relationships between those layers, the alarm set is structurally indistinguishable from multiple independent simultaneous failures. An operator who responds to each alarm independently extends the outage by re-traversing the same propagation path with each action.

At production scale, NSX fault propagation outpaces manual triage consistently. Azure VMware Solution Private Cloud Autonomous Self-Healing system is a closed-loop control architecture built to address this class of failure directly. The system correlates control-plane signals causally using a live runtime dependency graph, enforces a full policy gate stack before any automated action, acquires scoped mutual exclusion before execution begins, and verifies recovery independently before closing any incident. This article describes the architecture of the system and the design decisions that shaped it.

Architectural Components

Azure VMware Solution is a VMware validated first party Azure service from Microsoft that provides private clouds containing VMware vSphere clusters built from dedicated bare-metal Azure infrastructure. It enables customers to leverage their existing investments in VMware skills and tools, allowing them to focus on developing and running their VMware-based workloads on Azure.

The diagram below describes the architectural components of the Azure VMware Solution.

 

Figure 1 – Azure VMware Solution Architectural Components

Each Azure VMware Solution architectural component has the following function:

  • Azure Subscription: Used to provide controlled access, budget and quota management for the Azure VMware Solution.
  • Azure Region: Physical locations around the world where we group data centers into Availability Zones (AZs) and then group AZs into regions.
  • Azure Resource Group: Container used to place Azure services and resources into logical groups.
  • Azure VMware Solution Private Cloud: Uses VMware software, including vCenter Server, NSX software-defined networking, vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported.
  • Azure VMware Solution Resource Cluster: Uses VMware software, including vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources for customer workloads by scaling out the Azure VMware Solution private cloud. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported.
  • VMware HCX: Provides mobility, migration, and network extension services.
  • VMware Site Recovery: Provides Disaster Recovery automation, and storage replication services with VMware vSphere Replication. Third party Disaster Recovery solutions Zerto DR and JetStream DR are also supported.
  • Dedicated Microsoft Enterprise Edge (D-MSEE): Router that provides connectivity between Azure cloud and the Azure VMware Solution private cloud instance.
  • Azure Virtual Network (VNet): Private network used to connect Azure services and resources together.
  • Azure Route Server: Enables network appliances to exchange dynamic route information with Azure networks.
  • Azure Virtual Network Gateway: Cross premises gateway for connecting Azure services and resources to other private networks using IPSec VPN, ExpressRoute, and VNet to VNet.
  • Azure ExpressRoute: Provides high-speed private connections between Azure data centers and on-premises or colocation infrastructure.
  • Azure Virtual WAN (vWAN): Aggregates networking, security, and routing functions together into a single unified Wide Area Network (WAN).

What Autonomous Self-Healing Delivers

Table I describes the five system-guaranteed correctness properties introduced by Autonomous Self-Healing. None of these properties existed within the Azure VMware Solution control-plane incident response path as system-enforced behaviors prior to this system.

Table I – System-Guaranteed Properties Introduced by Autonomous Self-Healing

Capability

What Autonomous Self-Heal Does

Prior State

Bounded, verifiable recovery time

Measures time from the first correlated signal to verified stable recovery.

Incidents closed on action completion vs recovery.

Signal integrity at ingestion

Normalizes events, deduplicates sources, and suppresses flapping before correlation.

No normalization pipeline existed. Engineers received the raw alarm stream and established cause through pattern recognition.

Policy-gated execution

Checks freeze windows, risk budgets, blast radius, rate limits, and approvals atomically before execution.

No single atomic gate stack consistently enforced limits or approvals.

Append-only incident evidence

Stores signals, topology, decisions, workflow trace, and verification in a structured record.

Evidence hosted across separate logs and non-trivial to replay.

Progressive trust model

Supports notify-only mode so operators can inspect detections and proposed actions before enablement.

Automation was binary — no mechanism to observe system behavior before granting execution authority.

Design Principles

Autonomous Self-Healing introduces the following seven design elements to Azure VMware Solution private cloud control-plane operations:

  • Three-plane separation (detection, decisioning, execution) isolating failure surfaces across the control loop
  • Live runtime dependency graph updated continuously from VMware NSX and vCenter Server event streams, replacing static rule sets that drift from actual topology
  • Three-input causal correlation model (evidence strength, temporal ordering, dependency directionality) distinguishing causal chains from coincident co-occurrence
  • Pre-execution blast-radius computation as a gate input, enabling proportional gate enforcement before any action is taken
  • Phase boundary model (stabilization, execution, verification) converting event-driven oscillation into a damped feedback loop with hysteresis
  • Execution contract structure (trigger, gate declaration, step specification, verification contract) enforcing scope validity and topology currency as system constraints
  • Unified append-only ledger producing identical records across automated and human-led resolution paths, enabling governance review and postmortem replay

For in-scope failures, the result is bounded, auditable recovery time — at any hour, without operator involvement. For in-scope failures where automated remediation cannot be authorized, the result is a deterministic evidence bundle replacing engineer recollection with a structured, replayable handoff.

Architecture: Detection, Decisioning, and Execution

Autonomous Self-Healing separates detection, decisioning, and execution into distinct planes with single, testable contracts between them. Coupling these functions — the simpler approach — shares a failure surface across all three: a bug in the execution engine can corrupt evidence the correlation model depends on; a spike in alarm volume can starve the gate evaluator; a misconfigured policy gate can block signal normalization. Separation eliminates these cross-contamination failure modes.

Detection plane: Transforms raw VMware NSX and vCenter Server alarm streams into stable, discrete incident candidates. The pipeline normalizes event formats across sources, collapses redundant signals, and applies a dwell window to filter transient state changes. Candidates crossing the plane boundary are confirmed, stable units — the only form the correlation model can process correctly.

Decisioning plane: Runs causal correlation against the live private cloud dependency graph before gate evaluation, producing a ranked root-cause hypothesis with confidence scores and a computed blast-radius estimate. The plane produces exactly one of two outputs: a gated authorization to execute, or an escalation with a complete evidence bundle.

Execution plane: Acquires a fencing token scoped to the smallest viable failure domain, runs a versioned idempotent checkpointed playbook, and closes the incident only after independent post-condition verification confirms stable recovery across a stability dwell window. Every state transition appends to the incident ledger.

 

Figure 2 – Detection, Decisioning, and Execution: Autonomous Self-Healing Control Loop

Incident Ledger

Autonomous Self-Healing produces a structured, append-only ledger for every incident regardless of resolution path. Five categories are captured in sequence: raw and normalized signals with suppression outcomes; a topology snapshot at detection time; the full decision record including correlation results, root-cause ranking, blast-radius estimate, and gate evaluation trace; the workflow trace with step metadata and lease identifiers; and the verification outcome with post-condition results and stability dwell disposition. Automated and human-led paths produce the same record structure — a governance requirement, not a design preference. The reconstruction is deterministic: given the same ledger, two reviewers reconstruct the same incident timeline.

 

Figure 3 – The Incident Ledger: Audit, Replay, and Governance

Summary

Autonomous Self-Healing handles a defined subset of NSX and vCenter control-plane failures within an Azure VMware Solution private cloud. The system does not handle data-plane failures, storage faults, hypervisor crashes, hardware failures, or control-plane failures outside its modeled dependency graph. It does not run arbitrary scripts, bypass RBAC controls, or override tenant isolation boundaries. Bounded scope is the source of trustworthiness within the system — a system attempting to remediate everything carries failure modes proportional to its reach. When Autonomous Self-Healing cannot act, the evidence bundle it produces provides a complete, structured handoff for operator response.

To learn more about Azure VMware Solution:

Author Bio

Rohan Bhosle is a Principal Software Engineering Manager within Microsoft Azure with more than 19+ years of experience leading deeply technical work in hyperscale cloud networking, distributed control planes, and large-scale AI infrastructure. Their background includes SDN, multi-tenant isolation and policy enforcement, datacenter and cloud network architecture, traffic engineering, routing, load balancing, telemetry, and large-scale reliability engineering and operations. Their work also includes networking infrastructure for hyperscale AI/GPU clusters, with a focus on the performance, resilience, and operational rigor needed to support next-generation AI systems.

Updated Mar 27, 2026
Version 1.0
No CommentsBe the first to comment