Blog Post

Azure Compute Blog
4 MIN READ

AI-Powered Downtime Investigation for Azure VMs: Automating Root Cause Analysis

Jon_Andoni_Baranda's avatar
Apr 22, 2026

An solution that automates incident investigation for VM downtime events. Powered by Model Context Protocol (MCP), it delivers structured root cause analysis directly without manual intervention.

Co-authors: Jie Su, Abhinav Dua, Mukthar Ahmed, Dhruv Joshi

In a previous post, we shared how Azure Automated VM Recovery works to minimize virtual machine downtime through a three-stage approach: Detection, Diagnosis, and Mitigation. This post goes one layer deeper into how our team is using AI to transform incident investigation, one of the most time-consuming parts of that process.

When an alert fires for a recovery event taking longer than expected, a DRI is notified and a ticket is opened. From there, the DRI must manually dig through logs across multiple sources, build Kusto queries from scratch, and correlate timestamps across systems to identify where time was lost. This has historically taken a long time. On top of that, an engineering manager or TPM had to review the incident, understand the failure, and route it to the right engineer, often resulting in multiple handoffs before the right owner was found. Across a platform the size of Microsoft Azure, that time adds up. That is the problem we set out to solve.

How do we use AI for long duration downtime investigation?

Model Context Protocol (MCP) is a standardized protocol that connects AI models to external tools; in our case, Kusto databases, log analyzers, and incident metadata extractors. Rather than generating text about what might be wrong, the AI actually runs real queries against live telemetry. Critically, this is not a chatbot. There is no interface for a DRI to interact with. When an incident fires, the system triggers automatically, runs the full investigation pipeline, and attaches a structured analysis report directly to the ticket. By the time a DRI opens the alert, the work is already done.

The real intelligence in this system goes beyond incident analysis. It comes from encoded domain knowledge about what "normal" looks like: expected recovery timelines for different error categories, log patterns that indicate specific failure modes, and the precise meaning of each phase in the healing workflow. The system knows, for example, how to distinguish a diagnostics bottleneck from a node isolation bottleneck, and what it signals when a particular isolation step runs longer than expected. This is knowledge that took our team years to accumulate, now automatically applied to every incident. Ultimately, the goal is not to replace the DRI but to eliminate the manual investigation work so they can focus on what matters most: making the right call. The system surfaces the analysis; a human always makes the final decision.

How the System Works

The investigation pipeline follows a six-step reasoning chain that mirrors how our best engineers approach manual triage.

Step 1 (Parse and Identify): The system extracts the key metadata from the ticket incident: the affected node identifier, container identifier, the timestamp when the VM went down, and the total duration of the outage. These parameters become the inputs for everything that follows.

Step 2 (Query VM Health Events): Using the extracted metadata, the AI invokes the AI assisted triage against VM availability tables, retrieving the sequence of state transitions the virtual machine experienced during the incident window.

Step 3 (Check Host Health): The AI then queries host-level health event tables, examining node state changes to understand what the underlying host was doing during the same period. This establishes whether the issue originated at the VM level or at the node level.

Step 4 (Correlate Repair Service Logs): With both the VM and host picture in hand, the AI cross-references repair service logs to trace when our repair orchestration service was triggered, what actions it took, and how long each step took.

Step 5 (Build the Timeline)The AI assembles all of the retrieved data into a chronological, end-to-end timeline of the recovery event. This timeline maps directly to the three phases we track: Time to Detect (TTD), Time to Diagnose (TTDiag), and Time to Mitigate (TTM), as well as Time to Isolate (TTI) when service healing is involved.

Step 6 (Root Cause and Report): Finally, the AI analyzes the timeline, identifies which phase contained the largest gap, determines what operation caused the bottleneck, and generates a structured investigation report that is automatically attached to the ticket incident.

Results and conclusion

The results are measurable across three dimensions. On speed, the investigation pipeline now completes in under 5 minutes, down from 30 to 60 minutes manually, a roughly 90% reduction that shaves 50% off total triage time. On consistency, 100% of qualifying incidents receive the same thorough analysis regardless of who is on call, with the full phase breakdown (TTD, TTDiag, TTMitigate, and TTIsolate) applied every time. On ownership, the generated report gives managers and TPMs immediate context to assign the incident to the right engineer from the start, eliminating the back-and-forth handoffs that previously delayed remediation. This solution has saved Engineering Manager and TPM 10-20 minutes of manual work per incident.

By encoding our team's best practices into an automated pipeline, we turned a slow, inconsistent manual process into something fast, thorough, and always available. MCP offers a practical path for any engineering team to make the knowledge of their most experienced engineers universally accessible, not as documentation, but as an automated system that applies it to every incident, every time. We will continue to share updates as this evolves and would love to hear from teams working on similar problems.

Updated Apr 21, 2026
Version 1.0
No CommentsBe the first to comment