Blog Post

Azure Compute Blog
6 MIN READ

Revolutionizing Reliability: Introducing the Azure Failure Prediction and Detection (AFPD) system

andrewb710's avatar
andrewb710
Icon for Microsoft rankMicrosoft
Oct 31, 2025

Azure’s shift-left solution for preventing workload downtime and raising the bar for Azure fleet health

Blog authored by: Ayberk Ozturk, Andrew Boyd, Otis Smith, Sameer Hussain, Joao Madureira, Ronit Sharma, Isha Bhatia, Halley Ding, Parvaneh Alavi, Jelena Ilic, Kevin Meehan, Steven Li, Arhatha Bramhanand, Nathan Ernst, Abhishek Sanghai, Adam Wilson, Blake Wheaton, Dhruv Matta, Olubusola Femi-Fowode, Shweta Patil, and Tajinder Pal Singh Ahluwalia

Introduction 

As part of the journey to consistently improve Azure reliability and platform stability, we launched Azure Failure Prediction & Detection (AFPD), Azure’s premiere shift-left reliability solution. AFPD became operational in 2024, unifying failure prediction, detection, mitigation, and remediation services into a single end-to-end system with the goal of preventing Azure Compute customer workload interruptions and repairing nodes at scale. AFPD builds upon previous reliability solutions such as Project Narya, adding new best practices and fleet health management capabilities on top of pre-existing failure prediction and mitigation capabilities.  The end-to-end AFPD system has proven to further reduce the overall number of reboots by over 36% and allows for a proactive approach to maintaining the cloud. This system operates for all Azure Compute General Purpose, Specialized Compute, High Performance Computing (HPC)/Artificial Intelligence (AI) workloads and select Azure Storage scenarios. For a deeper dive, you can read the whitepaper here, which won Best Paper Award at the 2025 IEEE Cloud Summit!

How does AFPD advance Azure’s failure prediction and mitigation capabilities for customers?

Project Narya and several other existing Azure failure prediction and mitigation services effectively predicted and mitigated a broad range of hardware failures for customers, leveraging A/B testing and Multi-Armed Bandit models to improve mitigation techniques for customers over time. Building on the strengths of these systems, AFPD unifies prediction, detection, mitigation, notification, and remediation in a single end-to-end solution with a standard set of performance metrics. AFPD expands coverage to more hardware and software scenarios, uses new Contextual Bandit models alongside A/B and Multi-Armed Bandit methods, and introduces scaled node repair for better fleet health management. This proactive strategy improves reliability, reduces customer downtime, and enhances overall platform stability for a smoother Azure experience.

How does AFPD work?

AFPD’s ability to identify impending failure events and prevent downtime impact to customer workloads can be broken down into three phases: 1) failure prediction and detection, 2) failure mitigation, and 3) remediation.

Phase 1: Failure Prediction and Detection

Like Narya, AFPD rules and models monitor telemetry signals indicating node and component behavior from across the fleet. Current, AFPD primarily detects and predicts failures for key hardware components such as SSDs, along with select software scenarios. When these capabilities notice patterns in telemetry that indicate an ongoing failure or a likely future failure, the node in question is tagged and a request is sent to repair the node.

Phase 2: Mitigation

Once a repair request is sent, the mitigation service optimizes for both preventing customer impact and efficient node repair. Using optimized mitigation actions as suggested by a Contextual Bandit model or A/B testing and AFPD best practices, the mitigation service will take the following actions:

  • Mark the node “Unallocatable” to prevent new customer workloads from landing on the node
  • If eligible, Live Migrate customer workloads running on the node to a new, healthy node; if not eligible, a notification will be sent informing the customer that they need to take redeploy action (see "How can customers consume AFPD notifications?" section below)
  • Once customer migration actions have been taken, the node is ready to be swiftly removed from production and sent to be remediated

Phase 3: Remediation

Once customer workloads are evacuated from the node via Live Migration or customer redeploy, the node is assigned a fault code with relevant fault details. This ensures that once the node is sent to technicians for repair, the technicians can look at the targeted diagnostic information made beforehand by the subject matter expert prediction or detection capability, allowing for quicker testing to confirm the problem and perform swifter repair. Additionally, intelligent spare remediation capabilities proactively place spares so that once the affected node comes out, a spare part is ready and waiting to expedite repair. Once the node receives the correct repair and goes through additional diagnostics, it is quickly returned to production for customer use.

How can customers consume AFPD notifications?

To view and use AFPD notifications, we recommend leveraging both Flash Health events and Scheduled Events (SE). Flash Health events provide near real-time information on ongoing and historical availability disruptions. Scheduled Events offer proactive notifications prior to any impact on VM availability, including those detected by AFPD.

Navigating AFPD Events*

*The notification experience shown below is available for Azure public cloud customers in Compute and HPC/AI. Notifications may appear slightly different in Azure non-public offerings. 

When AFPD detects potential issues, you’ll see detailed event notifications in the Azure portal under your Resource Health blade. These notifications provide context on the type of impact, timing, and recommended actions.

For example, if an unplanned degraded event is detected for the Host, and the workload can be automatically migrated by the platform, the following notification will be showcased:

Fig 1 - AFPD Event in Resource Health – note the recommended step to wait for migration completion

 

The notification will include:

  • Event details: Description of the issue (e.g., “The Physical Host in which your VM is running is potentially degraded”).
  • Deadline: A “before” timestamp indicating when the impact may occur (e.g., Redeploy before 8/21/2025 6:34 PM UTC).
  • Recommended steps: Based on your workload type, you may be asked to...
    • Wait for a completion notification, as the platform will attempt to automatically migrate your workload or,
    • Redeploy the VM to a different host to avoid disruption.
  • Additional resources: Links to documentation

Alternatively, if the VM cannot be moved automatically, the following event will clearly state that redeployment is the only action required, and may include a deadline:

 

Fig 2 - AFPD Event in Resource Health – note the recommended step to redeploy the VM with a specific deadline

 

  • Notification text: “Please redeploy your VM before 8/21/2025 9:49 PM UTC to a different host server to avoid unexpected disruptions.”
  • Recommended steps: Redeploy the VM immediately to a different host server as soon as possible.

Consume AFPD annotations through Project Flash endpoints*

* All below endpoints are available for Azure public cloud customers in Compute and HPC/AI. In Azure non-public offerings, the ARG and Event Grid endpoints are not yet available.

We’re making it easier than ever to stay on top of AFPD events and respond quickly to minimize disruption. AFPD events are delivered as part of the resource health annotations, which can be conveniently accessed through different endpoints of Project Flash.

healthresources
| where type =~ 'microsoft.resourcehealth/resourceannotations'
| extend temp = parse_json(properties)
| where temp.impactType == "Degraded"

Consume AFPD events through Scheduled Events

We send events from AFPD through scheduled events so you can have a single source for all planned and unplanned availability impacts to your VMs. For automated resiliency, you can retrieve scheduled events within the VM through the Instance Metadata Endpoint. Then you can proactively migrate your workload away from at risk resources before the impact happens, reducing the downstream impact to your customers.

If you are a current user of scheduled events, you’re already receiving notifications from AFPD. You will just need to confirm that your workload is configured to handle not before times up to 7 days in the future. For new users, you can get started using scheduled events today with our code samples.

VM Watch for enhanced diagnostics and AFPD performance on L-Series

VM Watch is a lightweight, in-VM watchdog service that emits near real-time health signals from guest VMs, enabling Azure to detect regressions and initiate proactive recovery. Enabling VM Watch for L-Series workloads is specifically important for enabling better AFPD predictions and detections.

For onboarding, customers can enable VM watch through the Application Health Extension using tools like Azure CLI, PowerShell, ARM templates or Azure Policy. The onboarding process includes specifying a CohortId—typically a string name for customers, if they identify themselves—to group and track VM watch instances across a fleet. Once enabled, VM watch begins emitting signals immediately, thanks to its built-in suite of default tests that require minimal configuration. For L-Series VMs, which often run data-intensive workloads, VM watch can be particularly valuable in detecting issues like network connectivity failures, DNS resolution problems, and disk I/O anomalies. To onboard to VM watch, follow the steps in the linked page. In addition, VM watch can be configured to suit your specific requirements, including being able to view signals through a pre-configured Event Hub.

Updated Oct 31, 2025
Version 1.0
No CommentsBe the first to comment