Blog Post

Startups at Microsoft
6 MIN READ

Your Azure VM went down and nobody knew why. Here's how to fix that.

rmmartins's avatar
rmmartins
Icon for Microsoft rankMicrosoft
Apr 22, 2026

If you've ever had a production VM go unhealthy on Azure and found yourself scrambling to figure out what happened, you're not alone. I work with startups running production workloads on Azure, and this is one of the most common patterns I see: something goes wrong, the team opens a support ticket, and then everyone waits for a root cause while the CTO asks "how do we make sure we know about this before our customers do next time?"

The good news: Azure already gives you the tools to answer both questions. Most teams just haven't set them up yet.

Scope note: This post covers platform health and maintenance signals for Azure VMs. We're not covering guest OS metrics, application telemetry, or Azure Monitor/VM Insights here. If you don't have a dedicated SRE team, these are the highest-leverage Azure-native checks to set up first.

Let's get into it.

 

Step 1: Figure out what actually happened (Resource Health)

Before you open a support ticket, check Resource Health. It's the fastest way to determine whether your VM went down because of something Azure did (platform event) or something on your side (user-initiated or config issue).

Go to your VM in the Azure portal > Resource Health blade. You'll see:

  • Current status: Available, Unavailable, Degraded, or Unknown
  • Health history: 30 days of state transitions with annotations explaining each one
  • Root cause: For platform-initiated outages on VMs, Azure automatically publishes root cause details within 72 hours, directly in this blade

The annotations often tell you what kind of event occurred: live migration, host reboot, planned maintenance, degraded hardware, etc. In many cases, you get this information without filing a support ticket.

If your VM was affected by a live migration, the annotation will show it was a platform-initiated event. Live migration is a memory-preserving operation that causes a brief pause, typically no more than 5 seconds (docs). But if your application is sensitive to even short freezes, or if you're seeing them frequently, that's worth investigating further.

Docs: Resource Health overview

Step 2: Get notified when it happens (Service Health + Resource Health Alerts)

Checking the portal after an incident is fine. Getting an alert when the incident happens is better.

Service Health Alerts

These notify you about service issues, planned maintenance, health advisories, and security advisories for the Azure services and regions you're actually using. Service Health is best for subscription-level and region-level awareness. If there's a regional maintenance wave driving elevated live migrations, this is how you'd know about it proactively.

Set them up to notify your ops channel via email, SMS, webhook (Slack, PagerDuty, Teams), or automation via Logic Apps or Azure Functions.

Docs: Create Service Health alerts | PagerDuty integration

Resource Health Alerts

These fire when a specific resource (or all resources in a resource group) changes health status. The alert includes health-change details such as status, cause type (platform vs. user-initiated), and descriptive event text, so you get more than a generic "VM is unhealthy" notification.

This is the "never be surprised again" alert. If you only set up one thing from this post, make it this.

Docs: Create Resource Health alerts

Step 3: See it coming (Scheduled Events API)

This is the part most teams don't know about, and it's the most powerful tool for handling live migrations gracefully.

Azure exposes an Instance Metadata Service (IMDS) endpoint on every VM that gives your application advance notice of upcoming maintenance events. Live migrations show up as EventType: "Freeze". In typical cases, you get up to ~15 minutes between the event appearing and Azure proceeding with the operation, though exact timing varies and some failures (like hardware issues) can bypass the advance notification entirely.

Note: Most Azure VM families support live migration, but G, L, N, and H series VMs do not. If you run GPU or HPC workloads on these SKUs, you won't see Freeze events. You'll still get Reboot or Redeploy events for other maintenance types.

The endpoint is available from inside the VM at:

http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01

Here's an example response when a live migration is scheduled:

{
  "DocumentIncarnation": 1,
  "Events": [
    {
      "EventId": "602d9444-d2cd-49c7-8624-8643e7171297",
      "EventType": "Freeze",
      "ResourceType": "VirtualMachine",
      "Resources": ["my-production-vm"],
      "EventStatus": "Scheduled",
      "NotBefore": "Mon, 22 Apr 2026 19:17:47 GMT",
      "Description": "Virtual machine is being paused for a memory-preserving Live Migration operation.",
      "EventSource": "Platform",
      "DurationInSeconds": 5
    }
  ]
}

You can poll this endpoint and use the lead time to:

  • Drain connections so active users aren't affected
  • Checkpoint application state to recover faster
  • Remove the VM from your load balancer temporarily
  • Log the event so you have a record of migration frequency

Here's a simple polling script in Python:

import requests
import json
import time

ENDPOINT = "http://169.254.169.254/metadata/scheduledevents"
HEADERS = {"Metadata": "true"}
PARAMS = {"api-version": "2020-07-01"}

def get_scheduled_events():
    response = requests.get(ENDPOINT, headers=HEADERS, params=PARAMS)
    return response.json()

def handle_events(data):
    for event in data.get("Events", []):
        print(f"[{event['EventType']}] {event.get('Description', 'No description')}")
        print(f"  Status: {event['EventStatus']}, Not Before: {event['NotBefore']}")
        print(f"  Duration: {event['DurationInSeconds']}s, Source: {event['EventSource']}")
        # Your graceful drain/checkpoint logic here

def approve_event(event_id):
    """Acknowledge the event so Azure can proceed immediately."""
    payload = json.dumps({"StartRequests": [{"EventId": event_id}]})
    requests.post(ENDPOINT, headers=HEADERS, params=PARAMS, data=payload)

# Poll frequently - the official docs recommend every 1 second for production.
# Adjust based on your workload sensitivity.
while True:
    data = get_scheduled_events()
    handle_events(data)
    time.sleep(1)

Or a quick check in Bash:

curl -s -H "Metadata:true" --noproxy "*" \
  "http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01" | jq .

Event approval: Once your application has drained connections or checkpointed state, it can approve the event by POSTing back with the EventId. This tells Azure your app is ready, and the platform can proceed without waiting for the full timeout. If you don't explicitly approve, Azure proceeds when the NotBefore time is reached.

If you're seeing elevated frequency of live migrations, this data lets you quantify the pattern (how often, what times, what durations) and bring hard numbers to a support conversation instead of "it feels like it's happening a lot."

Docs: Scheduled Events for VMs

Step 4: Check your overall posture (Azure Advisor)

While you're at it, check Azure Advisor's Reliability recommendations for your VMs. It flags things like:

  • VMs not deployed in availability zones
  • Deprecated VM images that need updating
  • Missing backup configurations
  • Other resiliency gaps that make you more susceptible to availability issues

Advisor won't explain a past incident, but it can help prevent the next one.

Docs: Azure Advisor Reliability recommendations

A quick note on resilience

These tools improve your visibility and response time, but they don't eliminate downtime by themselves. If a VM is truly critical, pair this monitoring with basic resilience patterns: multiple instances behind a load balancer, availability zones, health probes, regular backups, and cross-region recovery where needed. Monitoring tells you what's happening. Architecture determines whether it matters.

The setup checklist

Quick wins (15 minutes)

#WhatWhyTime
1Check Resource Health on your production VMsSee if there are past events you didn't know about2 min
2Create a Service Health alert for your regions/servicesGet notified about platform issues proactively3 min
3Create Resource Health alerts for your VM resource groupsGet notified when any VM changes health state3 min
4Review Azure Advisor Reliability tabFix any posture gaps2 min

Advanced hardening (1+ hours depending on your app)

#WhatWhy
5Deploy the Scheduled Events polling script on critical VMsGet advance notice of live migrations and maintenance
6Implement drain/checkpoint logic tied to Scheduled EventsGracefully handle maintenance with zero user impact
7Wire event approvals into your automationControl the timing of when Azure proceeds with maintenance

Wrapping up

The pattern I keep seeing is teams treating Azure VM monitoring as something they'll get to "later." Then an incident happens, the RCA takes longer than anyone wants, and everyone wishes they had visibility sooner.

The tools are already there. Resource Health tells you what happened. Service Health and Resource Health alerts tell you when it's happening. Scheduled Events tells you before it happens. And Advisor helps you make sure your setup is resilient in the first place.

Fifteen minutes of setup for the quick wins, and you're in a fundamentally better place than most teams running VMs on Azure today.

Updated Apr 22, 2026
Version 6.0
No CommentsBe the first to comment