If you've ever had a production VM go unhealthy on Azure and found yourself scrambling to figure out what happened, you're not alone. I work with startups running production workloads on Azure, and this is one of the most common patterns I see: something goes wrong, the team opens a support ticket, and then everyone waits for a root cause while the CTO asks "how do we make sure we know about this before our customers do next time?"
The good news: Azure already gives you the tools to answer both questions. Most teams just haven't set them up yet.
Scope note: This post covers platform health and maintenance signals for Azure VMs. We're not covering guest OS metrics, application telemetry, or Azure Monitor/VM Insights here. If you don't have a dedicated SRE team, these are the highest-leverage Azure-native checks to set up first.
Let's get into it.
Step 1: Figure out what actually happened (Resource Health)
Before you open a support ticket, check Resource Health. It's the fastest way to determine whether your VM went down because of something Azure did (platform event) or something on your side (user-initiated or config issue).
Go to your VM in the Azure portal > Resource Health blade. You'll see:
- Current status: Available, Unavailable, Degraded, or Unknown
- Health history: 30 days of state transitions with annotations explaining each one
- Root cause: For platform-initiated outages on VMs, Azure automatically publishes root cause details within 72 hours, directly in this blade
The annotations often tell you what kind of event occurred: live migration, host reboot, planned maintenance, degraded hardware, etc. In many cases, you get this information without filing a support ticket.
If your VM was affected by a live migration, the annotation will show it was a platform-initiated event. Live migration is a memory-preserving operation that causes a brief pause, typically no more than 5 seconds (docs). But if your application is sensitive to even short freezes, or if you're seeing them frequently, that's worth investigating further.
Docs: Resource Health overview
Step 2: Get notified when it happens (Service Health + Resource Health Alerts)
Checking the portal after an incident is fine. Getting an alert when the incident happens is better.
Service Health Alerts
These notify you about service issues, planned maintenance, health advisories, and security advisories for the Azure services and regions you're actually using. Service Health is best for subscription-level and region-level awareness. If there's a regional maintenance wave driving elevated live migrations, this is how you'd know about it proactively.
Set them up to notify your ops channel via email, SMS, webhook (Slack, PagerDuty, Teams), or automation via Logic Apps or Azure Functions.
Docs: Create Service Health alerts | PagerDuty integration
Resource Health Alerts
These fire when a specific resource (or all resources in a resource group) changes health status. The alert includes health-change details such as status, cause type (platform vs. user-initiated), and descriptive event text, so you get more than a generic "VM is unhealthy" notification.
This is the "never be surprised again" alert. If you only set up one thing from this post, make it this.
Docs: Create Resource Health alerts
Step 3: See it coming (Scheduled Events API)
This is the part most teams don't know about, and it's the most powerful tool for handling live migrations gracefully.
Azure exposes an Instance Metadata Service (IMDS) endpoint on every VM that gives your application advance notice of upcoming maintenance events. Live migrations show up as EventType: "Freeze". In typical cases, you get up to ~15 minutes between the event appearing and Azure proceeding with the operation, though exact timing varies and some failures (like hardware issues) can bypass the advance notification entirely.
Note: Most Azure VM families support live migration, but G, L, N, and H series VMs do not. If you run GPU or HPC workloads on these SKUs, you won't see
Freezeevents. You'll still getRebootorRedeployevents for other maintenance types.
The endpoint is available from inside the VM at:
http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01
Here's an example response when a live migration is scheduled:
{
"DocumentIncarnation": 1,
"Events": [
{
"EventId": "602d9444-d2cd-49c7-8624-8643e7171297",
"EventType": "Freeze",
"ResourceType": "VirtualMachine",
"Resources": ["my-production-vm"],
"EventStatus": "Scheduled",
"NotBefore": "Mon, 22 Apr 2026 19:17:47 GMT",
"Description": "Virtual machine is being paused for a memory-preserving Live Migration operation.",
"EventSource": "Platform",
"DurationInSeconds": 5
}
]
}
You can poll this endpoint and use the lead time to:
- Drain connections so active users aren't affected
- Checkpoint application state to recover faster
- Remove the VM from your load balancer temporarily
- Log the event so you have a record of migration frequency
Here's a simple polling script in Python:
import requests
import json
import time
ENDPOINT = "http://169.254.169.254/metadata/scheduledevents"
HEADERS = {"Metadata": "true"}
PARAMS = {"api-version": "2020-07-01"}
def get_scheduled_events():
response = requests.get(ENDPOINT, headers=HEADERS, params=PARAMS)
return response.json()
def handle_events(data):
for event in data.get("Events", []):
print(f"[{event['EventType']}] {event.get('Description', 'No description')}")
print(f" Status: {event['EventStatus']}, Not Before: {event['NotBefore']}")
print(f" Duration: {event['DurationInSeconds']}s, Source: {event['EventSource']}")
# Your graceful drain/checkpoint logic here
def approve_event(event_id):
"""Acknowledge the event so Azure can proceed immediately."""
payload = json.dumps({"StartRequests": [{"EventId": event_id}]})
requests.post(ENDPOINT, headers=HEADERS, params=PARAMS, data=payload)
# Poll frequently - the official docs recommend every 1 second for production.
# Adjust based on your workload sensitivity.
while True:
data = get_scheduled_events()
handle_events(data)
time.sleep(1)
Or a quick check in Bash:
curl -s -H "Metadata:true" --noproxy "*" \
"http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01" | jq .
Event approval: Once your application has drained connections or checkpointed state, it can approve the event by POSTing back with the EventId. This tells Azure your app is ready, and the platform can proceed without waiting for the full timeout. If you don't explicitly approve, Azure proceeds when the NotBefore time is reached.
If you're seeing elevated frequency of live migrations, this data lets you quantify the pattern (how often, what times, what durations) and bring hard numbers to a support conversation instead of "it feels like it's happening a lot."
Docs: Scheduled Events for VMs
Step 4: Check your overall posture (Azure Advisor)
While you're at it, check Azure Advisor's Reliability recommendations for your VMs. It flags things like:
- VMs not deployed in availability zones
- Deprecated VM images that need updating
- Missing backup configurations
- Other resiliency gaps that make you more susceptible to availability issues
Advisor won't explain a past incident, but it can help prevent the next one.
Docs: Azure Advisor Reliability recommendations
A quick note on resilience
These tools improve your visibility and response time, but they don't eliminate downtime by themselves. If a VM is truly critical, pair this monitoring with basic resilience patterns: multiple instances behind a load balancer, availability zones, health probes, regular backups, and cross-region recovery where needed. Monitoring tells you what's happening. Architecture determines whether it matters.
The setup checklist
Quick wins (15 minutes)
| # | What | Why | Time |
|---|---|---|---|
| 1 | Check Resource Health on your production VMs | See if there are past events you didn't know about | 2 min |
| 2 | Create a Service Health alert for your regions/services | Get notified about platform issues proactively | 3 min |
| 3 | Create Resource Health alerts for your VM resource groups | Get notified when any VM changes health state | 3 min |
| 4 | Review Azure Advisor Reliability tab | Fix any posture gaps | 2 min |
Advanced hardening (1+ hours depending on your app)
| # | What | Why |
|---|---|---|
| 5 | Deploy the Scheduled Events polling script on critical VMs | Get advance notice of live migrations and maintenance |
| 6 | Implement drain/checkpoint logic tied to Scheduled Events | Gracefully handle maintenance with zero user impact |
| 7 | Wire event approvals into your automation | Control the timing of when Azure proceeds with maintenance |
Wrapping up
The pattern I keep seeing is teams treating Azure VM monitoring as something they'll get to "later." Then an incident happens, the RCA takes longer than anyone wants, and everyone wishes they had visibility sooner.
The tools are already there. Resource Health tells you what happened. Service Health and Resource Health alerts tell you when it's happening. Scheduled Events tells you before it happens. And Advisor helps you make sure your setup is resilient in the first place.
Fifteen minutes of setup for the quick wins, and you're in a fundamentally better place than most teams running VMs on Azure today.