Blog Post

Startups at Microsoft
12 MIN READ

A practical guide to Azure VM SKU capacity monitoring

rmmartins's avatar
rmmartins
Icon for Microsoft rankMicrosoft
May 22, 2025

Look, Azure capacity hiccups can really derail your day. You don’t get any warning—no “Heads up, we’re almost out of your preferred VM SKU”—you just try to create a VM and boom: error.

After one of those “oh-crap” moments hit some of my customers, I built a simple monitor that alerts you before you slam into that capacity wall—so you’re never blindsided again.

Thought I’d share it here—maybe save you from the same headache.

What this thing does

This solution isn't fancy, but it works. Here's what it'll do for you:

  1. Checks if your favorite VM types are actually available in your regions
  2. Shows exactly WHY something's unavailable if there's a problem
  3. Suggests similar VM types you could use instead (lifesaver!)
  4. Logs everything to Azure Log Analytics so you can track trends
  5. Works right from your terminal - no fancy setup needed

How it's put together

It's pretty simple really - just two main Python scripts:

  1. The Monitoring Script: Checks VM availability using Azure's API
  2. Log Analytics Setup: Stores your data for later analysis (optional, but super useful)

Here's a quick diagram:

Before you start

You'll need a few things:

1. Azure CLI installed and working on your machine

# If you haven't logged in yet
az login

2. Azure permissions if you're doing the Log Analytics part:

# Get your username first
USER_PRINCIPAL=$(az ad signed-in-user show --query userPrincipalName -o tsv)
echo "Looks like you're logged in as: $USER_PRINCIPAL"

# Create a resource group - you can change the name if you want
az group create --name vm-sku-monitor-rg --location eastus2

# Give yourself the right permissions
az role assignment create \
  --assignee "$USER_PRINCIPAL" \
  --role "Monitoring Metrics Publisher" \
  --scope "/subscriptions/$(az account show --query id -o tsv)/resourcegroups/vm-sku-monitor-rg"

# Double-check it worked
az role assignment list \
  --assignee "$USER_PRINCIPAL" \
  --role "Monitoring Metrics Publisher" \
  --scope "/subscriptions/$(az account show --query id -o tsv)/resourcegroups/vm-sku-monitor-rg"

Azure can be kinda slow with permissions sometimes. If you get weird 403 errors later, maybe grab a coffee and try again in 10-15 mins.

3. Python environment setup:

# Set up a virtual environment - don't skip this step!
# I learned this the hard way when I borked my system Python...
python3 -m venv venv

# Activate it
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install what we need
pip install azure-identity azure-mgmt-compute azure-mgmt-subscription azure-monitor-ingestion rich

Let's build this thing

1. The VM Capacity Checking Script

The star of the show is the monitoring script itself. This script does all the heavy lifting - checking VM availability, showing you what's happening, and logging the data for later.  I'll call it: monitor_vm_sku_capacity.py:


2. Log Analytics Setup Script

Now for the script that sets up all the Log Analytics stuff. This part is optional, but really helpful if you want to track capacity trends over time: setup_log_analytics.py

Setting default region and VM SKU

You've got a few options to set your preferred region and VM SKU:

1. Edit script defaults: Open monitor_vm_sku_capacity.py and look for:

parser.add_argument('--region', type=str, default='eastus2',  # Change this!
                    help='Azure region to check (default: eastus2)')
parser.add_argument('--sku', type=str, default='Standard_D16ds_v5',  # And this!
                    help='VM SKU to check (default: Standard_D16ds_v5)')

2. Specify on command line:

python monitor_vm_sku_capacity.py --region westus2 --sku Standard_D8ds_v5

3. Edit config file: After running the setup script, it creates a config.json with these values:

{
  "region": "eastus2",
  "target_sku": "Standard_D16ds_v5",
  "check_zones": true,
  ...
}

Finding Available Regions and SKUs

If you're wondering which regions and SKUs to monitor, here's how to get that info:

Using Azure CLI

# List all regions
az account list-locations --query "[].name" -o tsv

# List all VM SKUs in a region 
az vm list-skus --location eastus2 --resource-type virtualMachines --query "[].name" -o tsv  

# Get detailed info about a specific SKU
az vm list-skus --location eastus2 --size Standard_D16ds_v5 -o table

Using Azure Portal

Just go to the VM creation page in the portal and click "See all sizes" - you'll get a nice visual list of all available options. I sometimes just take a screenshot of this for reference.

Using this tool

So here's how you use this thing. I tried to make it as simple as possible:

1. Set up Log Analytics first (optional but recommended):

python setup_log_analytics.py

This builds all the Log Analytics stuff and spits out a config file you can use in the next step. The default options should work fine for most people, but you can customize if needed.

2. Run the monitoring script:

python monitor_vm_sku_capacity.py --config config.json

If you don't want to mess with Log Analytics, you can just run it directly:

python monitor_vm_sku_capacity.py --region eastus2 --sku Standard_D16ds_v5

The output will look something like this (way prettier if you have the rich package installed):

================================================================================
AZURE VM SKU CAPACITY MONITOR - 2024-05-20 14:32:45
================================================================================

Status:       AVAILABLE
SKU:          Standard_D16ds_v5
Region:       eastus2
Subscription: My Azure Subscription (12345678-1234-1234-1234-123456789012)

Available Zones:
  - 1
  - 2
  - 3

VM SKU Specifications:
  vCPUs: 16
  MemoryGB: 64
  MaxDataDiskCount: 32
  PremiumIO: True
  AcceleratedNetworkingEnabled: True

Or if the VM is unavailable:

================================================================================
AZURE VM SKU CAPACITY MONITOR - 2024-05-20 14:32:45
================================================================================

Status:       NOT AVAILABLE
SKU:          Standard_D16ds_v5
Region:       eastus2
Subscription: My Azure Subscription (12345678-1234-1234-1234-123456789012)
Details:      SKU Standard_D16ds_v5 is not available in region eastus2

Available Zones:
  None

Restrictions:
  Type:           Zone
  Reason:         NotAvailableForSubscription
  Affected Values: eastus2

VM SKU Specifications:
  vCPUs: 16
  MemoryGB: 64
  MaxDataDiskCount: 32
  PremiumIO: True
  AcceleratedNetworkingEnabled: True

Alternative SKUs:
  - Standard_D16as_v5 (vCPUs: 16, Memory: 64 GB, Family: standardDasv5Family, Similarity: 100%)
  - Standard_D16s_v5 (vCPUs: 16, Memory: 64 GB, Family: standardDsv5Family, Similarity: 100%)
  - Standard_D16s_v4 (vCPUs: 16, Memory: 64 GB, Family: standardDsv4Family, Similarity: 100%)
  - Standard_F16s_v2 (vCPUs: 16, Memory: 32 GB, Family: standardFSv2Family, Similarity: 80%)
  - Standard_E16s_v5 (vCPUs: 16, Memory: 128 GB, Family: standardEsv5Family, Similarity: 80%)

Setting up scheduled checks

I don't like missing things, so I set mine up to run every hour using cron:

# Open crontab editor
crontab -e

# Add this line to run it every hour
0 * * * * cd /path/to/scripts && source venv/bin/activate && python monitor_vm_sku_capacity.py --config config.json >> vm_sku_monitor.log 2>&1

Checking your data in Log Analytics

If you set up Log Analytics, you can run all sorts of cool queries:

// Basic query - see everything
VMSKUCapacity_CL
| order by TimeGenerated desc

// Find when capacity changed
VMSKUCapacity_CL
| where sku_name == "Standard_D16ds_v5" and region == "eastus2"
| project TimeGenerated, is_available
| order by TimeGenerated desc


// Simple dashboard
VMSKUCapacity_CL
| summarize LastStatus=arg_max(TimeGenerated, is_available), 
            LastChecked=max(TimeGenerated) 
  by sku_name, region
| extend Status = iff(LastStatus == true, "Available", "Not Available")
| project sku_name, region, Status, LastChecked

You can set up alerts too. That way Azure tells YOU when capacity changes, instead of you finding out during a failed deployment!

Troubleshooting

Some common problems I've run into:

  1. "Could not automatically detect subscription ID":
    • Make sure you're logged in with az login
    • Or just provide it explicitly with --subscription-id
  2. Log Analytics permission errors:
    • Make sure you ran the permission commands from the prerequisites section
    • Azure's permissions can be weirdly slow - wait 10-15 minutes and try again
  3. Python environment issues:
    • Always use a virtual environment! I learned this one the hard way when I messed up my system Python
    • Make sure all the packages are installed with pip install azure-identity azure-mgmt-compute azure-mgmt-subscription azure-monitor-ingestion rich

Next Steps

  1. Create a dashboard  to visualize VM SKU availability over time
  2. Set up alerts  to notify you when specific SKUs become available
  3. Integrate with your CI/CD pipeline to automatically select available SKUs
  4. For a serverless, fully managed option, create an Azure Function version of the monitoring script

Advanced: Bulk-Deploy Feasibility Check

Want to know up front “can I spin up N VMs of SKU X in region Y?”
We combine:

  1. Hardware-level: Resource SKUs API (is the SKU unrestricted?)
  2. Subscription-level: Usage API (enough free vCPU cores for N instances?)

Prerequisites already covered above:

az login
USER_PRINCIPAL=$(az ad signed-in-user show --query userPrincipalName -o tsv)

az group create --name vm-sku-monitor-rg --location eastus2

az role assignment create \
  --assignee "$USER_PRINCIPAL" \
  --role "Monitoring Metrics Publisher" \
  --scope "/subscriptions/$(az account show --query id -o tsv)/resourcegroups/vm-sku-monitor-rg"

python3 -m venv venv && source venv/bin/activate

pip install azure-identity azure-mgmt-compute azure-mgmt-subscription rich


 File: monitor_vm_sku_capacity_bulk.py

#!/usr/bin/env python
"""
Azure VM SKU Capacity & Quota Monitor (with Zone support)

Checks:
  1) Whether your target SKU is available in a region or zone
  2) Whether your subscription has enough free vCPU quota to deploy N VMs
Optionally logs results into Azure Log Analytics.
"""

import argparse
import datetime
import json
import logging
import subprocess
from typing import List, Tuple, Dict, Any

from azure.identity import DefaultAzureCredential
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.subscription import SubscriptionClient

# Rich for prettier tables
try:
    from rich.console import Console
    from rich.table import Table
    from rich import box
    RICH_AVAILABLE = True
except ImportError:
    RICH_AVAILABLE = False

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger("vm_sku_capacity_monitor")


def parse_arguments():
    p = argparse.ArgumentParser(
        description="Azure VM SKU Capacity & Quota Monitor (with zone support)"
    )
    p.add_argument("--region",        type=str,   default="eastus2",
                   help="Azure region to check")
    p.add_argument("--sku",           type=str,   default="Standard_D16ds_v5",
                   help="VM SKU to check")
    p.add_argument("--zone",          type=str,   default=None,
                   help="(Optional) Availability zone to check (e.g. '1')")
    p.add_argument("--count",         type=int,   default=1,
                   help="Number of VMs you plan to deploy")
    p.add_argument("--log-analytics", action="store_true",
                   help="Enable logging to Azure Log Analytics")
    p.add_argument("--endpoint",      type=str,
                   help="Data Collection Endpoint URI")
    p.add_argument("--rule-id",       type=str,
                   help="Data Collection Rule ID")
    p.add_argument("--stream-name",   type=str, default="Custom-VMSKUCapacity_CL",
                   help="Log Analytics stream name")
    p.add_argument("--debug",         action="store_true",
                   help="Enable debug logging")
    p.add_argument("--config",        type=str,
                   help="Path to JSON config file")
    p.add_argument("--subscription-id", type=str,
                   help="Azure Subscription ID")
    return p.parse_args()


def load_configuration(args) -> Dict[str, Any]:
    cfg = {
        "region": args.region,
        "zone": args.zone,
        "target_sku": args.sku,
        "desired_count": args.count,
        "subscription_id": args.subscription_id,
        "log_analytics": {
            "enabled": args.log_analytics,
            "endpoint": args.endpoint,
            "rule_id": args.rule_id,
            "stream_name": args.stream_name
        }
    }
    if args.config:
        try:
            with open(args.config) as f:
                j = json.load(f)
                # merge known keys
                for k in ("region","zone","target_sku","desired_count","subscription_id"):
                    if k in j: cfg[k] = j[k]
                cfg["log_analytics"].update(j.get("log_analytics", {}))
                logger.info(f"Loaded configuration from {args.config}")
        except Exception as e:
            logger.error(f"Failed loading config {args.config}: {e}")
    # CLI args override file
    if args.region:     cfg["region"] = args.region
    if args.zone:       cfg["zone"] = args.zone
    if args.sku:        cfg["target_sku"] = args.sku
    if args.count:      cfg["desired_count"] = args.count
    if args.subscription_id:
        cfg["subscription_id"] = args.subscription_id
    return cfg


def get_subscription_id(explicit: str) -> str:
    if explicit:
        return explicit
    # Try Azure CLI
    try:
        out = subprocess.run(
            "az account show --query id -o tsv",
            shell=True, check=True,
            stdout=subprocess.PIPE, text=True
        ).stdout.strip()
        if out:
            return out
    except:
        pass
    # Fallback: Azure SDK
    cred = DefaultAzureCredential()
    subs = list(SubscriptionClient(cred).subscriptions.list())
    return subs[0].subscription_id if subs else None


def check_sku_availability(
    compute: ComputeManagementClient,
    region: str, sku: str, zone: str = None
) -> Tuple[bool, str, List[str], Dict[str, Any]]:
    """
    Returns:
      is_available (bool),
      reason (str or None),
      supported_zones (list of str),
      capabilities (dict of name→value)
    """
    skus = list(compute.resource_skus.list())
    entry = next(
        (s for s in skus
         if s.name.lower() == sku.lower()
         and region.lower() in [loc.lower() for loc in s.locations]),
        None
    )
    if not entry:
        return False, "NotFound", [], {}

    # Find all zones where this SKU is sold in that region
    supported_zones = []
    for loc_info in entry.location_info or []:
        if loc_info.location.lower() == region.lower():
            supported_zones = loc_info.zones or []
            break

    # Determine restrictions
    if zone:
        # 1) If SKU doesn’t support the requested zone
        if zone not in supported_zones:
            return False, "UnsupportedZone", supported_zones, {}
        # 2) Check zone-level restrictionInfo.zones
        restricted = [
            r for r in entry.restrictions
            if r.restriction_info.zones and zone in r.restriction_info.zones
        ]
    else:
        # Region-level check
        restricted = [
            r for r in entry.restrictions
            if region.lower() in [l.lower() for l in r.restriction_info.locations]
        ]

    is_avail = len(restricted) == 0
    reason   = restricted[0].reason_code if restricted else None

    # Pull out SKU capabilities (vCPUs, MemoryGB, etc.)
    caps = {c.name: c.value for c in entry.capabilities or []}

    return is_avail, reason, supported_zones, caps


def check_quota(
    compute: ComputeManagementClient,
    region: str, vcpus_needed: int, count: int
) -> Tuple[int,int,bool]:
    usage = list(compute.usage.list(location=region))
    core = next((u for u in usage if u.name.value.lower()=="cores"), None)
    free = (core.limit - core.current_value) if core else 0
    required = vcpus_needed * count
    return free, required, free >= required


def display(rdata: Dict[str, Any]):
    if RICH_AVAILABLE:
        c = Console()
        c.print(f"\n[bold underline]SKU Capacity & Quota (Zone) Check "
                f"({datetime.datetime.now():%Y-%m-%d %H:%M:%S})[/]\n")

        # Availability table
        t1 = Table(box=box.SIMPLE)
        t1.add_column("SKU"); t1.add_column("Region"); t1.add_column("Zone")
        t1.add_column("Available"); t1.add_column("Reason")
        t1.add_row(
            rdata["target_sku"], rdata["region"],
            rdata["zone"] or "-",
            "✅" if rdata["is_available"] else "❌",
            rdata["reason"] or "-"
        )
        c.print(t1)

        # Supported zones
        t0 = Table(box=box.SIMPLE)
        t0.add_column("Supported Zones")
        t0.add_row(", ".join(rdata["supported_zones"]) or "None")
        c.print(t0)

        # Quota table
        t2 = Table(box=box.SIMPLE)
        t2.add_column("Desired VMs", justify="right")
        t2.add_column("vCPUs/VM",   justify="right")
        t2.add_column("Free Cores", justify="right")
        t2.add_column("Needs Cores",justify="right")
        t2.add_column("Quota OK?",  justify="center")
        t2.add_row(
            str(rdata["desired_count"]),
            str(rdata["vcpus"]),
            str(rdata["free_cores"]),
            str(rdata["required_cores"]),
            "✅" if rdata["quota_ok"] else "❌"
        )
        c.print(t2)

    else:
        print(f"\nSKU {rdata['target_sku']} in {rdata['region']} "
              f"zone {rdata['zone'] or '-'}: "
              f"Available={rdata['is_available']} (Reason={rdata['reason']})")
        print("Supported zones:", ", ".join(rdata["supported_zones"]) or "None")
        print(f"Quota: need {rdata['required_cores']} cores, "
              f"have {rdata['free_cores']} → OK={rdata['quota_ok']}")


def main():
    args = parse_arguments()
    if args.debug:
        logger.setLevel(logging.DEBUG)

    cfg = load_configuration(args)
    cfg["subscription_id"] = get_subscription_id(cfg.get("subscription_id"))
    logger.info(f"Checking SKU {cfg['target_sku']} x{cfg['desired_count']} "
                f"in {cfg['region']} zone {cfg['zone']}")

    cred = DefaultAzureCredential()
    compute = ComputeManagementClient(cred, cfg["subscription_id"])

    # 1) SKU + zone availability
    is_avail, reason, zones, caps = check_sku_availability(
        compute, cfg["region"], cfg["target_sku"], cfg["zone"]
    )
    vcpus = int(caps.get("vCPUs", 0))

    # 2) Subscription quota check
    free, required, ok = check_quota(
        compute, cfg["region"], vcpus, cfg["desired_count"]
    )

    result = {
        "target_sku":      cfg["target_sku"],
        "region":          cfg["region"],
        "zone":            cfg["zone"],
        "supported_zones": zones,
        "desired_count":   cfg["desired_count"],
        "is_available":    is_avail,
        "reason":          reason,
        "vcpus":           vcpus,
        "free_cores":      free,
        "required_cores":  required,
        "quota_ok":        ok
    }

    display(result)

    # (Optional) send to Log Analytics…
    # [omitted for brevity]


if __name__ == "__main__":
    main()


Run the bulk-deploy checker (region-level check)

python monitor_vm_sku_capacity_bulk.py \
  --region centralus \
  --sku Standard_B2s_v2 \
  --count 10 

(Optionally add the parameter  --log-analytics --endpoint <DCE-URI> --rule-id <DCR-ID> to send it to Log Analytics)

Example output

SKU Capacity & Quota (Zone) Check (2025-06-20 12:49:58)


  SKU               Region      Zone   Available   Reason
 ─────────────────────────────────────────────────────────
  Standard_B2s_v2   centralus   -      ✅          -


  Supported Zones
 ─────────────────
  1, 3, 2


  Desired VMs   vCPUs/VM   Free Cores   Needs Cores   Quota OK?
 ───────────────────────────────────────────────────────────────
           10          2          100            20      ✅

Run the bulk-deploy checker (zone-level heck)

python monitor_vm_sku_capacity_bulk.py \
  --region centralus \
  --zone 2 \
  --sku Standard_B2s_v2 \
  --count 10 

Example output

SKU Capacity & Quota (Zone) Check (2025-06-20 12:42:22)


  SKU               Region      Zone   Available   Reason
 ─────────────────────────────────────────────────────────
  Standard_B2s_v2   centralus   2      ✅          -


  Supported Zones
 ─────────────────
  1, 3, 2


  Desired VMs   vCPUs/VM   Free Cores   Needs Cores   Quota OK?
 ───────────────────────────────────────────────────────────────
           10          2          100            20      ✅

 

Final Thoughts

This monitoring solution has proven to be a valuable asset for Azure infrastructure management. Organizations using this tool can present data-driven insights on VM availability patterns during planning sessions, enabling more informed decisions about infrastructure scaling strategies.

This solution effectively reduces unplanned downtime and deployment failures by providing proactive notifications about resource constraints before they impact production systems. 

Happy monitoring!

Updated Jan 05, 2026
Version 10.0

4 Comments

  • Hello rmmartins. This article is conflating Azure capacity and subscription-level quota. Having quota will not prevent an Azure capacity issue from ruining someone's day. They are distinctly different things, but one walks away from this article thinking they are the same. 

    The article repeatedly refers to “capacity” problems, but the examples and explanations mix together Azure regional capacity and subscription-level quota as if they’re the same thing. They are not. They are completely different failure modes with completely different causes and entirely different remediation paths.

    Quota errors happen when a subscription doesn’t have enough vCPU allowance.
    Capacity errors happen when Azure doesn’t have physical hardware available in that region.

    There is currently no way for Azure customers to check if there is capacity for a VM SKU, except for attempting a deployment and getting an error.  Only quota can be checked. Please adjust the article accordingly to avoid contributing to this common misconception.

     

    • rmmartins's avatar
      rmmartins
      Icon for Microsoft rankMicrosoft

      Thanks for the thoughtful feedback. You’re absolutely right that subscription-level quota and physical Azure capacity are distinct failure modes, with different causes and remediation paths.

      The intent of the article was not to suggest they are the same internally, nor that quota guarantees deployability. The goal was to focus on the operator’s perspective. What signals can be observed, monitored, and correlated when VM deployments start to fail at scale.

      From an operational standpoint, both quota exhaustion and regional capacity shortages ultimately surface as deployment failures that block VM creation, and the article deliberately looks at telemetry, allocation errors, and trends that help teams anticipate and reason about capacity risk, not to pre-validate physical capacity availability.

      I agree it would be useful to make this distinction more explicit in the wording to avoid confusion, and I appreciate you calling it out. I’ll look at clarifying that “capacity” in this context refers to the practical ability to successfully deploy a VM SKU, which can be constrained by either quota or physical availability, even though they are separate mechanisms under the hood.

      Thanks again for the constructive discussion.

  • T12341285's avatar
    T12341285
    Copper Contributor

    This looks like a wonderful tool!    python .\monitor_vm_sku_capacity.py --region eastus --sku Standard_D2s_v3  give me result I can validate via portal.  However, the bulk script returns same results whether I put in a count of 250 and rachet down to 1 gives the same result of not available.  python .\monitor_vm_sku_capacity_bulk.py --region eastus --sku Standard_D2s_v3 --count 1    the result seem to contradict the first script.  Am I doing something wrong?

    • rmmartins's avatar
      rmmartins
      Icon for Microsoft rankMicrosoft

      Hi, thanks for testing both scripts and for the feedback. You’re not doing anything wrong. The behavior you’re seeing happens because the two scripts answer different questions.

      The single-SKU script checks only whether the SKU is available in the region according to the Azure Resource SKUs API. The bulk script uses that same availability check, but it also verifies your subscription’s regional vCPU quota to see whether you have enough free cores to deploy the number of VMs requested in the --count parameter.

      Because of that, lowering the count from 250 to 1 doesn’t change the final result if the root cause is either a subscription-level restriction for that SKU in that region or if your vCPU quota for that VM family is already full. In both cases, the script will still return “not available” regardless of the count.

      The fastest way to confirm which of the two is happening is to look at the “Reason” field near the availability result and the quota section showing “Free Cores”, “Needs Cores” and “Quota OK?”.