Startups at Microsoft articles

From Error Log to Closed Ticket, Without Leaving Your Terminal

lovanartem — Mon, 15 Jun 2026 19:55:53 GMT

You describe the problem. The assistant pulls the context from what you already have, drafts the ticket, files it, tracks the replies, and closes it out, and it asks before it does anything irreversible. No portal tabs. No re-typing the resource ID Azure already knows.
See a quick demo or jump straight to the code: https://github.com/Azure-Samples/azure_support_ticket_mcp

The common challenge

Every team running on Azure eventually opens a support ticket. Here is the catch: the investigation happens in your terminal or editor, but the ticket frequently happens in a browser. Bridging those two worlds is pure overhead, and you pay it at the worst possible moment, mid-incident.

Opening one ticket means stepping through:

Confirm the tenant, pick the subscription.
Find the right support service among hundreds.
Drill down a per-service problem-classification tree.
Set severity, enter contact details, write up the issue.
Re-type the resource ID and error you were just looking at.

And filing is only the start. A support ticket is a conversation: replies to read, follow-up questions to answer, logs to attach, a status to flip when it resolves. Each one is another trip to the portal, another context switch out of the place you actually work.

The value proposition is simple: collapse that entire lifecycle into the place you already work, in plain language, in seconds. Stay in flow. Let the assistant do the mechanical parts and keep the decisions for yourself.

How the lifecycle works

Every ticket follows the same path, and the same safety gate sits in the middle of it. Opening a ticket runs left to right; once it exists, the same conversational interface carries it through the rest of its life.

The second lane is the part most ticket tools skip. Once a ticket is open, support is a two-way conversation, and the server handles that side too: read the full thread of customer and Microsoft replies, get a local summary of where things stand, reply to the support engineer, and attach logs, traces, or screenshots, all without returning to the portal.

That preview-then-confirm gate is the whole trust model. The first call returns exactly what is about to happen; the second call carries it out, and only if nothing changed in between. Reading, summarizing, and triaging are instant, with nothing to approve; creating, replying, attaching, and closing always pause for your yes.

The Solution

The Azure Support Ticket MCP is a Model Context Protocol (MCP) server that exposes the Azure support ticket lifecycle as a set of conversational tools. Three things make it more than a thin API wrapper:

It infers context. Give it a resource ID or a portal URL and it reads the subscription, resource group, and service from it, then ranks the right problem classification from your description instead of making you walk the tree.
It is local-first. The Azure support catalog, the services and their problem-classification trees, is cached on disk, so the common path is instant and works even on a flaky connection.
It is safe by design. Every action that changes something is preview-then-confirm: nothing reaches Azure until you approve the exact payload.

Getting started

Install the binary (a single command, the repo README has the current one-liner).
Register it with your MCP-capable assistant.
Start describing problems in plain language.

The fastest way in is to pipe a failure straight from your terminal into a ticket:

# a failed infra provision copilot -i "ticket this: $(azd up 2>&1)" # a misbehaving pod copilot -i "ticket this: $(kubectl describe pod my-pod)" # a red CI run copilot -i "ticket this: $(gh run view <run-id> --log-failed)"

From that raw output, the server extracts the resource IDs, error codes, correlation IDs, and HTTP status, takes a first guess at severity, and scrubs obvious secrets on a best-effort basis. You then review and approve the full draft before any ticket is created, so anything the scrub might miss is still in front of you to catch first. Prefer plain English? That works too:

copilot -i "open a ticket — my AKS cluster prod-aks can’t scale out"

In short

The problem: filing and managing Azure support tickets pulls you out of your flow into a multi-step portal process, again and again, over the life of each ticket.
MCP: an open standard that lets AI assistants take real, permissioned actions through tools, not just answer questions.
The solution: an open-source MCP server that runs the whole ticket lifecycle from your terminal, context-aware, local-first, and gated by preview-then-confirm.

Try it, or read the code

It is open source under the MIT license. A short demo, installation, the full capability list, and the design notes are all in the repository — issues, ideas, and pull requests welcome.

https://github.com/Azure-Samples/azure_support_ticket_mcp

The flat-subscription problem

rmmartins — Wed, 22 Apr 2026 19:06:12 GMT

A real design review: management groups, policies, break-glass accounts, and the five things I'd tweak before going to production.

Here's what I see at most startups when they first show up on Azure: one subscription, one Global Admin, everything in the same resource group, and everyone's an Owner.

That works when you have three engineers and one environment. It stops working around the time you have a production workload, a dev environment, shared infrastructure, and an engineer who accidentally deleted the wrong resource group on a Friday afternoon.

The next step is usually "let's create more subscriptions." That's the right instinct. But without management groups and policies tying them together, you end up with four subscriptions, four sets of inconsistent RBAC assignments, no shared tagging strategy, and no audit trail showing who deployed what.

If you're at this stage and want a starting point, the Startup-Scale Landing Zone gives you an opinionated Bicep template with management groups, policies, and RBAC already wired together. This post goes deeper: what happens when a team takes those concepts and customizes them for their own environment.

The design

A startup VP of Engineering sent me their proposed management group hierarchy and asked me to review it before going to production. They'd done their homework: read the Cloud Adoption Framework docs, researched config options, and put together a three-level hierarchy with specific policies and RBAC at each level.

Here's the breakdown:

Tenant Root Group is the automatic top-level MG that Azure creates in every tenant. Be very selective about what you assign here. Anything at this level affects every subscription you'll ever create, including ones that don't exist yet. Some organizations do assign enterprise-wide "must have" policies at root, but for a startup still figuring out its governance posture, keeping root clean and pushing baselines to a company MG one level down gives you more flexibility.

Company MG sits directly below and carries the baseline that applies to everything: required tags on all resources (env, owner, cost-center, app), allowed regions locked to three US regions, Defender for Cloud enabled everywhere, and all diagnostic logs routed to a central Log Analytics workspace. Engineering gets Reader at this level, so everyone can see everything but can't change anything by default.

Three child MGs below that:

Nonprod MG is the relaxed zone. Tags are audited but not denied, so engineers can experiment without being blocked by policy. Public IPs are allowed. Engineering gets Contributor. This is where you iterate fast without filing PIM requests.

Prod MG is the strict zone. Tags are denied if missing. Public IPs are blocked. Encryption at rest is required. VM SKUs are restricted. Engineering gets Reader by default, and Contributor access is available through PIM (just-in-time, time-limited activation). You have to explicitly request write access, and it expires.

Platform MG protects the shared infrastructure that everything depends on. The Terraform state storage account, central Log Analytics workspace, and shared Key Vault all live here. Platform team gets Contributor; everyone else gets Reader. Critical resources are protected from deletion.

Under each MG, the subscriptions:

MG	Subscription	Purpose
Nonprod	dev	Development and testing
Nonprod	devtest (MSDN)	Engineer's personal scratch (MSDN-bound)
Prod	prod	Production workloads
Platform	cloud-infra	Terraform state, Log Analytics, Key Vault, workload identity

The parts that nail it

The hierarchy is flat and functional. CAF says keep it three to four levels deep and don't create management groups just for the sake of structure. This design does exactly that: a company MG for baselines, then Nonprod/Prod/Platform for the policy gradient. It's not "the one CAF pattern" (CAF deliberately avoids prescribing a single topology), but it's a clean startup pattern that scales to dozens of subscriptions without restructuring.

Audit in dev, deny in prod. Dev environments that deny everything become unusable. Engineers stop experimenting. Prod environments that only audit become insecure. The split is the right trade-off: visibility without friction in dev, enforcement without exceptions in prod.

The platform subscription for shared services. Centralizing Terraform state, the Log Analytics workspace, and shared Key Vault into a separate subscription (with its own RBAC) means application teams can't accidentally delete the infrastructure that manages their infrastructure. This is the "trust boundary" pattern, and most startups skip it until they learn the hard way.

What i'd change before going live

PIM licensing isn't one-seat-fits-all. They mentioned having "1 P2 seat" for PIM. PIM requires an Entra ID P2 (or Governance) license per user who's eligible for activation, plus anyone who approves or reviews PIM access. If four engineers need just-in-time Contributor access to production and one manager approves, that's five P2 licenses (~$9/user/month). Still cheap insurance compared to "everyone has standing Contributor," but budget for it correctly.

Think about SKU restrictions as a trade-off. Their prod MG had "restrict to approved SKUs." An allow-list gives you strict standardization (only pre-approved SKUs work), but every time Azure launches a new VM series, someone has to update it. A deny-list ("block these specific expensive or unnecessary SKUs") is easier to maintain since new SKUs are available by default. The right choice depends on your team: if you need tight control over what runs in prod, keep the allow-list. If you move fast and want less policy maintenance, a deny-list with periodic reviews is simpler.

Resource locks beat policy for protecting critical infra. Their Platform MG had "deny deletion of state storage / log workspace" as a policy. Azure Resource Locks (CanNotDelete) are simpler and more visible for this. A lock shows up right on the resource in the portal, so engineers see it immediately. A deny-delete policy is invisible until it blocks you, and the error message doesn't always make it obvious why. Locks are also easier to temporarily remove when you legitimately need to rotate or replace a resource.

Add cost alerts on every subscription from day one. Their design didn't mention budget alerts. Azure Cost Management lets you set budget thresholds per subscription with email and webhook notifications. Set them before any workloads deploy, not after the first surprise bill. Start with 80% and 100% of expected monthly spend. It takes 5 minutes and can save thousands.

Cap the MSDN subscription. Their devtest sub was MSDN-bound, described as "personal scratch." MSDN subscriptions come with a monthly credit ($50-$150 depending on the license tier), but the spending limit can be removed, which means charges hit a valid payment method with no cap. Keep the spending limit ON for scratch subs. If it's been removed, set a budget alert at the credit amount. Also note that some Marketplace and external services may bill separately regardless of the spending limit.

The break-glass question

This team was federating their primary domain with Google Workspace as the SAML identity provider (their whole company runs on Google). They asked: "Can I use my .onmicrosoft.com account as a break-glass account while my federated company.com is my daily driver?"

Yes. This is exactly the pattern Microsoft recommends.

Microsoft's security benchmark (PA-5) specifically calls for cloud-only break-glass accounts that bypass external IdP dependencies. If your Google SAML federation goes down (Google outage, misconfigured SAML cert, domain issues), all federated accounts fail to sign in. Cloud-only .onmicrosoft.com accounts authenticate directly against Entra ID with no external dependency.

How to harden them:

Create two break-glass accounts. Microsoft recommends at least two. Store credentials in separate physical locations. One person alone shouldn't be able to access both. Docs: Manage emergency access accounts.

Use phishing-resistant auth. Passkeys (FIDO2 security keys) are the strongest option: phishing-resistant and no dependency on a phone or authenticator app that might be unavailable during an emergency. If you already run PKI, certificate-based auth is another viable option. The key is diversity across your two accounts so a single authentication method failure doesn't lock out both. Docs: Enable FIDO2 security key sign-in.

Exclude at least one account from ALL Conditional Access policies. This is the account that guarantees access if a bad CA policy locks everyone out. Microsoft recommends excluding at least one break-glass account from every CA policy. The second account can optionally have phishing-resistant MFA enforced via CA, giving you a safer fallback for non-federation emergencies.

Assign Global Administrator permanently. Not through PIM. Break-glass accounts need immediate access. PIM activation requires the normal auth flow, which defeats the purpose in an emergency.

Monitor every sign-in. Set up alerts in Azure Monitor or Microsoft Sentinel for any authentication from a break-glass account. If these accounts show activity outside an emergency, investigate immediately.

Test quarterly. Actually sign in with the break-glass accounts on a schedule. Verify the credentials work, the FIDO2 keys work, and the monitoring alert fires. Don't wait for a real emergency to discover something is broken.

The pre-production governance checklist

Before deploying workloads into your new hierarchy, verify:

All subscriptions are nested under the correct MG (not dangling under Tenant Root Group)
Baseline policies applied at the company MG and verified with Get-AzPolicyAssignment
PIM configured with appropriate activation duration (4-8 hours max)
P2 licenses assigned to every user eligible for PIM activation, plus approvers and reviewers
Two break-glass accounts exist, tested, and monitored
At least one break-glass account excluded from all Conditional Access policies
Budget alerts set on every subscription (80% and 100% thresholds)
Resource locks on Terraform state, Log Analytics workspace, and Key Vault
MSDN spending limit verified ON (or budget alert set if removed)
Diagnostic settings routing all activity logs to the central Log Analytics workspace

Where this fits in the governance journey

If you're building Azure governance from zero, here's my recommended reading order:

Demystifying Microsoft Entra ID, Tenants and Azure Subscriptions - understand what tenants, subscriptions, and Entra ID actually are
Azure has three permission systems, and you're probably confusing them - the identity, resource, and billing planes
This post - design your management group hierarchy
Role Structures, Anti-Patterns, and the 10 Governance Principles - RBAC patterns and what not to do
Introducing the Startup-Scale Landing Zone - the full reference architecture

Your Azure VM went down and nobody knew why. Here's how to fix that.

rmmartins — Wed, 22 Apr 2026 15:49:46 GMT

If you've ever had a production VM go unhealthy on Azure and found yourself scrambling to figure out what happened, you're not alone. I work with startups running production workloads on Azure, and this is one of the most common patterns I see: something goes wrong, the team opens a support ticket, and then everyone waits for a root cause while the CTO asks "how do we make sure we know about this before our customers do next time?"

The good news: Azure already gives you the tools to answer both questions. Most teams just haven't set them up yet.

Scope note: This post covers platform health and maintenance signals for Azure VMs. We're not covering guest OS metrics, application telemetry, or Azure Monitor/VM Insights here. If you don't have a dedicated SRE team, these are the highest-leverage Azure-native checks to set up first.

Let's get into it.

Step 1: Figure out what actually happened (Resource Health)

Before you open a support ticket, check Resource Health. It's the fastest way to determine whether your VM went down because of something Azure did (platform event) or something on your side (user-initiated or config issue).

Go to your VM in the Azure portal > Resource Health blade. You'll see:

Current status: Available, Unavailable, Degraded, or Unknown
Health history: 30 days of state transitions with annotations explaining each one
Root cause: For platform-initiated outages on VMs, Azure automatically publishes root cause details within 72 hours, directly in this blade

The annotations often tell you what kind of event occurred: live migration, host reboot, planned maintenance, degraded hardware, etc. In many cases, you get this information without filing a support ticket.

If your VM was affected by a live migration, the annotation will show it was a platform-initiated event. Live migration is a memory-preserving operation that causes a brief pause, typically no more than 5 seconds (docs). But if your application is sensitive to even short freezes, or if you're seeing them frequently, that's worth investigating further.

Docs: Resource Health overview

Step 2: Get notified when it happens (Service Health + Resource Health Alerts)

Checking the portal after an incident is fine. Getting an alert when the incident happens is better.

Service Health Alerts

These notify you about service issues, planned maintenance, health advisories, and security advisories for the Azure services and regions you're actually using. Service Health is best for subscription-level and region-level awareness. If there's a regional maintenance wave driving elevated live migrations, this is how you'd know about it proactively.

Set them up to notify your ops channel via email, SMS, webhook (Slack, PagerDuty, Teams), or automation via Logic Apps or Azure Functions.

Docs: Create Service Health alerts | PagerDuty integration

Resource Health Alerts

These fire when a specific resource (or all resources in a resource group) changes health status. The alert includes health-change details such as status, cause type (platform vs. user-initiated), and descriptive event text, so you get more than a generic "VM is unhealthy" notification.

This is the "never be surprised again" alert. If you only set up one thing from this post, make it this.

Docs: Create Resource Health alerts

Step 3: See it coming (Scheduled Events API)

This is the part most teams don't know about, and it's the most powerful tool for handling live migrations gracefully.

Azure exposes an Instance Metadata Service (IMDS) endpoint on every VM that gives your application advance notice of upcoming maintenance events. Live migrations show up as EventType: "Freeze". In typical cases, you get up to ~15 minutes between the event appearing and Azure proceeding with the operation, though exact timing varies and some failures (like hardware issues) can bypass the advance notification entirely.

Note: Most Azure VM families support live migration, but G, L, N, and H series VMs do not. If you run GPU or HPC workloads on these SKUs, you won't see Freeze events. You'll still get Reboot or Redeploy events for other maintenance types.

The endpoint is available from inside the VM at:

http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01

Here's an example response when a live migration is scheduled:

{
  "DocumentIncarnation": 1,
  "Events": [
    {
      "EventId": "602d9444-d2cd-49c7-8624-8643e7171297",
      "EventType": "Freeze",
      "ResourceType": "VirtualMachine",
      "Resources": ["my-production-vm"],
      "EventStatus": "Scheduled",
      "NotBefore": "Mon, 22 Apr 2026 19:17:47 GMT",
      "Description": "Virtual machine is being paused for a memory-preserving Live Migration operation.",
      "EventSource": "Platform",
      "DurationInSeconds": 5
    }
  ]
}

You can poll this endpoint and use the lead time to:

Drain connections so active users aren't affected
Checkpoint application state to recover faster
Remove the VM from your load balancer temporarily
Log the event so you have a record of migration frequency

Here's a simple polling script in Python:

import requests
import json
import time

ENDPOINT = "http://169.254.169.254/metadata/scheduledevents"
HEADERS = {"Metadata": "true"}
PARAMS = {"api-version": "2020-07-01"}

def get_scheduled_events():
    response = requests.get(ENDPOINT, headers=HEADERS, params=PARAMS)
    return response.json()

def handle_events(data):
    for event in data.get("Events", []):
        print(f"[{event['EventType']}] {event.get('Description', 'No description')}")
        print(f"  Status: {event['EventStatus']}, Not Before: {event['NotBefore']}")
        print(f"  Duration: {event['DurationInSeconds']}s, Source: {event['EventSource']}")
        # Your graceful drain/checkpoint logic here

def approve_event(event_id):
    """Acknowledge the event so Azure can proceed immediately."""
    payload = json.dumps({"StartRequests": [{"EventId": event_id}]})
    requests.post(ENDPOINT, headers=HEADERS, params=PARAMS, data=payload)

# Poll frequently - the official docs recommend every 1 second for production.
# Adjust based on your workload sensitivity.
while True:
    data = get_scheduled_events()
    handle_events(data)
    time.sleep(1)

Or a quick check in Bash:

curl -s -H "Metadata:true" --noproxy "*" \
  "http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01" | jq .

Event approval: Once your application has drained connections or checkpointed state, it can approve the event by POSTing back with the EventId. This tells Azure your app is ready, and the platform can proceed without waiting for the full timeout. If you don't explicitly approve, Azure proceeds when the NotBefore time is reached.

If you're seeing elevated frequency of live migrations, this data lets you quantify the pattern (how often, what times, what durations) and bring hard numbers to a support conversation instead of "it feels like it's happening a lot."

Docs: Scheduled Events for VMs

Step 4: Check your overall posture (Azure Advisor)

While you're at it, check Azure Advisor's Reliability recommendations for your VMs. It flags things like:

VMs not deployed in availability zones
Deprecated VM images that need updating
Missing backup configurations
Other resiliency gaps that make you more susceptible to availability issues

Advisor won't explain a past incident, but it can help prevent the next one.

Docs: Azure Advisor Reliability recommendations

A quick note on resilience

These tools improve your visibility and response time, but they don't eliminate downtime by themselves. If a VM is truly critical, pair this monitoring with basic resilience patterns: multiple instances behind a load balancer, availability zones, health probes, regular backups, and cross-region recovery where needed. Monitoring tells you what's happening. Architecture determines whether it matters.

The setup checklist

Quick wins (15 minutes)

#	What	Why	Time
1	Check Resource Health on your production VMs	See if there are past events you didn't know about	2 min
2	Create a Service Health alert for your regions/services	Get notified about platform issues proactively	3 min
3	Create Resource Health alerts for your VM resource groups	Get notified when any VM changes health state	3 min
4	Review Azure Advisor Reliability tab	Fix any posture gaps	2 min

Advanced hardening (1+ hours depending on your app)

#	What	Why
5	Deploy the Scheduled Events polling script on critical VMs	Get advance notice of live migrations and maintenance
6	Implement drain/checkpoint logic tied to Scheduled Events	Gracefully handle maintenance with zero user impact
7	Wire event approvals into your automation	Control the timing of when Azure proceeds with maintenance

Wrapping up

The pattern I keep seeing is teams treating Azure VM monitoring as something they'll get to "later." Then an incident happens, the RCA takes longer than anyone wants, and everyone wishes they had visibility sooner.

The tools are already there. Resource Health tells you what happened. Service Health and Resource Health alerts tell you when it's happening. Scheduled Events tells you before it happens. And Advisor helps you make sure your setup is resilient in the first place.

Fifteen minutes of setup for the quick wins, and you're in a fundamentally better place than most teams running VMs on Azure today.

Role Structures, Anti-Patterns, and the 10 Governance Principles

rmmartins — Thu, 09 Apr 2026 21:25:00 GMT

Part 3 of 3: The implementation playbook for engineering, finance, and security teams

In Part 1, we established Azure's three-plane model: Entra for identity, RBAC for resources, Commerce for billing. In Part 2, we explored where those planes collide: Marketplace governance, Managed Identity, and ABAC.

Now it's time to get practical. This post covers the patterns that work, the anti-patterns that don't, and the governance principles that every digital-native company should adopt before they're forced to adopt them after an incident.

7 anti-patterns to avoid

These seven anti-patterns appear repeatedly across AI, SaaS, and digital-native customers. Every one of them has caused real incidents — surprise invoices, accidental deletions, compliance failures, or governance breakdowns.

❌ Anti-Pattern 1: Giving engineers billing permissions

What happens: Engineers are given Billing Reader or Billing Contributor roles "so they can see costs." They can now see MACC credits, private offer terms, commercial discounts, and Marketplace purchase history, none of which they need.

Symptoms: Engineers purchasing Marketplace SaaS without oversight. Surprise invoices. Procurement loses visibility into vendor commitments.

Fix: Engineers need Cost Management Reader (RBAC) for usage-based cost visibility. They do not need billing roles. If they need to understand MACC impact, create a reporting process, don't give them the keys.

❌ Anti-Pattern 2: Giving finance subscription owner access

What happens: Finance teams are given Owner or Contributor roles on subscriptions "so they can track spending." They now have the ability to deploy, modify, and delete production resources.

Symptoms: Massive over-permissioning. Finance can accidentally delete production resources. Audit risk, regulators will flag this.

Fix: Finance roles belong in the Billing plane, not the resource plane. Give finance Billing Reader for credit and invoice visibility. If they also need resource cost data, add Cost Management Reader (RBAC) scoped to the appropriate subscriptions — that's a read-only, resource-plane role.

❌ Anti-Pattern 3: Too many subscription owners

What happens: Every senior engineer, team lead, and sometimes product managers get Owner on subscriptions. The logic: "they need to unblock themselves."

Symptoms: No accountability, when everyone is Owner, nobody is. High blast radius. Hard to trace role assignments when troubleshooting.

Fix: Maximum 2–3 Owners per subscription: Platform Lead, SRE Lead, and optionally the Cloud Architect. Everyone else gets Contributor or scoped roles. Use PIM for emergency elevation.

❌ Anti-Pattern 4: Believing Entra Global Admin = Azure Owner

What happens: Leadership assumes Global Admin has universal access: subscriptions, resources, billing. They don't. Global Admin controls the identity plane only.

Symptoms: Security teams thinking they can see all resources (they can't). Incorrect governance designs that assume Entra = RBAC.

Fix: Train leadership explicitly: Entra ≠ RBAC ≠ Billing. Three planes, three sets of roles, zero overlap. A Global Admin who needs resource access must be separately granted RBAC roles.

❌ Anti-Pattern 5: Deploying marketplace SaaS without finance

What happens: Engineers purchase Marketplace tools directly because they have billing permissions (see Anti-Pattern 1) or because the org hasn't restricted Marketplace purchases.

Symptoms: Incorrect MACC burn. Licensing duplicates. Vendor lock-in without legal review. Private offer terms not applied.

Fix: Require finance approval for all paid Marketplace purchases. Follow the five-step workflow from Part 2: Engineer requests → Finance reviews → Billing executes → Engineering deploys → Cost monitoring activated.

❌ Anti-Pattern 6: Mixed dev/test/prod in one subscription

What happens: To save time, teams put all environments in one subscription.

Symptoms: Can't isolate production costs. A Contributor on the sub can modify both dev and prod. Can't enforce stricter policies on prod without affecting dev. Compliance teams can't get clean boundaries.

Fix: Separate subscriptions by environment. Pattern: 1 subscription per environment per workload (or at minimum per environment). Use cross-subscription networking via Hub & Spoke or Landing Zones.

❌ Anti-Pattern 7: Not using Azure Policy

What happens: Teams deploy freely with no guardrails. Over time: VMs in unapproved regions, GPU SKUs in non-production, storage accounts without encryption, missing tags, public IP drift.

Symptoms: Inconsistent regions. Wrong VM families. Missing tags make cost attribution impossible. Non-compliant configurations.

Fix: Adopt Azure Policy early, at Management Group scope. Critical policies: allowed locations, allowed VM SKUs, enforce HTTPS, enforce private endpoints, enforce tagging (environment, owner, cost-center).

Recommended role structure

Based on experience with dozens of digital-native customers, here's the role structure that works across the three planes.

Engineering plane (RBAC)

2–3 subscription Owners: Platform Lead, SRE Lead, Cloud Architect
Platform/SRE team as Contributors: deploy and manage infrastructure
Developers as RG-scoped Contributors or Readers: limited to their workload's resource group
Cost Management Reader for budget owners: usage visibility without deployment rights
Azure Policy for guardrails: VM SKUs, regions, encryption, tags
Management Groups for organizational structure

Finance plane (Commerce)

Billing Account Owner = CFO or Finance Director
Billing Contributor = Finance Operations
Billing Reader = FP&A and financial analysts
All Marketplace-paid offers require finance approval
MACC visibility restricted to finance roles

Identity/Security plane (Entra)

2–4 Global Admins (break-glass accounts included)
PIM enforced for all privileged roles, no permanent admin access
Conditional Access for all admin roles (MFA, compliant device, block legacy auth)
Groups used for RBAC assignment, never assign RBAC to individual users
Workload identities (Managed Identity) preferred over service principals

Role mapping templates

Copy these into your onboarding documentation.

Engineering Team

Role	Azure Role	Plane	Allowed actions
Cloud Architect	Owner (2–3 per sub)	RBAC	Govern workloads, assign roles, manage infrastructure
Platform / SRE	Contributor	RBAC	Deploy and manage infrastructure
Developer	Contributor or Reader (RG-scoped)	RBAC	Deploy to specific resource groups
Budget Owner	Cost Management Reader	RBAC	View usage-based cost, manage budgets — not billing

Finance Team

Role	Azure Role	Plane	Allowed actions
Finance Lead	Billing Account Owner	Billing	View and manage credits, invoices, MACC, payment methods
Finance Analyst	Billing Reader	Billing	Read-only billing visibility
FP&A	Billing Reader	Billing	Read-only; no deployments, no resource access

Leadership

Role	Azure Role	Plane	Actions
CTO / VP Engineering	Reader or Cost Mgmt Reader	RBAC	Visibility into platform and resource costs
CFO	Billing Reader	Billing	Visibility into credits, invoices, MACC, commitments

RACI Matrix

Adapted from the Microsoft Cloud Adoption Framework.

Function	Accountable	Responsible	Consulted	Informed
Billing account roles & access	Finance Lead	Finance Ops	Cloud Architect	Engineering
Subscription role assignments	Cloud Architect	Platform / SRE	Finance, Security	Engineering
Cost monitoring & budgets	Finance	Engineering	Leadership	All teams
Marketplace purchases	Finance Lead	Finance Ops	Engineering, Legal	CFO
IaC / Deployment governance	Platform Lead	Engineers	Security	Finance
Policies & guardrails	Security / Cloud Architect	Platform Team	Engineering	Leadership
Identity & access governance	Security Lead	Identity Admin	Cloud Architect	All teams
PIM & Conditional Access	Security Lead	Identity Admin	Platform Lead	Engineering
MACC tracking & credit visibility	Finance Lead	Finance Ops	Cloud Architect	Leadership

Include this template in your onboarding documentation and review it quarterly.

Best Practices

Use Entra Groups for RBAC assignment, never assign directly to users

Benefits: clear separation of identity and resource planes, easy onboarding/offboarding, predictable RBAC inheritance, enables PIM for group-based elevation.

Naming pattern:

grp-sub-<SubscriptionName>-Owner
grp-sub-<SubscriptionName>-Contributor
grp-rg-<WorkloadName>-Reader

Assign the group to the role, not individual users.

Enforce PIM + Conditional Access for all privileged roles

Key CA policies: MFA required for all admins, compliant device requirement, block legacy authentication, block sign-in from high-risk locations, require phishing-resistant MFA.

No permanent admin access. Use time-based elevation for every privileged operation.

Separate subscriptions by environment and workload

Subscriptions are a security boundary. Pattern: 1 subscription per environment per workload. Platform teams get their own subscription. Use Hub & Spoke or Landing Zones for cross-subscription networking.

Keep billing data confidential

Only Billing roles should see credits, commitments, discounts, invoices, and MACC balance. Engineers should never have access to commercial data.

The 10 Principles of Azure Governance

After working with digital natives across AI, SaaS, and infrastructure companies, I can summarize Azure governance into these principles:

#	Principle	Summary
1	Separate identity, resources, and billing. Always.	Never mix roles across planes. An engineer should never hold billing roles. A finance analyst should never hold subscription Owner.
2	Engineering owns the resource plane.	Give them Contributor and Cost Management Reader. Don't burden them with billing or identity administration.
3	Finance owns the billing plane.	Credits, MACC, invoices, private offers. Every Marketplace purchase flows through Finance.
4	Security owns identity and governance.	PIM, Conditional Access, Azure Policy. Identity decisions should not be made by engineering or finance.
5	Keep subscription Owners scarce.	Maximum 2–3 per subscription. Use PIM for emergency elevation. Everyone else gets Contributor or scoped roles.
6	Lock down Marketplace.	Every SaaS purchase approved by Finance. No exceptions. Use the five-step workflow.
7	Use Infrastructure as Code.	Manual deployments don't scale and can't be audited. Use Bicep, Terraform, or Pulumi.
8	Use budgets early.	Set budgets at Management Group, Subscription, and Resource Group levels. Configure alerts to email, Teams, or automation.
9	Use Management Groups from day one.	Every startup that scales beyond a single subscription regrets not using them. Recommended hierarchy: Tenant Root → OrgName → Platform / Production / NonProduction / Sandbox / Shared Services.
10	Build governance before scale.	The companies that scale successfully treat Azure governance as infrastructure, not bureaucracy.

References

Closing thoughts

Azure's three permission planes aren't a problem to solve, they're a framework to leverage.

The confusion happens when teams try to treat Azure as if it has a single permission system. It doesn't, and it never will. Because identity, billing, and resource deployment are fundamentally different domains that must be operated and secured differently.

But when organizations understand these three planes and structure their roles accordingly, something powerful happens:

Engineering moves faster. Clear RBAC scopes mean teams deploy without waiting for approvals they don't need.
Finance gains real oversight. Billing roles provide full commercial visibility without the risk of touching production resources.
Security gets a clean, enforceable boundary model. Entra controls identity; PIM and Conditional Access control elevation; Azure Policy controls the guardrails.
Leadership sees clarity instead of chaos. The right roles in the right planes mean dashboards, reports, and alerts actually reflect what each stakeholder needs.

Good governance doesn't slow down innovation. Bad governance does.

The companies that scale successfully, whether AI-native, SaaS platforms, or global digital-first organizations, are the ones that adopt a clean, intentional model early. They treat Azure governance as infrastructure, not bureaucracy.

The model is simple: Entra for who. RBAC for what. Commerce for how you pay. Start with that, and everything else becomes easier.

This concludes the 3-part series on Azure Governance for Digital Natives. For the full model, start with Part 1: The Three Permission Planes. For collision points and Managed Identity, read Part 2: Marketplace Governance and the Cross-Plane Bridge.

Marketplace governance and the cross-plane bridge

rmmartins — Thu, 09 Apr 2026 21:17:44 GMT

Part 2 of 3: Where resource deployment meets financial authority and how to govern it

In Part 1, we established the foundational model: Azure operates on three completely separate permission planes, Entra (identity), RBAC (resources), and Commerce (billing). A role in one plane grants zero access in the others.

That model is clean in theory. But in practice, the planes collide. And when they do, teams get confused, purchases stall, and governance gaps appear.

This post covers the biggest collision point: Marketplace, where resource deployment meets financial authority. We'll also dig into Managed Identity (the one construct that genuinely bridges two planes), ABAC (advanced conditional governance within the resource plane), and the five-step Marketplace approval workflow every digital-native company should adopt.

Marketplace: Where the resource and billing planes intersect

Marketplace is the most common collision point between Azure's permission planes. Here's why: deploying an Azure resource and purchasing a Marketplace SaaS product feel like the same action from the Portal, but they are governed by completely different permission systems.

Deploying resources ≠ Purchasing SaaS

A Contributor can deploy any native Azure resource: VMs, Storage, AKS, Networking, Databases, Azure OpenAI. These are resource plane operations governed by RBAC.

But purchasing a third-party SaaS product through Marketplace — Datadog, Snowflake, Elastic, Confluent, MongoDB Atlas, is a commercial transaction. It creates a financial obligation between your organization and a vendor. That's the billing plane.

Deploying → RBAC (Resource Plane)
Purchasing → Commerce (Financial Plane)

The marketplace permission model

Action	Requires RBAC?	Requires billing role?
Deploy a VM	✅ Yes	❌ No
Deploy AKS cluster	✅ Yes	❌ No
Deploy Azure OpenAI	✅ Yes	❌ No
Deploy Datadog agent extension	✅ Yes	❌ No
Deploy Confluent cluster (Azure-native)	✅ Yes	❌ No
Purchase Datadog SaaS plan	❌ No	✅ Yes
Purchase Snowflake SaaS	❌ No	✅ Yes
Accept Confluent SaaS contract	❌ No	✅ Yes
View Snowflake private offer	❌ No	✅ Yes
Approve Marketplace private offer	❌ No	✅ Yes

This is why engineers often ask:

"Why can't I buy Snowflake? I'm an Owner."

Because Owner has no financial authority. Owner is the highest role in the resource plane, but Marketplace SaaS purchases are commercial transactions that require billing plane permissions. These are different systems.

The subtlety: Azure-Native vs. SaaS

Some vendors have both Azure-native integrations and SaaS offerings, which makes this even more confusing:

Datadog agent extension: deploys as an Azure resource → RBAC ✅
Datadog SaaS plan: creates a billing relationship → Commerce ✅
Confluent for Azure: deploys Kafka as an Azure resource → RBAC ✅
Confluent Cloud SaaS contract: financial commitment → Commerce ✅

When an engineer deploys a Datadog agent via the Portal, everything works. When they try to subscribe to the Datadog SaaS plan, they hit a wall. Same vendor, same Portal, different permission plane.

The five-step marketplace purchase workflow

For digital natives operating with financial governance, every Marketplace purchase should follow this workflow:

Step 1: Engineer requests a SaaS or marketplace resource

The request should include: why it's needed, expected cost, impact on MACC, preferred vendor, and alternatives considered.

Step 2: Finance reviews commercial implications

Finance checks: MACC impact (does this purchase count toward the commitment?), budget alignment, available discounts (private offers), vendor validation, and contract terms.

Step 3: Billing role executes the purchase

Billing Account Owner or Contributor completes the transaction in the Portal. This is a billing plane operation.

Step 4: Engineering deploys or configures the resource

SaaS connector setup, private offer entitlement, RBAC for workload integration, data pipelines and integration. This is a resource plane operation.

Step 5: Cost monitoring activated

Alerts configured, budgets set, tagging applied, forecasting enabled.

This five-step workflow is simple, but most digital natives skip it and end up with surprise invoices, unapproved vendor commitments, or MACC burn they didn't plan for.

The one cross-plane bridge: Managed Identity

If the three-plane model is about separation, Managed Identity is the one construct that genuinely bridges two of those planes.

A Managed Identity is an Entra identity tied to an Azure resource and authorized via RBAC. It lets Azure workloads authenticate to other Azure services without storing credentials in code, environment variables, or configuration files.

The cross-plane flow

Step	Plane	What happens
1. Identity created	Entra (Identity)	A service principal is registered in the directory
2. Access authorized	RBAC (Resource)	Role assignments grant access to specific resources
3. Identity used	Runtime (Resource)	The workload requests a token from Entra and calls the target service

No secrets. No passwords. No key rotation. The identity lifecycle is managed by Azure itself.

AI workload examples

For digital natives building AI applications, Managed Identity is essential:

Scenario	Source	Target	RBAC role needed
App calls Azure OpenAI	App Service / Container App	Azure OpenAI	Cognitive Services OpenAI User
App reads secrets	App Service / Container App	Key Vault	Key Vault Secrets User
App reads/writes blobs	App Service / Container App	Storage Account	Storage Blob Data Contributor
AKS pod calls AOAI	AKS (Workload Identity)	Azure OpenAI	Cognitive Services OpenAI User
AKS pod reads secrets	AKS (Workload Identity)	Key Vault	Key Vault Secrets User
Function processes events	Azure Function	Event Hub	Azure Event Hubs Data Receiver
Pipeline reads training data	ML Workspace	Storage Account	Storage Blob Data Reader

System-Assigned vs. User-Assigned

System-assigned: Tied to a single resource. When the resource is deleted, the identity is deleted. Best for simple scenarios with one resource accessing one or a few target services.

User-assigned: Created as a standalone resource. Can be assigned to multiple resources. Best for shared identity across microservices, AKS Workload Identity, or when the identity must persist independently.

AKS Workload Identity

AKS Workload Identity deserves special mention, it's the most common Managed Identity pattern in digital-native companies running Kubernetes:

A User-Assigned Managed Identity is created in Azure
A Kubernetes Service Account is annotated with the identity's client ID
A Federated Identity Credential links the K8s service account to the Managed Identity
RBAC role assignments grant the Managed Identity access to target resources
At runtime, the pod uses the service account to get an Entra token via workload identity federation

This is Entra + RBAC + Kubernetes working together: identity plane creates the trust, resource plane authorizes the access, and the workload uses it at runtime.

Key insight: Managed Identity bridges Entra and RBAC, but never touches the third plane (billing). No identity, managed or otherwise, can see MACC credits or approve Marketplace purchases.

Advanced: Attribute-Based Access Control (ABAC)

ABAC extends RBAC with conditions based on resource attributes (tags), principal attributes, and request context. It is not a separate permission system, it's an enhancement to the resource plane.

For example, you can write a role assignment that says: "Allow Contributor access only to resources tagged Environment = Dev" or "Allow read access only to storage blobs under a specific path prefix."

ABAC is particularly useful for:

Multi-tenant SaaS applications that need tenant isolation at the resource layer
Regulated workloads that require fine-grained access control beyond what standard RBAC scopes provide

What ABAC cannot do: grant billing access, override Entra roles, access MACC, or purchase Marketplace products. It operates entirely within the RBAC resource plane.

For implementation details, see: Azure RBAC Conditions (ABAC)

References

What's Next → We've now covered the three-plane model (Part 1) and the biggest collision points: Marketplace, Managed Identity, and ABAC. In Part 3, we get tactical: the 7 anti-patterns to avoid, recommended role structures for Engineering, Finance, and Security teams, RACI templates, and the 10 core governance principles every scaling organization should adopt.

Introducing the Startup-Scale Landing Zone: Get Azure right from day one

rmmartins — Mon, 16 Mar 2026 18:25:26 GMT

If you've been following this blog, you may recall the post From Zero to Hero with Azure Landing Zones, where we walked through the full Azure Landing Zone journey, from identity and RBAC to Platform and Application Landing Zones. That guide covered the what and the why. This post introduces the how, a deployable, open-source project that distills those principles into something a startup can actually ship in an afternoon:

The problem: Cloud foundations shouldn't take two months

Every startup building on Azure faces the same fork in the road:

Option A: Follow the Azure Landing Zone (ALZ) guidance. It's comprehensive, battle-tested, and designed for organizations with thousands of users. It's also 100+ modules, a multi-layered management group hierarchy, and months of work to understand, let alone implement. For a 10-person startup, it's like buying a commercial kitchen to make breakfast.

Option B: Skip governance entirely. One subscription, no policies, no budgets, no RBAC strategy. Ship fast now, deal with security debt later. This is what most startups actually do, and it works until the first security questionnaire from an enterprise customer, the first runaway cost incident, or the first az group delete that hits production.

Neither option is right. Startups need a third path: just enough governance to be secure and cost-aware from day one, without the operational overhead that slows them down.

That's exactly what the Startup-Scale Landing Zone (SSLZ) provides.

What is the Startup-Scale Landing Zone?

SSLZ is an opinionated, production-ready Azure infrastructure template that deploys in under one hour using Bicep or Terraform. It's built for teams of 5–50 engineers, typically pre-seed to Series A, who don't have a dedicated platform team but need to get Azure right from the start.

It takes the core principles from the Azure Landing Zone architecture and strips them to the essentials:

One management group, two subscriptions (prod + non-prod). That's it. No six-layer hierarchy.
Security built-in. Defender for Cloud, RBAC groups, NSG deny-all defaults, and policy enforcement, all automated.
Cost controls from day one. Budget alerts at 50%, 80%, and 100%, mandatory tagging, and reservation guidance.
An explicit graduation path. When you outgrow SSLZ, there's a step-by-step guide to migrate to the full ALZ architecture.

Important: SSLZ is not a replacement for Azure Landing Zones. It targets a different profile: very early-stage startups with a single workload, a single region, and no hybrid connectivity. For those teams, the realistic alternative isn't ALZ, it's usually no governance at all.

Architecture: Simplicity as a design principle

The architecture is deliberately minimal:

Tenant Root Group └── mg-<yourcompany> ← Policies applied here ├── sub-<yourcompany>-prod ← Production workloads └── sub-<yourcompany>-nonprod ← Dev, staging, QA

Each subscription gets its own VNet with a standardized subnet layout:

vnet-<co>-prod (10.0.0.0/16) ├── snet-aks 10.0.0.0/20 (4,091 IPs — AKS nodes + pods) ├── snet-app 10.0.16.0/22 (1,019 IPs — App Service / Container Apps) ├── snet-data 10.0.20.0/22 (1,019 IPs — Private Endpoints) └── snet-shared 10.0.24.0/24 (251 IPs — CI/CD agents, jump boxes)

No hub network. No Azure Firewall. No VNet peering. Each subscription is a self-contained island.

Why no hub?

A hub-spoke topology costs a minimum of ~$1,500/month. Azure Firewall alone runs $900+/month. For a startup with a single workload in a single region, that's cost and complexity with no return. NSGs provide L3/L4 filtering for free and handle 95% of startup networking use cases. When compliance or hybrid connectivity demands centralized egress control, the graduation guide walks you through adding a hub, without touching existing resources.

Why two subscriptions?

Two subscriptions give you isolation that resource groups can't:

Cost isolation for free: no tagging gymnastics to separate prod from dev spend.
RBAC without custom roles: developers get Contributor on non-prod and Reader on prod.
Blast radius containment: az group delete in dev can't touch production.
Quota isolation: non-prod experiments don't consume prod quotas.

This is a habit that's cheap to form early and expensive to retrofit later. One primary workload per subscription; when you deploy a second independent workload, create a new subscription.

What you get out of the box

Component	What's deployed
Management Groups	Single MG with two subscriptions
Azure Policy	Microsoft Cloud Security Benchmark (audit mode), required tags (environment, team), allowed locations, diagnostic settings
Networking	VNet + 4 subnets per subscription, NSGs with deny-all-inbound default
Monitoring	Log Analytics workspace, Activity Log forwarding, 90-day retention
Security	Defender for Cloud CSPM (free), Defender for Servers P2 (prod), security contact alerts
Cost Management	Budget alerts at 50/80/100% thresholds, tag enforcement via policy
CI/CD	GitHub Actions workflows for both Bicep and Terraform, Workload Identity Federation (no secrets)

Security without friction

The security model avoids compliance theater. Instead of buying Entra ID P2 "to check a box," SSLZ enables Security Defaults, free MFA that blocks 99.9% of identity attacks. Instead of enforcing MCSB in Deny mode on day one (which blocks legitimate deployments and frustrates developers), it starts in Audit mode so you can understand your posture first, then selectively move to Deny as your team matures.

RBAC follows three rules:

Never assign roles to individuals: always use security groups.
Developers don't get Contributor on prod: deployments go through CI/CD.
No Owner at subscription level for non-admins: a compromised account with Owner can grant itself anything.

For CI/CD, SSLZ uses Workload Identity Federation (WIF) instead of client secrets. No credentials to store, rotate, or accidentally commit. Short-lived OIDC tokens scoped to specific repos and branches.

Cost transparency

Every recommendation includes real numbers:

"Azure Firewall: $900+/month. Skip until compliance or hybrid demands it."
"DDoS Protection Standard: $2,944/month. Azure's free basic DDoS + Front Door WAF handles most cases."
"Defender for App Service: ~$15/month. Limited value compared to other plans. Revisit later."
"Standard_D4s_v5 VM: $140/month on-demand → $90/month with 1-year RI. 36% savings."

The documentation also covers the six most common cost mistakes startups make: forgotten dev VMs, over-provisioned databases, ignoring Reserved Instances, premium storage where standard works, not using Spot VMs, and missing Dev/Test pricing. Each mistake comes with a concrete fix and code example.

Starter examples: Three startup archetypes

SSLZ ships with three production-grade example architectures, each with Bicep + Terraform implementations, deployment instructions, and realistic cost estimates:

SaaS Startup (~$330–440/month)

Container Apps + Azure SQL Elastic Pool + Redis + Key Vault. Multi-tenant with shared schema and tenant_id column. Container Apps scale to zero in non-prod. Elastic pools are 50–70% cheaper than individual databases.

AI Startup (~$1,150–1,250/month)

AKS with GPU Spot node pools (60–90% savings) + Azure OpenAI + Blob Storage + Redis for inference caching. Covers model serving framework choices (vLLM vs Triton vs TGI) and GPU node management with taints and KEDA autoscaling.

API-First Startup (~$163–345/month)

App Service with deployment slots (zero-downtime swaps) + API Management (Consumption tier, pay-per-call) + Cosmos DB + Application Insights. Includes API versioning strategy, rate limiting tiers, and Cosmos DB partitioning guidance.

When to graduate

SSLZ is explicit about its limits. You'll outgrow it when 2–3 of these signals appear simultaneously:

Signal	Why it matters
Second independent workload	Each workload gets its own subscription
Engineering team > 50 people	Different teams need different permissions and cost boundaries
Regulatory compliance (SOC2, HIPAA, PCI)	Requires specific controls SSLZ doesn't cover
Multi-region deployment	Needs centralized network management
Hybrid connectivity (VPN, ExpressRoute)	Requires a Connectivity subscription with gateways
5+ subscriptions	Policy and RBAC at scale needs MG hierarchy

The Graduation Guide provides a five-phase migration path to full ALZ: management group hierarchy, hub network + firewall, management subscription, policy hardening, and identity hardening with risk assessments for each phase. It also includes the cost of the full platform layer ($1,500–3,000/month), so you can make an informed decision about when the investment makes sense.

Quick start: From zero to production-ready in under 1 hour

Prerequisites (5 min)

Azure CLI installed
Two Azure subscriptions (prod + non-prod)
Owner permissions on both subscriptions

git clone https://github.com/ricmmartins/sslz.git cd sslz az login ./scripts/validate-prerequisites.sh

Deploy with Bicep (20 min)

cd infra/bicep cp parameters/prod.bicepparam parameters/prod.local.bicepparam # Edit prod.local.bicepparam with your values az deployment sub create \ --location eastus2 \ --template-file main.bicep \ --parameters parameters/prod.local.bicepparam

Or Deploy with Terraform (20 min)

cd infra/terraform cp terraform.tfvars.example terraform.tfvars # Edit terraform.tfvars with your values terraform init terraform plan -out=tfplan terraform apply tfplan

Verify (5 min)

az group list --query "[?contains(name, 'yourcompany')].name" -o tsv az policy assignment list --query "[].displayName" -o tsv az security pricing list --query "value[?pricingTier=='Standard'].{Name:name, Tier:pricingTier}" -o table

Design philosophy

Three principles guided every decision in SSLZ:

Opinionated over flexible. "It depends" isn't helpful when you have five engineers and no platform team. SSLZ makes the call: two subscriptions, no hub, deny-all NSGs, MCSB in audit mode and tells you when to revisit.
Reversible over perfect. Every architectural decision is designed to be easy to change later. Moving subscriptions between management groups is a 10-second operation. Adding a hub VNet requires only a new deployment, not changes to existing resources. Policies can move from Audit to Deny on a schedule. Multi-region is a future add-on, not a prerequisite.
Honest about trade-offs. Instead of claiming "enterprise-grade," SSLZ says:"You'll outgrow this when..." and "Here's exactly what it costs to add the next layer." That transparency is what separates it from frameworks that are either overkill for startups or under-engineered for production.

Get involved

SSLZ is open source under the MIT license. The project welcomes contributions, especially real-world configurations from startup CTOs and platform engineers who've battle-tested the patterns.

GitHub: github.com/ricmmartins/sslz
Documentation site: startupscalelanding.zone
Previous post: From Zero to Hero with Azure Landing Zones

If you're a startup building on Azure, give SSLZ a try. Deploy it, break it, and tell us what your real infrastructure looks like, so the next team doesn't have to figure it out from scratch.

Production-grade API Gateway patterns for Microsoft Foundry

rmmartins — Thu, 29 Jan 2026 19:35:08 GMT

Most startup teams start with the simplest thing that can work. One or two apps call Microsoft Foundry model endpoints directly, traffic is predictable, and “routing” is just a config value in the app.

The gateway pattern becomes necessary when Foundry stops being “an integration” and becomes “a shared platform”. That shift shows up in a few reliable signals:

You do not fully control client code, or updating client configuration is riskier than updating a central routing configuration.
You need blue green rollouts for model versions or fine-tuned variants without forcing every client to redeploy
You need server-side retry and circuit breaking semantics to handle throttling and availability without duplicating logic across every app.
You need consistent token governance and usage visibility across multiple apps and consumers.

On Azure, this is commonly implemented with Azure API Management (APIM) using GenAI-aware “AI Gateway” capabilities, and it can be configured from the Foundry portal and applied per project.

What problems a gateway solves

A production gateway in front of Foundry is not about adding a hop. It is about centralizing cross-cutting concerns that otherwise get reimplemented inconsistently:

Stable API surface while deployments and backends evolve.
Consistent auth termination at the gateway, then reestablish trust from the gateway to the model backend (for example with Azure RBAC).
Token-based throttling and quotas for fairness and cost control across consumers.
Operational resiliency via backend pools, priority and weight routing, retry, and circuit breaker behavior that honors throttling signals like Retry-After.
Unified telemetry at the choke point, even when you have multiple underlying instances.

Decoupling clients from backend topology

One secondary but important effect of introducing a gateway is that it shifts backend-specific details out of application code. Clients call a stable API owned by your platform team, while routing, credentials, and failover semantics live behind that boundary. This does not make models interchangeable, and it does not eliminate platform dependencies. What it does is contain them. As backend topology evolves, whether that means new deployments, additional subscriptions, or additional regions, those changes become operational updates rather than coordinated application rewrites.

In practice, this means your platform team owns the API contract and operational semantics, while backend providers remain an implementation detail behind that contract.

One simple mental model

Concrete gateway patterns

Choosing the right gateway pattern

The table below summarizes when each pattern is most appropriate, and what trade-offs it introduces.

Pattern	Primary goal	Isolation level	Throughput scaling	Resiliency impact	Operational complexity
Single Foundry, multi-deployment routing	Decouple clients from models and enable safe rollouts	Logical only (same resource boundary)	Limited to single resource quotas	Low to moderate (deployment-level)	Low
Multi-resource, same region, same subscription	Security segmentation, reliability, backend pooling	Resource-level	Not increased for standard tier	Moderate (backend failover)	Medium
Prioritized failover, spillover (PTU → standard)	Cost control and burst protection	Resource-level	Controlled spillover	High (explicit failover semantics)	Medium to high
Multi-subscription, same region	Quota expansion, org boundaries, central AI service	Subscription-level	Scales with number of subscriptions	High	High
Multi-region	Regional resilience, data residency, global access	Region-level	Region-bounded	Very high	High

How to read this table:

If your problem is model lifecycle and client decoupling, start with Pattern 1.
If your problem is reliability and segmentation, Pattern 2 and 3 are the usual next step.
If your problem is quota ceilings or organizational boundaries, Pattern 4 appears.
If your problem is regional resilience or global scale, Pattern 5 becomes unavoidable.

Below are the most common patterns that show up as startups move from “one app calling one deployment” to “multiple products, multiple teams, and production SLOs”.

Pattern 1: Single Foundry resource with multi-deployment routing

When you use it

You run multiple model deployments under one Foundry resource and want to control routing centrally.
You want safer rollouts (blue green) without forcing client updates.

What it solves

Routing decisions move from clients to a single place.
You can gradually shift traffic between deployments, but you still need safe deployment practices because changing “which model” can be a breaking change from the client’s perspective.

Key operational detail

Strongly consider credential termination and reestablishment. Clients authenticate to the gateway. The gateway authenticates to the model backend via Azure RBAC.

Pattern 2: Multi-resource in the same region and same subscription

When you use it

You need security segmentation boundaries (separate keys or Azure RBAC per client).
You want an easier chargeback model.
You want failover for availability issues, operational mistakes, or pairing provisioned and standard for spillover.

What it solves

You can treat multiple backends as active-active and load balance across instances.
You centralize retry and circuit-breaker behavior.

Critical constraint

Standard quotas are subscription-level, not instance-level. Load balancing across standard instances in the same subscription does not create additional throughput

Pattern 3: Prioritized failover and planned spillover (PTU first, consumption fallback)

This is the pattern you reach for when you want to maximize utilization of dedicated capacity and still survive bursts and outages.

The AI Gateway workshop describes a “Prioritized PTU with Fallback Consumption” approach using APIM backend pools with priority and weight-based routing, combined with circuit breaker rules and retries for 429 and selected 503 cases.

Concrete implementation details from the workshop that are worth copying into your playbook:

Configure backend pool across multiple endpoints.
Add a circuit breaker rule that can trip on throttling (429) and accept Retry-After
Use APIM policy to authenticate with managed identity and set the backend to the pool, then retry on 429 or specific 503 conditions.

This moves “resiliency logic” out of every client and into one place you can test and iterate.

Pattern 4: Multi-subscription, same region (quota scaling and centralized service)

When you use it

You need more quota in standard deployments but must constrain models to a single region.
You are building a centralized “Microsoft Foundry as a service” model. Standard quota is subscription-bound, so capacity pooling often implies multiple subscriptions.

Implementation tips from the Azure Architecture Center guide

Prefer subscriptions backed by the same Microsoft Entra tenant for consistency in Azure RBAC and Azure Policy.
Deploy the gateway in the same region as the backends.
Consider a dedicated gateway subscription.
Ensure private endpoints are reachable across subscriptions, including cross-subscription Private Link where supported.

Pattern 5: Multi-region

When you use it

You need a service availability failover strategy (for example cross-region pairs).
You have data residency and compliance requirements.
You face mixed model availability across regions.

The Azure Architecture Center guide calls out that for business-critical architectures that must survive a complete regional outage, a global unified gateway helps eliminate failover logic from client code. It also notes the trade-offs of single-region gateway deployment doing active-active load balancing across regions, including added latency and egress charges for cross-region calls.

Real-world scenarios this architecture supports

These are representative scenarios drawn from common production environments and directly supported by the gateway patterns and reference implementations.

Scenario A: Containing a runaway application

A company has five internal applications sharing the same Foundry environment. One application ships a prompt regression that suddenly multiplies average request size by 8x.

Without a gateway:

Token consumption spikes globally.
Other apps experience 429s and degraded latency.
Root cause takes time to identify because telemetry is scattered.

With an AI Gateway in front of Foundry:

Token-based limits are enforced per application.
The faulty app is throttled at the gateway.
Other applications continue operating normally.
The gateway telemetry immediately shows which consumer is exhausting the quota.

Outcome:

Incident blast radius is limited to one consumer.
No global outage.
Faster root cause isolation.

Scenario B: Zero-downtime model migration

A startup is migrating from one production deployment to a newer model version.

They deploy the new model alongside the old one and configure the gateway to:

Route 5 percent of traffic to the new deployment.
Keep 95 percent on the old deployment.

They observe:

Error rate.
Latency.
Token growth.

Over several days they progressively shift traffic to 100 percent without requiring any client changes.

Outcome:

No forced redeployments.
No mass client reconfiguration.
Rollback is a gateway configuration change, not an emergency code change.

Scenario C: Cost-controlled burst handling

A product runs steady baseline traffic on provisioned capacity and experiences unpredictable spikes.

Gateway configuration:

Priority backend pool.
Provisioned deployment as primary.
Standard deployment as secondary.
Circuit breaker honors Retry-After.

Normal operation:

Nearly all traffic hits provisioned throughput.

During spikes:

Overflow is routed to standard tier.
The gateway absorbs throttling behavior and retries.

Outcome:

Provisioned capacity is fully utilized.
Spikes are handled without hard failures.
Clients are unaware that backend routing changed.

Scenario D: Subscription quota pooling

An organization reaches standard tier quota ceilings in a single subscription.

They deploy Foundry resources across multiple subscriptions and place a single gateway in front.

Gateway behavior:

Distributes requests across subscriptions.
Applies unified token governance.
Exposes one API endpoint to all internal teams.

Outcome:

Aggregate usable quota increases.
Organizational boundaries are preserved.
Clients remain unaware of backend topology.

Operational playbook

This is the part that separates “it works” from “it survives production”.

1. Authentication strategy

Recommended default

Terminate client auth at the gateway.
Reestablish gateway-to-backend authorization via Azure RBAC rather than passing through client secrets.

The AI Gateway workshop provides a concrete example using authentication-managed-identity and setting the Authorization header for the backend call.

Guardrail

If you choose pass-through client credentials, ensure clients cannot bypass the gateway or model restrictions.

2. Token throttling and fairness

You want limits that match how LLMs consume capacity and budget.

APIM GenAI capabilities emphasize controlled token limits and monitoring for cost efficiency.
Foundry AI Gateway governance scenarios explicitly include configuring token limits for models at the project level.

Use token throttling as your primary fairness control, then layer request-rate limits if needed.

3. Failover semantics

Two rules that prevent most “self-inflicted outages”:

Honor Retry-After from the backend when implementing failover and circuit breaker behavior. Do not continuously hit a throttled endpoint returning 429.
Prefer gateway-side retry and circuit breaking to avoid repeated code and to keep one place to test.

The workshop shows a pragmatic retry condition on 429 and selected 503, combined with backend pool routing and a circuit breaker that can trip on 429 while checking Retry-After.

4. Observability and consumption tracking

A gateway is uniquely positioned to publish telemetry across all consumed models to a single store, which makes unified dashboarding and alerting easier.

APIM’s GenAI positioning highlights token monitoring as part of “cost efficiency”.
The workshop navigation includes model monitoring and consumption tracking as first-class steps in the AI Gateway journey.

Operationally, decide up front what you will dimension your telemetry by (project, tenant, application, environment) and enforce those identifiers at the gateway.

5. APIOps: Treat gateway configuration as code

Even if you configure the first version in the portal, production systems need repeatability:

Use a code-driven workflow for policies and configuration so routing and governance changes are reviewed and promoted like any other production change.
If you adopt a federated model, APIM Workspaces are positioned to help organizations manage APIs more productively and securely.
Keep an eye on the APIM changelog and GenAI feature updates because gateway capabilities are evolving quickly.

When not to add a gateway

The Architecture Center guide is explicit: If controlling client configuration is as easy as controlling gateway routing, the added reliability, security, cost, maintenance, and performance impact might not be worth it.

Also, if you are using a single instance with multiple deployments primarily to simulate identity segmentation, you might be better served by multiple instances with distinct Azure RBAC boundaries instead of pushing that complexity into gateway logic.

Closing thought

A gateway is not a prerequisite for Foundry. It is an operational maturity step.

When Foundry usage becomes multi-tenant, SLO-driven, and quota-sensitive, the gateway stops being “extra architecture” and becomes the place you express your platform intent. Auth boundaries. Token governance. Failover semantics. Telemetry. And a repeatable APIOps process to keep it all sane as the system evolves.

References

When and why startups add a gateway in front of Microsoft Foundry

rmmartins — Tue, 27 Jan 2026 03:24:31 GMT

Note: This post focuses on when and why startups begin adopting a gateway in front of Microsoft Foundry. In a follow-up article, we’ll go into a technical deep dive, covering design decisions, operational tradeoffs, latency considerations, observability, and patterns used in production-scale environments.

Most teams don’t hit scaling challenges with Microsoft Foundry on day one.

Early on, things are simple. One or two applications call Foundry directly. Traffic is predictable. Model experimentation moves fast. Everything works, and there’s no reason to add extra layers.

Then adoption grows. More applications start calling the same models. Traffic becomes spiky. Teams want better visibility into usage. Questions about rate limits, authentication, and how to evolve models over time begin to surface.

This is usually the moment when teams start asking: “Do we need some kind of control layer in front of Foundry?”

The signals that start to show up

Across many startups, the same patterns tend to emerge as Foundry usage scales:

Multiple clients and services calling the same Foundry endpoints
The need for consistent rate limiting and access control
A desire to evolve models or deployments without touching every client
Limited visibility into who is calling what, and how often

None of these are problems at small scale. But together, they create friction as usage grows.

A pattern we often see working well

A common pattern at this stage is placing a gateway in front of Microsoft Foundry APIs.

Client applications call a single gateway endpoint, where policies such as authentication, rate limits, and routing are applied before requests are forwarded to Foundry model deployments.

Rather than having every application talk directly to Foundry, teams introduce a control layer that sits between clients and Foundry.

On Azure, this is often implemented using API Management with GenAI capabilities.

This gateway does not replace Foundry. Foundry remains the model and AI platform. The gateway simply becomes the entry point for client traffic.

What this enables in practice

When teams introduce a gateway layer, a few things become much easier:

A single, stable API surface for applications, even as models or deployments evolve
Centralized throttling and authentication, instead of per-client logic
Policy-based routing across models or backends without changing clients
Improved request-level observability into usage patterns, latency, and errors

Importantly, this structure lets teams scale without slowing down experimentation. Model teams can continue to iterate, while platform concerns stay centralized.

What this pattern is not

It’s worth calling out what this approach is not:

It’s not required on day one
It’s not mandatory for every startup
It’s not about adding complexity early

Many teams run successfully without a gateway for a long time. This pattern becomes useful when scale, team size, or operational needs make direct integrations harder to manage.

When teams usually consider this

From experience, teams tend to explore this pattern when:

Foundry usage spans multiple applications or teams
Rate limits and quotas need consistent enforcement
There’s a desire to future-proof model or deployment changes
Observability and governance start to matter more

If those conversations are already happening, it’s often a good time to look at a gateway approach.

How this looks on Azure

On Azure, this pattern is commonly implemented using:

Azure API Management as the gateway
AI-aware policies for rate limiting, routing, and governance
Microsoft Foundry as the backend model platform

The architecture stays flexible. Teams can start simple and add capabilities over time as needs evolve.

Closing thoughts

This pattern is less about tooling and more about timing.

Adding a gateway too early can slow teams down. Adding it too late can make change painful. The right moment is usually when Foundry usage starts to feel like a shared platform rather than a single experiment.

For teams approaching that stage, a gateway can provide structure without taking away speed.

References

Azure has three permission systems, and you're probably confusing them

rmmartins — Wed, 06 May 2026 20:10:11 GMT

Series: Azure Governance for Digital Natives and Startups: This is Part 1 of a 3-part series on Azure governance for digital-native companies scaling on Azure.

New to Azure's identity and subscription model? This post assumes you already know how tenants, subscriptions, and Entra ID fit together. If that's fuzzy, read Demystifying Microsoft Entra ID, Tenants and Azure Subscriptions. That post covers the architecture; this one covers the permissions that live inside it

Part 1 (this post): The three-plane model: Identity, Resources, and Billing

Part 2: Marketplace, Managed Identity, and where the planes collide

Part 3: Anti-patterns, role structures, and the 10 principles of Azure governance

Azure is a powerful cloud platform, but its governance model is widely misunderstood, especially in fast-moving, engineering-led organizations.

After working with dozens of digital-native customers (AI startups, SaaS platforms, companies scaling from zero to millions in Azure spend), I've seen the same confusion play out over and over. Engineers can't see MACC credits. Finance can't see workloads. Global Admins think they own everything. And Marketplace purchases happen without anyone in Finance knowing.

The root cause is always the same: Azure is governed by three completely separate permission systems, and most teams treat it like one.

If you're a customer moving fast on Azure, you've likely heard these questions:

"Why can't my engineering Owner see MACC credits?"
"Why can't a Billing Contributor deploy a VM?"
"Why doesn't Global Admin let me access subscriptions?"
"Why can a Contributor deploy AKS but not buy Snowflake?"
"Why does Cost Management Reader show cost but not credit balance?"

These questions appear in nearly every customer I work with: AI companies consuming Azure OpenAI at scale, SaaS companies running global AKS footprints, and digital natives under Microsoft Azure Consumption Commitments (MACC).

This guide breaks down the entire model with practical patterns and deep insight into each plane — so these questions are never confusing again.

Why digital natives struggle with this

Before diving into the technical model, it's worth understanding why this causes so much friction in digital-native companies specifically. These problems hit startups and scaling companies harder than traditional enterprises for three reasons:

Speed over governance. Engineering-led companies prioritize shipping over process. Governance is added retroactively, often after something goes wrong.
Flat org structures. Without clear Platform, Finance, and Security functions, the same people end up with roles across multiple planes creating exactly the kind of role sprawl the three-plane model was designed to prevent.
MACC commitments. Digital natives under MACC have a financial relationship with Azure that most team members don't even know exists. When engineers can't see MACC burn and finance can't see resource usage, nobody has the full picture.

The result is predictable:

Role	What They Expect	What They Actually Get
Engineers	"I'm Owner, I should see everything, including billing"	RBAC gives full resource control but zero billing visibility
Finance	"I need to see what's running so I can forecast"	Billing Reader shows credits and invoices but not workloads
Security	"I'm Global Admin, I have total control"	Entra controls identity but not resources or billing
Procurement	"I need to buy Marketplace software for the team"	Marketplace purchases require billing roles, not RBAC
Leadership	"I want a single dashboard for cost, resources, and credits"	No single role spans all three planes; you need a combination

When these expectations go unaddressed: engineers get billing access "just to see costs" (creating financial risk), Marketplace purchases happen without finance oversight, and Global Admin is treated as the "master key" when it controls only one of three planes.

The fix isn't more permissions. It's the right permissions in the right plane for the right people.

The three-plane model

Everything in Azure governance flows from this single truth:

Plane	Controls	Example Roles	See Billing?	Deploy Resources?	Manage Identity?
Microsoft Entra (Identity)	Users, groups, MFA, PIM, Conditional Access	Global Admin, Groups Admin, PIM Admin	❌ No	❌ No	✅ Yes
Azure RBAC (Resources)	VMs, AKS, Storage, AOAI, networking, policies	Owner, Contributor, Reader	❌ No	✅ Yes	❌ No
Billing / Commerce (Financial)	Invoices, credits, MACC, payments, Marketplace purchases	Billing Owner, Billing Reader	✅ Yes	❌ No	❌ No

Three planes. Zero overlap. A role in one plane grants zero access in the others.

Entra Global Admin can't access subscriptions.
Subscription Owner can't see the MACC balance.
Billing Account Owner can't deploy resources.

This separation is by design. Once your company internalizes it, governance becomes dramatically more predictable.

Plane 1: Microsoft Entra (Identity Plane)

Security, authentication, authorization, administrative boundaries.

Microsoft Entra (formerly Azure AD) is the authoritative identity provider for Azure. It governs identity, authentication, Conditional Access, PIM (Privileged Identity Management), group membership, and tenant-wide administrative policies. Entra is the security boundary for the entire tenant.

💡 Common misunderstanding: "I'm Global Admin, why can't I access subscriptions?"

Because Entra roles do not grant Azure RBAC permissions by default. This behavior is intentional and foundational. A compromised Global Admin cannot delete all subscriptions. A compromised Subscription Owner cannot compromise directory security. Identity and infrastructure operate independently for resiliency.

What Entra roles can do

Create and manage users
Manage MFA & Conditional Access
Approve PIM requests
Manage security settings
Create/assign groups (which can then hold RBAC roles)
Manage enterprise applications, OIDC apps, etc.

What Entra roles cannot do

Action	Allowed?
Deploy resources	❌ No
Access subscriptions	❌ No
View MACC credits	❌ No
Make Marketplace purchases	❌ No
Modify billing profiles	❌ No
Change RBAC roles	❌ No
Access data or storage accounts	❌ No

Most relevant Entra roles for startups

Entra Role	Purpose
Global Administrator	Full directory control (identity, security)
Privileged Role Administrator	Manages privileged role assignments
Groups Administrator	Creates and manages groups (often used for RBAC assignments)
Conditional Access Administrator	Manages CA policies
Authentication Administrator	Controls authentication settings
Security Administrator	Manages security policies and alerts

Key insight: Entra governs identity and security, not cloud resources or billing. Because Entra manages groups, and groups are often used for RBAC assignments, Entra is the root of who can be given access, but not what access they have. This is where many organizations misunderstand the boundary.

Plane 2: Azure RBAC (Resource Plane)

Everything engineering touches: workloads, clusters, deployments, pipelines, resources.

Azure RBAC is the backbone of the Azure operational model. It controls all deployments (IaC, CLI, Portal, API), resource creation & modification, monitoring & diagnostics, Key Vault, Storage, Networking, AKS cluster operations, Azure OpenAI deployments, everything under Azure Resource Manager (ARM).

RBAC scopes

RBAC can be assigned at: Tenant root → Management group → Subscription → Resource group → Individual resource → Sub-resource (e.g., Key Vault secret).

RBAC role behavior

Role	Can Deploy?	Can View Usage Cost?	Can View Billing/MACC?
Owner	✅ Yes	✅ Yes	❌ No
Contributor	✅ Yes	✅ Yes	❌ No
Reader	❌ No	✅ Yes (limited)	❌ No
Cost Management Reader	❌ No	✅ Yes	❌ No
User Access Admin	❌ No	❌ No	❌ No

The critical point: RBAC cannot see billing. RBAC cannot view MACC. RBAC cannot read invoices. RBAC cannot approve Marketplace purchases. Even Owner, the highest role in the resource plane, is blind to billing.

Plane 3: Azure Billing/Commerce (Financial Plane)

Governed by the Microsoft Commerce Platform, not Azure Resource Manager.

This plane governs the financial relationship between the customer and Microsoft: billing accounts, invoices, credits (MACC, Azure credits, grants), commitments, discounts, payment methods, invoice sections, Marketplace SaaS purchases, reservations & savings plans, and private offers. Commerce roles live in an entirely different system from RBAC.

Common billing roles

Role	Can see credits?	Can deploy?	Notes
Billing Account Owner	✅ Yes	❌ No	Full financial authority
Billing Contributor	✅ Yes	❌ No	Can update payment methods
Billing Reader	✅ Yes	❌ No	Most finance teams use this
Invoice Section Owner	✅ Yes	❌ No	Scoped financial management

What billing roles can see: MACC balance, credits, invoices, payment history, reservations & savings plans (financial side), and Marketplace purchase capabilities.

What billing roles cannot do: deploy anything, modify RBAC, access resources, see workloads, change policy, or access cost analysis at resource group level.

Billing is where MACC lives. MACC (Azure Consumption Commitment) visibility is tied to Billing Account Owner, Billing Account Contributor, and Billing Reader. Even a subscription Owner cannot see MACC burn rate. This single point causes confusion in almost every startup onboarding Azure.

Full comparison matrix

When you need to answer "who can see what?" Use this table:

Data type	System	Who can see it
Resource usage cost	ARM (RBAC)	Cost Mgmt Reader, Owner, Contributor
Resource inventory	ARM (RBAC)	Owner, Contributor, Reader
Budgets & cost alerts	ARM (RBAC)	Owner, Contributor, Cost Mgmt Reader
Azure OpenAI cost analysis	ARM (RBAC)	RBAC roles
MACC credit balance	Commerce Platform	Billing roles only
Invoices & payments	Commerce Platform	Billing roles only
Marketplace private offers	Commerce Platform	Billing roles only
Commercial discounts	Commerce Platform	Billing roles only

💡 If your engineering lead says "I can see costs" and your CFO says "I can see costs", they are looking at different data from different systems. Both are right. Neither has the full picture.

The #1 source of confusion: Cost Management Reader vs. Billing Reader

This is the single most frequent misunderstanding in Azure governance. These two roles sound similar. They are completely different systems.

Cost Management Reader (RBAC Plane)

Can see: usage-based resource cost, cost by tags, cost by resource, cost forecast, budgets & alerts.

Cannot see: credits, invoices, payments, MACC, private offers, or contract terms.

Billing Reader (Commerce Plane)

Can see: invoices, credits, payments, MACC balance, Marketplace transaction history.

Cannot see: resource-level cost breakdown, cost by tags, subscription usage trends, or resource inventory.

Data type	Where it lives	Who can see it
Resource usage cost	Azure Cost Management (ARM)	Cost Mgmt Reader, Owner, Contributor
Budgets & cost alerts	ARM	Owner, Contributor, Cost Mgmt Reader
MACC credit balance	Commerce Platform	Billing roles only
Invoices	Commerce Platform	Billing roles only
Marketplace private offers	Commerce Platform	Billing roles only
Commercial discounts	Commerce Platform	Billing roles only

Cost visibility (usage-based cost) comes from RBAC. Billing visibility (credits, invoices, MACC) comes from Commerce. These are two completely different datasets. When you understand this distinction, half of the "why can't I see…?" questions answer themselves.

Quick start: where to set this up

Here's exactly where each plane is configured, in the Portal and via CLI.

Microsoft Entra (Identity Plane)

Portal: Azure Portal → Microsoft Entra ID → Roles and administrators

# List Entra directory role assignments az rest --method GET --url "https://graph.microsoft.com/v1.0/directoryRoles" # Add a user to a directory role az ad group member add --group "Groups Administrator" --member-id <user-object-id>

Azure RBAC (Resource Plane)

Portal: Subscription → Access Control (IAM) → Add role assignment

# Assign Contributor at subscription scope az role assignment create \ --assignee "user@contoso.com" \ --role "Contributor" \ --scope "/subscriptions/{subscription-id}" # Assign Cost Management Reader at resource group scope az role assignment create \ --assignee "user@contoso.com" \ --role "Cost Management Reader" \ --scope "/subscriptions/{sub-id}/resourceGroups/{rg-name}"

Azure Billing/Commerce (Financial Plane)

Portal: Azure Portal → Cost Management + Billing → Billing scopes → select billing account → Access control (IAM)

# List billing accounts az billing account list --output table # Assign Billing Reader via REST API az rest --method PUT \ --url "https://management.azure.com/providers/Microsoft.Billing/billingAccounts/{billing-account-id}/billingRoleAssignments/{id}?api-version=2024-04-01" \ --body '{"properties":{"principalId":"{user-object-id}","roleDefinitionId":"/providers/Microsoft.Billing/billingAccounts/{billing-account-id}/billingRoleDefinitions/{billing-reader-role-id}"}}'

References

What's next → This post established the foundation: Azure's three permission planes are separate by design. But the real complexity begins where these planes intersect.

In the part 2, we'll explore Marketplace governance, where resource deployment meets financial authority along with Managed Identity, the one construct that bridges two planes, and ABAC for advanced conditional governance.

Azure capacity planning: Using quotas, reservations, vmss instance mix, and compute fleet

rmmartins — Thu, 06 Nov 2025 16:47:25 GMT

Introduction

Over the past few months, I’ve been helping several digital-native customers navigate capacity constraints while scaling AI and compute-intensive workloads on Azure. Many teams run into the same frustrating message:

“SkuNotAvailable – The requested size is currently not available in the location.”

This post summarizes the strategy I’ve been using to help customers design around these challenges combining Quota Groups, Capacity Reservations (ODCR), VMSS Instance Mix, and Compute Fleet. These tools don’t create capacity where none exists, but together, when paired with proactive alerts, they form a practical playbook for scaling reliably through regional constraints.

Quota vs. Capacity: What’s the difference?

Concept	What It Is	Who Controls It	Can You Fix It Yourself?
Quota	A logical limit on how many vCPUs or specific VM series you can deploy.	Microsoft (adjustable on request).	✅ Yes, request an increase.
Capacity	The physical availability of hardware in the datacenter.	Azure datacenter (supply and utilization).	❌ No, if no servers exist, no deployment will succeed.

Example: You have 300 vCPUs of quota for the D-series in East US 2. You try to deploy 100 D8as_v5 VMs and get a failure. You open a support request and find:

Your quota is fine
But the region has no physical capacity for D8as_v5

Even if Microsoft raised your quota to 1,000 vCPUs, the deployment would still fail because quota ≠ capacity.

Quota issue: You’ll see errors like OperationNotAllowed or QuotaExceeded.
Capacity issue: The message will be SkuNotAvailable or AllocationFailed.

If you see a quota error, open the Usage + quotas blade and request an increase. If it’s a capacity error, switching zones, SKUs, or regions, or using VMSS Instance Mix or Compute Fleet is your best next step.

“Quota is a number on paper. Capacity is what’s physically sitting in the racks.”

Strategy 1: Quota management and Quota Groups

Azure applies vCPU quotas by region and VM family (e.g., Dsv5, Esv5). Quota Groups provide a consolidated way to monitor and manage these logical limits across families.

Learn more:

Quota limits are easy to overlook until automation or scale pipelines fail. AI-heavy startups often discover too late that they’ve maxed out their quota family.

Best practices:

Monitor with Quota Group alerts: Use Quota Alerts (preview) to automatically notify you when usage reaches thresholds (e.g., 80%). Alerts integrate with Azure Monitor and Action Groups.
Request increases proactively: Portal path: Subscriptions → Usage + quotas → Request increase. Most CPU SKUs are approved quickly; GPUs can take longer.
Plan by family, not by SKU: If you only check “D8as_v5 usage,” you may miss that the entire D-series family is at its quota limit.

Strategy 2: Capacity Reservations (ODCR)

A Capacity Reservation (formally On-Demand Capacity Reservation, ODCR) lets you pre-book physical infrastructure in a specific region, zone, and VM size. You’re reserving capacity, not committing to a term or discount. Azure holds those servers for your subscription, ensuring your workloads can always start.

Learn more:

Capacity Reservation vs. Reserved Instance (RI)

Aspect	Capacity Reservation (ODCR)	Reserved Instance (RI)
Purpose	Guarantees capacity (hardware availability).	Locks in price (discounted rate).
Scope	Specific region, zone, and VM size.	Region and VM family.
Billing	Pay-as-you-go, no term commitment.	1 or 3-year fixed term.
Capacity Guarantee	✅ Yes, hardware is held for you.	❌ No, no guarantee.
Price Benefit	❌ None, PAYG rate.	✅ Up to ~70% discount.
Flexibility	Modify or cancel anytime.	Bound to term.

In short:

ODCR = “Hold my spot in the datacenter.”

RI = “Give me a discount because I’ll keep using it.”

You can use both: ODCR for capacity, RI for savings.

Example: A startup consistently runs 20× D16as_v5 VMs nightly for training. They reserve that capacity (ODCR) and apply RIs for discounts ensuring predictable performance and cost.

Limitations:

You can’t reserve SKUs already out of stock.
ODCR doesn’t autoscale, it holds your baseline.
Best for core workloads, not ephemeral jobs.

Strategy 3: VMSS Instance Mix

Virtual Machine Scale Set (VMSS) Instance Mix is a feature of VMSS Flex that enables capacity-aware scaling across multiple VM sizes, and even across different purchase options (Standard and Spot). When you define more than one acceptable VM size, Azure automatically chooses whichever has capacity available during scale-out.

Learn more:

VMSS Instance Mix – Overview

Example: Here’s a simplified configuration snippet from an ARM or Bicep template using Instance Mix:

"virtualMachineProfile": { "hardwareProfile": { "vmSizeProperties": { "vmSizes": [ "Standard_D8as_v5", "Standard_E8as_v5", "Standard_F8s_v2" ] } } }

VMSS Instance Mix helps you survive temporary SKU shortages by dynamically selecting the next available size, while Spot Priority Mix lets you blend Spot and Standard instances to reduce cost and improve resilience. This makes it ideal for large-scale app tiers, batch processing, and AI inference.

Limitations:

Works across zones, not regions.
Doesn’t mix Spot + Standard in the same pool.
Doesn’t reserve hardware capacity, it only improves allocation success rates.

Strategy 4: Azure Compute Fleet

Azure Compute Fleet can deploy up to 10,000 VMs across multiple SKUs, zones, and (in preview) regions. You define acceptable SKUs, and Azure picks the ones that have capacity.

Learn more:

Azure Compute Fleet – Overview

Fleet automatically:

Tries alternate SKUs (D8as_v5 → E8as_v5).
Expands to other zones or regions.
Combines Standard and Spot instances.

In short, it automates the “try this, then that” logic, improving your odds of successful deployment.

Example: A rendering studio needs 2,000 VMs nightly. Fleet dynamically uses D8s_v5, D16s_v5, or E8s_v5 across East US 2 and West US 2, depending on live availability.

Limitations:

Fleet doesn’t create capacity it just searches smarter. If every zone and region is full, it still fails. Ideal for AI training, batch jobs, rendering, or HPC, not for stateful services.

When to use what

Scenario	Best tool	What it solves
Logical limits before deployment	Quota Groups + Alerts	Prevent hitting soft limits.
Guaranteed baseline	Capacity Reservation (ODCR)	Reserve real hardware.
Managed autoscaling	VMSS Instance Mix	Scale out despite partial shortages.
Large-scale/bursty workloads	Azure Compute Fleet	Try alternate SKUs and regions.
GPU/high-demand SKUs	ODCR + Fleet	Reserve base, burst flexibly.

Real Talk: There’s no magic when a datacenter is full. Let’s be transparent: If a region has no physical servers available, no tool can make capacity appear.

Quota Groups remove logical blockers.
Capacity Reservations secure what you need.
Compute Fleet and VMSS Instance Mix increase the odds of success.

Together, they maximize probability, but none can override a physically full region.

The Azure capacity strategy flow

Final thoughts

For fast-scaling digital-native companies, the right question isn’t “How do I guarantee capacity?”. It’s “How do I design for capacity uncertainty?” Start by putting the basics on autopilot: Configure Quota Group alerts to prevent silent blockers.

Use Capacity Reservations (ODCR) to secure your baseline compute.
Add elasticity through VMSS Instance Mix and, when flexibility allows, Compute Fleet.
Monitor everything with Azure Monitor alerts — from quotas and reservations to scale-out failures and Fleet allocation health.

💡 Pro tip: Combine Quota Group Alerts, Reservation coverage monitoring, and VMSS/Fleet deployment telemetry in Azure Monitor to detect issues early. The faster you know what kind of failure you’re hitting, the faster you can act.

Accept that capacity is finite, but also that visibility is your greatest advantage. Azure gives you multiple levers; success comes from knowing when and how to use each one together.

Over the past few months, I’ve supported multiple customers, from AI platforms to SaaS startups, who faced real capacity challenges in regions like East US 2 and West US 2. This post came directly from those experiences, with one goal: to help others move from reactive firefighting to proactive, layered capacity planning. If your workloads are scaling fast, I hope this guide helps you build not just a plan, but a mindset, for running reliably when the cloud gets crowded.

Azure Monitor 101: The missing guide to understanding monitoring on Azure

rmmartins — Mon, 20 Oct 2025 14:24:09 GMT

Introduction

Monitoring in the cloud is often misunderstood. Some think it’s about checking whether a virtual machine is up; others equate it with dashboards or alerts. In reality, monitoring is about visibility, correlation, and action, and in Azure, that all converges in one platform: Azure Monitor.

This article explains, in practical terms, how Azure Monitor works, the role of Log Analytics, and how to build a foundation for observability across your workloads.

If you’ve read our earlier posts, on Service and Resource Health Monitoring, Advanced Alerting Strategies, Azure Workbooks Customization, or Azure Monitor & MELT, this post ties them all together.

What Is Azure Monitor?

Azure Monitor is Microsoft’s unified platform for collecting, analyzing, and acting on telemetry across applications, infrastructure, and networks, whether they run on Azure, hybrid, or multicloud environments.

It helps you answer four questions:

Is my environment healthy?
What’s happening right now?
Why did it happen?
What should I do next?

The Building Blocks

Layer	Description	Examples
1. Data Sources	Where telemetry originates: VMs, AKS, databases, applications, networks.	Activity Logs, Performance Counters, Container Metrics, App Insights telemetry.
2. Data Platform (Log Analytics)	Central workspace where logs are stored and queried using KQL.	Diagnostic Settings → Workspace → Query → Alert/Workbook.
3. Insights & Visualizations	Built-in experiences that interpret raw data.	Azure Monitor for VMs, Containers, Apps, Network.
4. Action & Automation	Responding through alerts, workflows, or ITSM integrations.	Alerts + Action Groups → Teams, Logic Apps, PagerDuty.

Azure Monitor core layers

Metrics vs. Logs

Aspect	Metrics	Logs
Format	Numeric values sampled over time	Text-based records with context
Best for	Performance monitoring and thresholds	Troubleshooting and auditing
Examples	CPU %, latency, requests/sec	Error messages, policy changes
Store	Azure Monitor metrics DB	Log Analytics workspace

Metrics are fast and lightweight; logs are richer and more flexible. Both live under Azure Monitor.

The role of Log Analytics Workspace

If Azure Monitor is the nervous system, Log Analytics is the brain.

Resources send diagnostic and activity data via Diagnostic Settings, agents, or connectors. Once in the workspace, you can query everything using Kusto Query Language (KQL).

AzureActivity | where OperationNameValue contains "Delete" | summarize Count = count() by Caller, bin(TimeGenerated, 1d)

You can then:

Create alerts that fire on query results.
Build workbooks for dashboards and storytelling.
Export data to Event Hub, Storage, or SIEM.

Log Analytics as the central data plane

Data flow overview

The MELT Model

To understand observability holistically, adopt the MELT framework: Metrics, Events, Logs, and Traces, explained in detail in Azure Monitor & MELT.

Pillar	Purpose
Metrics	How your system performs
Events	What changed
Logs	Why it happened
Traces	How requests flow through components

From data to action: alerts and automation

Azure Monitor includes:

Metric alerts (near real-time thresholds)
Log alerts (KQL queries on schedule)
Activity Log alerts (platform events)

Use Action Groups to define responses: email, Teams, Logic App, or ticket.

For advanced patterns like dynamic thresholds and suppression, see Advanced Alerting Strategies for Azure Monitoring.

Alerting and automation workflow

Visualization and Workbooks

Workbooks transform data into decisions. Combine KQL queries, parameters, and visuals: all within the Azure portal.

Perf | where ObjectName == "Processor" | summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m), Computer

To go beyond basics: multi-resource joins, conditional formatting, custom JSON parameters, see Azure Workbooks: Advanced Customization and Data Visualization in Azure.

Example workbook visualization

Health Monitoring and Platform Signals

Azure provides Service Health and Resource Health to help differentiate between Azure-side issues and workload issues. They complement Azure Monitor by tracking platform events and maintenance notifications.

Configuration guidance is available in The Importance of Setting Up Service and Resource Health Monitoring in Azure.

Service Health and Resource Health integration

Best practices for workspaces

Centralize intelligently: aggregate where cross-resource correlation matters.
Control costs: use Data Collection Rules to filter noise.
Manage retention: align with compliance needs.
Secure access: apply RBAC and table-level permissions.
Automate deployment: define diagnostics via Bicep or Terraform.

Quick start checklist

Create a Log Analytics workspace.
Enable Diagnostic Settings for key resources.
Run a basic KQL query to verify data.
Configure a metric alert and action group.
Build a simple workbook to visualize results.

You now have a full feedback loop: data → query → alert → visualize → act.

Next steps & further reading

Together these form a complete learning path, from monitoring basics to full observability.

Conclusion

Azure Monitor is more than a tool, it’s the observability backbone of Azure. Once you understand its layers, the rest of the ecosystem, health alerts, workbooks, advanced rules, and MELT falls naturally into place.

Start simple. Connect a resource, explore your workspace, and let data guide your next question. That’s when monitoring becomes insight.

Monitoring Azure OpenAI without switching from your existing observability platform

rmmartins — Thu, 09 Oct 2025 20:16:23 GMT

Recently, one of my customers asked me a simple but powerful question:

“We already use Datadog for observability, but the rate-limit metrics we see in the Azure Portal don’t match what we get in Datadog. Why does Azure show higher TPM numbers?”

That question led to a deeper conversation about how Azure measures rate limits for Azure OpenAI.

They weren’t necessarily trying to move away from Datadog, in fact, they already have a mature observability stack built on it, but they wanted to understand and monitor Azure OpenAI usage directly in the portal.

After reviewing the documentation and confirming with Azure OpenAI Engineering team, the answer made sense:

Azure’s Tokens-Per-Minute (TPM) metric is based on an estimated token count derived from the character length of the request, not the exact tokenized count used for billing.
This estimate accounts for the worst-case request scenario (prompt + max_tokens + best_of), so Azure’s TPM can appear “inflated” compared to Datadog, which measures actual tokens consumed after completion.

That conversation inspired this post because many customers find themselves in a similar spot: they already have powerful observability tools but still want quick, built-in visibility into Azure OpenAI usage and rate limits without adding new integrations or switching platforms.

The two monitoring paths

When it comes to monitoring Azure OpenAI, there are two main options:

1. The full flow (most powerful, requires Log Analytics): This unlocks correlation, deep queries, and exporting metrics/logs to external tools.

Azure OpenAI Service → Azure Monitor → Log Analytics → KQL, Workbooks, Alerts → integrations like Datadog, Grafana.

2. The lightweight flow (fast, free, no Log Analytics): This is what we’ll explore: simple dashboards and quota-based alerts right in the Azure Portal.

Azure OpenAI Service → Azure Monitor (Metrics) → Portal Workbooks + Alerts.

Metrics available in Azure OpenAI

Azure OpenAI publishes several key metrics natively (no ingestion required). According to the official documentation:

Processed Inference Tokens → tokens consumed (prompt + completion).
Azure OpenAI Requests → total API calls.
Request Errors → failed requests (429s, 5xx).
Availability Rate → percentage of successful calls.
Latency metrics → TTFT (time to first token), TTLB (time to last byte).

You can view these under: AOAI Resource → Monitoring → Metrics.

Azure OpenAI exposes native metrics like tokens, requests, errors, and latency directly in the Azure Portal

Quotas: The other half of the picture

Metrics tell you usage. Quotas tell you capacity. Every deployment has fixed Tokens per Minute (TPM) and Requests per Minute (RPM) limits. You can find these under: AOAI Foundry Portal → Deployments → Select Deployment → Rate Limits.

Example:

GPT-4.1-mini deployment → 250,000 TPM / 250 RPM

These are the values you’ll compare against metrics and use in alerts.

Each deployment has fixed TPM/RPM quotas. Here, GPT-4.1-mini is capped at 250,000 TPM and 250 RPM.

If you prefer a more programmatically way, you could run this command:

az rest --method get \ --url "https://management.azure.com/subscriptions/<subscriptionId>/resourceGroups/<resourceGroup>/providers/Microsoft.CognitiveServices/accounts/<accountName>/deployments/<deploymentName>?api-version=2023-05-01" \ --query "{deployment:name, TPM:properties.rateLimits[?key=='token'].count | [0], RPM:properties.rateLimits[?key=='request'].count | [0]}"

Sample output:

{ "RPM": 250, "TPM": 250000, "deployment": "gpt-4.1-mini" }

Building a lightweight workbook

Even without Log Analytics, you can build a simple workbook to track usage vs quota:

Go to Azure Monitor → Workbooks → + New.
Add a metric visualization for Processed Inference Tokens (Sum).
- Metric: Processed Inference Tokens
- Aggregation: Sum
- Display name: Token Usage vs Quota.

Resource Type: Azure AI Foundry
Azure AI Foundry: Select your instance
Click to add metric

Add another metric for Azure OpenAI Requests (Count).

Metric: Azure OpenAI Requests
Aggregation: Count
Display name: Requests per Minute vs Quota.

Click to Run Metrics
Save as AOAI Usage vs Capacity.

Workbooks let you visualize token and request usage against your deployment’s fixed quotas

Creating alerts (proactive notification)

From the portal you can also configure alerts directly on metrics:

Go to Azure Monitor → Alerts → + Create → Alert rule.
Scope = your AOAI resource.
Condition step:

Signal name = Processed Inference Tokens.
Threshold type: Static
Value is: Greater than
Unit: Count
Threshold = 200,000 (warning) or 250,000 (critical).

Actions step:

Use Quick Actions → add your email (or Azure mobile push).
Or create an Action Group for Teams/webhook integration.

Details step:

Name = AOAI-TPM-Warning / AOAI-TPM-Critical.
Severity = 2 (Warning) or 0 (Critical).

Review + Create.
Repeat for Azure OpenAI Requests with thresholds of 200 (warning) and 250 (critical).

Alert conditions:

Configure alert conditions directly on metrics. Here, we trigger at 200,000 tokens per minute (80% of quota)

Quick Actions:

Quick Actions let you add email or mobile notifications without creating a full Action Group.

Overview from the Alert:

Give your alert a descriptive name and severity. Here, AOAI-TPM-Warning at Severity 2.

How this helps with 429 errors

One of the most common issues Azure OpenAI customers face is the dreaded “Too Many Requests” (429) error.

Why it happens:

Each deployment enforces hard TPM/RPM quotas.
If you send more tokens or requests than allowed in a minute, the service rejects them with a 429.
You may see headers like x-ms-retry-after-ms telling you how long to wait.

How monitoring helps:

Metrics as early warning: Watching token/request metrics shows when you’re approaching the cap.
Alerts before throttling: Warning alerts at 80% (200k TPM / 200 RPM) give you time to react before 429s hit.
Critical alerts at 100%: Confirm you’ve saturated the quota and need to adjust.

Important note:

Monitoring doesn’t prevent 429s, your app should still implement retry with backoff and consider batching/queuing requests.
But with this setup, you’ll know before the error storm begins, and can respond faster.

Why this matters

For many companies, time-to-value is more important than building a new monitoring stack.

This approach means:

No Log Analytics ingestion.
No need to replace Datadog or Splunk.
Free visibility into usage vs quota.
Proactive notifications on approaching limits.
Fewer surprises with 429 errors.

And if later you want deeper insights, you can still enable Log Analytics and export into your existing observability platform.

References:

Closing thoughts

This article was inspired by a customer request, but I believe many others will benefit from the same approach. In just a few minutes, you can build a dashboard, set alerts, and gain confidence in your Azure OpenAI usage, all without leaving the Azure Portal.

I’d love to hear from you: how is your team monitoring Azure OpenAI today? Share in the comments, your feedback will help shape what we build next.

Azure routing preference: A hidden lever for performance vs. cost trade-offs

rmmartins — Fri, 05 Sep 2025 20:40:21 GMT

For Digital Native companies, every engineering decision is also a business decision. How you design your cloud architecture affects not just performance but also your burn rate, margins, and ultimately your ability to scale.

One of the most overlooked levers in Azure networking is Routing Preference, a simple setting that determines how your outbound internet traffic leaves Azure. The choice is binary:

Microsoft Global Network (Premium): High-quality, low-latency routing on Microsoft’s backbone (default).
ISP Transit (Internet Routing): Lower-cost routing via local ISPs.

Most startups never change the default, but understanding when to switch can save you serious money without sacrificing customer experience.

Why it matters for digital natives

Bandwidth is one of those quiet COGS items (Cost of Goods Sold, the direct cost of delivering your product to customers) that doesn’t make noise until the bill arrives. If your product depends on moving data, whether streaming, analytics, or SaaS APIs, outbound traffic is part of your unit economics.

Routing Preference is your toggle between performance and cost. Think of it as:

Business class routing (Premium): smoother ride, higher price.
Economy routing (ISP): cheaper seat, gets you there, but less predictable.

Pricing Snapshot

Outbound internet rates vary by region, and they differ sharply between routing options. For example (first 10 TB/month, beyond free 100 GB):

Region	Microsoft Global Network (Premium)	ISP Transit (Internet Routing)
United States	$0.087 / GB	$0.04 / GB
Australia	$0.12 / GB	$0.06 / GB

Azure Bandwidth Pricing

Which Azure resources support routing preference?

Routing Preference applies to any Azure resource backed by a Public IP, including:

Virtual Machines (VMs)
Virtual Machine Scale Sets (VMSS)
Azure Kubernetes Service (AKS)
Public Load Balancers (NIC-based backend)
Application Gateway
Azure Firewall
Storage Accounts (Blob, Files, etc.)

Routing preference overview

How to configure it

Public IP Example (CLI)

az network public-ip create \ --name MyPublicIP \ --resource-group MyResourceGroup \ --location eastus \ --ip-tags 'RoutingPreference=Internet' \ --sku Standard \ --allocation-method Static \ --version IPv4

Docs:

A digital native playbook

Here’s a quick guide to help you decide:

Scenario	Recommended Routing	Why
Latency-sensitive SaaS APIs	Premium (Global Network)	Predictable performance, better customer experience
Dev/Test environments	ISP Transit	Optimize cost where performance isn’t critical
Bulk log exports, backups	ISP Transit	Cut bandwidth costs significantly
Production workloads with end-users	Premium	Protect SLA and latency for customers

Key takeaways

By default, you’re paying for Premium routing whether you need it or not.
ISP Transit can cut costs nearly in half, a huge win for cost-sensitive workloads.
Routing Preference applies to VMs, AKS, Load Balancers, App Gateways, Firewalls, and Storage.
The right choice depends on your growth stage and workload type: optimize for performance where it matters, optimize for cost everywhere else.

Closing

For digital natives, scaling is a balance: you need to delight customers while watching COGS. Routing Preference is a small Azure feature that gives you a big lever on both. Next time you spin up a VM, AKS cluster, or Storage account, don’t just go through defaults.

Ask: Do I want business class routing or economy? That one decision could save you thousands as you scale.

Azure Quota Alerts (Preview): Still overlooked, but incredibly useful

rmmartins — Fri, 22 Aug 2025 16:24:56 GMT

Quota limits are one of those hidden blockers that can catch digital native companies by surprise. You’re scaling fast, deploying more VMs or GPU nodes, and suddenly: “Quota exceeded.”

Since late 2024, Azure has offered Quota Alerts (Preview), a built-in way to monitor and get notified before you hit subscription limits. It’s not brand new, but many digital native companies still aren’t taking advantage of it.

Why this matters for startups & digital natives

Avoid outages at scale: deployments won’t suddenly fail due to quota ceilings.
Save engineering time: no need for custom monitoring pipelines.
Simple setup: alerts in minutes directly from the portal.
Fits existing workflows: integrates with Action Groups (email, Teams, PagerDuty).

How to create a quota alert in Azure (Preview)

1. Open Quotas in the Portal

Search Quotas in the Azure Portal and go to the blade.

2. Go to Alert rules (Preview)

Click Alert rules (Preview), then + Create Alert Rule (Preview).

3. Configure the Alert

On the Create usage alert rule page:

Subscription → choose the subscription.
Provider → e.g., Compute for vCPUs.
Alert rule name → e.g., Quota Alert – EastUS vCPUs.
Threshold → usage % (e.g., 80%).
Severity → pick according to policy.
Frequency of evaluation → e.g., 15 minutes.
Resource group → select/create one.
Managed Identity → click Create new.
Notify me by → email, Action Group, Teams, etc.
Dimensions → select Location and Quota (e.g., DSv5 Family vCPUs).

Save, and you’re done. You can find more detailed configuration options in the official Microsoft docs.

4. Assign Permissions to the Managed Identity

When the new Managed Identity is created (e.g., quota-alert-managed-identity), you must give it access to read quota usage.

Go to Managed Identities in the portal.
Select the identity created for the quota alert.
Open Azure role assignments → + Add role assignment (Preview).
Set:
- Scope: Subscription
- Subscription: the subscription where quotas are monitored
- Role: Reader (or any role that includes read access)
Save.

5. Track & Act

Your rules are visible under Quotas → Alert rules (Preview).
Triggered alerts show up under Fired alerts (Preview).

Old vs New Reality

Approach	Custom Scripts & Logs	Quota Alerts (Preview)
Effort to set up	High	Very low
Extra services needed	Log Analytics, Automation	None
Visibility	Manual dashboards	Native alert rules
Best fit	Ops-heavy teams	Startups, lean teams

Takeaway

Quota alerts may have been around since late 2024, but they remain one of the most underrated features in Azure. For startups and digital native companies scaling quickly, they provide peace of mind with almost zero setup.

Don’t wait until your next deployment fails, set up a quota alert today (start with Regional vCPUs in your main region). It only takes a couple of minutes and could save your launch.

⚡️ Pro tip: You can reuse your Action Groups for quota alerts, keeping all notifications consistent with your existing monitoring strategy.

Azure Support Slack Bot on Azure Container Apps: Production-ready guide

rmmartins — Thu, 04 Sep 2025 17:28:18 GMT

Launch a secure, scalable Slack bot for Azure support tickets in minutes — no secrets in code, no manual admin steps, and fully aligned with modern cloud-native best practices.

This guide walks you through deploying the GitHub sample azure-support-slack-bot on Azure Container Apps, using managed identities, Key Vault, and scale-to-zero architecture that just works, whether you're building from scratch or plugging into your existing DevOps flow.

Here’s what you’ll build:

Zero-admin secrets management with Managed Identity + Key Vault
RBAC-first access to Azure Support APIs
A clean, local-first development workflow (with ngrok support)
Slack integration using manifest-based setup
Observability with App Insights + Log Analytics
Scale from 0 to N replicas, with autoscaling baked in

And yes, all of this, without ever hardcoding a secret or exposing a public endpoint you didn’t intend to.

If you’re running lean and building fast, this bot is a solid foundation. It’s not just a cool demo — it’s a production-ready blueprint for any digital native team that wants to integrate Slack with Azure support in a secure, automated, and developer-friendly way.

1. What you're building

Features:

/azure-support slash command
Auto-scaling from 0→N replicas based on HTTP load
Zero secrets in code or environment variables
Comprehensive logging and Azure RBAC integration

2. Why Azure Container Apps (ACA)?

When you're building for speed, security, and scale without a huge ops team, Azure Container Apps (ACA) hits the sweet spot.

This Slack bot doesn't need a full-blown cluster. It needs event-driven scale, zero-trust security, and built-in automation, and that’s exactly what ACA delivers. Here’s why ACA is a better fit than the usual suspects:

Azure Container Instances (ACI)

Great for quick scripts or batch jobs , but no built-in ingress, TLS, scaling rules, or managed identities.
ACA gives you all that out of the box, with production features and native integration with Key Vault, App Insights, and autoscaling.

Web App for Containers

Web Apps are more suited for classic web hosting. You’ll hit limits with scaling flexibility, networking, and secret management.
ACA gives you Kubernetes-grade scale and observability, without having to think about servers or patching.

Azure Kubernetes Service (AKS)

Powerful, but heavy. You manage clusters, patch nodes, deal with autoscaler configs, ingress controllers, and more.
ACA does the heavy lifting for you, zero node management, zero cluster maintenance.
And here’s the kicker: AKS charges for the VM nodes 24/7, even when idle.
- ACA? Pay-per-request. When there’s no traffic, it scales to zero, and you don’t pay.

Cost Efficiency That Scales With You

For digital native teams, especially startups and growth-stage companies, ACA’s serverless pricing model is a game-changer:

You scale from 0 to N replicas based on actual demand
You only pay when your app is running
No need to over-provision VMs or guess your future traffic

This means you can launch a support bot, an internal API, or a microservice without worrying about burning cash while it's idle.

Built for These Teams

ACA is ideal for:

Platform engineering teams who want secure templates, not snowflake infra
DevOps-light teams who need autoscaling without managing YAML storms
Growth-stage product squads rolling out bots, APIs, or event-driven services fast
Startups who care about velocity, observability, and not hiring a full SRE team

Comparison table:

Platform	Best Fit	Where It Falls Short	Why ACA Wins
ACI	Short-lived scripts & jobs	No ingress, limited identity, lacks autoscaling	ACA supports scale-to-zero, secure access, and managed identity out of the box
Web App	Traditional web hosting	Rigid scaling, fewer network/runtime controls	ACA offers greater flexibility and microservice patterns
AKS	Complex, large-scale distributed apps	Operational overhead, always-on cost	ACA simplifies ops with managed scaling & lower cost
ACA	Cloud-native APIs, internal tools, microservices	—	Built-in identity, ingress, scale-to-zero, lower total cost

ACA is your serverless container platform when you want:

TLS and ingress baked in
GitHub Actions support out of the box
Built-in support for Key Vault, managed identities, and auto-scaling
Production-grade infra, without managing a single VM

If you're moving fast and don’t want to build a platform just to run a bot, ACA is the move.

3. Prerequisites & verification

Required:

Azure subscription with Contributor access
Azure CLI ≥ 2.49.0 with `containerapp` extension
Docker Desktop or equivalent
Slack workspace with app creation permissions
Python 3.8+ for local development

Optional:

ngrok account for stable local testing URLs

Setup verification:

# Verify Azure CLI and login az --version az login az account set --subscription <your-subscription-id> az extension add --name containerapp --upgrade # Verify Docker and Python docker --version python --version # Verify current user permissions currentUserId=$(az ad signed-in-user show --query id -o tsv) subscriptionId=$(az account show --query id -o tsv) az role assignment list --assignee $currentUserId --scope "/subscriptions/$subscriptionId" --query "[].roleDefinitionName" -o table

4. Azure permissions setup

The repository requires specific Azure RBAC roles that are often missed:

# Get current subscription and user subscriptionId=$(az account show --query id -o tsv) currentUserId=$(az ad signed-in-user show --query id -o tsv) # Support Request Contributor - Required to create/manage Azure support tickets az role assignment create \ --assignee $currentUserId \ --role "Support Request Contributor" \ --scope "/subscriptions/$subscriptionId" # Reader - Required to list and view Azure resources in the bot az role assignment create \ --assignee $currentUserId \ --role "Reader" \ --scope "/subscriptions/$subscriptionId"

Why these roles are required:

Support Request Contributor: Allows creating and managing Azure support tickets
Reader: Allows the bot to list subscriptions, services, and resources in dropdown menus

5. Local development setup

5.1 Clone and initialize project

git clone https://github.com/Azure-Samples/azure-support-slack-bot.git cd azure-support-slack-bot # Create virtual environment python -m venv .venv source .venv/bin/activate # Linux/macOS # Install dependencies pip install -r requirements.txt # Create local environment file cp .env-example .env

5.2 Create Slack app with manifest

Using the provided manifest is crucial for correct configuration:

Visit Slack API Apps
Click "Create New App" → "From a manifest"
Choose YAML and paste the contents from slack_app_manifest.yaml:
Click Next → Create
Copy the Signing Secret from Basic Information
Important: Click "Install App" → "Install to Workspace" to generate the Bot User OAuth Token (xoxb-...)
After installation, copy the Bot User OAuth Token from the OAuth & Permissions page

5.3 Local testing with ngrok

# Edit .env with your tokens (local development only) # SLACK_SIGNING_SECRET=your-signing-secret-here # SLACK_BOT_TOKEN=xoxb-your-bot-token-here # Terminal 1: Start the Flask app python app.py

Expected output:

INFO:azure_support:Azure credentials configured successfully INFO:azure_support:Preloading subscriptions completed ⚡️ Bolt app is running on port 5000!

# Terminal 2: Create ngrok tunnel ngrok http 5000

Copy the https forwarding URL (e.g., https://abc123.ngrok-free.app)

5.4 Update Slack app manifest for local testing

This is the critical step that's often missed:

In your Slack app settings, go to App Manifest
Replace ALL instances of YOUR-DOMAIN-NAME with your ngrok domain
Example replacement:

# Before request_url: https://YOUR-DOMAIN-NAME/slack/events # After request_url: https://abc123.ngrok-free.app/slack/events

Click Save Changes
Go to Install App and install it to your workspace
Copy the Bot User OAuth Token (xoxb-...)

5.5 Test local integration

Invite the bot to channels:

/invite azure-support

Test the slash command:

/azure-support

You should be able to see this screen:

4. Monitor logs:

# Check your Python terminal for incoming requests INFO:azure_support:Opened modal for support request

6. Azure infrastructure setup

6.1 Define resource names

# Set consistent naming convention RG="rg-slack-support-prod" LOCATION="eastus" ACR_NAME="acrsupport$RANDOM" # Globally unique ENV_NAME="aca-slack-env" APP_NAME="slack-support-app" KV_NAME="kv-slack-$RANDOM" UAMI_NAME="id-slack-support" LAW_NAME="law-slack-support" # Verify names are available echo "ACR Name: $ACR_NAME" echo "Key Vault: $KV_NAME"

6.2 Create resource group

az group create \ --name $RG \ --location $LOCATION \ --tags environment=production project=slack-support

7. Container registry with security

7.1 Create Azure container registry

az acr create \ --name $ACR_NAME \ --resource-group $RG \ --sku Standard \ --admin-enabled false # Security: No admin credentials

7.2 Build and push image

# Login to ACR az acr login --name $ACR_NAME # Build and push IMAGE_NAME="$ACR_NAME.azurecr.io/azure-support-slack-bot:latest" docker build -t $IMAGE_NAME . docker push $IMAGE_NAME # Verify image az acr repository show \ --name $ACR_NAME \ --repository azure-support-slack-bot

8. Managed Identity and RBAC Setup

8.1 Create User-Assigned Managed Identity

az identity create \ --name $UAMI_NAME \ --resource-group $RG \ --location $LOCATION # Get identity details UAMI_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query id -o tsv) UAMI_PRINCIPAL_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query principalId -o tsv) UAMI_CLIENT_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query clientId -o tsv)

8.2 Grant ACR pull permissions

ACR_ID=$(az acr show --name $ACR_NAME --resource-group $RG --query id -o tsv) az role assignment create \ --assignee $UAMI_PRINCIPAL_ID \ --role "AcrPull" \ --scope $ACR_ID # Wait for role propagation echo "Waiting 60 seconds for role assignment propagation..." sleep 60

8.3 Grant Azure support API permissions

# Get subscription ID (in case it's not set from earlier) subscriptionId=$(az account show --query id -o tsv) # Support Request Contributor for the managed identity az role assignment create \ --assignee $UAMI_PRINCIPAL_ID \ --role "Support Request Contributor" \ --scope "/subscriptions/$subscriptionId" # Reader role for listing Azure resources az role assignment create \ --assignee $UAMI_PRINCIPAL_ID \ --role "Reader" \ --scope "/subscriptions/$subscriptionId" # Verify the role assignments were created successfully echo "Azure Support API permissions granted to managed identity" az role assignment list \ --assignee $UAMI_PRINCIPAL_ID \ --query "[].{Role:roleDefinitionName,Scope:scope}" -o table

9. Azure Key Vault Setup

9.1 Create Key Vault with RBAC

az keyvault create \ --name $KV_NAME \ --resource-group $RG \ --location $LOCATION \ --enable-rbac-authorization true \ --retention-days 7

9.2 Grant Key Vault permissions

# Get current user and Key Vault scope USER_PRINCIPAL_ID=$(az ad signed-in-user show --query id -o tsv) KV_SCOPE=$(az keyvault show --name $KV_NAME --resource-group $RG --query id -o tsv) # Grant admin access to current user az role assignment create \ --assignee $USER_PRINCIPAL_ID \ --role "Key Vault Administrator" \ --scope $KV_SCOPE # Grant read access to managed identity az role assignment create \ --assignee $UAMI_PRINCIPAL_ID \ --role "Key Vault Secrets User" \ --scope $KV_SCOPE # Wait for propagation sleep 30

9.3 Store secrets

# Store Slack secrets (replace with your actual values) echo "Enter your Slack Bot Token (xoxb-...):" read -s SLACK_BOT_TOKEN echo "Enter your Slack Signing Secret:" read -s SLACK_SIGNING_SECRET az keyvault secret set \ --vault-name $KV_NAME \ --name "slack-bot-token" \ --value "$SLACK_BOT_TOKEN" az keyvault secret set \ --vault-name $KV_NAME \ --name "slack-signing-secret" \ --value "$SLACK_SIGNING_SECRET" # Verify secrets are stored az keyvault secret list --vault-name $KV_NAME --query "[].name" -o table

10. Container Apps environment

10.1 Create Log Analytics Workspace

az monitor log-analytics workspace create \ --workspace-name $LAW_NAME \ --resource-group $RG \ --location $LOCATION # Get workspace details LAW_CUSTOMER_ID=$(az monitor log-analytics workspace show \ --workspace-name $LAW_NAME \ --resource-group $RG \ --query customerId -o tsv) LAW_SHARED_KEY=$(az monitor log-analytics workspace get-shared-keys \ --workspace-name $LAW_NAME \ --resource-group $RG \ --query primarySharedKey -o tsv)

10.2 Create Container Apps environment

az containerapp env create \ --name $ENV_NAME \ --resource-group $RG \ --location $LOCATION \ --logs-workspace-id $LAW_CUSTOMER_ID \ --logs-workspace-key $LAW_SHARED_KEY

11. Deploy Container App with security

11.1 Create Container App

az containerapp create \ --name $APP_NAME \ --resource-group $RG \ --environment $ENV_NAME \ --image $IMAGE_NAME \ --target-port 5000 \ --ingress external \ --registry-server "$ACR_NAME.azurecr.io" \ --user-assigned $UAMI_ID \ --min-replicas 1 \ --max-replicas 10 \ --cpu 0.5 \ --memory 1Gi

11.2 Configure Key Vault secret references

# Create secret references to Key Vault az containerapp secret set \ --name $APP_NAME \ --resource-group $RG \ --secrets \ "slack-bot-token=keyvaultref:https://$KV_NAME.vault.azure.net/secrets/slack-bot-token,identityref:$UAMI_ID" \ "slack-signing-secret=keyvaultref:https://$KV_NAME.vault.azure.net/secrets/slack-signing-secret,identityref:$UAMI_ID"

11.3 Configure environment variables

az containerapp update \ --name $APP_NAME \ --resource-group $RG \ --set-env-vars \ "SLACK_BOT_TOKEN=secretref:slack-bot-token" \ "SLACK_SIGNING_SECRET=secretref:slack-signing-secret" \ "AZURE_CLIENT_ID=$UAMI_CLIENT_ID" \ "PORT=5000"

11.4 Configure scaling rules

az containerapp update \ --name $APP_NAME \ --resource-group $RG \ --scale-rule-name "http-rule" \ --scale-rule-type "http" \ --scale-rule-http-concurrency 50 \ --min-replicas 0 \ --max-replicas 10

12. Production configuration

12.1 Get application URL

APP_FQDN=$(az containerapp show \ --name $APP_NAME \ --resource-group $RG \ --query properties.configuration.ingress.fqdn -o tsv) APP_URL="https://$APP_FQDN" echo "Production URL: $APP_URL/slack/events"

12.2 Update Slack app manifest for production

Critical: Replace ngrok URLs with production URLs:

In your Slack app settings, go to App Manifest
Replace all ngrok URLs with your Azure Container Apps URL:

settings: event_subscriptions: request_url: https://your-app-fqdn.region.azurecontainerapps.io/slack/events interactivity: is_enabled: true request_url: https://your-app-fqdn.region.azurecontainerapps.io/slack/events message_menu_options_url: https://your-app-fqdn.region.azurecontainerapps.io/slack/events
Click Save Changes
Reinstall the app

Critical: URL Verification Step

After updating your Slack App Manifest with the production URL, Slack will attempt to verify the new endpoint. This verification process is mandatory and must succeed before your bot will work in production.

What happens during verification:

Slack sends a POST request to your new URL (https://your-app-fqdn.region.azurecontainerapps.io/slack/events)
The request contains a challenge parameter that your Flask app must echo back
If verification fails, Slack will reject the manifest changes

Common verification failures:

Container App not running: Ensure your Azure Container App is deployed and healthy
Wrong URL format: Must end with /slack/events exactly
HTTPS required: Slack only accepts HTTPS endpoints (Container Apps provides this automatically)
Timeout issues: Container App must respond within Slack's timeout window

12.3 Bot installation and invitation

Required Post-Deployment Steps:

Slack App Manifest updated with production URL
Reinstall the bot in your Slack workspace
Invite the bot to channels: /invite @ azure-support
Test with: /azure-support command

13. Testing and validation

13.1 Health and connectivity checks

# Test basic connectivity (note: this will return 404 since the app has no root endpoint handler) curl -f "$APP_URL/" || echo "Expected 404 - app only handles /slack/events endpoint" # Check container app status az containerapp show \ --name $APP_NAME \ --resource-group $RG \ --query properties.provisioningState # Check logs az containerapp logs show \ --name $APP_NAME \ --resource-group $RG \ --follow

13.2 Functional testing

Test Slack integration:

/azure-support
Complete workflow:
- Fill out the support ticket modal completely (details below)
- Submit the form
- Verify ticket appears in Azure Portal → Help + Support

13.3 Opening the support request

When you open the support request form, you’ll see a few fields that need your attention:

Subject: Think of this as your headline. Keep it short and clear
Problem Details: Here’s your chance to explain what’s going wrong. Be specific! The more details, the better.
Azure Subscription, Service, Problem Type, and Resource: Select the right options from the dropdown menus. This helps the support team route your ticket to the right experts.

You’ll notice options for advanced diagnostic info. If you’re not sure, just say “Yes” (it’s recommended). Set the severity, if it’s a minor issue, pick “Minimal impact.” And choose how you’d like to be contacted (email is usually easiest). Make sure your name and email are correct. If you want someone else to get updates, add their email too.

Once you’ve filled everything out, click Submit. You’ll see a confirmation message, your ticket is on its way!

If you chose a Slack channel, you’ll get a message like this:

You’ll also get a link to view your ticket in the Azure portal, along an e-mail with all the details you provided.

14. Production observability

14.1 Application Insights Integration

# Create Application Insights APPINSIGHTS_NAME="ai-slack-support" az monitor app-insights component create \ --app $APPINSIGHTS_NAME \ --location $LOCATION \ --resource-group $RG \ --workspace $LAW_NAME # Get instrumentation key APPINSIGHTS_KEY=$(az monitor app-insights component show \ --app $APPINSIGHTS_NAME \ --resource-group $RG \ --query instrumentationKey -o tsv) # Add to container app az containerapp update \ --name $APP_NAME \ --resource-group $RG \ --set-env-vars \ "APPLICATIONINSIGHTS_INSTRUMENTATION_KEY=$APPINSIGHTS_KEY"

14.2 Monitoring and alerts

# Create alert for container app failures az monitor metrics alert create \ --name "SlackBot-ContainerFailures" \ --resource-group $RG \ --scopes "/subscriptions/$subscriptionId/resourceGroups/$RG/providers/Microsoft.App/containerApps/$APP_NAME" \ --condition "avg Requests < 1" \ --description "Slack bot container app is not receiving requests" \ --window-size 5m \ --evaluation-frequency 1m # Create alert for Key Vault access failures az monitor metrics alert create \ --name "SlackBot-KeyVaultAccess" \ --resource-group $RG \ --scopes "/subscriptions/$subscriptionId/resourceGroups/$RG/providers/Microsoft.KeyVault/vaults/$KV_NAME" \ --condition "total ServiceApiHit < 1" \ --description "Slack bot unable to access Key Vault secrets" \ --target-resource-type "Microsoft.KeyVault/vaults" \ --target-resource-region $LOCATION \ --window-size 5m \ --evaluation-frequency 1m

15. Security Hardening Checklist

Authentication & Authorization

User-Assigned Managed Identity for all Azure resources
RBAC-based access (no admin credentials)
Key Vault for all secrets with proper role assignments
Azure Support API permissions (Support Request Contributor + Reader)
Least-privilege permissions

Network Security

HTTPS-only ingress (Container Apps provides TLS termination)
No public admin endpoints
Container registry private access via managed identity

Operational Security

Comprehensive logging with Log Analytics
Health monitoring and alerting
Automated vulnerability scanning (ACR)
Secret rotation capability via Key Vault

Application Security

No secrets in code or environment variables
Slack request signature verification
Input validation and sanitization (built into Slack Bolt framework)

16. Complete deployment script

Before running the one-command deployment script, ensure you've completed sections 3 and 4 above, then verify:

1. You're in the repository root directory

pwd # Should end with: azure-support-slack-bot ls # Should show: Dockerfile, requirements.txt, app.py

2. Docker is ready for building

docker ps # Should not show permission errors

3. You have your Slack tokens ready

Now, go from zero to production Slack bot in one command. Save this as deploy-slack-bot.sh for one-command deployment:

#!/bin/bash set -euo pipefail # Check parameters if [ $# -ne 3 ]; then echo "Usage: $0 <subscription-id> <slack-bot-token> <slack-signing-secret>" exit 1 fi SUBSCRIPTION_ID="$1" SLACK_BOT_TOKEN="$2" SLACK_SIGNING_SECRET="$3" # Configuration - modern naming conventions RG="rg-slack-support-prod" LOCATION="eastus" ACR_NAME="acrsupport$RANDOM" ENV_NAME="aca-slack-env" APP_NAME="slack-support-app" KV_NAME="kv-slack-$RANDOM" UAMI_NAME="id-slack-support" LAW_NAME="law-slack-support" echo "🚀 Deploying secure Azure Support Slack Bot..." # Set subscription context az account set --subscription "$SUBSCRIPTION_ID" # Create resource group az group create --name $RG --location $LOCATION --tags environment=production project=slack-support # Create ACR with security defaults az acr create --name $ACR_NAME --resource-group $RG --sku Standard --admin-enabled false az acr login --name $ACR_NAME # Build and push image IMAGE_NAME="$ACR_NAME.azurecr.io/azure-support-slack-bot:latest" docker build -t $IMAGE_NAME . docker push $IMAGE_NAME # Create managed identity - zero-trust foundation az identity create --name $UAMI_NAME --resource-group $RG --location $LOCATION UAMI_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query id -o tsv) UAMI_PRINCIPAL_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query principalId -o tsv) UAMI_CLIENT_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query clientId -o tsv) # Grant ACR pull permissions ACR_ID=$(az acr show --name $ACR_NAME --resource-group $RG --query id -o tsv) az role assignment create --assignee $UAMI_PRINCIPAL_ID --role "AcrPull" --scope $ACR_ID # Grant Azure Support API permissions - least privilege subscriptionId="$SUBSCRIPTION_ID" az role assignment create --assignee $UAMI_PRINCIPAL_ID --role "Support Request Contributor" --scope "/subscriptions/$subscriptionId" az role assignment create --assignee $UAMI_PRINCIPAL_ID --role "Reader" --scope "/subscriptions/$subscriptionId" echo "Azure Support API permissions granted to managed identity" # Create Key Vault with RBAC (no access policies) az keyvault create --name $KV_NAME --resource-group $RG --location $LOCATION --enable-rbac-authorization true --retention-days 7 KV_SCOPE=$(az keyvault show --name $KV_NAME --resource-group $RG --query id -o tsv) # Grant Key Vault permissions USER_PRINCIPAL_ID=$(az ad signed-in-user show --query id -o tsv) az role assignment create --assignee $USER_PRINCIPAL_ID --role "Key Vault Administrator" --scope $KV_SCOPE az role assignment create --assignee $UAMI_PRINCIPAL_ID --role "Key Vault Secrets User" --scope $KV_SCOPE # Wait for RBAC propagation sleep 60 # Store secrets securely az keyvault secret set --vault-name $KV_NAME --name "slack-bot-token" --value "$SLACK_BOT_TOKEN" az keyvault secret set --vault-name $KV_NAME --name "slack-signing-secret" --value "$SLACK_SIGNING_SECRET" # Create observability foundation az monitor log-analytics workspace create --workspace-name $LAW_NAME --resource-group $RG --location $LOCATION LAW_CUSTOMER_ID=$(az monitor log-analytics workspace show --workspace-name $LAW_NAME --resource-group $RG --query customerId -o tsv) LAW_SHARED_KEY=$(az monitor log-analytics workspace get-shared-keys --workspace-name $LAW_NAME --resource-group $RG --query primarySharedKey -o tsv) # Create Container Apps environment az containerapp env create --name $ENV_NAME --resource-group $RG --location $LOCATION --logs-workspace-id $LAW_CUSTOMER_ID --logs-workspace-key $LAW_SHARED_KEY # Deploy Container App with scale-to-zero az containerapp create \ --name $APP_NAME \ --resource-group $RG \ --environment $ENV_NAME \ --image $IMAGE_NAME \ --target-port 5000 \ --ingress external \ --registry-server "$ACR_NAME.azurecr.io" \ --user-assigned $UAMI_ID \ --min-replicas 0 \ --max-replicas 10 \ --cpu 0.5 \ --memory 1Gi # Configure Key Vault secret references az containerapp secret set --name $APP_NAME --resource-group $RG --secrets \ "slack-bot-token=keyvaultref:https://$KV_NAME.vault.azure.net/secrets/slack-bot-token,identityref:$UAMI_ID" \ "slack-signing-secret=keyvaultref:https://$KV_NAME.vault.azure.net/secrets/slack-signing-secret,identityref:$UAMI_ID" # Configure environment variables az containerapp update --name $APP_NAME --resource-group $RG --set-env-vars \ "SLACK_BOT_TOKEN=secretref:slack-bot-token" \ "SLACK_SIGNING_SECRET=secretref:slack-signing-secret" \ "AZURE_CLIENT_ID=$UAMI_CLIENT_ID" \ "PORT=5000" # Configure HTTP-based autoscaling az containerapp update --name $APP_NAME --resource-group $RG \ --scale-rule-name "http-requests" \ --scale-rule-type "http" \ --scale-rule-http-concurrency 50 \ --min-replicas 0 \ --max-replicas 10 # Get deployment results APP_FQDN=$(az containerapp show --name $APP_NAME --resource-group $RG --query properties.configuration.ingress.fqdn -o tsv) echo "" echo "🎉 Deployment Complete!" echo "" echo "Slack Webhook URL: https://$APP_FQDN/slack/events" echo " Resource Group: $RG" echo " Key Vault: $KV_NAME" echo " ACR: $ACR_NAME" echo "" echo " Next Steps:" echo "1. Update your Slack App Manifest with: https://$APP_FQDN/slack/events" echo "2. Reinstall the Slack app in your workspace" echo "3. Invite the bot to channels: /invite -support" echo "4. Test with: /azure-support" echo "" echo "Monitor: az containerapp logs show -n $APP_NAME -g $RG --follow" echo "Debug: az containerapp show -n $APP_NAME -g $RG --query properties.provisioningState"

Usage:

chmod +x deploy-slack-bot.sh ./deploy-slack-bot.sh "your-subscription-id" "xoxb-your-bot-token" "your-signing-secret"

Cost Expectations

Scale-to-zero architecture = minimal compute costs
Base charges: Key Vault ($0.03/day), Log Analytics ($2.30/GB ingested)
Container Apps: Only charges when processing requests (true serverless)

Deployment Notes

Script creates globally unique resource names using $RANDOM
Takes ~8-12 minutes due to RBAC propagation delays
After deployment, update your Slack App Manifest with the production URL

Post-Deployment Steps

Update Slack App Manifest with your Azure Container Apps URL
Reinstall the Slack app (required for URL changes)
Test with /azure-support or the global shortcut

17. Cleanup

When you're ready to remove all resources:

# Delete resource group (removes all resources) az group delete --name $RG --yes --no-wait # Purge Key Vault (if purge protection was enabled) az keyvault purge --name $KV_NAME --location $LOCATION

18. You’re live, what’s next?

You’ve just deployed a production-grade Slack bot for Azure Support using a modern, secure-by-default architecture — no manual secrets, no patchy scripts, no guesswork.

What you now have is more than a bot — it’s a template for how digital native teams should approach platform automation on Azure:

Zero-trust foundation with managed identities + Key Vault
Dev-first workflows for local testing and CI/CD
Scale-to-zero architecture on Azure Container Apps
Built-in observability with Log Analytics and App Insights
RBAC-controlled access to support APIs — no over-permissioned service principals
End-to-end automation via GitHub Actions

This isn't just a bot — it's a pattern. A way to wire your internal tools to your platform securely, scalably, and with full auditability from day one.

This guide was made for fast-moving teams who prefer CLI over click-ops and automation over tribal knowledge. If you're building platforms, bots, or tools to empower your engineering org, this is a foundation you can trust.

A practical guide to Azure VM SKU eligibility and zonal support monitoring

rmmartins — Tue, 20 Jan 2026 21:43:33 GMT

Important clarification about “capacity”

This guide does not provide real-time, deployable capacity signals for Azure VM SKUs. The solution is based on the Azure ResourceSkus API, which exposes SKU metadata, regional availability, zonal support, and subscription-level restrictions. It can tell you whether a SKU is eligible for your subscription in a region and which zones are supported.

It does not guarantee that capacity is available at deployment time. Azure capacity is dynamic, and allocation failures can still occur even when a SKU appears available and quota is sufficient. This solution is best used to proactively detect SKU restrictions, understand zonal exposure, and build guardrails and alternatives before deployments. For guaranteed capacity, Azure Capacity Reservations or pre-deployment validation are required.

Look, Azure allocation failures can really derail your day. Most of the time, you only find out there’s a problem when a deployment fails. no clear early signal, no easy way to validate whether a SKU is even usable in a given region or zone for your subscription.

After seeing this happen repeatedly with customers, I built a simple monitor that helps you proactively validate SKU eligibility, restrictions, and zonal support, so you can catch “this SKU won’t work here” scenarios early and design alternatives before you hit deployment time.

Thought I’d share it here. hopefully it saves you some of the same headaches.

What this thing does

This solution isn't fancy, but it works. Here's what it'll do for you:

Checks whether specific VM SKUs are eligible and exposed in a given region for your subscription
Shows exactly why a SKU can’t be used when there’s a restriction (for example, not available for the subscription or in specific zones)
Shows which availability zones are supported for each SKU in that region
Suggests similar VM SKUs you could consider when a SKU is restricted
Logs all results to Azure Log Analytics so you can track SKU exposure and restriction trends over time
Runs directly from your terminal, no complex setup required

What this solution does not do

This solution does not provide a real-time view of free or remaining Azure capacity. There is currently no public API that exposes live, deploy-time capacity per SKU, per zone, per region. As a result, even if a SKU appears eligible and zonally supported, deployments may still fail due to transient or regional capacity constraints.

If you need allocation certainty, you should consider:

Azure Capacity Reservations
Running validation deployments as a point-in-time signal
Designing for flexibility across SKUs, zones, and regions

How it's put together

It's pretty simple really - just two main Python scripts:

The Monitoring Script: Checks VM SKU eligibility, restrictions, and zonal support using Azure’s ResourceSkus API
Log Analytics Setup: Stores your data for later analysis (optional, but super useful)

Here's a quick diagram:

Before you start

You'll need a few things:

1. Azure CLI installed and working on your machine

# If you haven't logged in yet az login

2. Azure permissions if you're doing the Log Analytics part:

# Get your username first USER_PRINCIPAL=$(az ad signed-in-user show --query userPrincipalName -o tsv) echo "Looks like you're logged in as: $USER_PRINCIPAL" # Create a resource group - you can change the name if you want az group create --name vm-sku-monitor-rg --location eastus2 # Give yourself the right permissions az role assignment create \ --assignee "$USER_PRINCIPAL" \ --role "Monitoring Metrics Publisher" \ --scope "/subscriptions/$(az account show --query id -o tsv)/resourcegroups/vm-sku-monitor-rg" # Double-check it worked az role assignment list \ --assignee "$USER_PRINCIPAL" \ --role "Monitoring Metrics Publisher" \ --scope "/subscriptions/$(az account show --query id -o tsv)/resourcegroups/vm-sku-monitor-rg"

Azure can be kinda slow with permissions sometimes. If you get weird 403 errors later, maybe grab a coffee and try again in 10-15 mins.

3. Python environment setup:

# Set up a virtual environment - don't skip this step! # I learned this the hard way when I borked my system Python... python3 -m venv venv # Activate it source venv/bin/activate # On Windows: venv\Scripts\activate # Install what we need pip install azure-identity azure-mgmt-compute azure-mgmt-subscription azure-monitor-ingestion rich

Let's build this thing

1. The VM Capacity Checking Script

The star of the show is the monitoring script itself. This script does all the heavy lifting - checking VM availability, showing you what's happening, and logging the data for later. I'll call it: monitor_vm_sku_capacity.py:

The script uses compute_client.resource_skus.list() to evaluate SKU metadata, regional exposure, supported zones, and restriction codes. This API does not surface live allocation capacity.

2. Log Analytics Setup Script

Now for the script that sets up all the Log Analytics stuff. This part is optional, but really helpful if you want to track capacity trends over time: setup_log_analytics.py

Setting default region and VM SKU

You've got a few options to set your preferred region and VM SKU:

1. Edit script defaults: Open monitor_vm_sku_capacity.py and look for:

parser.add_argument('--region', type=str, default='eastus2', # Change this! help='Azure region to check (default: eastus2)') parser.add_argument('--sku', type=str, default='Standard_D16ds_v5', # And this! help='VM SKU to check (default: Standard_D16ds_v5)')

2. Specify on command line:

python monitor_vm_sku_capacity.py --region westus2 --sku Standard_D8ds_v5

3. Edit config file: After running the setup script, it creates a config.json with these values:

{ "region": "eastus2", "target_sku": "Standard_D16ds_v5", "check_zones": true, ... }

Finding Available Regions and SKUs

If you're wondering which regions and SKUs to monitor, here's how to get that info:

Using Azure CLI

# List all regions az account list-locations --query "[].name" -o tsv # List all VM SKUs in a region az vm list-skus --location eastus2 --resource-type virtualMachines --query "[].name" -o tsv # Get detailed info about a specific SKU az vm list-skus --location eastus2 --size Standard_D16ds_v5 -o table

Using Azure Portal

Just go to the VM creation page in the portal and click "See all sizes" - you'll get a nice visual list of all available options. I sometimes just take a screenshot of this for reference.

Using this tool

So here's how you use this thing. I tried to make it as simple as possible:

1. Set up Log Analytics first (optional but recommended):

python setup_log_analytics.py

This builds all the Log Analytics stuff and spits out a config file you can use in the next step. The default options should work fine for most people, but you can customize if needed.

2. Run the monitoring script:

python monitor_vm_sku_capacity.py --config config.json

If you don't want to mess with Log Analytics, you can just run it directly:

python monitor_vm_sku_capacity.py --region eastus2 --sku Standard_D16ds_v5

The output will look something like this (way prettier if you have the rich package installed):

AVAILABLE means no subscription-level restriction was detected and the SKU is exposed in this region. It does not guarantee deploy-time capacity.

Or if the VM is unavailable:

================================================================================ AZURE VM SKU CAPACITY MONITOR - 2024-05-20 14:32:45 ================================================================================ Status: NOT AVAILABLE SKU: Standard_D16ds_v5 Region: eastus2 Subscription: My Azure Subscription (12345678-1234-1234-1234-123456789012) Details: SKU Standard_D16ds_v5 is not available in region eastus2 Available Zones: None Restrictions: Type: Zone Reason: NotAvailableForSubscription Affected Values: eastus2 VM SKU Specifications: vCPUs: 16 MemoryGB: 64 MaxDataDiskCount: 32 PremiumIO: True AcceleratedNetworkingEnabled: True Alternative SKUs: - Standard_D16as_v5 (vCPUs: 16, Memory: 64 GB, Family: standardDasv5Family, Similarity: 100%) - Standard_D16s_v5 (vCPUs: 16, Memory: 64 GB, Family: standardDsv5Family, Similarity: 100%) - Standard_D16s_v4 (vCPUs: 16, Memory: 64 GB, Family: standardDsv4Family, Similarity: 100%) - Standard_F16s_v2 (vCPUs: 16, Memory: 32 GB, Family: standardFSv2Family, Similarity: 80%) - Standard_E16s_v5 (vCPUs: 16, Memory: 128 GB, Family: standardEsv5Family, Similarity: 80%)

NOT AVAILABLE means the SKU is restricted for this subscription in this region or zone based on the ResourceSkus restriction signals.

Setting up scheduled checks

I don't like missing things, so I set mine up to run every hour using cron:

# Open crontab editor crontab -e # Add this line to run it every hour 0 * * * * cd /path/to/scripts && source venv/bin/activate && python monitor_vm_sku_capacity.py --config config.json >> vm_sku_monitor.log 2>&1

Checking your data in Log Analytics

If you set up Log Analytics, you can run all sorts of cool queries:

// Basic query - see everything VMSKUCapacity_CL | order by TimeGenerated desc // Find when capacity changed VMSKUCapacity_CL | where sku_name == "Standard_D16ds_v5" and region == "eastus2" | project TimeGenerated, is_available | order by TimeGenerated desc // Simple dashboard VMSKUCapacity_CL | summarize LastStatus=arg_max(TimeGenerated, is_available), LastChecked=max(TimeGenerated) by sku_name, region | extend Status = iff(LastStatus == true, "Available", "Not Available") | project sku_name, region, Status, LastChecked

You can set up alerts too. That way Azure tells YOU when capacity changes, instead of you finding out during a failed deployment!

Troubleshooting

Some common problems I've run into:

"Could not automatically detect subscription ID":
- Make sure you're logged in with az login
- Or just provide it explicitly with --subscription-id
Log Analytics permission errors:
- Make sure you ran the permission commands from the prerequisites section
- Azure's permissions can be weirdly slow - wait 10-15 minutes and try again
Python environment issues:
- Always use a virtual environment! I learned this one the hard way when I messed up my system Python
- Make sure all the packages are installed with pip install azure-identity azure-mgmt-compute azure-mgmt-subscription azure-monitor-ingestion rich

Next Steps

Create a dashboard to visualize VM SKU availability over time
Set up alerts to notify you when specific SKUs become available
Integrate with your CI/CD pipeline to automatically select available SKUs
For a serverless, fully managed option, create an Azure Function version of the monitoring script

Advanced: Bulk-Deploy Feasibility Check

Want to validate up front whether a SKU is eligible in a region and whether your subscription quota would allow N VMs?
We combine:

Hardware-level: Resource SKUs API (is the SKU unrestricted?)
Subscription-level: Usage API (enough free vCPU cores for N instances?)

Prerequisites already covered above:

az login USER_PRINCIPAL=$(az ad signed-in-user show --query userPrincipalName -o tsv) az group create --name vm-sku-monitor-rg --location eastus2 az role assignment create \ --assignee "$USER_PRINCIPAL" \ --role "Monitoring Metrics Publisher" \ --scope "/subscriptions/$(az account show --query id -o tsv)/resourcegroups/vm-sku-monitor-rg" python3 -m venv venv && source venv/bin/activate pip install azure-identity azure-mgmt-compute azure-mgmt-subscription rich

File: monitor_vm_sku_capacity_bulk.py

#!/usr/bin/env python """ Azure VM SKU Capacity & Quota Monitor (with Zone support) Checks: 1) Whether your target SKU is available in a region or zone 2) Whether your subscription has enough free vCPU quota to deploy N VMs Optionally logs results into Azure Log Analytics. """ import argparse import datetime import json import logging import subprocess from typing import List, Tuple, Dict, Any from azure.identity import DefaultAzureCredential from azure.mgmt.compute import ComputeManagementClient from azure.mgmt.subscription import SubscriptionClient # Rich for prettier tables try: from rich.console import Console from rich.table import Table from rich import box RICH_AVAILABLE = True except ImportError: RICH_AVAILABLE = False # Configure logging logging.basicConfig( level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s", handlers=[logging.StreamHandler()] ) logger = logging.getLogger("vm_sku_capacity_monitor") def parse_arguments(): p = argparse.ArgumentParser( description="Azure VM SKU Capacity & Quota Monitor (with zone support)" ) p.add_argument("--region", type=str, default="eastus2", help="Azure region to check") p.add_argument("--sku", type=str, default="Standard_D16ds_v5", help="VM SKU to check") p.add_argument("--zone", type=str, default=None, help="(Optional) Availability zone to check (e.g. '1')") p.add_argument("--count", type=int, default=1, help="Number of VMs you plan to deploy") p.add_argument("--log-analytics", action="store_true", help="Enable logging to Azure Log Analytics") p.add_argument("--endpoint", type=str, help="Data Collection Endpoint URI") p.add_argument("--rule-id", type=str, help="Data Collection Rule ID") p.add_argument("--stream-name", type=str, default="Custom-VMSKUCapacity_CL", help="Log Analytics stream name") p.add_argument("--debug", action="store_true", help="Enable debug logging") p.add_argument("--config", type=str, help="Path to JSON config file") p.add_argument("--subscription-id", type=str, help="Azure Subscription ID") return p.parse_args() def load_configuration(args) -> Dict[str, Any]: cfg = { "region": args.region, "zone": args.zone, "target_sku": args.sku, "desired_count": args.count, "subscription_id": args.subscription_id, "log_analytics": { "enabled": args.log_analytics, "endpoint": args.endpoint, "rule_id": args.rule_id, "stream_name": args.stream_name } } if args.config: try: with open(args.config) as f: j = json.load(f) # merge known keys for k in ("region","zone","target_sku","desired_count","subscription_id"): if k in j: cfg[k] = j[k] cfg["log_analytics"].update(j.get("log_analytics", {})) logger.info(f"Loaded configuration from {args.config}") except Exception as e: logger.error(f"Failed loading config {args.config}: {e}") # CLI args override file if args.region: cfg["region"] = args.region if args.zone: cfg["zone"] = args.zone if args.sku: cfg["target_sku"] = args.sku if args.count: cfg["desired_count"] = args.count if args.subscription_id: cfg["subscription_id"] = args.subscription_id return cfg def get_subscription_id(explicit: str) -> str: if explicit: return explicit # Try Azure CLI try: out = subprocess.run( "az account show --query id -o tsv", shell=True, check=True, stdout=subprocess.PIPE, text=True ).stdout.strip() if out: return out except: pass # Fallback: Azure SDK cred = DefaultAzureCredential() subs = list(SubscriptionClient(cred).subscriptions.list()) return subs[0].subscription_id if subs else None def check_sku_availability( compute: ComputeManagementClient, region: str, sku: str, zone: str = None ) -> Tuple[bool, str, List[str], Dict[str, Any]]: """ Returns: is_available (bool), reason (str or None), supported_zones (list of str), capabilities (dict of name→value) """ skus = list(compute.resource_skus.list()) entry = next( (s for s in skus if s.name.lower() == sku.lower() and region.lower() in [loc.lower() for loc in s.locations]), None ) if not entry: return False, "NotFound", [], {} # Find all zones where this SKU is sold in that region supported_zones = [] for loc_info in entry.location_info or []: if loc_info.location.lower() == region.lower(): supported_zones = loc_info.zones or [] break # Determine restrictions if zone: # 1) If SKU doesn’t support the requested zone if zone not in supported_zones: return False, "UnsupportedZone", supported_zones, {} # 2) Check zone-level restrictionInfo.zones restricted = [ r for r in entry.restrictions if r.restriction_info.zones and zone in r.restriction_info.zones ] else: # Region-level check restricted = [ r for r in entry.restrictions if region.lower() in [l.lower() for l in r.restriction_info.locations] ] is_avail = len(restricted) == 0 reason = restricted[0].reason_code if restricted else None # Pull out SKU capabilities (vCPUs, MemoryGB, etc.) caps = {c.name: c.value for c in entry.capabilities or []} return is_avail, reason, supported_zones, caps def check_quota( compute: ComputeManagementClient, region: str, vcpus_needed: int, count: int ) -> Tuple[int,int,bool]: usage = list(compute.usage.list(location=region)) core = next((u for u in usage if u.name.value.lower()=="cores"), None) free = (core.limit - core.current_value) if core else 0 required = vcpus_needed * count return free, required, free >= required def display(rdata: Dict[str, Any]): if RICH_AVAILABLE: c = Console() c.print(f"\n[bold underline]SKU Capacity & Quota (Zone) Check " f"({datetime.datetime.now():%Y-%m-%d %H:%M:%S})[/]\n") # Availability table t1 = Table(box=box.SIMPLE) t1.add_column("SKU"); t1.add_column("Region"); t1.add_column("Zone") t1.add_column("Available"); t1.add_column("Reason") t1.add_row( rdata["target_sku"], rdata["region"], rdata["zone"] or "-", "✅" if rdata["is_available"] else "❌", rdata["reason"] or "-" ) c.print(t1) # Supported zones t0 = Table(box=box.SIMPLE) t0.add_column("Supported Zones") t0.add_row(", ".join(rdata["supported_zones"]) or "None") c.print(t0) # Quota table t2 = Table(box=box.SIMPLE) t2.add_column("Desired VMs", justify="right") t2.add_column("vCPUs/VM", justify="right") t2.add_column("Free Cores", justify="right") t2.add_column("Needs Cores",justify="right") t2.add_column("Quota OK?", justify="center") t2.add_row( str(rdata["desired_count"]), str(rdata["vcpus"]), str(rdata["free_cores"]), str(rdata["required_cores"]), "✅" if rdata["quota_ok"] else "❌" ) c.print(t2) else: print(f"\nSKU {rdata['target_sku']} in {rdata['region']} " f"zone {rdata['zone'] or '-'}: " f"Available={rdata['is_available']} (Reason={rdata['reason']})") print("Supported zones:", ", ".join(rdata["supported_zones"]) or "None") print(f"Quota: need {rdata['required_cores']} cores, " f"have {rdata['free_cores']} → OK={rdata['quota_ok']}") def main(): args = parse_arguments() if args.debug: logger.setLevel(logging.DEBUG) cfg = load_configuration(args) cfg["subscription_id"] = get_subscription_id(cfg.get("subscription_id")) logger.info(f"Checking SKU {cfg['target_sku']} x{cfg['desired_count']} " f"in {cfg['region']} zone {cfg['zone']}") cred = DefaultAzureCredential() compute = ComputeManagementClient(cred, cfg["subscription_id"]) # 1) SKU + zone availability is_avail, reason, zones, caps = check_sku_availability( compute, cfg["region"], cfg["target_sku"], cfg["zone"] ) vcpus = int(caps.get("vCPUs", 0)) # 2) Subscription quota check free, required, ok = check_quota( compute, cfg["region"], vcpus, cfg["desired_count"] ) result = { "target_sku": cfg["target_sku"], "region": cfg["region"], "zone": cfg["zone"], "supported_zones": zones, "desired_count": cfg["desired_count"], "is_available": is_avail, "reason": reason, "vcpus": vcpus, "free_cores": free, "required_cores": required, "quota_ok": ok } display(result) # (Optional) send to Log Analytics… # [omitted for brevity] if __name__ == "__main__": main()

Run the bulk-deploy checker (region-level check)

python monitor_vm_sku_capacity_bulk.py \ --region centralus \ --sku Standard_B2s_v2 \ --count 10

(Optionally add the parameter --log-analytics --endpoint <DCE-URI> --rule-id <DCR-ID> to send it to Log Analytics)

Example output

SKU Capacity & Quota (Zone) Check (2025-06-20 12:49:58) SKU Region Zone Available Reason ───────────────────────────────────────────────────────── Standard_B2s_v2 centralus - ✅ - Supported Zones ───────────────── 1, 3, 2 Desired VMs vCPUs/VM Free Cores Needs Cores Quota OK? ─────────────────────────────────────────────────────────────── 10 2 100 20 ✅

Availability in this output reflects SKU eligibility, not real-time capacity.

Run the bulk-deploy checker (zone-level heck)

python monitor_vm_sku_capacity_bulk.py \ --region centralus \ --zone 2 \ --sku Standard_B2s_v2 \ --count 10

Example output

SKU Capacity & Quota (Zone) Check (2025-06-20 12:42:22) SKU Region Zone Available Reason ───────────────────────────────────────────────────────── Standard_B2s_v2 centralus 2 ✅ - Supported Zones ───────────────── 1, 3, 2 Desired VMs vCPUs/VM Free Cores Needs Cores Quota OK? ─────────────────────────────────────────────────────────────── 10 2 100 20 ✅

Availability in this output reflects SKU eligibility and zonal exposure, not real-time capacity.

Final Thoughts

This solution has proven to be a valuable asset for Azure infrastructure planning. It helps teams proactively identify SKU restrictions, understand zonal exposure, and spot changes in SKU eligibility over time.

Used correctly, it reduces surprise deployment failures by surfacing where SKUs cannot be used early, enabling better design decisions around regions, zones, and alternatives before production deployments

Happy monitoring!

The Digital Native's Checklist for Azure: Stuff I wish every startup knew

rmmartins — Tue, 22 Apr 2025 14:50:05 GMT

I’ve had the chance to work with a bunch of digital native customers — you know, those fast-moving, API-first, cloud-from-day-zero teams building the next big thing. And while no two startups are ever quite the same, I’ve noticed a pattern: the same Azure gotchas pop up again and again.

So I thought, why not write down a quick checklist? Not a 100-page whitepaper. Just the stuff that actually helps — especially if you’re trying to go from MVP chaos to something a little more production-grade.

This isn’t just based on my own experience (though there’s been plenty of that). I’ve pulled together insights from some awesome blog posts and official docs to consolidate the essentials into one simple checklist. Let’s jump in!

Identity & Access: First thing to get right

Start here. Trust me, cleaning up Entra ID and access controls after you scale is a nightmare.

Use Microsoft Entra ID as your single source of truth.
Ditch the “Owner” role everywhere. Implement RBAC properly.
Use Managed Identities instead of storing secrets in your code. It’s cleaner, safer, and modern.
PIM (Privileged Identity Management) is your friend. Turn it on.

Extra reading:
Demystifying Entra Tenants and Subscriptions
From Zero to Hero: Identity in AKS

Networking & Security: You can't secure what you can’t see

Yes, even if you're “just prototyping.” Flat networks and open ports will haunt you later.

Set up your VNets, subnets, NSGs with actual thought.
Plan out VNet architecture — even if you think “we’re just testing stuff.”
Turn on Defender for Cloud. The free plan gives you a lot already.
Use Azure Firewall and DDoS protection where it makes sense.
Lock down public IPs, use private endpoints when you can.
Set up Key Vault + Managed Identity — even for “just a demo.”

Bonus:

Building a Secure & Scalable Foundation

AKS Networking Guide — bookmark this one.

Resource Management: Don’t be that team with 243 unnamed resources

I once worked with a customer who had 15 “rg-dev-test-temp” resource groups. No one knew who owned them. Chaos.

Follow a resource organization strategy. Management groups. Subscriptions. Do it.
Use tags everywhere. Tag by owner, environment, cost center — whatever helps. No exceptions.

Cost & FinOps: Avoid billing surprises (and awkward CFO convos)

You will get burned if you don’t track costs. It’s not “extra work” — it’s survival.

Azure Cost Management is free. Use it.
Set budgets + alerts. Even if it’s just $10 over, that’s your early warning system.
Use Azure Advisor regularly. It's free. It’s there. It’s helpful. Just do it.
Check out those “hidden” optimizations — Reservations, Spot, Savings Plans.
Learn FinOps basics from this toolkit

Also:
Slash Your Azure Bill – Tips for Startups

Monitoring & Observability: MELT is not just a buzzword

You need to know what’s happening — before your customers do.

Enable Azure Monitor and Service + Resource Health.
Use Workbooks to make dashboards that are actually useful.
Set up advanced alerts.
MELT = Metrics, Events, Logs, Traces. Here’s a good read: MELT in Azure

Infrastructure as Code: No, clicking around in the portal isn’t “agile”

Use Bicep, ARM, or Terraform — not the portal. (Unless you're debugging.)
Plug it into CI/CD. Infra pipelines are a thing. Use them.
Add Azure Landing Zones for structure, governance, and scale-readiness — even if you’re small. They scale with you.

AKS & App Architecture: Because most of y’all are running Kubernetes anyway

Start here: AKS Guide for Startups
Learn about storage, upgrades, identity, and cluster models.
Add monitoring with Azure Monitor features for Kubernetes
And please, for the love of uptime, use the best practices for AKS

Azure OpenAI (AOAI): Because GenAI is everywhere now

Start with this gem: AOAI Best Practices
Follow this doc if you’re using your own data
Familiarize yourself with how Azure OpenAI processes and stores data.
Watch out for data residency, concurrency, and cost — especially at scale

Bonus: AWS background? Here's your Rosetta Stone

👉 Azure for AWS Pros

Final thought

This isn’t about checking every box on day one. It’s about having a clear, shared view of what “mature” looks like on Azure — for founders, devs, ops, finance, and even the intern shipping ARM templates on day three.

Save this list. Bookmark it. Share it with your team. Better yet, build your own version and make it yours.

Got a checklist you use or a tip you love? I’d seriously love to hear it.

Let’s build smart, not just fast.

Azure OpenAI best practices: A quick-reference guide to optimize your deployments

rmmartins — Fri, 25 Apr 2025 18:16:33 GMT

Contributors: Ahmed Chowdhury

As organizations increasingly integrate Azure OpenAI into their applications, it's essential to be aware of the comprehensive best practices that Microsoft has published. However, these valuable resources are often dispersed across various documentation pages, making it challenging to access them efficiently.

This quick-reference guide consolidates the key best practices for deploying and managing Azure OpenAI workloads. By bringing together architectural considerations, security measures, governance strategies, networking configurations, and more, this guide aims to provide a centralized resource to help you optimize your Azure OpenAI deployments effectively.

Architectural considerations

A robust architecture is the foundation of any successful Azure OpenAI deployment. Azure's Well-Architected Framework provides guidance to design and implement solutions that are reliable, secure, and efficient.

Key recommendations:

Design for scalability: Utilize Azure's scalable services to handle varying loads, ensuring consistent performance during peak times.
Optimize cost: Monitor and manage resources to avoid unnecessary expenditures. Implement auto-scaling and choose appropriate pricing tiers based on workload demands.

Example: An e-commerce platform using Azure OpenAI for personalized recommendations can leverage auto-scaling to handle increased traffic during sales events, ensuring users receive timely suggestions without over-provisioning resources.

For detailed architectural guidance, refer to the Architecture Best Practices for Azure OpenAI Service.

Security best practices

Protecting sensitive data and ensuring compliance are paramount when deploying AI solutions. Azure provides a comprehensive security baseline tailored for Azure OpenAI services.

Key recommendations:

Data encryption: Implement encryption for data at rest and in transit to safeguard against unauthorized access.
Access controls: Utilize Azure's Role-Based Access Control (RBAC) to restrict access to AI resources, ensuring only authorized personnel can interact with sensitive data.

Example: A healthcare provider deploying Azure OpenAI for patient diagnostics should encrypt patient data and restrict access based on roles, ensuring compliance with regulations like HIPAA.

For comprehensive security guidelines, consult the Azure Security Baseline for Azure OpenAI.

Governance strategies

Effective governance ensures that AI deployments align with organizational policies and regulatory requirements. Azure's governance recommendations provide a framework for managing AI resources.

Key recommendations:

Resource tagging: Implement consistent tagging for AI resources to facilitate tracking, management, and cost allocation.
Policy enforcement: Use Azure Policy to enforce organizational standards and assess compliance across AI resources.

Example: A company can use resource tagging to allocate AI resource costs to specific departments, ensuring transparency and accountability.

For detailed governance strategies, refer to the Governance Recommendations for AI Workloads on Azure.

Networking considerations

Efficient and secure networking is crucial for AI workloads, especially when dealing with large datasets and real-time processing. Azure offers networking recommendations tailored for AI services.

Key recommendations:

Virtual networks (VNet): Isolate AI resources within VNets to enhance security and control traffic flow.
Private endpoints: Use private endpoints to connect securely to AI services, reducing exposure to the public internet.

VNet Connectivity Patterns:

When you need AI resources in two VNets to talk to each other, there are two primary approaches:

1. Gateway‑to‑Gateway VPN

Encryption: Built‑in IPsec/IKE tunnel, ensuring all traffic is encrypted in transit.
Transit Support: Enables hub‑and‑spoke or multi‑region topologies without mesh peerings —just connect each spoke to a central transit VNet.
When to use: Regulated workloads, cross‑region connectivity, or any scenario demanding IPsec in-flight encryption.

2. VNet Peering

Performance: Lowest latency over Microsoft’s backbone network.
Cost: No gateway data‑processing charges (peering is metered only on data egress).
When to use: VNets in the same region/tenant, where encryption‑tunnel overhead isn’t required and you want simplicity and speed.

Note:

Peering is non‑transitive by default: A↔B and B↔C peerings don’t auto-connect A to C. To achieve transit, you either need gateway transit settings on your peering or use a VPN hub.

If you require both low latency and encrypted traffic, you can combine peering (data path) with Azure Route Server + NVA‑based IPsec—or stick with VPN for simplicity.

Quota management and optimization

Azure imposes quotas to manage resource usage effectively. Understanding and optimizing these quotas ensures uninterrupted AI operations.

Key recommendations:

Monitor usage: Regularly monitor token usage and request rates to stay within allocated quotas.
Request increases proactively: If approaching quota limits, request increases in advance to avoid service disruptions.

Example: A chatbot service experiencing increased user interactions should monitor token usage and anticipate quota adjustments to maintain seamless user experiences.

For detailed quota management, refer to:

Example: A financial institution processing real-time transactions with AI can use VNets and private endpoints to ensure data remains within a secure network boundary, mitigating risks of data breaches.

For comprehensive networking guidelines, consult the Networking Recommendations for AI Workloads on Azure.

Provisioned throughput units (PTUs)

For workloads requiring consistent and predictable performance, Azure offers Provisioned Throughput Units (PTUs).

Key recommendations:

Assess workload needs: Determine if PTUs align with your workload's performance requirements and cost considerations.
Plan for scalability: Allocate PTUs based on anticipated growth, ensuring the AI system can handle increased demand.
Monitor utilization: Regularly monitor PTU utilization to ensure optimal performance and cost-effectiveness.

Example: A streaming service using Azure OpenAI for content recommendations can deploy PTUs to guarantee consistent performance during peak viewing times.

For detailed information on PTUs, refer to the Provisioned Throughput Units (PTUs) in Azure OpenAI Service.

Monitoring and logging

Comprehensive monitoring and logging are vital for maintaining the health and performance of AI systems. Azure provides tools to monitor AI services effectively.

Key recommendations:

Enable diagnostic logs: Capture detailed logs for troubleshooting and performance analysis.
Set up alerts: Configure alerts for anomalies or performance degradation to enable proactive responses.
Utilize Azure monitor: Use Azure Monitor to collect, analyze, and act on telemetry data from your Azure OpenAI resources.

Example: An online retailer using Azure OpenAI for customer support chatbots can set up alerts to detect unusual spikes in response times, allowing for immediate investigation and resolution.

For comprehensive monitoring guidelines, consult the Monitor Azure OpenAI Service documentation.

Multi-region gateway deployment strategy for Azure OpenAI

To enhance reliability, latency, and resilience for geographically distributed Azure OpenAI users, a multi-region API gateway architecture is strongly recommended. This has become a key focus for engineering teams and field specialists, and for good reason: regional outages, high traffic scenarios, or backend limitations can impact availability. A well-architected gateway setup helps mitigate these issues.

Why This Matters

You can route requests intelligently across multiple Azure OpenAI deployments or models.
You minimize latency by serving traffic from the closest region.
You reduce single points of failure and improve your disaster recovery posture.

Implementation Patterns

There are two main patterns for implementing this in production:

Option 1: Azure API Management Premium – Multi-region deployment (recommended for enterprise scale)

This option leverages Azure API Management's built-in multi-region deployment capability, available with the Premium tier.

Benefits:

Replicates the gateway component across multiple Azure regions.
Traffic is automatically routed to the nearest regional gateway based on latency.
Ensures localized access points and high availability in case of regional failures.

Considerations:

Requires Premium tier (higher cost).
Management plane and developer portal remain in the primary region.

Option 2: Standard tier APIM with external load balancer (cost-effective alternative)

If Premium tier is not feasible, you can deploy separate APIM instances (Standard tier or higher) in each region and use a global load balancer like Azure Front Door or Traffic Manager to distribute traffic.

Steps:

Deploy multiple APIM instances independently in different regions.
Use Azure Front Door or Traffic Manager to route traffic based on geo-proximity or latency.
Maintain consistent configuration across all APIM instances.

Trade-offs:

No built-in multi-region replication; manual config sync needed.
More flexible cost-wise and supports gradual scaling.

Additional strategies to strengthen resilience

Multi-backend gateway pattern: Configure your APIM to route requests to different OpenAI deployments/models based on performance, availability, or workload type.
Public backbone consumption: Use gateways that connect via the Microsoft Public Backbone to improve performance and reduce exposure to public internet routing.
Business continuity & disaster recovery (BCDR): Integrate failover rules, caching, and retry policies to ensure seamless experiences during disruptions.

Example: A multinational company deploying Azure OpenAI for internal employee support creates deployments in East US, West Europe, and Southeast Asia. They set up regional APIM gateways using the Premium tier and route traffic intelligently through Azure Front Door. If the East US region is unavailable, users are routed to West Europe automatically — with minimal latency impact — ensuring uptime and productivity.

Resources:

Bonus: Download the full Azure OpenAI review checklist

If you're looking for a structured way to assess your Azure OpenAI implementation, the Azure Review Checklists now provides a comprehensive checklist with 180+ best practice items covering AI Landing Zone for every critical area: Governance, Operations, networking, Identity, Cost Management, and Business Continuity & Disaster Recovery (BCDR):

Download the official Review Checklist Excel Workbook
Select AI Landing Zone and click to Import latest checklist
Load the AI Landing Zone checklist and explore categorized recommendations with direct reference links to Microsoft documentation.

This checklist serves as a powerful tool to validate architecture decisions, uncover gaps, and guide implementation discussions across technical and governance domains.

Conclusion

By adhering to these best practices, organizations can effectively manage and secure their Azure OpenAI workloads, ensuring they are reliable, efficient, and aligned with industry standards.

AKS networking made easy: Your comprehensive guide

rmmartins — Thu, 17 Apr 2025 14:55:28 GMT

Azure Kubernetes Service (AKS) is not just about deploying containerized applications—it’s also about architecting robust, secure, and efficient network connectivity for your clusters. In this blog post, we’ll explore the intricacies of AKS networking, clarify the different models and options available, and discuss best practices through real-world scenarios. Whether you’re just starting out or looking to fine-tune an existing deployment, this guide will help you master AKS networking.

1. AKS network topologies and connectivity

Understanding the network topology is the foundation of effective AKS networking. The Cloud Adoption Framework’s AKS network topology and connectivity guide provides a structured look at how AKS clusters integrate into an organization’s network fabric.

Key concepts:

Cluster connectivity: How pods, services, and external resources communicate.
Topology options: From simple flat networks to more segmented designs.

Real-world scenario: Imagine a multi-tier application where frontend pods need to securely talk to backend services and databases. A clear network topology ensures that the traffic flow respects both performance and security requirements.

This diagram illustrates a simplified view of how traffic flows from external users through an ingress controller to both frontend and backend pods.

2. Comparing AKS network models

One of the most important decisions when deploying AKS is choosing between the different networking models.

Kubenet was one of the original networking drivers in Kubernetes, and it still “just works” out of the box in most on‑prem or DIY clusters. But as we’ve moved toward managed, cloud‑hosted Kubernetes, vendor‑built CNIs have become the norm—solving Kubenet’s limitations around IP‑address management, scalability and lack of overlay networking.

That’s why AKS now offers a full spectrum of Azure‑native CNIs—Standard (Node Subnet), Overlay, dynamic IP allocation and even Cilium‑powered variants—each built to fill those gaps. Standard mode injects pod IPs straight into your VNet, Overlay preserves your address space, dynamic IP mode auto‑manages huge clusters, and Cilium brings eBPF‑driven performance and observability.

The AKS concepts on network models outline the primary options:

Kubenet vs. Azure CNI (Standard)

Kubenet:

Kubenet in action: Pods receive overlay network IPs, use NAT for external communication, and preserve VNET addresses.

- Simplicity and flexibility: Pods receive an IP from an overlay network.
- Use case: Historically, kubenet was favored for smaller clusters or scenarios where conserving IP addresses was important.
- Important notice: On 31 March 2028, kubenet networking for Azure Kubernetes Service (AKS) will be retired. To avoid service disruptions, you will need to upgrade your workloads running on kubenet to Azure Container Networking Interface (CNI) overlay before that date. More details can be found in the official Microsoft documentation.

Azure CNI (Standard Mode):

CNI Standard: Pods obtain IPs directly from the VNET, ensuring seamless integration but requiring careful IP planning

Full integration: Pods get IP addresses directly from the virtual network (VNET), providing seamless integration with other Azure resources.
Scalability and integration: Ideal for large clusters and scenarios that require tight integration with Azure networking.

Azure CNI Standard vs. Azure CNI Overlay

When choosing a networking approach for your AKS cluster, it's important to understand the trade-offs between the two main Azure CNI variants. Azure CNI Standard assigns pod IPs directly from your Azure VNET, offering tight integration with your network infrastructure. In contrast, Azure CNI Overlay decouples pod IP assignment from the VNET through encapsulation (e.g., VXLAN), which can be advantageous for large-scale deployments with limited IP space. Below is an overview of the differences between these two approaches:

Azure CNI Standard:

CNI Standard: Pods directly receive IPs from the VNET, allowing seamless integration but requiring careful IP planning

Direct IP assignment: Each pod is assigned a unique IP address from your Azure VNET.
Full VNET integration: Enables use of VNET-level controls (like NSGs) and ensures pods are routable within your VNET.
IP consumption: Requires careful IP planning, as each pod consumes a VNET IP.
Learn more: Azure CNI networking

Azure CNI Overlay:

CNI Overlay: Pods receive IPs from an overlay network, decoupled from VNET IP space for efficient IP usage but with slight encapsulation overhead.

Overlay network: Pods receive IP addresses from an overlay network using encapsulation (such as VXLAN).
Efficient IP utilization: Decouples pod IP assignment from the VNET's IP range, which is beneficial for large-scale deployments with limited VNET address space.
Performance consideration: There is a slight overhead due to encapsulation/decapsulation processes.
Learn more: Azure CNI overlay

Additional Azure CNI variants

Beyond the standard modes, Microsoft offers other variants to address different workload needs:

Azure CNI with dynamic IP allocation: Allocates pod IP addresses dynamically, reducing the need for pre-allocation and easing IP management in highly dynamic environments.
- Benefits:
  - Reduces IP waste when pods are ephemeral and can be scaled up or down frequently.
  - Simplifies IP address management by allocating IPs on-demand.
- When to use: Ideal for environments with rapid scaling or high pod churn, where managing a static pool of IPs can be cumbersome.
- Learn more: Azure CNI with dynamic IP allocation
Azure CNI Powered by Cilium: Leverages Cilium and eBPF to provide advanced networking capabilities, enhanced security policies, and improved observability.
- Benefits:
  - Provides granular security and networking policies with high performance, thanks to eBPF.
  - Enables advanced features like transparent encryption, load balancing, and deep visibility into network flows.
- When to use: Suitable for organizations looking for cutting-edge network security, observability, and performance improvements.
- Learn more: Azure CNI Powered by Cilium

Example lab: Deploying an AKS cluster with Azure CNI (Standard)

1. Plan IP addressing: Use the Azure CNI configuration guide to determine your IP range.

2. Create the AKS cluster:

# Variables resourceGroup="MyResourceGroup" location="centralus" vnetName="MyVnet" vnetAddressPrefix="10.0.0.0/16" subnetName="MySubnet" subnetAddressPrefix="10.0.1.0/24" aksName="MyCNIAKSCluster" serviceCidr="10.200.0.0/16" dnsServiceIp="10.200.0.10" # Create resource group az group create --name "$resourceGroup" --location "$location" # Create virtual network az network vnet create \ --resource-group "$resourceGroup" \ --name "$vnetName" \ --address-prefix "$vnetAddressPrefix" # Create subnet within the VNET az network vnet subnet create \ --resource-group "$resourceGroup" \ --vnet-name "$vnetName" \ --name "$subnetName" \ --address-prefix "$subnetAddressPrefix" # Retrieve current subscription ID and build the subnet ID dynamically subId=$(az account show --query id -o tsv) subnetId="/subscriptions/${subId}/resourceGroups/${resourceGroup}/providers/Microsoft.Network/virtualNetworks/${vnetName}/subnets/${subnetName}" # Create the AKS cluster using the dynamic subnet ID az aks create \ --resource-group "$resourceGroup" \ --name "$aksName" \ --location "$location" \ --network-plugin azure \ --vnet-subnet-id "$subnetId" \ --service-cidr "$serviceCidr" \ --dns-service-ip "$dnsServiceIp" \ --enable-managed-identity

AKS CNI Standard mode – Key networking parameters

When deploying an AKS cluster using Azure CNI in standard mode, it’s important to understand the key parameters that control network configuration:

--service-cidr:

Purpose: This parameter defines the CIDR block from which Kubernetes service IPs are allocated.
Usage: The service CIDR must be a range that does not conflict with your virtual network (VNET) or pod IP ranges.
Example: If you specify --service-cidr 10.200.0.0/16, all cluster services (such as those created via kubectl expose) will receive IPs from this range. It’s critical to plan this CIDR carefully to ensure there are no overlaps with any other network segments in your environment.

--dns-service-ip:

Purpose: This parameter designates the IP address within the service CIDR that is used for the cluster’s DNS service (typically CoreDNS).
Usage: This IP must fall within the range defined by the service CIDR and must not be in use by any other service.
Example: For a service CIDR of 10.200.0.0/16, you might set --dns-service-ip 10.200.0.10. This reserved IP is then used by the DNS service to resolve names for services and pods within the cluster.

Why these settings are critical:
Using separate CIDR blocks for the VNET, pods, and services ensures there is no overlap, which is essential for proper routing and network isolation. While Azure CNI (standard mode) assigns pod IPs directly from the VNET, the service CIDR is distinct and is only used for service IP allocation. This separation allows you to have more control over your network design and helps prevent conflicts with external networks.

3. Validate networking:

Check that nodes and pods are receiving IPs from the specified VNET. This command displays each node's internal IP address, helping you verify that nodes are attached to the correct network.

kubectl get nodes -o wide

Look for the INTERNAL-IP column to confirm that each node's IP falls within the expected VNET address space.

To check that pods are receiving IPs correctly, list pods across all namespaces:

kubectl get pods --all-namespaces -o wide

The IP column should show addresses allocated from the VNET's defined range (for Azure CNI).

For additional details on a specific node’s networking, including labels and annotations related to IP assignment, you can describe the node:

kubectl describe node <node-name>

Replace <node-name> with one of the node names from the previous command. This output can help confirm that the node is correctly integrated with the VNET.

These commands together help validate that both the node and pod IP assignments are in line with your planned IP ranges, ensuring that your network planning and model selection are correctly implemented.

3. Private clusters and DNS configurations

For organizations with strict security requirements, AKS offers the ability to create private clusters. Private clusters ensure that the API server is not exposed to the public internet, enhancing security.

Key topics:

Private cluster deployment: Detailed in the private clusters documentation and the DNS prerequisites.
Private DNS: The Azure Private DNS overview explains how to leverage private DNS zones, and the configuration guide provides a step-by-step approach to integrate it with your private clusters.

Example lab: Creating a private AKS cluster with CNI (Standard)

1. Deploy a private AKS cluster:

az aks create \ --resource-group MyResourceGroup \ --name MyPrivateAKSCluster \ --enable-private-cluster \ --network-plugin azure

Why isn't the VNET or service CIDR specified?

In this example, advanced networking parameters like the virtual network, subnet, service CIDR, and DNS service IP are not explicitly defined. This is because:

Default networking configuration: When these parameters are omitted, AKS automatically provisions a default virtual network and assigns IP ranges for the cluster. With --network-plugin azure, the cluster is created using Azure CNI. This managed configuration is sufficient for many scenarios, reducing complexity during initial deployments.
Focus on enabling privacy: The primary goal in this scenario is to enable the private connectivity feature. By focusing on --enable-private-cluster, the example emphasizes that the API server will be accessible only within the internal network. Customizing networking settings (like specifying a particular VNET or IP ranges) is optional and can be added if you have specific integration or policy requirements.
Flexibility and customization: If your deployment requires integration with an existing virtual network or adherence to particular IP address planning, you can extend the command to include those parameters, similar to the public cluster examples. The minimal command is provided as a baseline for simplicity and ease of deployment.

2. Configure a private DNS zone:

The Azure Private DNS overview explains how to leverage private DNS zones for name resolution within your virtual network. For private clusters, configuring a private DNS zone ensures that your cluster’s API server and internal endpoints are accessible using friendly domain names. The configuration guide provides step-by-step instructions for this setup.

Real-world example: Consider a financial services company that must comply with strict data residency and security guidelines. Deploying AKS as a private cluster—with a dedicated private DNS zone—ensures that all control-plane communications and sensitive endpoints remain isolated within the company’s secure virtual network. If advanced network customization is needed, additional parameters (like a pre-created VNET, custom service CIDR, etc.) can be integrated into the deployment command.

4. Ingress, application routing, and traffic management

Managing incoming traffic is critical for any production-grade application. AKS offers several options for routing traffic:

Application Gateway for Containers

Azure’s latest Ingress offering, Application Gateway for Containers, is the successor to Application Gateway Ingress Controller, bringing numerous performance, resiliency, and layer 7 load balancing capabilities. In addition, it adopts Kubernetes’s latest Gateway API to enable administrators and developers to easily define their load balancing intent.

Application Gateway Ingress Controller (AGIC)

Azure Application Gateway Ingress Controller can provide advanced load balancing, SSL termination, and web application firewall capabilities.

Application routing

HTTP Application Routing: Historically, HTTP Application Routing was a popular option for simplifying DNS management for your applications. Note: Microsoft has announced that HTTP Application Routing will be retired on 03 March 2025. It is recommended that you migrate to the Application Routing add-on by that date to ensure continued support and enhanced functionality. For further details on migration, refer to the App routing migration guide.

Traffic management overview

In addition to ingress and application routing, effective traffic management involves strategies that optimize how traffic is handled within your environment. While a deep dive into these advanced topics is beyond the scope of this article, here is a brief overview:

Traffic splitting & canary deployments: Techniques that enable gradual rollout of new application versions by directing a portion of the traffic to new deployments while the majority remains on the current version. This reduces risk during updates and allows for real-time testing under live conditions.
A/B testing & blue/green deployments: Strategies that allow you to serve different versions of your application to different user groups. This can help in testing features or UI changes before a full rollout, ensuring smoother transitions and minimizing disruption.
Geo-based routing: Directing user requests to the nearest available service endpoint based on geographic location. This not only improves response times but also enhances the overall user experience by reducing latency.
Service mesh integration: Tools like Istio can be deployed alongside AKS to provide fine-grained control over traffic routing, observability, and secure communication between services. These tools add another layer of management for scenarios that require dynamic traffic policies and granular control.

Note: For a comprehensive exploration of these advanced traffic management strategies, a dedicated article would be ideal. This overview aims to provide context on how these techniques integrate with basic ingress and application routing to form a complete traffic management strategy.

Ingress resource example with Gateway API and how to use it

With the new capabilities of Application Gateway for Containers, you can now leverage the Gateway API for more advanced ingress scenarios—such as hosting multiple sites and aligning with Kubernetes’ evolving standards. Unlike the traditional ingress resource, the Gateway API provides a more flexible and standardized way to manage external traffic.

Step 1: Prepare Your Backend Service

Ensure you have a backend service deployed (for example, a service named my-service that listens on port 80). For instance:
Service Configuration (my-service.yaml)

apiVersion: v1 kind: Service metadata: name: my-service namespace: default spec: selector: app: myapp ports: - protocol: TCP port: 80 targetPort: 80

Deploy the service using:

kubectl apply -f my-service.yaml

Step 2: Create a Gateway API Configuration

Below is an example of how to configure the Gateway API to work with AGC. This example demonstrates creating a Gateway and an associated HTTPRoute to host traffic for the hostname example.yourdomain.com.

Gateway Configuration (gateway.yaml):

apiVersion: gateway.networking.k8s.io/v1beta1 kind: Gateway metadata: name: my-gateway namespace: default spec: gatewayClassName: azure-agc listeners: - name: https protocol: HTTPS port: 443 allowedRoutes: namespaces: from: All

HTTPRoute Configuration (httproute.yaml):

apiVersion: gateway.networking.k8s.io/v1beta1 kind: HTTPRoute metadata: name: my-httproute namespace: default spec: parentRefs: - name: my-gateway hostnames: - "example.yourdomain.com" rules: - matches: - path: type: PathPrefix value: / backendRefs: - name: my-service port: 80

Step 3: Deploy the Gateway and HTTPRoute

Apply the configurations:

kubectl apply -f gateway.yaml kubectl apply -f httproute.yaml

Step 4: Validate the Deployment

DNS resolution: Ensure that example.yourdomain.com points to the public IP of your Application Gateway for Containers.
Testing connectivity: From an external client, send an HTTPS request to https://example.yourdomain.com and verify that the traffic is routed to your backend service.
Monitoring and troubleshooting: Use the following commands to inspect the status and events of your Gateway and HTTPRoute:

kubectl describe gateway my-gateway -n default kubectl describe httproute my-httproute -n default

Key advantages of Gateway API with AGC:

Advanced routing capabilities: The Gateway API allows you to define multiple routes, enabling scenarios like multiple site hosting, path-based routing, and more.
Future-proof alignment: With Ingress API development in a freeze state, adopting the Gateway API aligns your deployments with the evolving direction of Kubernetes networking.
Unified management: By using AGC with Gateway API, you benefit from the advanced features of Application Gateway for Containers, including robust load balancing and enhanced security features.

This updated approach not only modernizes your ingress setup but also provides a more scalable and flexible way to manage external traffic into your AKS clusters.

For additional details and the latest examples, see the multi-site hosting with Application Gateway for Containers

Another great content about AGC written by Jose Moreno is available here: Application Gateway for Containers: a not-so-gentle intro

Diagram: AGC traffic flow

An example architecture illustrating how Application Gateway for Containers (AGC) uses the Gateway API to route HTTPS traffic from a client to different services within an AKS cluster.

Scenario: A global e-commerce platform leverages Application Gateway for Containers (AGC) integrated with the Gateway API to route traffic based on hostnames, paths, or other advanced routing rules. This approach allows each microservice (e.g., checkout, product catalog, user management) to be served through its own route configuration, simplifying scaling and updates. As the platform grows, the Gateway API’s extensible model ensures a future-proof solution—one that supports multiple site hosting and advanced traffic management without relying on the older Ingress API.

5. Virtual networks, service endpoints, and private link

Integrating your AKS clusters with Azure Virtual Networks (VNETs) is crucial for secure communication with other Azure services.

Service endpoints and private link:

Service endpoints: The Virtual network service endpoints overview explains how endpoints extend VNET private address space to Azure services.
Private link: For even tighter integration, private link allows you to access Azure PaaS services over a private endpoint in your VNET.

Diagram: VNET Integration:

An AKS cluster integrates with an Azure Virtual Network, enabling secure access to Azure SQL Database, Storage Accounts, and other PaaS services

Example Use Case: A healthcare application that needs to access an Azure SQL Database can use service endpoints or Private Link. This ensures that traffic between the AKS cluster and the database does not traverse the public internet, thereby meeting regulatory compliance and security requirements.

6. Planning IP addressing with Azure CNI

A critical aspect of designing your AKS network is planning the IP address space. The Azure CNI configuration guide helps you to:

Determine IP range requirements: For nodes and pods.
Avoid address overlap: With existing VNETs or on-premises networks.

7. Egress traffic management and security controls

Outbound traffic from your AKS clusters must be managed to ensure security and compliance. There are several approaches:

Egress options:

UDR and Azure Firewall: The Deploy a cluster with outbound type of UDR and Azure Firewall documentation details how to route egress traffic through user-defined routes (UDRs) and Azure Firewall for enhanced control.
Egress outbound types: Additional details in the egress outbound type guide illustrate various configurations.
Limiting egress traffic: The limit egress traffic document offers strategies for restricting outbound access to only trusted destinations.

Security layers:

Network security groups (NSGs): NSGs in virtual networks provide an extra layer of security by filtering traffic at the subnet or NIC level.
Network policies: For pod-level security, the use network policies guide explains how to restrict communication between pods.

Network policies are a key tool for enforcing security at the pod level. They allow you to restrict both ingress and egress traffic between pods. In this example, we focus on an ingress policy that permits only pods with the label app: frontend to communicate with pods labeled app: backend.

Practical scenario: In a scenario where a cluster hosts a mix of public-facing and internal services, configuring UDRs with Azure Firewall and applying NSGs and network policies ensures that public endpoints are hardened while internal communications remain efficient and secure.

8. Advanced networking: CNI overlay and operator best practices

For those looking to push the envelope in AKS networking, advanced configurations can offer improved performance and flexibility. One such configuration is using Azure CNI Overlay, which helps in scenarios where you need to conserve VNET IP addresses for large-scale deployments.

What is Azure CNI overlay?

Overlay network: Instead of assigning each pod an IP directly from your VNET (as in standard Azure CNI), pods receive IP addresses from an overlay network. This overlay is built using encapsulation methods (such as VXLAN), allowing you to decouple pod IP assignment from your VNET’s IP range.
Efficient IP utilization: This approach is particularly beneficial in environments with limited VNET address space or when deploying clusters with high pod density.
Trade-off: While the overlay approach introduces slight encapsulation overhead, it greatly enhances scalability.

Advanced concepts:

CNI overlay: The Azure CNI overlay documentation covers how to leverage overlay networks when direct VNET integration is not feasible.
Operator best practices: Following the Operator best practices for networking can help you maintain optimal performance and security in production environments.
CNI overview: The AKS concepts on CNI provide a thorough understanding of how container networking works within Azure.

Example lab: Implementing CNI overlay

1. Deploy a cluster with CNI overlay:

# Variables resourceGroup="MyResourceGroup" location="centralus" aksName="MyOverlayAKSCluster" podCidr="192.168.0.0/16" nodeCount=3 # Create resource group (if it doesn't already exist) az group create --name "$resourceGroup" --location "$location" # Create AKS cluster with CNI Overlay az aks create \ --resource-group "$resourceGroup" \ --name "$aksName" \ --location "$location" \ --network-plugin azure \ --network-plugin-mode overlay \ --pod-cidr "$podCidr" \ --enable-addons monitoring \ --node-count "$nodeCount"

Note on VNET and pod CIDR with CNI overlay

When deploying an AKS cluster using Azure CNI Overlay, it's important to understand how networking is handled:

Overlay pod CIDR: The pod CIDR you specify (e.g., 192.168.0.0/16) is used exclusively for assigning IP addresses to pods. This overlay CIDR is completely independent of the address space used by the underlying virtual network (VNET).
Default VNET provisioning: In overlay mode, you do not have the option to provide a custom VNET or configure its address range. Instead, if you do not explicitly specify a VNET (which you actually cannot in overlay mode), AKS automatically provisions a default VNET in a system-managed resource group. This VNET supports the cluster's control plane and node infrastructure, and its IP range is independent of the overlay pod CIDR.
Decoupled pod networking: Because pod IP addresses are allocated from the overlay CIDR rather than the VNET, even if the system-managed VNET uses a different range (e.g., 10.0.0.0/16), there is no conflict with pod IPs from the overlay CIDR (e.g., 192.168.0.0/16). This decoupling simplifies IP management and allows for greater scalability, especially in environments where VNET IP space is limited.
When to Use Azure CNI (Standard): If you require explicit control over your VNET—such as defining specific address ranges, subnets, or other custom network configurations—you should use Azure CNI (Standard) mode. With Standard mode, you can create and supply your own custom VNET during cluster creation.

In summary, Azure CNI Overlay is designed to abstract the underlying VNET management, automatically provisioning a default VNET without allowing custom configurations, while still providing efficient and scalable pod networking via a decoupled overlay pod CIDR.

2. Test connectivity:

Deploy a sample application and verify pod-to-pod connectivity using overlay networking tools and commands.

Step 1: Deploy two test pods:

Create two pods (named test-pod-1 and test-pod-2) using the BusyBox image, which provides basic networking utilities:

kubectl run test-pod-1 --image=busybox --restart=Never -- /bin/sh -c "sleep 3600" kubectl run test-pod-2 --image=busybox --restart=Never -- /bin/sh -c "sleep 3600"

Step 2: Verify pods are running

Check that both pods are in the running state:

kubectl get pods

You should see output similar to:

NAME READY STATUS RESTARTS AGE test-pod-1 1/1 Running 0 1m test-pod-2 1/1 Running 0 1m

Step 3: Retrieve the IP address of one pod

Get the IP address of test-pod-2:

POD2_IP=$(kubectl get pod test-pod-2 -o jsonpath='{.status.podIP}') echo "test-pod-2 IP: $POD2_IP"

Step 4: Test connectivity from the other pod

Exec into test-pod-1 and ping test-pod-2 using the retrieved IP address:

kubectl exec test-pod-1 -- ping -c 4 $POD2_IP

You should see output confirming that test-pod-1 can reach test-pod-2, such as:

PING 192.168.0.5 (192.168.2.117): 56 data bytes 64 bytes from 192.168.2.117: seq=0 ttl=64 time=0.123 ms 64 bytes from 192.168.2.117: seq=1 ttl=64 time=0.098 ms ... --- 192.168.2.117 ping statistics --- 4 packets transmitted, 4 packets received, 0% packet loss

Optional: Clean up

After testing, remove the test pods:

kubectl delete pod test-pod-1 test-pod-2

9. Managing resource groups and FAQs

Understanding how AKS organizes its resources is critical for efficient management. When you deploy an AKS cluster, two resource groups are created by design: one for the cluster's user-managed resources and a secondary, system-managed resource group that contains supporting components. Here’s what you need to know:

Primary vs. secondary resource group: The primary resource group hosts the cluster’s core components, while the secondary resource group holds system-managed resources like load balancers, managed identities, and network components. It’s important to avoid manual modifications in the secondary group since it is maintained by AKS.
Lifecycle management best practices: To safeguard your resources:

Apply resource locks or policies to prevent accidental deletion or modification.
Use consistent naming conventions and tagging across both resource groups. This aids in tracking, cost management, and operational monitoring.

Role-based access control (RBAC): Implement RBAC not only within your AKS cluster but also across both resource groups. Proper RBAC configuration ensures that access is granted based on roles and responsibilities, enhancing overall security and operational efficiency.
Monitoring and auditing: Regular monitoring using Azure Monitor or other auditing tools is essential. Keeping a close watch on both resource groups can help detect unauthorized changes or unexpected costs early on, ensuring the operational health and security of your deployment.

By integrating these practices into your management strategy, you can efficiently control the lifecycle, security, and performance of your AKS resources, leading to a more stable and secure production environment.

For further details, refer to the AKS FAQ on resource groups.

Conclusion

AKS networking is multifaceted, covering everything from basic connectivity and IP planning to advanced security and routing scenarios. By understanding:

Network topologies and models (Topology & Connectivity, Compare Network Models),
Private clusters and DNS configurations (Private Clusters, Private DNS Overview),
Ingress and routing strategies (Ingress Controller Overview, HTTP Application Routing Note: retirement on 03 March 2025 with migration to Application Routing add-on, App Routing Migration),
Integration with virtual networks and security controls (Service Endpoints, Private Link),
Advanced topics like CNI overlay and operator best practices (CNI Overlay, Operator Best Practices),

you can design and operate AKS clusters that are both high-performing and secure. Real-world scenarios, like segregating public-facing and internal services or ensuring regulatory compliance via private networking, illustrate how these concepts are applied in production environments.

Next Steps: Hands-On Labs

Lab 1: Deploy an AKS Cluster with Azure CNI Standard and validate IP addressing
Follow the steps in Section 2 to create your cluster and verify pod IPs.
Lab 2: Implement a Private Cluster and Configure Private DNS
Use Section 3’s instructions to deploy a private cluster and set up a private DNS zone.
Lab 3: Deploy an AKS Cluster with Azure CNI Overlay
Follow the steps in step 8 to create your cluster and test pods connectivity

These labs will give you hands-on experience with the core aspects of AKS networking, solidifying your understanding of both the concepts and their practical applications.

References

Embracing AKS built-in upgrade features and exploring custom solutions

rmmartins — Fri, 28 Mar 2025 12:54:30 GMT

Upgrading your AKS clusters in production is made simple with Microsoft’s robust, automated upgrade and update features. The official AKS upgrade process seamlessly handles surge nodes, optimizes Pod Disruption Budgets (PDBs), manages node updates, and performs comprehensive compatibility checks—all ensuring a smooth, low-downtime experience with minimal manual intervention.

Official AKS upgrade and update features

Microsoft has built a rich set of features into AKS to simplify the upgrade process:

Automated cluster upgrades: AKS provides an automated upgrade process via the az aks upgrade command. This process manages surge nodes for availability, applies necessary health checks, and ensures minimal disruption during the upgrade.

Scheduled and auto-upgrades: With scheduled upgrades, you can define maintenance windows for cluster updates. The auto-upgrade feature (when enabled) automatically updates clusters, ensuring they remain under support without manual intervention.

Node image upgrades: The AKS Node Image Upgrade process automatically updates the underlying node images, reducing the risk of security vulnerabilities and compatibility issues.

Fleet orchestration for multi-cluster management: For organizations managing multiple clusters, Kubernetes Fleet Update Orchestration provides a centralized way to coordinate upgrades and updates across your entire fleet.

These features are robust and continuously evolving, ensuring your production clusters are maintained with industry’s best practices.

Why consider a custom upgrade approach?

For most users, leveraging the builtin AKS upgrade capabilities is the best way to maintain and update clusters. However, some users desire complete control over every step of the process. If you have unique requirements—for instance, if you prefer to manually trigger upgrades ondemand rather than using scheduled upgrades, or if you need to integrate custom health checks and rollback logic—the custom CLI script presented in this post may be of interest.

Disclaimer: This experimental proof-of-concept custom CLI solution is provided as-is and is not an official Microsoft solution. It hasn’t been tested on every supported configuration and is not production ready. Use it at your own risk and discretion.

By exploring this custom approach, you may gain additional control over the upgrade process. Nevertheless, we strongly encourage most users to leverage the robust, builtin features provided by AKS.

The custom CLI script for AKS upgrades

For users interested in a more granular approach, this custom CLI script automates many aspects of the upgrade process. The script:

Displays available information: It lists your resource groups and AKS clusters (with resource group, cluster name, and location) so you can easily obtain the required parameters.

Dynamic credential download: The script automatically downloads your cluster credentials based on the Resource Group and Cluster Name you provide.

Retrieves the current version and allowed upgrade paths: It displays your current Kubernetes version and uses an interactive menu to show available upgrade targets, clearly marking allowed options.

Performs pre-upgrade health checks: The script checks node readiness, PDBs, failed pods, and even includes a placeholder for surge capacity.

Ensures compatibility checks: It reminds you to verify that your workloads are compatible with the new version before proceeding.

Initiates the upgrade process: Once you confirm, the script triggers the upgrade using the az aks upgrade command.

Validates post-upgrade health: After upgrading, the script verifies application health and provides a simulated rollback option if issues are detected.

AKS upgrade script:

#!/bin/bash ############################################################################### # Enhanced AKS Upgrade Script with Health Validation & Rollback # # Prerequisites are handled within this script: # 1. List your subscriptions to get the subscription ID: # az account list --output table # 2. Set your subscription: # az account set --subscription 00000000-0000-0000-0000-000000000000 # 3. List your resource groups: # az group list --output table # 4. List your AKS clusters (ResourceGroup, Cluster Name, Location): # az aks list --output table # 5. The script downloads your cluster credentials dynamically. # # This script retrieves the current Kubernetes version for an AKS cluster, # shows allowed upgrade paths and highlights allowed options, performs # pre-upgrade health and compatibility checks, initiates the upgrade process, # validates application health post-upgrade, and offers a simulated rollback if issues # are detected. # # NOTE: AKS does not officially support downgrades. The rollback here simulates # a recovery by re-upgrading to the previous version. ############################################################################### #--------------------------- # Display Available Resource Groups and AKS Clusters #--------------------------- echo "------ Available Resource Groups ------" az group list --output table echo "" echo "------ Available AKS Clusters (ResourceGroup, Cluster Name, Location) ------" az aks list --query "[].{ResourceGroup: resourceGroup, ClusterName: name, Location: location}" -o table echo "" echo "Please note the Resource Group, Cluster Name, and Location for your AKS cluster." echo "----------------------------------------------------------------------" echo "" #--------------------------- # Helper Functions #--------------------------- # Pre-upgrade health checks (nodes, PDBs, failed pods, surge capacity) perform_pre_upgrade_checks() { echo "" echo "--------------------------------------------" echo "Performing Pre-Upgrade Health Checks" echo "--------------------------------------------" echo "1. Checking node status..." kubectl get nodes echo "" echo "2. Checking for any NotReady nodes..." NOTREADY=$(kubectl get nodes | grep NotReady) if [ ! -z "$NOTREADY" ]; then echo "WARNING: Some nodes are not ready. Please investigate before upgrading." else echo "All nodes are Ready." fi echo "" echo "3. Checking Pod Disruption Budgets (PDBs)..." kubectl get pdb --all-namespaces echo "" echo "4. Checking for pods in a failed state (e.g., CrashLoopBackOff, Error)..." kubectl get pods --all-namespaces | grep -E 'CrashLoopBackOff|Error' || echo "No pods in error state found." echo "" echo "5. Checking for surge nodes / additional capacity (placeholder)..." echo " (Ensure your node pool autoscaler or surge capacity is configured properly.)" echo "" echo "Pre-upgrade health checks completed. Please review the output above." read -p "Do you want to continue with the upgrade? (y/N): " CHECK_CONFIRM if [[ ! "$CHECK_CONFIRM" =~ ^[Yy]$ ]]; then echo "Upgrade cancelled based on pre-upgrade health checks." exit 1 fi } # Compatibility check reminder perform_compatibility_checks() { echo "" echo "--------------------------------------------" echo "Performing Compatibility Checks" echo "--------------------------------------------" echo "NOTE: Ensure that all critical workloads, custom resources, and third-party" echo " integrations are compatible with the new Kubernetes version." echo " Review release notes and documentation for any breaking changes." echo "" read -p "Have you verified workload compatibility? (y/N): " COMP_CONFIRM if [[ ! "$COMP_CONFIRM" =~ ^[Yy]$ ]]; then echo "Upgrade cancelled. Please verify compatibility and try again." exit 1 fi } # Post-upgrade health checks (applications, deployments, pods) perform_post_upgrade_checks() { echo "" echo "--------------------------------------------" echo "Performing Post-Upgrade Health Checks" echo "--------------------------------------------" echo "Checking deployments status..." kubectl get deployments --all-namespaces echo "" echo "Checking pods status..." kubectl get pods --all-namespaces echo "" echo "Please review the output for any errors or issues with your applications." read -p "Do all applications appear healthy? (y/N): " POST_CHECK_CONFIRM if [[ ! "$POST_CHECK_CONFIRM" =~ ^[Yy]$ ]]; then return 1 fi return 0 } # Attempt rollback to previous version (simulation) attempt_rollback() { echo "" echo "--------------------------------------------" echo "Attempting Rollback" echo "--------------------------------------------" read -p "Rollback to the previous version ($CURRENT_VERSION) ? (y/N): " ROLLBACK_CONFIRM if [[ "$ROLLBACK_CONFIRM" =~ ^[Yy]$ ]]; then echo "Initiating rollback to version $CURRENT_VERSION..." az aks upgrade --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" --kubernetes-version "$CURRENT_VERSION" --yes if [ $? -eq 0 ]; then echo "Rollback executed successfully." else echo "Rollback failed. Please check the error messages above and consider manual recovery." exit 1 fi else echo "Rollback aborted. Please perform manual recovery if necessary." exit 1 fi } #--------------------------- # Main Script #--------------------------- # Prompt for input parameters read -p "Enter the Resource Group: " RESOURCE_GROUP read -p "Enter the AKS Cluster Name: " CLUSTER_NAME read -p "Enter the AKS Region (e.g., eastus): " LOCATION # Download cluster credentials dynamically echo "" echo "Downloading cluster credentials for '$CLUSTER_NAME' in resource group '$RESOURCE_GROUP'..." az aks get-credentials --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" --overwrite-existing ############################################################################### # Step 1: Retrieve and Display the Current Kubernetes Version ############################################################################### echo "" echo "Fetching the current Kubernetes version for cluster '$CLUSTER_NAME' in '$RESOURCE_GROUP'..." CURRENT_VERSION=$(az aks show --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" --query "kubernetesVersion" -o tsv) if [ -z "$CURRENT_VERSION" ]; then echo "ERROR: Failed to retrieve the current Kubernetes version. Please check your cluster details." exit 1 fi echo "Current Kubernetes version: $CURRENT_VERSION" echo "" ############################################################################### # Step 2: Retrieve Allowed Upgrade Paths for the Cluster ############################################################################### echo "Retrieving allowed upgrade paths for your cluster..." UPGRADES_JSON=$(az aks get-upgrades --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" -o json) ALLOWED_UPGRADES=$(echo "$UPGRADES_JSON" | jq -r 'if .controlPlaneProfile.upgradeProfile.upgrades then (.controlPlaneProfile.upgradeProfile.upgrades | map(.kubernetesVersion) | join(" ")) else "" end') # Fallback: if no allowed upgrades are determined and CURRENT_VERSION starts with "1.30" if [ -z "$ALLOWED_UPGRADES" ]; then if [[ "$CURRENT_VERSION" =~ ^1\.30 ]]; then echo "WARNING: No allowed upgrade paths could be determined automatically." echo "Typically, if your cluster is running a version like 1.30.x (e.g., 1.30.10)," echo "you can only upgrade directly to a 1.31.x version." echo "For example, allowed upgrade targets might include: 1.31.6, 1.31.5, 1.31.4, 1.31.3, 1.31.2, or 1.31.1." ALLOWED_UPGRADES="1.31.6 1.31.5 1.31.4 1.31.3 1.31.2 1.31.1" ALLOWED_MAJOR_MINOR="1.31" else echo "WARNING: No allowed upgrade paths could be determined automatically. Proceed with caution." ALLOWED_MAJOR_MINOR="" fi else # Extract unique major.minor values from allowed upgrades ALLOWED_MAJOR_MINOR=$(for ver in $ALLOWED_UPGRADES; do echo "$ver" | awk -F. '{print $1"."$2}'; done | sort -u | tr '\n' ' ') fi if [ -n "$ALLOWED_MAJOR_MINOR" ]; then echo "" echo "Based on your current version ($CURRENT_VERSION), you can upgrade directly to versions with major.minor:" for mm in $ALLOWED_MAJOR_MINOR; do echo " - $mm" done echo "Only versions matching these allowed major.minor values will be marked as [ALLOWED] below." echo "For more details, please see https://aka.ms/aks-supported-k8s-ver" echo "" fi ############################################################################### # Step 3: Fetch Available Kubernetes Versions ############################################################################### echo "Fetching available Kubernetes versions in '$LOCATION'..." VERSIONS_JSON=$(az aks get-versions --location "$LOCATION" -o json) if [ $? -ne 0 ]; then echo "ERROR: Failed to fetch available versions. Please check your Azure CLI configuration." exit 1 fi # Extract list of versions and preview flag from the "values" array mapfile -t OPTIONS < <(echo "$VERSIONS_JSON" | jq -r '(.values // [])[] | "\(.version) \(.isPreview)"') if [ ${#OPTIONS[@]} -eq 0 ]; then echo "ERROR: No versions found for location '$LOCATION'." echo "This might be due to the aks-preview extension altering the output." echo "If you don't need preview features, try removing the extension with: az extension remove --name aks-preview" exit 1 fi ############################################################################### # Step 4: Build the Interactive Menu with Highlighted Allowed Options ############################################################################### declare -a VERSION_LIST declare -a LABELS for entry in "${OPTIONS[@]}"; do # Extract version and preview flag VERSION=$(echo "$entry" | awk '{print $1}') IS_PREVIEW=$(echo "$entry" | awk '{print $2}') LABEL="$VERSION" if [ "$IS_PREVIEW" == "true" ]; then LABEL="$LABEL (Preview)" else LABEL="$LABEL (Stable)" fi # Highlight if allowed (by comparing major.minor) if [ -n "$ALLOWED_MAJOR_MINOR" ]; then AVAILABLE_MM=$(echo "$VERSION" | awk -F. '{print $1"."$2}') for allowed in $ALLOWED_MAJOR_MINOR; do if [ "$AVAILABLE_MM" == "$allowed" ]; then LABEL="$LABEL [ALLOWED]" break fi done fi VERSION_LIST+=("$VERSION") LABELS+=("$LABEL") done echo "" echo "Select the Kubernetes version to upgrade to:" PS3="Enter your choice (or type 'q' to quit): " select opt in "${LABELS[@]}"; do if [[ "$REPLY" == "q" ]]; then echo "Exiting..." exit 0 fi if [ -z "$opt" ]; then echo "Invalid selection. Please try again." else TARGET_VERSION=${VERSION_LIST[$((REPLY-1))]} echo "You selected: $opt" break fi done ############################################################################### # Step 5: Validate the Selected Target Version ############################################################################### if [ -n "$ALLOWED_MAJOR_MINOR" ]; then AVAILABLE_MM=$(echo "$TARGET_VERSION" | awk -F. '{print $1"."$2}') ALLOWED_MATCH=0 for allowed in $ALLOWED_MAJOR_MINOR; do if [ "$AVAILABLE_MM" == "$allowed" ]; then ALLOWED_MATCH=1 break fi done if [ $ALLOWED_MATCH -ne 1 ]; then echo "" echo "WARNING: Upgrading from $CURRENT_VERSION to $TARGET_VERSION is not allowed based on your cluster's upgrade policy." echo "Allowed upgrades from your current version are only for versions with major.minor:" for mm in $ALLOWED_MAJOR_MINOR; do echo " - $mm" done echo "Please select one of these allowed versions." exit 1 fi fi ############################################################################### # Step 6: Pre-Upgrade Health & Compatibility Checks ############################################################################### perform_pre_upgrade_checks perform_compatibility_checks ############################################################################### # Step 7: Confirm and Execute the Upgrade ############################################################################### echo "" read -p "Proceed with upgrading '$CLUSTER_NAME' from version $CURRENT_VERSION to $TARGET_VERSION? (y/N): " CONFIRM if [[ ! "$CONFIRM" =~ ^[Yy]$ ]]; then echo "Upgrade cancelled." exit 0 fi echo "" echo "Initiating upgrade to version $TARGET_VERSION..." az aks upgrade --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" --kubernetes-version "$TARGET_VERSION" --yes if [ $? -eq 0 ]; then echo "Upgrade command executed successfully." else echo "ERROR: Upgrade command failed. Please check the error messages above." exit 1 fi ############################################################################### # Step 8: Post-Upgrade Health Checks & Rollback Option # Rollback Mechanism Note: The rollback feature in this script is designed to simulate a recovery process by re-upgrading the cluster back to its previous version. # Please note that AKS does not officially support downgrades, so this rollback is not a true downgrade in the traditional sense. It is a best-effort approach that relies on having a known, working previous version and should only be used as a last resort. # Ensure that you have proper backups and recovery strategies in place before relying on this functionality. ############################################################################### if ! perform_post_upgrade_checks; then echo "" echo "One or more post-upgrade health checks have failed." attempt_rollback else echo "" echo "Post-upgrade health checks passed. Your applications appear healthy." fi echo "" echo "Upgrade complete. Please continue monitoring your cluster and applications for any issues."

You can download the full script here: Custom CLI Script for AKS Upgrades

Additional considerations

Rollback mechanism note: The rollback feature in this script is designed to simulate a recovery process by re-upgrading the cluster to its previous version. Please note that AKS does not officially support downgrades, so this rollback is not a true downgrade in the traditional sense. It is a best-effort approach that relies on having a known, working previous version and should only be used as a last resort. Ensure you have proper backups and a comprehensive recovery strategy in place before relying on this functionality.

Final thoughts

Microsoft’s official AKS upgrade features are powerful and designed to simplify the process—from automated cluster and node image upgrades to orchestrated updates across multiple clusters using Fleet. For most users, these built-in capabilities offer the most reliable and supported approach.

That said, if you’re a user with unique requirements, exploring a custom solution can provide granular control over every step of the process. This custom CLI script is provided as a proof-of-concept to inspire those who wish to tailor the upgrade process to their specific needs—but always remember, use it at your own risk. It is experimental, not production-ready, and is not fully supported by Microsoft.