rss.livelink.threads-in-node

From Error Log to Closed Ticket, Without Leaving Your Terminal

lovanartem — Mon, 15 Jun 2026 19:55:53 GMT

You describe the problem. The assistant pulls the context from what you already have, drafts the ticket, files it, tracks the replies, and closes it out, and it asks before it does anything irreversible. No portal tabs. No re-typing the resource ID Azure already knows.
See a quick demo or jump straight to the code: https://github.com/Azure-Samples/azure_support_ticket_mcp

The common challenge

Every team running on Azure eventually opens a support ticket. Here is the catch: the investigation happens in your terminal or editor, but the ticket frequently happens in a browser. Bridging those two worlds is pure overhead, and you pay it at the worst possible moment, mid-incident.

Opening one ticket means stepping through:

Confirm the tenant, pick the subscription.
Find the right support service among hundreds.
Drill down a per-service problem-classification tree.
Set severity, enter contact details, write up the issue.
Re-type the resource ID and error you were just looking at.

And filing is only the start. A support ticket is a conversation: replies to read, follow-up questions to answer, logs to attach, a status to flip when it resolves. Each one is another trip to the portal, another context switch out of the place you actually work.

The value proposition is simple: collapse that entire lifecycle into the place you already work, in plain language, in seconds. Stay in flow. Let the assistant do the mechanical parts and keep the decisions for yourself.

How the lifecycle works

Every ticket follows the same path, and the same safety gate sits in the middle of it. Opening a ticket runs left to right; once it exists, the same conversational interface carries it through the rest of its life.

The second lane is the part most ticket tools skip. Once a ticket is open, support is a two-way conversation, and the server handles that side too: read the full thread of customer and Microsoft replies, get a local summary of where things stand, reply to the support engineer, and attach logs, traces, or screenshots, all without returning to the portal.

That preview-then-confirm gate is the whole trust model. The first call returns exactly what is about to happen; the second call carries it out, and only if nothing changed in between. Reading, summarizing, and triaging are instant, with nothing to approve; creating, replying, attaching, and closing always pause for your yes.

The Solution

The Azure Support Ticket MCP is a Model Context Protocol (MCP) server that exposes the Azure support ticket lifecycle as a set of conversational tools. Three things make it more than a thin API wrapper:

It infers context. Give it a resource ID or a portal URL and it reads the subscription, resource group, and service from it, then ranks the right problem classification from your description instead of making you walk the tree.
It is local-first. The Azure support catalog, the services and their problem-classification trees, is cached on disk, so the common path is instant and works even on a flaky connection.
It is safe by design. Every action that changes something is preview-then-confirm: nothing reaches Azure until you approve the exact payload.

Getting started

Install the binary (a single command, the repo README has the current one-liner).
Register it with your MCP-capable assistant.
Start describing problems in plain language.

The fastest way in is to pipe a failure straight from your terminal into a ticket:

# a failed infra provision copilot -i "ticket this: $(azd up 2>&1)" # a misbehaving pod copilot -i "ticket this: $(kubectl describe pod my-pod)" # a red CI run copilot -i "ticket this: $(gh run view <run-id> --log-failed)"

From that raw output, the server extracts the resource IDs, error codes, correlation IDs, and HTTP status, takes a first guess at severity, and scrubs obvious secrets on a best-effort basis. You then review and approve the full draft before any ticket is created, so anything the scrub might miss is still in front of you to catch first. Prefer plain English? That works too:

copilot -i "open a ticket — my AKS cluster prod-aks can’t scale out"

In short

The problem: filing and managing Azure support tickets pulls you out of your flow into a multi-step portal process, again and again, over the life of each ticket.
MCP: an open standard that lets AI assistants take real, permissioned actions through tools, not just answer questions.
The solution: an open-source MCP server that runs the whole ticket lifecycle from your terminal, context-aware, local-first, and gated by preview-then-confirm.

Try it, or read the code

It is open source under the MIT license. A short demo, installation, the full capability list, and the design notes are all in the repository — issues, ideas, and pull requests welcome.

https://github.com/Azure-Samples/azure_support_ticket_mcp

Azure Startup Credit Issue

OzanTuranli — Tue, 09 Jun 2026 12:28:44 GMT

Hi, I've been trying to verify my LinkedIn account to unlock $1,000 in Azure startup credits for over a week. Every time the OAuth callback hits error and, I get: {"error":{"code":"BadRequest","message":"BadRequest (Conflict)"}}.

I've revoked all LinkedIn app permissions on both LinkedIn and Microsoft sides. I suspect that the issue is an orphaned LinkedIn identity record in your backend from a previous failed attempt. My support ticket has gone unanswered for over a week now and I also created a post in the learn community that went unanswered for days now.

Could you point me to the right person? Happy to provide any account details needed.

Urgent Guidance Needed - Azure Credits Exhausted During Active AI Infrastructure Scaling

PLIMSOLL — Mon, 18 May 2026 08:00:54 GMT

Hello Microsoft for Startups Community,

My name is Joseph Thomas, and I am currently building our startup through Microsoft for Startups Founders Hub. We have recently exhausted both our initial and extended Azure sponsorship allocations ($1,000 and the subsequent $5,000 sponsorship tier).

Our team is actively scaling AI infrastructure entirely on Azure, including Azure AI Foundry workloads, multi-model orchestration, inference pipelines, and cloud application infrastructure. As development and deployment activity accelerated, our Azure consumption grew significantly faster than originally projected.

We are now beginning to see active billing charges accrue, and I am working proactively to avoid disruption to our ongoing production and development environments. I have already submitted support requests through both the Founders Hub portal and Azure Billing Support.

I wanted to reach out to the community here to ask:

Has anyone successfully secured additional sponsorship credits or a sponsorship extension after exhausting their initial allocations?
What is the best escalation path for an AI startup heavily invested in Azure AI Foundry and Azure infrastructure?
Has anyone recently gone through the “Level Up” review or higher sponsorship-tier evaluation process, and if so, do you have any recommendations?

We are fully committed to building long term on Azure and would greatly appreciate any guidance, recommendations, or insight from other founders or Microsoft team members who have navigated this stage successfully.

Thank you for your time.

Joseph Thomas

Unable to access Founders Hub portal - M365 benefit redemption blocked

Macdmacd — Tue, 05 May 2026 21:03:41 GMT

I am an approved Founders Hub member with an active $1,000 Azure credit and I am unable to access the Founders Hub portal or redeem my Microsoft 365 Business Premium benefit.

I have already contacted Azure Billing Support regarding this issue, but I believe this falls under your team's remit rather than billing, so I am reaching out directly.

The issues I am experiencing:

1. portal.startups.microsoft.com/login only offers a "Log in with LinkedIn" option — my Founders Hub account is not linked to LinkedIn, so this does not work for me.

2. Clicking "Log in with your Microsoft account here" redirects me to portal.startups.microsoft.com/signup, which is the new user signup page — not a login page.

3. Navigating directly to portal.startups.microsoft.com also lands on the signup page with no way to manually enter my Microsoft account email.

All roads lead back to the same circular loop with no resolution.

My Founders Hub account is registered under a gmail not my custom domain. I am fully signed into the Azure Portal (portal.azure.com) with this account and can confirm my Azure credit is active.

What I need help with:

- Accessing my existing Founders Hub account

- Redeeming the Microsoft 365 Business Premium benefit

- Setting up a Microsoft 365 tenant so I can use Teams under my custom domain

Please advise on how to proceed.

Startup credits expired silently — GPT-5.5 charges — is there any path to resolution?

Noviamind — Thu, 30 Apr 2026 13:00:17 GMT

I've been in the Microsoft for Startups program since August 2025. My €5,000 credits expired on April 12 with no notification. I checked my entire mailbox and there is nothing from Microsoft about it.

On April 27, I deployed GPT-5.5 on Foundry — two days after Microsoft released it — still believing I was covered. I found out on April 29 by checking my own consumption data and shut everything down immediately.

I launched my AI startup in production two weeks ago. The invoice for April hasn't arrived yet but I estimate it at €4,000–5,000. This could end my company.

I've been through two levels of billing support and both say they cannot adjust charges. I've now escalated to the support manager.

My question: has anyone been through this and found a path to resolution? And does anyone have a contact at Microsoft for Startups with actual decision-making authority?

I'm not looking to debate the policy. I'm looking for a human at Microsoft who understands what "startup" means.

Can't login to Startup Portal

lspstartup — Wed, 29 Apr 2026 00:14:13 GMT

When I try to login using my email and password, I keep getting prompted to apply for the start up program. I'm using the same account I use for the Azure portal and it shows my proper credits.

How can I find out what's wrong?

The flat-subscription problem

rmmartins — Wed, 22 Apr 2026 19:06:12 GMT

A real design review: management groups, policies, break-glass accounts, and the five things I'd tweak before going to production.

Here's what I see at most startups when they first show up on Azure: one subscription, one Global Admin, everything in the same resource group, and everyone's an Owner.

That works when you have three engineers and one environment. It stops working around the time you have a production workload, a dev environment, shared infrastructure, and an engineer who accidentally deleted the wrong resource group on a Friday afternoon.

The next step is usually "let's create more subscriptions." That's the right instinct. But without management groups and policies tying them together, you end up with four subscriptions, four sets of inconsistent RBAC assignments, no shared tagging strategy, and no audit trail showing who deployed what.

If you're at this stage and want a starting point, the Startup-Scale Landing Zone gives you an opinionated Bicep template with management groups, policies, and RBAC already wired together. This post goes deeper: what happens when a team takes those concepts and customizes them for their own environment.

The design

A startup VP of Engineering sent me their proposed management group hierarchy and asked me to review it before going to production. They'd done their homework: read the Cloud Adoption Framework docs, researched config options, and put together a three-level hierarchy with specific policies and RBAC at each level.

Here's the breakdown:

Tenant Root Group is the automatic top-level MG that Azure creates in every tenant. Be very selective about what you assign here. Anything at this level affects every subscription you'll ever create, including ones that don't exist yet. Some organizations do assign enterprise-wide "must have" policies at root, but for a startup still figuring out its governance posture, keeping root clean and pushing baselines to a company MG one level down gives you more flexibility.

Company MG sits directly below and carries the baseline that applies to everything: required tags on all resources (env, owner, cost-center, app), allowed regions locked to three US regions, Defender for Cloud enabled everywhere, and all diagnostic logs routed to a central Log Analytics workspace. Engineering gets Reader at this level, so everyone can see everything but can't change anything by default.

Three child MGs below that:

Nonprod MG is the relaxed zone. Tags are audited but not denied, so engineers can experiment without being blocked by policy. Public IPs are allowed. Engineering gets Contributor. This is where you iterate fast without filing PIM requests.

Prod MG is the strict zone. Tags are denied if missing. Public IPs are blocked. Encryption at rest is required. VM SKUs are restricted. Engineering gets Reader by default, and Contributor access is available through PIM (just-in-time, time-limited activation). You have to explicitly request write access, and it expires.

Platform MG protects the shared infrastructure that everything depends on. The Terraform state storage account, central Log Analytics workspace, and shared Key Vault all live here. Platform team gets Contributor; everyone else gets Reader. Critical resources are protected from deletion.

Under each MG, the subscriptions:

MG	Subscription	Purpose
Nonprod	dev	Development and testing
Nonprod	devtest (MSDN)	Engineer's personal scratch (MSDN-bound)
Prod	prod	Production workloads
Platform	cloud-infra	Terraform state, Log Analytics, Key Vault, workload identity

The parts that nail it

The hierarchy is flat and functional. CAF says keep it three to four levels deep and don't create management groups just for the sake of structure. This design does exactly that: a company MG for baselines, then Nonprod/Prod/Platform for the policy gradient. It's not "the one CAF pattern" (CAF deliberately avoids prescribing a single topology), but it's a clean startup pattern that scales to dozens of subscriptions without restructuring.

Audit in dev, deny in prod. Dev environments that deny everything become unusable. Engineers stop experimenting. Prod environments that only audit become insecure. The split is the right trade-off: visibility without friction in dev, enforcement without exceptions in prod.

The platform subscription for shared services. Centralizing Terraform state, the Log Analytics workspace, and shared Key Vault into a separate subscription (with its own RBAC) means application teams can't accidentally delete the infrastructure that manages their infrastructure. This is the "trust boundary" pattern, and most startups skip it until they learn the hard way.

What i'd change before going live

PIM licensing isn't one-seat-fits-all. They mentioned having "1 P2 seat" for PIM. PIM requires an Entra ID P2 (or Governance) license per user who's eligible for activation, plus anyone who approves or reviews PIM access. If four engineers need just-in-time Contributor access to production and one manager approves, that's five P2 licenses (~$9/user/month). Still cheap insurance compared to "everyone has standing Contributor," but budget for it correctly.

Think about SKU restrictions as a trade-off. Their prod MG had "restrict to approved SKUs." An allow-list gives you strict standardization (only pre-approved SKUs work), but every time Azure launches a new VM series, someone has to update it. A deny-list ("block these specific expensive or unnecessary SKUs") is easier to maintain since new SKUs are available by default. The right choice depends on your team: if you need tight control over what runs in prod, keep the allow-list. If you move fast and want less policy maintenance, a deny-list with periodic reviews is simpler.

Resource locks beat policy for protecting critical infra. Their Platform MG had "deny deletion of state storage / log workspace" as a policy. Azure Resource Locks (CanNotDelete) are simpler and more visible for this. A lock shows up right on the resource in the portal, so engineers see it immediately. A deny-delete policy is invisible until it blocks you, and the error message doesn't always make it obvious why. Locks are also easier to temporarily remove when you legitimately need to rotate or replace a resource.

Add cost alerts on every subscription from day one. Their design didn't mention budget alerts. Azure Cost Management lets you set budget thresholds per subscription with email and webhook notifications. Set them before any workloads deploy, not after the first surprise bill. Start with 80% and 100% of expected monthly spend. It takes 5 minutes and can save thousands.

Cap the MSDN subscription. Their devtest sub was MSDN-bound, described as "personal scratch." MSDN subscriptions come with a monthly credit ($50-$150 depending on the license tier), but the spending limit can be removed, which means charges hit a valid payment method with no cap. Keep the spending limit ON for scratch subs. If it's been removed, set a budget alert at the credit amount. Also note that some Marketplace and external services may bill separately regardless of the spending limit.

The break-glass question

This team was federating their primary domain with Google Workspace as the SAML identity provider (their whole company runs on Google). They asked: "Can I use my .onmicrosoft.com account as a break-glass account while my federated company.com is my daily driver?"

Yes. This is exactly the pattern Microsoft recommends.

Microsoft's security benchmark (PA-5) specifically calls for cloud-only break-glass accounts that bypass external IdP dependencies. If your Google SAML federation goes down (Google outage, misconfigured SAML cert, domain issues), all federated accounts fail to sign in. Cloud-only .onmicrosoft.com accounts authenticate directly against Entra ID with no external dependency.

How to harden them:

Create two break-glass accounts. Microsoft recommends at least two. Store credentials in separate physical locations. One person alone shouldn't be able to access both. Docs: Manage emergency access accounts.

Use phishing-resistant auth. Passkeys (FIDO2 security keys) are the strongest option: phishing-resistant and no dependency on a phone or authenticator app that might be unavailable during an emergency. If you already run PKI, certificate-based auth is another viable option. The key is diversity across your two accounts so a single authentication method failure doesn't lock out both. Docs: Enable FIDO2 security key sign-in.

Exclude at least one account from ALL Conditional Access policies. This is the account that guarantees access if a bad CA policy locks everyone out. Microsoft recommends excluding at least one break-glass account from every CA policy. The second account can optionally have phishing-resistant MFA enforced via CA, giving you a safer fallback for non-federation emergencies.

Assign Global Administrator permanently. Not through PIM. Break-glass accounts need immediate access. PIM activation requires the normal auth flow, which defeats the purpose in an emergency.

Monitor every sign-in. Set up alerts in Azure Monitor or Microsoft Sentinel for any authentication from a break-glass account. If these accounts show activity outside an emergency, investigate immediately.

Test quarterly. Actually sign in with the break-glass accounts on a schedule. Verify the credentials work, the FIDO2 keys work, and the monitoring alert fires. Don't wait for a real emergency to discover something is broken.

The pre-production governance checklist

Before deploying workloads into your new hierarchy, verify:

All subscriptions are nested under the correct MG (not dangling under Tenant Root Group)
Baseline policies applied at the company MG and verified with Get-AzPolicyAssignment
PIM configured with appropriate activation duration (4-8 hours max)
P2 licenses assigned to every user eligible for PIM activation, plus approvers and reviewers
Two break-glass accounts exist, tested, and monitored
At least one break-glass account excluded from all Conditional Access policies
Budget alerts set on every subscription (80% and 100% thresholds)
Resource locks on Terraform state, Log Analytics workspace, and Key Vault
MSDN spending limit verified ON (or budget alert set if removed)
Diagnostic settings routing all activity logs to the central Log Analytics workspace

Where this fits in the governance journey

If you're building Azure governance from zero, here's my recommended reading order:

Demystifying Microsoft Entra ID, Tenants and Azure Subscriptions - understand what tenants, subscriptions, and Entra ID actually are
Azure has three permission systems, and you're probably confusing them - the identity, resource, and billing planes
This post - design your management group hierarchy
Role Structures, Anti-Patterns, and the 10 Governance Principles - RBAC patterns and what not to do
Introducing the Startup-Scale Landing Zone - the full reference architecture

Your Azure VM went down and nobody knew why. Here's how to fix that.

rmmartins — Wed, 22 Apr 2026 15:49:46 GMT

If you've ever had a production VM go unhealthy on Azure and found yourself scrambling to figure out what happened, you're not alone. I work with startups running production workloads on Azure, and this is one of the most common patterns I see: something goes wrong, the team opens a support ticket, and then everyone waits for a root cause while the CTO asks "how do we make sure we know about this before our customers do next time?"

The good news: Azure already gives you the tools to answer both questions. Most teams just haven't set them up yet.

Scope note: This post covers platform health and maintenance signals for Azure VMs. We're not covering guest OS metrics, application telemetry, or Azure Monitor/VM Insights here. If you don't have a dedicated SRE team, these are the highest-leverage Azure-native checks to set up first.

Let's get into it.

Step 1: Figure out what actually happened (Resource Health)

Before you open a support ticket, check Resource Health. It's the fastest way to determine whether your VM went down because of something Azure did (platform event) or something on your side (user-initiated or config issue).

Go to your VM in the Azure portal > Resource Health blade. You'll see:

Current status: Available, Unavailable, Degraded, or Unknown
Health history: 30 days of state transitions with annotations explaining each one
Root cause: For platform-initiated outages on VMs, Azure automatically publishes root cause details within 72 hours, directly in this blade

The annotations often tell you what kind of event occurred: live migration, host reboot, planned maintenance, degraded hardware, etc. In many cases, you get this information without filing a support ticket.

If your VM was affected by a live migration, the annotation will show it was a platform-initiated event. Live migration is a memory-preserving operation that causes a brief pause, typically no more than 5 seconds (docs). But if your application is sensitive to even short freezes, or if you're seeing them frequently, that's worth investigating further.

Docs: Resource Health overview

Step 2: Get notified when it happens (Service Health + Resource Health Alerts)

Checking the portal after an incident is fine. Getting an alert when the incident happens is better.

Service Health Alerts

These notify you about service issues, planned maintenance, health advisories, and security advisories for the Azure services and regions you're actually using. Service Health is best for subscription-level and region-level awareness. If there's a regional maintenance wave driving elevated live migrations, this is how you'd know about it proactively.

Set them up to notify your ops channel via email, SMS, webhook (Slack, PagerDuty, Teams), or automation via Logic Apps or Azure Functions.

Docs: Create Service Health alerts | PagerDuty integration

Resource Health Alerts

These fire when a specific resource (or all resources in a resource group) changes health status. The alert includes health-change details such as status, cause type (platform vs. user-initiated), and descriptive event text, so you get more than a generic "VM is unhealthy" notification.

This is the "never be surprised again" alert. If you only set up one thing from this post, make it this.

Docs: Create Resource Health alerts

Step 3: See it coming (Scheduled Events API)

This is the part most teams don't know about, and it's the most powerful tool for handling live migrations gracefully.

Azure exposes an Instance Metadata Service (IMDS) endpoint on every VM that gives your application advance notice of upcoming maintenance events. Live migrations show up as EventType: "Freeze". In typical cases, you get up to ~15 minutes between the event appearing and Azure proceeding with the operation, though exact timing varies and some failures (like hardware issues) can bypass the advance notification entirely.

Note: Most Azure VM families support live migration, but G, L, N, and H series VMs do not. If you run GPU or HPC workloads on these SKUs, you won't see Freeze events. You'll still get Reboot or Redeploy events for other maintenance types.

The endpoint is available from inside the VM at:

http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01

Here's an example response when a live migration is scheduled:

{
  "DocumentIncarnation": 1,
  "Events": [
    {
      "EventId": "602d9444-d2cd-49c7-8624-8643e7171297",
      "EventType": "Freeze",
      "ResourceType": "VirtualMachine",
      "Resources": ["my-production-vm"],
      "EventStatus": "Scheduled",
      "NotBefore": "Mon, 22 Apr 2026 19:17:47 GMT",
      "Description": "Virtual machine is being paused for a memory-preserving Live Migration operation.",
      "EventSource": "Platform",
      "DurationInSeconds": 5
    }
  ]
}

You can poll this endpoint and use the lead time to:

Drain connections so active users aren't affected
Checkpoint application state to recover faster
Remove the VM from your load balancer temporarily
Log the event so you have a record of migration frequency

Here's a simple polling script in Python:

import requests
import json
import time

ENDPOINT = "http://169.254.169.254/metadata/scheduledevents"
HEADERS = {"Metadata": "true"}
PARAMS = {"api-version": "2020-07-01"}

def get_scheduled_events():
    response = requests.get(ENDPOINT, headers=HEADERS, params=PARAMS)
    return response.json()

def handle_events(data):
    for event in data.get("Events", []):
        print(f"[{event['EventType']}] {event.get('Description', 'No description')}")
        print(f"  Status: {event['EventStatus']}, Not Before: {event['NotBefore']}")
        print(f"  Duration: {event['DurationInSeconds']}s, Source: {event['EventSource']}")
        # Your graceful drain/checkpoint logic here

def approve_event(event_id):
    """Acknowledge the event so Azure can proceed immediately."""
    payload = json.dumps({"StartRequests": [{"EventId": event_id}]})
    requests.post(ENDPOINT, headers=HEADERS, params=PARAMS, data=payload)

# Poll frequently - the official docs recommend every 1 second for production.
# Adjust based on your workload sensitivity.
while True:
    data = get_scheduled_events()
    handle_events(data)
    time.sleep(1)

Or a quick check in Bash:

curl -s -H "Metadata:true" --noproxy "*" \
  "http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01" | jq .

Event approval: Once your application has drained connections or checkpointed state, it can approve the event by POSTing back with the EventId. This tells Azure your app is ready, and the platform can proceed without waiting for the full timeout. If you don't explicitly approve, Azure proceeds when the NotBefore time is reached.

If you're seeing elevated frequency of live migrations, this data lets you quantify the pattern (how often, what times, what durations) and bring hard numbers to a support conversation instead of "it feels like it's happening a lot."

Docs: Scheduled Events for VMs

Step 4: Check your overall posture (Azure Advisor)

While you're at it, check Azure Advisor's Reliability recommendations for your VMs. It flags things like:

VMs not deployed in availability zones
Deprecated VM images that need updating
Missing backup configurations
Other resiliency gaps that make you more susceptible to availability issues

Advisor won't explain a past incident, but it can help prevent the next one.

Docs: Azure Advisor Reliability recommendations

A quick note on resilience

These tools improve your visibility and response time, but they don't eliminate downtime by themselves. If a VM is truly critical, pair this monitoring with basic resilience patterns: multiple instances behind a load balancer, availability zones, health probes, regular backups, and cross-region recovery where needed. Monitoring tells you what's happening. Architecture determines whether it matters.

The setup checklist

Quick wins (15 minutes)

#	What	Why	Time
1	Check Resource Health on your production VMs	See if there are past events you didn't know about	2 min
2	Create a Service Health alert for your regions/services	Get notified about platform issues proactively	3 min
3	Create Resource Health alerts for your VM resource groups	Get notified when any VM changes health state	3 min
4	Review Azure Advisor Reliability tab	Fix any posture gaps	2 min

Advanced hardening (1+ hours depending on your app)

#	What	Why
5	Deploy the Scheduled Events polling script on critical VMs	Get advance notice of live migrations and maintenance
6	Implement drain/checkpoint logic tied to Scheduled Events	Gracefully handle maintenance with zero user impact
7	Wire event approvals into your automation	Control the timing of when Azure proceeds with maintenance

Wrapping up

The pattern I keep seeing is teams treating Azure VM monitoring as something they'll get to "later." Then an incident happens, the RCA takes longer than anyone wants, and everyone wishes they had visibility sooner.

The tools are already there. Resource Health tells you what happened. Service Health and Resource Health alerts tell you when it's happening. Scheduled Events tells you before it happens. And Advisor helps you make sure your setup is resilient in the first place.

Fifteen minutes of setup for the quick wins, and you're in a fundamentally better place than most teams running VMs on Azure today.

$17,493 in Undisclosed Marketplace Charges with No Cost Visibility, No Recourse, No Accountability

chrisbaker2000 — Mon, 13 Apr 2026 14:01:05 GMT

I'm a co-founder of a 13-person startup in the Microsoft for Startups Founders Hub program. I'm posting here because after two months of support tickets, calls, and emails across both Microsoft and Anthropic, I have been unable to get anyone with decision-making authority to address this issue. I'm hoping this reaches someone at Microsoft who can help, and that other affected founders in the program will share their experiences.

What happened:

In February 2026, we deployed Claude Opus 4.6 and Sonnet 4.6 through Azure AI Foundry as part of a migration of our AI infrastructure. We are active users of our Azure sponsorship credits and assumed, as anyone would, that these models were covered the same way Azure OpenAI models are. There was no indication otherwise during deployment.

In early March, we received our first invoice: $1,078.07 (invoice G144899694, billing period 02/01–02/28). We were shocked, but we paid it immediately and removed all Anthropic model deployments from our account to prevent further charges.

It didn't matter. On April 9, we received a second invoice: $16,414.94 (invoice G151890529, billing period 03/01–03/31). Despite removing the deployments in mid-March, charges had already accumulated for the first half of the month. We are unable to pay this invoice. Our total exposure across both invoices is $17,493.01.

Why we had no way to prevent this:

No billing distinction at deployment. Azure AI Foundry presents all models (Microsoft-native and third-party) in the same unified interface. There is no warning, label, or confirmation step indicating that certain models are excluded from sponsorship credits.
No cost visibility whatsoever. The Azure AI Foundry monitoring dashboard has an "Estimated Cost" section that is completely blank for these models, with a disclaimer: "Cost monitoring is available for Foundry Models sold directly by Azure only." We could see token counts but had zero visibility into what we were being charged.
Token counts that don't explain the charges. The dashboards show our Claude Opus 4.6 deployment used 63.4M tokens and our Sonnet 4.6 deployments used roughly 170M tokens combined. At published rates, that should be in the low thousands, not $17,500. My analysis shows the dashboard hides billions of cached tokens (prompt caching reads and writes) that are invisible in the monitoring UI but account for the vast majority of the bill. There is no view in Azure that provides a breakdown of these charges by token type.
No alerts or notifications. There were no cost alerts, no threshold warnings, and no notifications at any point.
No indication in any Azure portal that charges were hitting our credit card. There was no line item, no pending charge, no Marketplace spend summary - nothing anywhere in the Azure ecosystem that showed dollars accumulating against our payment method for these deployments.

What happened when we asked for help:

Azure Support (TrackingID#2603090040002936): After a month-long wait, a support engineer told us Microsoft cannot issue credits for Marketplace charges and directed us to Anthropic. The first version of the response email referenced "Azure DDoS Protection Standard" instead of our actual issue, suggesting the volume of similar cases in the queue.
Anthropic: Their AI support bot responded within one minute with a blanket statement and closed the ticket four hours later. I escalated, but have still not received a response over a month later.
Microsoft for Startups Team (TrackingID#2604070040009778): Told us they cannot apply Marketplace charges against sponsorship credits and referred us to a Marketplace billing contact.
Azure Marketplace billing contact: Pending response.

The pattern is clear: Microsoft directs us to Anthropic. Anthropic directs us to Microsoft. The Microsoft for Startups team directs us to Azure Marketplace billing. No one takes responsibility.

What I'm asking for:

A full refund of $17,493.01 across invoices G144899694 and G151890529
That Microsoft implement clear billing warnings in Azure AI Foundry before deploying models that are excluded from sponsorship credits.
That Microsoft provide actual cost visibility in the monitoring dashboard for all models deployed through AI Foundry, not just first-party models.

To other founders in the program:

If you have experienced this same issue, please reply to this thread. I know I'm not alone because this has been covered by The Register, InfoWorld, and Computerworld, and at least 20 founders have signed a Change.org petition about it. The more founders who come forward with specific amounts and case numbers, the harder this is to ignore.

We joined Microsoft for Startups because the program was supposed to help early-stage companies manage infrastructure costs during the most financially vulnerable period of our growth. Instead, the program's own platform generated charges within 2 weeks that exceed the total sponsorship credits we've consumed over the past year, with no visibility, no warning, and no path to resolution.

With Further Inc. | Microsoft for Startups Founders Hub

Azure Support TrackingID #2603090040002936

Startup Support TrackingID #2604070040009778

Role Structures, Anti-Patterns, and the 10 Governance Principles

rmmartins — Thu, 09 Apr 2026 21:25:00 GMT

Part 3 of 3: The implementation playbook for engineering, finance, and security teams

In Part 1, we established Azure's three-plane model: Entra for identity, RBAC for resources, Commerce for billing. In Part 2, we explored where those planes collide: Marketplace governance, Managed Identity, and ABAC.

Now it's time to get practical. This post covers the patterns that work, the anti-patterns that don't, and the governance principles that every digital-native company should adopt before they're forced to adopt them after an incident.

7 anti-patterns to avoid

These seven anti-patterns appear repeatedly across AI, SaaS, and digital-native customers. Every one of them has caused real incidents — surprise invoices, accidental deletions, compliance failures, or governance breakdowns.

❌ Anti-Pattern 1: Giving engineers billing permissions

What happens: Engineers are given Billing Reader or Billing Contributor roles "so they can see costs." They can now see MACC credits, private offer terms, commercial discounts, and Marketplace purchase history, none of which they need.

Symptoms: Engineers purchasing Marketplace SaaS without oversight. Surprise invoices. Procurement loses visibility into vendor commitments.

Fix: Engineers need Cost Management Reader (RBAC) for usage-based cost visibility. They do not need billing roles. If they need to understand MACC impact, create a reporting process, don't give them the keys.

❌ Anti-Pattern 2: Giving finance subscription owner access

What happens: Finance teams are given Owner or Contributor roles on subscriptions "so they can track spending." They now have the ability to deploy, modify, and delete production resources.

Symptoms: Massive over-permissioning. Finance can accidentally delete production resources. Audit risk, regulators will flag this.

Fix: Finance roles belong in the Billing plane, not the resource plane. Give finance Billing Reader for credit and invoice visibility. If they also need resource cost data, add Cost Management Reader (RBAC) scoped to the appropriate subscriptions — that's a read-only, resource-plane role.

❌ Anti-Pattern 3: Too many subscription owners

What happens: Every senior engineer, team lead, and sometimes product managers get Owner on subscriptions. The logic: "they need to unblock themselves."

Symptoms: No accountability, when everyone is Owner, nobody is. High blast radius. Hard to trace role assignments when troubleshooting.

Fix: Maximum 2–3 Owners per subscription: Platform Lead, SRE Lead, and optionally the Cloud Architect. Everyone else gets Contributor or scoped roles. Use PIM for emergency elevation.

❌ Anti-Pattern 4: Believing Entra Global Admin = Azure Owner

What happens: Leadership assumes Global Admin has universal access: subscriptions, resources, billing. They don't. Global Admin controls the identity plane only.

Symptoms: Security teams thinking they can see all resources (they can't). Incorrect governance designs that assume Entra = RBAC.

Fix: Train leadership explicitly: Entra ≠ RBAC ≠ Billing. Three planes, three sets of roles, zero overlap. A Global Admin who needs resource access must be separately granted RBAC roles.

❌ Anti-Pattern 5: Deploying marketplace SaaS without finance

What happens: Engineers purchase Marketplace tools directly because they have billing permissions (see Anti-Pattern 1) or because the org hasn't restricted Marketplace purchases.

Symptoms: Incorrect MACC burn. Licensing duplicates. Vendor lock-in without legal review. Private offer terms not applied.

Fix: Require finance approval for all paid Marketplace purchases. Follow the five-step workflow from Part 2: Engineer requests → Finance reviews → Billing executes → Engineering deploys → Cost monitoring activated.

❌ Anti-Pattern 6: Mixed dev/test/prod in one subscription

What happens: To save time, teams put all environments in one subscription.

Symptoms: Can't isolate production costs. A Contributor on the sub can modify both dev and prod. Can't enforce stricter policies on prod without affecting dev. Compliance teams can't get clean boundaries.

Fix: Separate subscriptions by environment. Pattern: 1 subscription per environment per workload (or at minimum per environment). Use cross-subscription networking via Hub & Spoke or Landing Zones.

❌ Anti-Pattern 7: Not using Azure Policy

What happens: Teams deploy freely with no guardrails. Over time: VMs in unapproved regions, GPU SKUs in non-production, storage accounts without encryption, missing tags, public IP drift.

Symptoms: Inconsistent regions. Wrong VM families. Missing tags make cost attribution impossible. Non-compliant configurations.

Fix: Adopt Azure Policy early, at Management Group scope. Critical policies: allowed locations, allowed VM SKUs, enforce HTTPS, enforce private endpoints, enforce tagging (environment, owner, cost-center).

Recommended role structure

Based on experience with dozens of digital-native customers, here's the role structure that works across the three planes.

Engineering plane (RBAC)

2–3 subscription Owners: Platform Lead, SRE Lead, Cloud Architect
Platform/SRE team as Contributors: deploy and manage infrastructure
Developers as RG-scoped Contributors or Readers: limited to their workload's resource group
Cost Management Reader for budget owners: usage visibility without deployment rights
Azure Policy for guardrails: VM SKUs, regions, encryption, tags
Management Groups for organizational structure

Finance plane (Commerce)

Billing Account Owner = CFO or Finance Director
Billing Contributor = Finance Operations
Billing Reader = FP&A and financial analysts
All Marketplace-paid offers require finance approval
MACC visibility restricted to finance roles

Identity/Security plane (Entra)

2–4 Global Admins (break-glass accounts included)
PIM enforced for all privileged roles, no permanent admin access
Conditional Access for all admin roles (MFA, compliant device, block legacy auth)
Groups used for RBAC assignment, never assign RBAC to individual users
Workload identities (Managed Identity) preferred over service principals

Role mapping templates

Copy these into your onboarding documentation.

Engineering Team

Role	Azure Role	Plane	Allowed actions
Cloud Architect	Owner (2–3 per sub)	RBAC	Govern workloads, assign roles, manage infrastructure
Platform / SRE	Contributor	RBAC	Deploy and manage infrastructure
Developer	Contributor or Reader (RG-scoped)	RBAC	Deploy to specific resource groups
Budget Owner	Cost Management Reader	RBAC	View usage-based cost, manage budgets — not billing

Finance Team

Role	Azure Role	Plane	Allowed actions
Finance Lead	Billing Account Owner	Billing	View and manage credits, invoices, MACC, payment methods
Finance Analyst	Billing Reader	Billing	Read-only billing visibility
FP&A	Billing Reader	Billing	Read-only; no deployments, no resource access

Leadership

Role	Azure Role	Plane	Actions
CTO / VP Engineering	Reader or Cost Mgmt Reader	RBAC	Visibility into platform and resource costs
CFO	Billing Reader	Billing	Visibility into credits, invoices, MACC, commitments

RACI Matrix

Adapted from the Microsoft Cloud Adoption Framework.

Function	Accountable	Responsible	Consulted	Informed
Billing account roles & access	Finance Lead	Finance Ops	Cloud Architect	Engineering
Subscription role assignments	Cloud Architect	Platform / SRE	Finance, Security	Engineering
Cost monitoring & budgets	Finance	Engineering	Leadership	All teams
Marketplace purchases	Finance Lead	Finance Ops	Engineering, Legal	CFO
IaC / Deployment governance	Platform Lead	Engineers	Security	Finance
Policies & guardrails	Security / Cloud Architect	Platform Team	Engineering	Leadership
Identity & access governance	Security Lead	Identity Admin	Cloud Architect	All teams
PIM & Conditional Access	Security Lead	Identity Admin	Platform Lead	Engineering
MACC tracking & credit visibility	Finance Lead	Finance Ops	Cloud Architect	Leadership

Include this template in your onboarding documentation and review it quarterly.

Best Practices

Use Entra Groups for RBAC assignment, never assign directly to users

Benefits: clear separation of identity and resource planes, easy onboarding/offboarding, predictable RBAC inheritance, enables PIM for group-based elevation.

Naming pattern:

grp-sub-<SubscriptionName>-Owner
grp-sub-<SubscriptionName>-Contributor
grp-rg-<WorkloadName>-Reader

Assign the group to the role, not individual users.

Enforce PIM + Conditional Access for all privileged roles

Key CA policies: MFA required for all admins, compliant device requirement, block legacy authentication, block sign-in from high-risk locations, require phishing-resistant MFA.

No permanent admin access. Use time-based elevation for every privileged operation.

Separate subscriptions by environment and workload

Subscriptions are a security boundary. Pattern: 1 subscription per environment per workload. Platform teams get their own subscription. Use Hub & Spoke or Landing Zones for cross-subscription networking.

Keep billing data confidential

Only Billing roles should see credits, commitments, discounts, invoices, and MACC balance. Engineers should never have access to commercial data.

The 10 Principles of Azure Governance

After working with digital natives across AI, SaaS, and infrastructure companies, I can summarize Azure governance into these principles:

#	Principle	Summary
1	Separate identity, resources, and billing. Always.	Never mix roles across planes. An engineer should never hold billing roles. A finance analyst should never hold subscription Owner.
2	Engineering owns the resource plane.	Give them Contributor and Cost Management Reader. Don't burden them with billing or identity administration.
3	Finance owns the billing plane.	Credits, MACC, invoices, private offers. Every Marketplace purchase flows through Finance.
4	Security owns identity and governance.	PIM, Conditional Access, Azure Policy. Identity decisions should not be made by engineering or finance.
5	Keep subscription Owners scarce.	Maximum 2–3 per subscription. Use PIM for emergency elevation. Everyone else gets Contributor or scoped roles.
6	Lock down Marketplace.	Every SaaS purchase approved by Finance. No exceptions. Use the five-step workflow.
7	Use Infrastructure as Code.	Manual deployments don't scale and can't be audited. Use Bicep, Terraform, or Pulumi.
8	Use budgets early.	Set budgets at Management Group, Subscription, and Resource Group levels. Configure alerts to email, Teams, or automation.
9	Use Management Groups from day one.	Every startup that scales beyond a single subscription regrets not using them. Recommended hierarchy: Tenant Root → OrgName → Platform / Production / NonProduction / Sandbox / Shared Services.
10	Build governance before scale.	The companies that scale successfully treat Azure governance as infrastructure, not bureaucracy.

References

Closing thoughts

Azure's three permission planes aren't a problem to solve, they're a framework to leverage.

The confusion happens when teams try to treat Azure as if it has a single permission system. It doesn't, and it never will. Because identity, billing, and resource deployment are fundamentally different domains that must be operated and secured differently.

But when organizations understand these three planes and structure their roles accordingly, something powerful happens:

Engineering moves faster. Clear RBAC scopes mean teams deploy without waiting for approvals they don't need.
Finance gains real oversight. Billing roles provide full commercial visibility without the risk of touching production resources.
Security gets a clean, enforceable boundary model. Entra controls identity; PIM and Conditional Access control elevation; Azure Policy controls the guardrails.
Leadership sees clarity instead of chaos. The right roles in the right planes mean dashboards, reports, and alerts actually reflect what each stakeholder needs.

Good governance doesn't slow down innovation. Bad governance does.

The companies that scale successfully, whether AI-native, SaaS platforms, or global digital-first organizations, are the ones that adopt a clean, intentional model early. They treat Azure governance as infrastructure, not bureaucracy.

The model is simple: Entra for who. RBAC for what. Commerce for how you pay. Start with that, and everything else becomes easier.

This concludes the 3-part series on Azure Governance for Digital Natives. For the full model, start with Part 1: The Three Permission Planes. For collision points and Managed Identity, read Part 2: Marketplace Governance and the Cross-Plane Bridge.

Marketplace governance and the cross-plane bridge

rmmartins — Thu, 09 Apr 2026 21:17:44 GMT

Part 2 of 3: Where resource deployment meets financial authority and how to govern it

In Part 1, we established the foundational model: Azure operates on three completely separate permission planes, Entra (identity), RBAC (resources), and Commerce (billing). A role in one plane grants zero access in the others.

That model is clean in theory. But in practice, the planes collide. And when they do, teams get confused, purchases stall, and governance gaps appear.

This post covers the biggest collision point: Marketplace, where resource deployment meets financial authority. We'll also dig into Managed Identity (the one construct that genuinely bridges two planes), ABAC (advanced conditional governance within the resource plane), and the five-step Marketplace approval workflow every digital-native company should adopt.

Marketplace: Where the resource and billing planes intersect

Marketplace is the most common collision point between Azure's permission planes. Here's why: deploying an Azure resource and purchasing a Marketplace SaaS product feel like the same action from the Portal, but they are governed by completely different permission systems.

Deploying resources ≠ Purchasing SaaS

A Contributor can deploy any native Azure resource: VMs, Storage, AKS, Networking, Databases, Azure OpenAI. These are resource plane operations governed by RBAC.

But purchasing a third-party SaaS product through Marketplace — Datadog, Snowflake, Elastic, Confluent, MongoDB Atlas, is a commercial transaction. It creates a financial obligation between your organization and a vendor. That's the billing plane.

Deploying → RBAC (Resource Plane)
Purchasing → Commerce (Financial Plane)

The marketplace permission model

Action	Requires RBAC?	Requires billing role?
Deploy a VM	✅ Yes	❌ No
Deploy AKS cluster	✅ Yes	❌ No
Deploy Azure OpenAI	✅ Yes	❌ No
Deploy Datadog agent extension	✅ Yes	❌ No
Deploy Confluent cluster (Azure-native)	✅ Yes	❌ No
Purchase Datadog SaaS plan	❌ No	✅ Yes
Purchase Snowflake SaaS	❌ No	✅ Yes
Accept Confluent SaaS contract	❌ No	✅ Yes
View Snowflake private offer	❌ No	✅ Yes
Approve Marketplace private offer	❌ No	✅ Yes

This is why engineers often ask:

"Why can't I buy Snowflake? I'm an Owner."

Because Owner has no financial authority. Owner is the highest role in the resource plane, but Marketplace SaaS purchases are commercial transactions that require billing plane permissions. These are different systems.

The subtlety: Azure-Native vs. SaaS

Some vendors have both Azure-native integrations and SaaS offerings, which makes this even more confusing:

Datadog agent extension: deploys as an Azure resource → RBAC ✅
Datadog SaaS plan: creates a billing relationship → Commerce ✅
Confluent for Azure: deploys Kafka as an Azure resource → RBAC ✅
Confluent Cloud SaaS contract: financial commitment → Commerce ✅

When an engineer deploys a Datadog agent via the Portal, everything works. When they try to subscribe to the Datadog SaaS plan, they hit a wall. Same vendor, same Portal, different permission plane.

The five-step marketplace purchase workflow

For digital natives operating with financial governance, every Marketplace purchase should follow this workflow:

Step 1: Engineer requests a SaaS or marketplace resource

The request should include: why it's needed, expected cost, impact on MACC, preferred vendor, and alternatives considered.

Step 2: Finance reviews commercial implications

Finance checks: MACC impact (does this purchase count toward the commitment?), budget alignment, available discounts (private offers), vendor validation, and contract terms.

Step 3: Billing role executes the purchase

Billing Account Owner or Contributor completes the transaction in the Portal. This is a billing plane operation.

Step 4: Engineering deploys or configures the resource

SaaS connector setup, private offer entitlement, RBAC for workload integration, data pipelines and integration. This is a resource plane operation.

Step 5: Cost monitoring activated

Alerts configured, budgets set, tagging applied, forecasting enabled.

This five-step workflow is simple, but most digital natives skip it and end up with surprise invoices, unapproved vendor commitments, or MACC burn they didn't plan for.

The one cross-plane bridge: Managed Identity

If the three-plane model is about separation, Managed Identity is the one construct that genuinely bridges two of those planes.

A Managed Identity is an Entra identity tied to an Azure resource and authorized via RBAC. It lets Azure workloads authenticate to other Azure services without storing credentials in code, environment variables, or configuration files.

The cross-plane flow

Step	Plane	What happens
1. Identity created	Entra (Identity)	A service principal is registered in the directory
2. Access authorized	RBAC (Resource)	Role assignments grant access to specific resources
3. Identity used	Runtime (Resource)	The workload requests a token from Entra and calls the target service

No secrets. No passwords. No key rotation. The identity lifecycle is managed by Azure itself.

AI workload examples

For digital natives building AI applications, Managed Identity is essential:

Scenario	Source	Target	RBAC role needed
App calls Azure OpenAI	App Service / Container App	Azure OpenAI	Cognitive Services OpenAI User
App reads secrets	App Service / Container App	Key Vault	Key Vault Secrets User
App reads/writes blobs	App Service / Container App	Storage Account	Storage Blob Data Contributor
AKS pod calls AOAI	AKS (Workload Identity)	Azure OpenAI	Cognitive Services OpenAI User
AKS pod reads secrets	AKS (Workload Identity)	Key Vault	Key Vault Secrets User
Function processes events	Azure Function	Event Hub	Azure Event Hubs Data Receiver
Pipeline reads training data	ML Workspace	Storage Account	Storage Blob Data Reader

System-Assigned vs. User-Assigned

System-assigned: Tied to a single resource. When the resource is deleted, the identity is deleted. Best for simple scenarios with one resource accessing one or a few target services.

User-assigned: Created as a standalone resource. Can be assigned to multiple resources. Best for shared identity across microservices, AKS Workload Identity, or when the identity must persist independently.

AKS Workload Identity

AKS Workload Identity deserves special mention, it's the most common Managed Identity pattern in digital-native companies running Kubernetes:

A User-Assigned Managed Identity is created in Azure
A Kubernetes Service Account is annotated with the identity's client ID
A Federated Identity Credential links the K8s service account to the Managed Identity
RBAC role assignments grant the Managed Identity access to target resources
At runtime, the pod uses the service account to get an Entra token via workload identity federation

This is Entra + RBAC + Kubernetes working together: identity plane creates the trust, resource plane authorizes the access, and the workload uses it at runtime.

Key insight: Managed Identity bridges Entra and RBAC, but never touches the third plane (billing). No identity, managed or otherwise, can see MACC credits or approve Marketplace purchases.

Advanced: Attribute-Based Access Control (ABAC)

ABAC extends RBAC with conditions based on resource attributes (tags), principal attributes, and request context. It is not a separate permission system, it's an enhancement to the resource plane.

For example, you can write a role assignment that says: "Allow Contributor access only to resources tagged Environment = Dev" or "Allow read access only to storage blobs under a specific path prefix."

ABAC is particularly useful for:

Multi-tenant SaaS applications that need tenant isolation at the resource layer
Regulated workloads that require fine-grained access control beyond what standard RBAC scopes provide

What ABAC cannot do: grant billing access, override Entra roles, access MACC, or purchase Marketplace products. It operates entirely within the RBAC resource plane.

For implementation details, see: Azure RBAC Conditions (ABAC)

References

What's Next → We've now covered the three-plane model (Part 1) and the biggest collision points: Marketplace, Managed Identity, and ABAC. In Part 3, we get tactical: the 7 anti-patterns to avoid, recommended role structures for Engineering, Finance, and Security teams, RACI templates, and the 10 core governance principles every scaling organization should adopt.

No Decision After 7 Working Days + Portal Loop Issue – Rain Stella Technology

khalidadine — Tue, 24 Mar 2026 16:14:45 GMT

Hi Microsoft for Startups Community & Team,

I am following up on my Microsoft for Startups application submitted on behalf of Rain Stella Technology, and would also like to flag a portal issue I experienced after the application was submitted.

--- Application Status ---

I received a confirmation email acknowledging receipt of my application, however it has now been 7 working days with no further communication — no approval, no decline, and no request for additional information.

According to Microsoft's own guidelines, applications are typically reviewed within 3 business days. We are now more than double that timeframe with no update.

Application Details:

• Applicant Name: Khalid Adine

• Startup Name: Rain Stella Technology

• Status: Receipt confirmation received, no decision communicated since

• Days Elapsed: 7 working days

--- Portal Issue ---

During and after submitting my application, I encountered a frustrating portal bug where, instead of displaying a confirmation screen or application status, the portal kept redirecting me back to the "Apply Now" screen — as if my submission had not been recorded.

I received the receipt confirmation email, which confirmed the application went through, but the portal still does not reflect any application status. This caused significant confusion, and I wanted to flag it in case it is a widely known issue affecting other applicants.

I also attempted to contact support via email (email address removed for privacy reasons), but the address returned a bounce error, leaving the forum and contact form as my only available channels.

--- My Requests ---

1. Can someone from the team confirm the current status of our application?

2. Is any additional information or documentation required from our side?

3. Can the portal loop bug be investigated and fixed for other applicants facing the same issue?

4. What is the correct and currently working support email or channel for urgent queries?

We are actively building our product and are eager to move forward with the program. Any update would be sincerely appreciated.

Thank you for your time and support.

Khalid Adine

Founder, Rain Stella Technology

Azure credit and Azure Portal successful, but login to Startups Portal fails

MessageStack-HB — Sat, 14 Mar 2026 10:13:52 GMT

Hello
I have successfully registered for the Microsoft for Startups basic offering using a new Microsoft Account [1].
Using my Microsoft account, I can log into the Azure Portal successfully and can see my "Azure for Startups" subscription with free credit.

However, when I try to log into the Microsoft for Startups Portal using LinkedIn I see error "No user found. Please sign up or try a different LinkedIn account".

Initially I though this is because my Primary LinkedIn email address is not the same as the email address of my Microsoft Account [1]
In LinkedIn, I changed my Primary email address to my Microsoft Account email address.
Again I tried again to log into the Microsoft for Startups Portal using LinkedIn - same error as before.
I created a new LinkedIn account for using my new Microsoft Account email address - same error as before.

Please can someone help?

Introducing the Startup-Scale Landing Zone: Get Azure right from day one

rmmartins — Mon, 16 Mar 2026 18:25:26 GMT

If you've been following this blog, you may recall the post From Zero to Hero with Azure Landing Zones, where we walked through the full Azure Landing Zone journey, from identity and RBAC to Platform and Application Landing Zones. That guide covered the what and the why. This post introduces the how, a deployable, open-source project that distills those principles into something a startup can actually ship in an afternoon:

The problem: Cloud foundations shouldn't take two months

Every startup building on Azure faces the same fork in the road:

Option A: Follow the Azure Landing Zone (ALZ) guidance. It's comprehensive, battle-tested, and designed for organizations with thousands of users. It's also 100+ modules, a multi-layered management group hierarchy, and months of work to understand, let alone implement. For a 10-person startup, it's like buying a commercial kitchen to make breakfast.

Option B: Skip governance entirely. One subscription, no policies, no budgets, no RBAC strategy. Ship fast now, deal with security debt later. This is what most startups actually do, and it works until the first security questionnaire from an enterprise customer, the first runaway cost incident, or the first az group delete that hits production.

Neither option is right. Startups need a third path: just enough governance to be secure and cost-aware from day one, without the operational overhead that slows them down.

That's exactly what the Startup-Scale Landing Zone (SSLZ) provides.

What is the Startup-Scale Landing Zone?

SSLZ is an opinionated, production-ready Azure infrastructure template that deploys in under one hour using Bicep or Terraform. It's built for teams of 5–50 engineers, typically pre-seed to Series A, who don't have a dedicated platform team but need to get Azure right from the start.

It takes the core principles from the Azure Landing Zone architecture and strips them to the essentials:

One management group, two subscriptions (prod + non-prod). That's it. No six-layer hierarchy.
Security built-in. Defender for Cloud, RBAC groups, NSG deny-all defaults, and policy enforcement, all automated.
Cost controls from day one. Budget alerts at 50%, 80%, and 100%, mandatory tagging, and reservation guidance.
An explicit graduation path. When you outgrow SSLZ, there's a step-by-step guide to migrate to the full ALZ architecture.

Important: SSLZ is not a replacement for Azure Landing Zones. It targets a different profile: very early-stage startups with a single workload, a single region, and no hybrid connectivity. For those teams, the realistic alternative isn't ALZ, it's usually no governance at all.

Architecture: Simplicity as a design principle

The architecture is deliberately minimal:

Tenant Root Group └── mg-<yourcompany> ← Policies applied here ├── sub-<yourcompany>-prod ← Production workloads └── sub-<yourcompany>-nonprod ← Dev, staging, QA

Each subscription gets its own VNet with a standardized subnet layout:

vnet-<co>-prod (10.0.0.0/16) ├── snet-aks 10.0.0.0/20 (4,091 IPs — AKS nodes + pods) ├── snet-app 10.0.16.0/22 (1,019 IPs — App Service / Container Apps) ├── snet-data 10.0.20.0/22 (1,019 IPs — Private Endpoints) └── snet-shared 10.0.24.0/24 (251 IPs — CI/CD agents, jump boxes)

No hub network. No Azure Firewall. No VNet peering. Each subscription is a self-contained island.

Why no hub?

A hub-spoke topology costs a minimum of ~$1,500/month. Azure Firewall alone runs $900+/month. For a startup with a single workload in a single region, that's cost and complexity with no return. NSGs provide L3/L4 filtering for free and handle 95% of startup networking use cases. When compliance or hybrid connectivity demands centralized egress control, the graduation guide walks you through adding a hub, without touching existing resources.

Why two subscriptions?

Two subscriptions give you isolation that resource groups can't:

Cost isolation for free: no tagging gymnastics to separate prod from dev spend.
RBAC without custom roles: developers get Contributor on non-prod and Reader on prod.
Blast radius containment: az group delete in dev can't touch production.
Quota isolation: non-prod experiments don't consume prod quotas.

This is a habit that's cheap to form early and expensive to retrofit later. One primary workload per subscription; when you deploy a second independent workload, create a new subscription.

What you get out of the box

Component	What's deployed
Management Groups	Single MG with two subscriptions
Azure Policy	Microsoft Cloud Security Benchmark (audit mode), required tags (environment, team), allowed locations, diagnostic settings
Networking	VNet + 4 subnets per subscription, NSGs with deny-all-inbound default
Monitoring	Log Analytics workspace, Activity Log forwarding, 90-day retention
Security	Defender for Cloud CSPM (free), Defender for Servers P2 (prod), security contact alerts
Cost Management	Budget alerts at 50/80/100% thresholds, tag enforcement via policy
CI/CD	GitHub Actions workflows for both Bicep and Terraform, Workload Identity Federation (no secrets)

Security without friction

The security model avoids compliance theater. Instead of buying Entra ID P2 "to check a box," SSLZ enables Security Defaults, free MFA that blocks 99.9% of identity attacks. Instead of enforcing MCSB in Deny mode on day one (which blocks legitimate deployments and frustrates developers), it starts in Audit mode so you can understand your posture first, then selectively move to Deny as your team matures.

RBAC follows three rules:

Never assign roles to individuals: always use security groups.
Developers don't get Contributor on prod: deployments go through CI/CD.
No Owner at subscription level for non-admins: a compromised account with Owner can grant itself anything.

For CI/CD, SSLZ uses Workload Identity Federation (WIF) instead of client secrets. No credentials to store, rotate, or accidentally commit. Short-lived OIDC tokens scoped to specific repos and branches.

Cost transparency

Every recommendation includes real numbers:

"Azure Firewall: $900+/month. Skip until compliance or hybrid demands it."
"DDoS Protection Standard: $2,944/month. Azure's free basic DDoS + Front Door WAF handles most cases."
"Defender for App Service: ~$15/month. Limited value compared to other plans. Revisit later."
"Standard_D4s_v5 VM: $140/month on-demand → $90/month with 1-year RI. 36% savings."

The documentation also covers the six most common cost mistakes startups make: forgotten dev VMs, over-provisioned databases, ignoring Reserved Instances, premium storage where standard works, not using Spot VMs, and missing Dev/Test pricing. Each mistake comes with a concrete fix and code example.

Starter examples: Three startup archetypes

SSLZ ships with three production-grade example architectures, each with Bicep + Terraform implementations, deployment instructions, and realistic cost estimates:

SaaS Startup (~$330–440/month)

Container Apps + Azure SQL Elastic Pool + Redis + Key Vault. Multi-tenant with shared schema and tenant_id column. Container Apps scale to zero in non-prod. Elastic pools are 50–70% cheaper than individual databases.

AI Startup (~$1,150–1,250/month)

AKS with GPU Spot node pools (60–90% savings) + Azure OpenAI + Blob Storage + Redis for inference caching. Covers model serving framework choices (vLLM vs Triton vs TGI) and GPU node management with taints and KEDA autoscaling.

API-First Startup (~$163–345/month)

App Service with deployment slots (zero-downtime swaps) + API Management (Consumption tier, pay-per-call) + Cosmos DB + Application Insights. Includes API versioning strategy, rate limiting tiers, and Cosmos DB partitioning guidance.

When to graduate

SSLZ is explicit about its limits. You'll outgrow it when 2–3 of these signals appear simultaneously:

Signal	Why it matters
Second independent workload	Each workload gets its own subscription
Engineering team > 50 people	Different teams need different permissions and cost boundaries
Regulatory compliance (SOC2, HIPAA, PCI)	Requires specific controls SSLZ doesn't cover
Multi-region deployment	Needs centralized network management
Hybrid connectivity (VPN, ExpressRoute)	Requires a Connectivity subscription with gateways
5+ subscriptions	Policy and RBAC at scale needs MG hierarchy

The Graduation Guide provides a five-phase migration path to full ALZ: management group hierarchy, hub network + firewall, management subscription, policy hardening, and identity hardening with risk assessments for each phase. It also includes the cost of the full platform layer ($1,500–3,000/month), so you can make an informed decision about when the investment makes sense.

Quick start: From zero to production-ready in under 1 hour

Prerequisites (5 min)

Azure CLI installed
Two Azure subscriptions (prod + non-prod)
Owner permissions on both subscriptions

git clone https://github.com/ricmmartins/sslz.git cd sslz az login ./scripts/validate-prerequisites.sh

Deploy with Bicep (20 min)

cd infra/bicep cp parameters/prod.bicepparam parameters/prod.local.bicepparam # Edit prod.local.bicepparam with your values az deployment sub create \ --location eastus2 \ --template-file main.bicep \ --parameters parameters/prod.local.bicepparam

Or Deploy with Terraform (20 min)

cd infra/terraform cp terraform.tfvars.example terraform.tfvars # Edit terraform.tfvars with your values terraform init terraform plan -out=tfplan terraform apply tfplan

Verify (5 min)

az group list --query "[?contains(name, 'yourcompany')].name" -o tsv az policy assignment list --query "[].displayName" -o tsv az security pricing list --query "value[?pricingTier=='Standard'].{Name:name, Tier:pricingTier}" -o table

Design philosophy

Three principles guided every decision in SSLZ:

Opinionated over flexible. "It depends" isn't helpful when you have five engineers and no platform team. SSLZ makes the call: two subscriptions, no hub, deny-all NSGs, MCSB in audit mode and tells you when to revisit.
Reversible over perfect. Every architectural decision is designed to be easy to change later. Moving subscriptions between management groups is a 10-second operation. Adding a hub VNet requires only a new deployment, not changes to existing resources. Policies can move from Audit to Deny on a schedule. Multi-region is a future add-on, not a prerequisite.
Honest about trade-offs. Instead of claiming "enterprise-grade," SSLZ says:"You'll outgrow this when..." and "Here's exactly what it costs to add the next layer." That transparency is what separates it from frameworks that are either overkill for startups or under-engineered for production.

Get involved

SSLZ is open source under the MIT license. The project welcomes contributions, especially real-world configurations from startup CTOs and platform engineers who've battle-tested the patterns.

GitHub: github.com/ricmmartins/sslz
Documentation site: startupscalelanding.zone
Previous post: From Zero to Hero with Azure Landing Zones

If you're a startup building on Azure, give SSLZ a try. Deploy it, break it, and tell us what your real infrastructure looks like, so the next team doesn't have to figure it out from scratch.

Unable to see Azure credits for verified startup business

IanSR — Wed, 04 Mar 2026 21:28:53 GMT

TL;DR: Business verification only completed a few weeks ago (Feb 2026), can't see Azure for Startup credits anywhere on my account, would like to activate my $5000 credits and put them to use building my application which will leverage Azure in a number of ways. Help needed!

Details:

I have an early stage startup, pre-money, pre-product. Building a native Microsoft/Windows 11 application for AI data analysis with cloud hosted private LLMs.

I signed up for a business M365 account back in September 2025 which was fine for my M365 Office Suite access (OneDrive, OneNote, Word, Excel, PowerPoint, SharePoint), however my business verification ran into mysterious obstacles (it is a Delaware C Corp, incorporated in 2025), so I couldn't setup my corporate developer accounts.

Fast forward to February 2026, and I finally push through the business verification steps, and am now working full time on my business (as of late January 2026). I am "in" the Azure for Startups program, but I can't access this "Founders Hub" area that I'm reading about, and I can't see in my Azure Billing/Invoicing area anything other than my corporate credit card for payments -- no evidence of $1000 or $5000 startup credits for Azure.

What am I missing?

What did I do wrong?

Is there any way I can activate these now that I'm actually in a position where I need them, now that I have my business verified with Microsoft?

Thank you in advance for any assistance anyone can provide on this point!

Production-grade API Gateway patterns for Microsoft Foundry

rmmartins — Thu, 29 Jan 2026 19:35:08 GMT

Most startup teams start with the simplest thing that can work. One or two apps call Microsoft Foundry model endpoints directly, traffic is predictable, and “routing” is just a config value in the app.

The gateway pattern becomes necessary when Foundry stops being “an integration” and becomes “a shared platform”. That shift shows up in a few reliable signals:

You do not fully control client code, or updating client configuration is riskier than updating a central routing configuration.
You need blue green rollouts for model versions or fine-tuned variants without forcing every client to redeploy
You need server-side retry and circuit breaking semantics to handle throttling and availability without duplicating logic across every app.
You need consistent token governance and usage visibility across multiple apps and consumers.

On Azure, this is commonly implemented with Azure API Management (APIM) using GenAI-aware “AI Gateway” capabilities, and it can be configured from the Foundry portal and applied per project.

What problems a gateway solves

A production gateway in front of Foundry is not about adding a hop. It is about centralizing cross-cutting concerns that otherwise get reimplemented inconsistently:

Stable API surface while deployments and backends evolve.
Consistent auth termination at the gateway, then reestablish trust from the gateway to the model backend (for example with Azure RBAC).
Token-based throttling and quotas for fairness and cost control across consumers.
Operational resiliency via backend pools, priority and weight routing, retry, and circuit breaker behavior that honors throttling signals like Retry-After.
Unified telemetry at the choke point, even when you have multiple underlying instances.

Decoupling clients from backend topology

One secondary but important effect of introducing a gateway is that it shifts backend-specific details out of application code. Clients call a stable API owned by your platform team, while routing, credentials, and failover semantics live behind that boundary. This does not make models interchangeable, and it does not eliminate platform dependencies. What it does is contain them. As backend topology evolves, whether that means new deployments, additional subscriptions, or additional regions, those changes become operational updates rather than coordinated application rewrites.

In practice, this means your platform team owns the API contract and operational semantics, while backend providers remain an implementation detail behind that contract.

One simple mental model

Concrete gateway patterns

Choosing the right gateway pattern

The table below summarizes when each pattern is most appropriate, and what trade-offs it introduces.

Pattern	Primary goal	Isolation level	Throughput scaling	Resiliency impact	Operational complexity
Single Foundry, multi-deployment routing	Decouple clients from models and enable safe rollouts	Logical only (same resource boundary)	Limited to single resource quotas	Low to moderate (deployment-level)	Low
Multi-resource, same region, same subscription	Security segmentation, reliability, backend pooling	Resource-level	Not increased for standard tier	Moderate (backend failover)	Medium
Prioritized failover, spillover (PTU → standard)	Cost control and burst protection	Resource-level	Controlled spillover	High (explicit failover semantics)	Medium to high
Multi-subscription, same region	Quota expansion, org boundaries, central AI service	Subscription-level	Scales with number of subscriptions	High	High
Multi-region	Regional resilience, data residency, global access	Region-level	Region-bounded	Very high	High

How to read this table:

If your problem is model lifecycle and client decoupling, start with Pattern 1.
If your problem is reliability and segmentation, Pattern 2 and 3 are the usual next step.
If your problem is quota ceilings or organizational boundaries, Pattern 4 appears.
If your problem is regional resilience or global scale, Pattern 5 becomes unavoidable.

Below are the most common patterns that show up as startups move from “one app calling one deployment” to “multiple products, multiple teams, and production SLOs”.

Pattern 1: Single Foundry resource with multi-deployment routing

When you use it

You run multiple model deployments under one Foundry resource and want to control routing centrally.
You want safer rollouts (blue green) without forcing client updates.

What it solves

Routing decisions move from clients to a single place.
You can gradually shift traffic between deployments, but you still need safe deployment practices because changing “which model” can be a breaking change from the client’s perspective.

Key operational detail

Strongly consider credential termination and reestablishment. Clients authenticate to the gateway. The gateway authenticates to the model backend via Azure RBAC.

Pattern 2: Multi-resource in the same region and same subscription

When you use it

You need security segmentation boundaries (separate keys or Azure RBAC per client).
You want an easier chargeback model.
You want failover for availability issues, operational mistakes, or pairing provisioned and standard for spillover.

What it solves

You can treat multiple backends as active-active and load balance across instances.
You centralize retry and circuit-breaker behavior.

Critical constraint

Standard quotas are subscription-level, not instance-level. Load balancing across standard instances in the same subscription does not create additional throughput

Pattern 3: Prioritized failover and planned spillover (PTU first, consumption fallback)

This is the pattern you reach for when you want to maximize utilization of dedicated capacity and still survive bursts and outages.

The AI Gateway workshop describes a “Prioritized PTU with Fallback Consumption” approach using APIM backend pools with priority and weight-based routing, combined with circuit breaker rules and retries for 429 and selected 503 cases.

Concrete implementation details from the workshop that are worth copying into your playbook:

Configure backend pool across multiple endpoints.
Add a circuit breaker rule that can trip on throttling (429) and accept Retry-After
Use APIM policy to authenticate with managed identity and set the backend to the pool, then retry on 429 or specific 503 conditions.

This moves “resiliency logic” out of every client and into one place you can test and iterate.

Pattern 4: Multi-subscription, same region (quota scaling and centralized service)

When you use it

You need more quota in standard deployments but must constrain models to a single region.
You are building a centralized “Microsoft Foundry as a service” model. Standard quota is subscription-bound, so capacity pooling often implies multiple subscriptions.

Implementation tips from the Azure Architecture Center guide

Prefer subscriptions backed by the same Microsoft Entra tenant for consistency in Azure RBAC and Azure Policy.
Deploy the gateway in the same region as the backends.
Consider a dedicated gateway subscription.
Ensure private endpoints are reachable across subscriptions, including cross-subscription Private Link where supported.

Pattern 5: Multi-region

When you use it

You need a service availability failover strategy (for example cross-region pairs).
You have data residency and compliance requirements.
You face mixed model availability across regions.

The Azure Architecture Center guide calls out that for business-critical architectures that must survive a complete regional outage, a global unified gateway helps eliminate failover logic from client code. It also notes the trade-offs of single-region gateway deployment doing active-active load balancing across regions, including added latency and egress charges for cross-region calls.

Real-world scenarios this architecture supports

These are representative scenarios drawn from common production environments and directly supported by the gateway patterns and reference implementations.

Scenario A: Containing a runaway application

A company has five internal applications sharing the same Foundry environment. One application ships a prompt regression that suddenly multiplies average request size by 8x.

Without a gateway:

Token consumption spikes globally.
Other apps experience 429s and degraded latency.
Root cause takes time to identify because telemetry is scattered.

With an AI Gateway in front of Foundry:

Token-based limits are enforced per application.
The faulty app is throttled at the gateway.
Other applications continue operating normally.
The gateway telemetry immediately shows which consumer is exhausting the quota.

Outcome:

Incident blast radius is limited to one consumer.
No global outage.
Faster root cause isolation.

Scenario B: Zero-downtime model migration

A startup is migrating from one production deployment to a newer model version.

They deploy the new model alongside the old one and configure the gateway to:

Route 5 percent of traffic to the new deployment.
Keep 95 percent on the old deployment.

They observe:

Error rate.
Latency.
Token growth.

Over several days they progressively shift traffic to 100 percent without requiring any client changes.

Outcome:

No forced redeployments.
No mass client reconfiguration.
Rollback is a gateway configuration change, not an emergency code change.

Scenario C: Cost-controlled burst handling

A product runs steady baseline traffic on provisioned capacity and experiences unpredictable spikes.

Gateway configuration:

Priority backend pool.
Provisioned deployment as primary.
Standard deployment as secondary.
Circuit breaker honors Retry-After.

Normal operation:

Nearly all traffic hits provisioned throughput.

During spikes:

Overflow is routed to standard tier.
The gateway absorbs throttling behavior and retries.

Outcome:

Provisioned capacity is fully utilized.
Spikes are handled without hard failures.
Clients are unaware that backend routing changed.

Scenario D: Subscription quota pooling

An organization reaches standard tier quota ceilings in a single subscription.

They deploy Foundry resources across multiple subscriptions and place a single gateway in front.

Gateway behavior:

Distributes requests across subscriptions.
Applies unified token governance.
Exposes one API endpoint to all internal teams.

Outcome:

Aggregate usable quota increases.
Organizational boundaries are preserved.
Clients remain unaware of backend topology.

Operational playbook

This is the part that separates “it works” from “it survives production”.

1. Authentication strategy

Recommended default

Terminate client auth at the gateway.
Reestablish gateway-to-backend authorization via Azure RBAC rather than passing through client secrets.

The AI Gateway workshop provides a concrete example using authentication-managed-identity and setting the Authorization header for the backend call.

Guardrail

If you choose pass-through client credentials, ensure clients cannot bypass the gateway or model restrictions.

2. Token throttling and fairness

You want limits that match how LLMs consume capacity and budget.

APIM GenAI capabilities emphasize controlled token limits and monitoring for cost efficiency.
Foundry AI Gateway governance scenarios explicitly include configuring token limits for models at the project level.

Use token throttling as your primary fairness control, then layer request-rate limits if needed.

3. Failover semantics

Two rules that prevent most “self-inflicted outages”:

Honor Retry-After from the backend when implementing failover and circuit breaker behavior. Do not continuously hit a throttled endpoint returning 429.
Prefer gateway-side retry and circuit breaking to avoid repeated code and to keep one place to test.

The workshop shows a pragmatic retry condition on 429 and selected 503, combined with backend pool routing and a circuit breaker that can trip on 429 while checking Retry-After.

4. Observability and consumption tracking

A gateway is uniquely positioned to publish telemetry across all consumed models to a single store, which makes unified dashboarding and alerting easier.

APIM’s GenAI positioning highlights token monitoring as part of “cost efficiency”.
The workshop navigation includes model monitoring and consumption tracking as first-class steps in the AI Gateway journey.

Operationally, decide up front what you will dimension your telemetry by (project, tenant, application, environment) and enforce those identifiers at the gateway.

5. APIOps: Treat gateway configuration as code

Even if you configure the first version in the portal, production systems need repeatability:

Use a code-driven workflow for policies and configuration so routing and governance changes are reviewed and promoted like any other production change.
If you adopt a federated model, APIM Workspaces are positioned to help organizations manage APIs more productively and securely.
Keep an eye on the APIM changelog and GenAI feature updates because gateway capabilities are evolving quickly.

When not to add a gateway

The Architecture Center guide is explicit: If controlling client configuration is as easy as controlling gateway routing, the added reliability, security, cost, maintenance, and performance impact might not be worth it.

Also, if you are using a single instance with multiple deployments primarily to simulate identity segmentation, you might be better served by multiple instances with distinct Azure RBAC boundaries instead of pushing that complexity into gateway logic.

Closing thought

A gateway is not a prerequisite for Foundry. It is an operational maturity step.

When Foundry usage becomes multi-tenant, SLO-driven, and quota-sensitive, the gateway stops being “extra architecture” and becomes the place you express your platform intent. Auth boundaries. Token governance. Failover semantics. Telemetry. And a repeatable APIOps process to keep it all sane as the system evolves.

References

When and why startups add a gateway in front of Microsoft Foundry

rmmartins — Tue, 27 Jan 2026 03:24:31 GMT

Note: This post focuses on when and why startups begin adopting a gateway in front of Microsoft Foundry. In a follow-up article, we’ll go into a technical deep dive, covering design decisions, operational tradeoffs, latency considerations, observability, and patterns used in production-scale environments.

Most teams don’t hit scaling challenges with Microsoft Foundry on day one.

Early on, things are simple. One or two applications call Foundry directly. Traffic is predictable. Model experimentation moves fast. Everything works, and there’s no reason to add extra layers.

Then adoption grows. More applications start calling the same models. Traffic becomes spiky. Teams want better visibility into usage. Questions about rate limits, authentication, and how to evolve models over time begin to surface.

This is usually the moment when teams start asking: “Do we need some kind of control layer in front of Foundry?”

The signals that start to show up

Across many startups, the same patterns tend to emerge as Foundry usage scales:

Multiple clients and services calling the same Foundry endpoints
The need for consistent rate limiting and access control
A desire to evolve models or deployments without touching every client
Limited visibility into who is calling what, and how often

None of these are problems at small scale. But together, they create friction as usage grows.

A pattern we often see working well

A common pattern at this stage is placing a gateway in front of Microsoft Foundry APIs.

Client applications call a single gateway endpoint, where policies such as authentication, rate limits, and routing are applied before requests are forwarded to Foundry model deployments.

Rather than having every application talk directly to Foundry, teams introduce a control layer that sits between clients and Foundry.

On Azure, this is often implemented using API Management with GenAI capabilities.

This gateway does not replace Foundry. Foundry remains the model and AI platform. The gateway simply becomes the entry point for client traffic.

What this enables in practice

When teams introduce a gateway layer, a few things become much easier:

A single, stable API surface for applications, even as models or deployments evolve
Centralized throttling and authentication, instead of per-client logic
Policy-based routing across models or backends without changing clients
Improved request-level observability into usage patterns, latency, and errors

Importantly, this structure lets teams scale without slowing down experimentation. Model teams can continue to iterate, while platform concerns stay centralized.

What this pattern is not

It’s worth calling out what this approach is not:

It’s not required on day one
It’s not mandatory for every startup
It’s not about adding complexity early

Many teams run successfully without a gateway for a long time. This pattern becomes useful when scale, team size, or operational needs make direct integrations harder to manage.

When teams usually consider this

From experience, teams tend to explore this pattern when:

Foundry usage spans multiple applications or teams
Rate limits and quotas need consistent enforcement
There’s a desire to future-proof model or deployment changes
Observability and governance start to matter more

If those conversations are already happening, it’s often a good time to look at a gateway approach.

How this looks on Azure

On Azure, this pattern is commonly implemented using:

Azure API Management as the gateway
AI-aware policies for rate limiting, routing, and governance
Microsoft Foundry as the backend model platform

The architecture stays flexible. Teams can start simple and add capabilities over time as needs evolve.

Closing thoughts

This pattern is less about tooling and more about timing.

Adding a gateway too early can slow teams down. Adding it too late can make change painful. The right moment is usually when Foundry usage starts to feel like a shared platform rather than a single experiment.

For teams approaching that stage, a gateway can provide structure without taking away speed.

References

Founders Hub billing issue - stuck between Microsoft and ISV with no resolution path

Playpals — Thu, 15 Jan 2026 19:39:05 GMT

Hi everyone,

I'm reaching out to the community because I've hit a wall with standard support channels and hoping someone from the Founders Hub team or other founders who've faced similar issues can help.

QUICK SUMMARY

I'm a Founders Hub participant who got charged ~€1,000 for using Claude (Anthropic) models through Azure AI Foundry. I assumed these were covered by my $25k Sponsorship credits since:

• I used ai.azure.com (not a separate marketplace)

• There was no clear warning about separate billing

• There's no way to monitor Marketplace spending in Founders Hub

THE PROBLEM:

I'm stuck in a loop:

• Microsoft says "contact the ISV for refund approval"

• Anthropic says "billing is handled by Microsoft, not us"

• The "Support" link Microsoft points to redirects back to Microsoft Support

I've sent detailed emails explaining this circular situation. Support keeps responding with the same copy-paste policy text, ignoring my request for escalation.

MY ASK

1. Is there a dedicated support channel for Founders Hub billing issues?

2. Has anyone from the Founders Hub team dealt with similar Marketplace confusion before?

3. Any founders here who successfully resolved unexpected Marketplace charges?

I'm not trying to get something for free - I already paid the first invoice (€460). I just want help with the second invoice that surprised me a month later.

The Founders Hub program has been great otherwise, but this experience has been really frustrating. Any pointers would be hugely appreciated.

Thanks!

Bartek

Azure has three permission systems, and you're probably confusing them

rmmartins — Wed, 06 May 2026 20:10:11 GMT

Series: Azure Governance for Digital Natives and Startups: This is Part 1 of a 3-part series on Azure governance for digital-native companies scaling on Azure.

New to Azure's identity and subscription model? This post assumes you already know how tenants, subscriptions, and Entra ID fit together. If that's fuzzy, read Demystifying Microsoft Entra ID, Tenants and Azure Subscriptions. That post covers the architecture; this one covers the permissions that live inside it

Part 1 (this post): The three-plane model: Identity, Resources, and Billing

Part 2: Marketplace, Managed Identity, and where the planes collide

Part 3: Anti-patterns, role structures, and the 10 principles of Azure governance

Azure is a powerful cloud platform, but its governance model is widely misunderstood, especially in fast-moving, engineering-led organizations.

After working with dozens of digital-native customers (AI startups, SaaS platforms, companies scaling from zero to millions in Azure spend), I've seen the same confusion play out over and over. Engineers can't see MACC credits. Finance can't see workloads. Global Admins think they own everything. And Marketplace purchases happen without anyone in Finance knowing.

The root cause is always the same: Azure is governed by three completely separate permission systems, and most teams treat it like one.

If you're a customer moving fast on Azure, you've likely heard these questions:

"Why can't my engineering Owner see MACC credits?"
"Why can't a Billing Contributor deploy a VM?"
"Why doesn't Global Admin let me access subscriptions?"
"Why can a Contributor deploy AKS but not buy Snowflake?"
"Why does Cost Management Reader show cost but not credit balance?"

These questions appear in nearly every customer I work with: AI companies consuming Azure OpenAI at scale, SaaS companies running global AKS footprints, and digital natives under Microsoft Azure Consumption Commitments (MACC).

This guide breaks down the entire model with practical patterns and deep insight into each plane — so these questions are never confusing again.

Why digital natives struggle with this

Before diving into the technical model, it's worth understanding why this causes so much friction in digital-native companies specifically. These problems hit startups and scaling companies harder than traditional enterprises for three reasons:

Speed over governance. Engineering-led companies prioritize shipping over process. Governance is added retroactively, often after something goes wrong.
Flat org structures. Without clear Platform, Finance, and Security functions, the same people end up with roles across multiple planes creating exactly the kind of role sprawl the three-plane model was designed to prevent.
MACC commitments. Digital natives under MACC have a financial relationship with Azure that most team members don't even know exists. When engineers can't see MACC burn and finance can't see resource usage, nobody has the full picture.

The result is predictable:

Role	What They Expect	What They Actually Get
Engineers	"I'm Owner, I should see everything, including billing"	RBAC gives full resource control but zero billing visibility
Finance	"I need to see what's running so I can forecast"	Billing Reader shows credits and invoices but not workloads
Security	"I'm Global Admin, I have total control"	Entra controls identity but not resources or billing
Procurement	"I need to buy Marketplace software for the team"	Marketplace purchases require billing roles, not RBAC
Leadership	"I want a single dashboard for cost, resources, and credits"	No single role spans all three planes; you need a combination

When these expectations go unaddressed: engineers get billing access "just to see costs" (creating financial risk), Marketplace purchases happen without finance oversight, and Global Admin is treated as the "master key" when it controls only one of three planes.

The fix isn't more permissions. It's the right permissions in the right plane for the right people.

The three-plane model

Everything in Azure governance flows from this single truth:

Plane	Controls	Example Roles	See Billing?	Deploy Resources?	Manage Identity?
Microsoft Entra (Identity)	Users, groups, MFA, PIM, Conditional Access	Global Admin, Groups Admin, PIM Admin	❌ No	❌ No	✅ Yes
Azure RBAC (Resources)	VMs, AKS, Storage, AOAI, networking, policies	Owner, Contributor, Reader	❌ No	✅ Yes	❌ No
Billing / Commerce (Financial)	Invoices, credits, MACC, payments, Marketplace purchases	Billing Owner, Billing Reader	✅ Yes	❌ No	❌ No

Three planes. Zero overlap. A role in one plane grants zero access in the others.

Entra Global Admin can't access subscriptions.
Subscription Owner can't see the MACC balance.
Billing Account Owner can't deploy resources.

This separation is by design. Once your company internalizes it, governance becomes dramatically more predictable.

Plane 1: Microsoft Entra (Identity Plane)

Security, authentication, authorization, administrative boundaries.

Microsoft Entra (formerly Azure AD) is the authoritative identity provider for Azure. It governs identity, authentication, Conditional Access, PIM (Privileged Identity Management), group membership, and tenant-wide administrative policies. Entra is the security boundary for the entire tenant.

💡 Common misunderstanding: "I'm Global Admin, why can't I access subscriptions?"

Because Entra roles do not grant Azure RBAC permissions by default. This behavior is intentional and foundational. A compromised Global Admin cannot delete all subscriptions. A compromised Subscription Owner cannot compromise directory security. Identity and infrastructure operate independently for resiliency.

What Entra roles can do

Create and manage users
Manage MFA & Conditional Access
Approve PIM requests
Manage security settings
Create/assign groups (which can then hold RBAC roles)
Manage enterprise applications, OIDC apps, etc.

What Entra roles cannot do

Action	Allowed?
Deploy resources	❌ No
Access subscriptions	❌ No
View MACC credits	❌ No
Make Marketplace purchases	❌ No
Modify billing profiles	❌ No
Change RBAC roles	❌ No
Access data or storage accounts	❌ No

Most relevant Entra roles for startups

Entra Role	Purpose
Global Administrator	Full directory control (identity, security)
Privileged Role Administrator	Manages privileged role assignments
Groups Administrator	Creates and manages groups (often used for RBAC assignments)
Conditional Access Administrator	Manages CA policies
Authentication Administrator	Controls authentication settings
Security Administrator	Manages security policies and alerts

Key insight: Entra governs identity and security, not cloud resources or billing. Because Entra manages groups, and groups are often used for RBAC assignments, Entra is the root of who can be given access, but not what access they have. This is where many organizations misunderstand the boundary.

Plane 2: Azure RBAC (Resource Plane)

Everything engineering touches: workloads, clusters, deployments, pipelines, resources.

Azure RBAC is the backbone of the Azure operational model. It controls all deployments (IaC, CLI, Portal, API), resource creation & modification, monitoring & diagnostics, Key Vault, Storage, Networking, AKS cluster operations, Azure OpenAI deployments, everything under Azure Resource Manager (ARM).

RBAC scopes

RBAC can be assigned at: Tenant root → Management group → Subscription → Resource group → Individual resource → Sub-resource (e.g., Key Vault secret).

RBAC role behavior

Role	Can Deploy?	Can View Usage Cost?	Can View Billing/MACC?
Owner	✅ Yes	✅ Yes	❌ No
Contributor	✅ Yes	✅ Yes	❌ No
Reader	❌ No	✅ Yes (limited)	❌ No
Cost Management Reader	❌ No	✅ Yes	❌ No
User Access Admin	❌ No	❌ No	❌ No

The critical point: RBAC cannot see billing. RBAC cannot view MACC. RBAC cannot read invoices. RBAC cannot approve Marketplace purchases. Even Owner, the highest role in the resource plane, is blind to billing.

Plane 3: Azure Billing/Commerce (Financial Plane)

Governed by the Microsoft Commerce Platform, not Azure Resource Manager.

This plane governs the financial relationship between the customer and Microsoft: billing accounts, invoices, credits (MACC, Azure credits, grants), commitments, discounts, payment methods, invoice sections, Marketplace SaaS purchases, reservations & savings plans, and private offers. Commerce roles live in an entirely different system from RBAC.

Common billing roles

Role	Can see credits?	Can deploy?	Notes
Billing Account Owner	✅ Yes	❌ No	Full financial authority
Billing Contributor	✅ Yes	❌ No	Can update payment methods
Billing Reader	✅ Yes	❌ No	Most finance teams use this
Invoice Section Owner	✅ Yes	❌ No	Scoped financial management

What billing roles can see: MACC balance, credits, invoices, payment history, reservations & savings plans (financial side), and Marketplace purchase capabilities.

What billing roles cannot do: deploy anything, modify RBAC, access resources, see workloads, change policy, or access cost analysis at resource group level.

Billing is where MACC lives. MACC (Azure Consumption Commitment) visibility is tied to Billing Account Owner, Billing Account Contributor, and Billing Reader. Even a subscription Owner cannot see MACC burn rate. This single point causes confusion in almost every startup onboarding Azure.

Full comparison matrix

When you need to answer "who can see what?" Use this table:

Data type	System	Who can see it
Resource usage cost	ARM (RBAC)	Cost Mgmt Reader, Owner, Contributor
Resource inventory	ARM (RBAC)	Owner, Contributor, Reader
Budgets & cost alerts	ARM (RBAC)	Owner, Contributor, Cost Mgmt Reader
Azure OpenAI cost analysis	ARM (RBAC)	RBAC roles
MACC credit balance	Commerce Platform	Billing roles only
Invoices & payments	Commerce Platform	Billing roles only
Marketplace private offers	Commerce Platform	Billing roles only
Commercial discounts	Commerce Platform	Billing roles only

💡 If your engineering lead says "I can see costs" and your CFO says "I can see costs", they are looking at different data from different systems. Both are right. Neither has the full picture.

The #1 source of confusion: Cost Management Reader vs. Billing Reader

This is the single most frequent misunderstanding in Azure governance. These two roles sound similar. They are completely different systems.

Cost Management Reader (RBAC Plane)

Can see: usage-based resource cost, cost by tags, cost by resource, cost forecast, budgets & alerts.

Cannot see: credits, invoices, payments, MACC, private offers, or contract terms.

Billing Reader (Commerce Plane)

Can see: invoices, credits, payments, MACC balance, Marketplace transaction history.

Cannot see: resource-level cost breakdown, cost by tags, subscription usage trends, or resource inventory.

Data type	Where it lives	Who can see it
Resource usage cost	Azure Cost Management (ARM)	Cost Mgmt Reader, Owner, Contributor
Budgets & cost alerts	ARM	Owner, Contributor, Cost Mgmt Reader
MACC credit balance	Commerce Platform	Billing roles only
Invoices	Commerce Platform	Billing roles only
Marketplace private offers	Commerce Platform	Billing roles only
Commercial discounts	Commerce Platform	Billing roles only

Cost visibility (usage-based cost) comes from RBAC. Billing visibility (credits, invoices, MACC) comes from Commerce. These are two completely different datasets. When you understand this distinction, half of the "why can't I see…?" questions answer themselves.

Quick start: where to set this up

Here's exactly where each plane is configured, in the Portal and via CLI.

Microsoft Entra (Identity Plane)

Portal: Azure Portal → Microsoft Entra ID → Roles and administrators

# List Entra directory role assignments az rest --method GET --url "https://graph.microsoft.com/v1.0/directoryRoles" # Add a user to a directory role az ad group member add --group "Groups Administrator" --member-id <user-object-id>

Azure RBAC (Resource Plane)

Portal: Subscription → Access Control (IAM) → Add role assignment

# Assign Contributor at subscription scope az role assignment create \ --assignee "user@contoso.com" \ --role "Contributor" \ --scope "/subscriptions/{subscription-id}" # Assign Cost Management Reader at resource group scope az role assignment create \ --assignee "user@contoso.com" \ --role "Cost Management Reader" \ --scope "/subscriptions/{sub-id}/resourceGroups/{rg-name}"

Azure Billing/Commerce (Financial Plane)

Portal: Azure Portal → Cost Management + Billing → Billing scopes → select billing account → Access control (IAM)

# List billing accounts az billing account list --output table # Assign Billing Reader via REST API az rest --method PUT \ --url "https://management.azure.com/providers/Microsoft.Billing/billingAccounts/{billing-account-id}/billingRoleAssignments/{id}?api-version=2024-04-01" \ --body '{"properties":{"principalId":"{user-object-id}","roleDefinitionId":"/providers/Microsoft.Billing/billingAccounts/{billing-account-id}/billingRoleDefinitions/{billing-reader-role-id}"}}'

References

What's next → This post established the foundation: Azure's three permission planes are separate by design. But the real complexity begins where these planes intersect.

In the part 2, we'll explore Marketplace governance, where resource deployment meets financial authority along with Managed Identity, the one construct that bridges two planes, and ABAC for advanced conditional governance.

Azure capacity planning: Using quotas, reservations, vmss instance mix, and compute fleet

rmmartins — Thu, 06 Nov 2025 16:47:25 GMT

Introduction

Over the past few months, I’ve been helping several digital-native customers navigate capacity constraints while scaling AI and compute-intensive workloads on Azure. Many teams run into the same frustrating message:

“SkuNotAvailable – The requested size is currently not available in the location.”

This post summarizes the strategy I’ve been using to help customers design around these challenges combining Quota Groups, Capacity Reservations (ODCR), VMSS Instance Mix, and Compute Fleet. These tools don’t create capacity where none exists, but together, when paired with proactive alerts, they form a practical playbook for scaling reliably through regional constraints.

Quota vs. Capacity: What’s the difference?

Concept	What It Is	Who Controls It	Can You Fix It Yourself?
Quota	A logical limit on how many vCPUs or specific VM series you can deploy.	Microsoft (adjustable on request).	✅ Yes, request an increase.
Capacity	The physical availability of hardware in the datacenter.	Azure datacenter (supply and utilization).	❌ No, if no servers exist, no deployment will succeed.

Example: You have 300 vCPUs of quota for the D-series in East US 2. You try to deploy 100 D8as_v5 VMs and get a failure. You open a support request and find:

Your quota is fine
But the region has no physical capacity for D8as_v5

Even if Microsoft raised your quota to 1,000 vCPUs, the deployment would still fail because quota ≠ capacity.

Quota issue: You’ll see errors like OperationNotAllowed or QuotaExceeded.
Capacity issue: The message will be SkuNotAvailable or AllocationFailed.

If you see a quota error, open the Usage + quotas blade and request an increase. If it’s a capacity error, switching zones, SKUs, or regions, or using VMSS Instance Mix or Compute Fleet is your best next step.

“Quota is a number on paper. Capacity is what’s physically sitting in the racks.”

Strategy 1: Quota management and Quota Groups

Azure applies vCPU quotas by region and VM family (e.g., Dsv5, Esv5). Quota Groups provide a consolidated way to monitor and manage these logical limits across families.

Learn more:

Quota limits are easy to overlook until automation or scale pipelines fail. AI-heavy startups often discover too late that they’ve maxed out their quota family.

Best practices:

Monitor with Quota Group alerts: Use Quota Alerts (preview) to automatically notify you when usage reaches thresholds (e.g., 80%). Alerts integrate with Azure Monitor and Action Groups.
Request increases proactively: Portal path: Subscriptions → Usage + quotas → Request increase. Most CPU SKUs are approved quickly; GPUs can take longer.
Plan by family, not by SKU: If you only check “D8as_v5 usage,” you may miss that the entire D-series family is at its quota limit.

Strategy 2: Capacity Reservations (ODCR)

A Capacity Reservation (formally On-Demand Capacity Reservation, ODCR) lets you pre-book physical infrastructure in a specific region, zone, and VM size. You’re reserving capacity, not committing to a term or discount. Azure holds those servers for your subscription, ensuring your workloads can always start.

Learn more:

Capacity Reservation vs. Reserved Instance (RI)

Aspect	Capacity Reservation (ODCR)	Reserved Instance (RI)
Purpose	Guarantees capacity (hardware availability).	Locks in price (discounted rate).
Scope	Specific region, zone, and VM size.	Region and VM family.
Billing	Pay-as-you-go, no term commitment.	1 or 3-year fixed term.
Capacity Guarantee	✅ Yes, hardware is held for you.	❌ No, no guarantee.
Price Benefit	❌ None, PAYG rate.	✅ Up to ~70% discount.
Flexibility	Modify or cancel anytime.	Bound to term.

In short:

ODCR = “Hold my spot in the datacenter.”

RI = “Give me a discount because I’ll keep using it.”

You can use both: ODCR for capacity, RI for savings.

Example: A startup consistently runs 20× D16as_v5 VMs nightly for training. They reserve that capacity (ODCR) and apply RIs for discounts ensuring predictable performance and cost.

Limitations:

You can’t reserve SKUs already out of stock.
ODCR doesn’t autoscale, it holds your baseline.
Best for core workloads, not ephemeral jobs.

Strategy 3: VMSS Instance Mix

Virtual Machine Scale Set (VMSS) Instance Mix is a feature of VMSS Flex that enables capacity-aware scaling across multiple VM sizes, and even across different purchase options (Standard and Spot). When you define more than one acceptable VM size, Azure automatically chooses whichever has capacity available during scale-out.

Learn more:

VMSS Instance Mix – Overview

Example: Here’s a simplified configuration snippet from an ARM or Bicep template using Instance Mix:

"virtualMachineProfile": { "hardwareProfile": { "vmSizeProperties": { "vmSizes": [ "Standard_D8as_v5", "Standard_E8as_v5", "Standard_F8s_v2" ] } } }

VMSS Instance Mix helps you survive temporary SKU shortages by dynamically selecting the next available size, while Spot Priority Mix lets you blend Spot and Standard instances to reduce cost and improve resilience. This makes it ideal for large-scale app tiers, batch processing, and AI inference.

Limitations:

Works across zones, not regions.
Doesn’t mix Spot + Standard in the same pool.
Doesn’t reserve hardware capacity, it only improves allocation success rates.

Strategy 4: Azure Compute Fleet

Azure Compute Fleet can deploy up to 10,000 VMs across multiple SKUs, zones, and (in preview) regions. You define acceptable SKUs, and Azure picks the ones that have capacity.

Learn more:

Azure Compute Fleet – Overview

Fleet automatically:

Tries alternate SKUs (D8as_v5 → E8as_v5).
Expands to other zones or regions.
Combines Standard and Spot instances.

In short, it automates the “try this, then that” logic, improving your odds of successful deployment.

Example: A rendering studio needs 2,000 VMs nightly. Fleet dynamically uses D8s_v5, D16s_v5, or E8s_v5 across East US 2 and West US 2, depending on live availability.

Limitations:

Fleet doesn’t create capacity it just searches smarter. If every zone and region is full, it still fails. Ideal for AI training, batch jobs, rendering, or HPC, not for stateful services.

When to use what

Scenario	Best tool	What it solves
Logical limits before deployment	Quota Groups + Alerts	Prevent hitting soft limits.
Guaranteed baseline	Capacity Reservation (ODCR)	Reserve real hardware.
Managed autoscaling	VMSS Instance Mix	Scale out despite partial shortages.
Large-scale/bursty workloads	Azure Compute Fleet	Try alternate SKUs and regions.
GPU/high-demand SKUs	ODCR + Fleet	Reserve base, burst flexibly.

Real Talk: There’s no magic when a datacenter is full. Let’s be transparent: If a region has no physical servers available, no tool can make capacity appear.

Quota Groups remove logical blockers.
Capacity Reservations secure what you need.
Compute Fleet and VMSS Instance Mix increase the odds of success.

Together, they maximize probability, but none can override a physically full region.

The Azure capacity strategy flow

Final thoughts

For fast-scaling digital-native companies, the right question isn’t “How do I guarantee capacity?”. It’s “How do I design for capacity uncertainty?” Start by putting the basics on autopilot: Configure Quota Group alerts to prevent silent blockers.

Use Capacity Reservations (ODCR) to secure your baseline compute.
Add elasticity through VMSS Instance Mix and, when flexibility allows, Compute Fleet.
Monitor everything with Azure Monitor alerts — from quotas and reservations to scale-out failures and Fleet allocation health.

💡 Pro tip: Combine Quota Group Alerts, Reservation coverage monitoring, and VMSS/Fleet deployment telemetry in Azure Monitor to detect issues early. The faster you know what kind of failure you’re hitting, the faster you can act.

Accept that capacity is finite, but also that visibility is your greatest advantage. Azure gives you multiple levers; success comes from knowing when and how to use each one together.

Over the past few months, I’ve supported multiple customers, from AI platforms to SaaS startups, who faced real capacity challenges in regions like East US 2 and West US 2. This post came directly from those experiences, with one goal: to help others move from reactive firefighting to proactive, layered capacity planning. If your workloads are scaling fast, I hope this guide helps you build not just a plan, but a mindset, for running reliably when the cloud gets crowded.