A real design review: management groups, policies, break-glass accounts, and the five things I'd tweak before going to production.
Here's what I see at most startups when they first show up on Azure: one subscription, one Global Admin, everything in the same resource group, and everyone's an Owner.
That works when you have three engineers and one environment. It stops working around the time you have a production workload, a dev environment, shared infrastructure, and an engineer who accidentally deleted the wrong resource group on a Friday afternoon.
The next step is usually "let's create more subscriptions." That's the right instinct. But without management groups and policies tying them together, you end up with four subscriptions, four sets of inconsistent RBAC assignments, no shared tagging strategy, and no audit trail showing who deployed what.
If you're at this stage and want a starting point, the Startup-Scale Landing Zone gives you an opinionated Bicep template with management groups, policies, and RBAC already wired together. This post goes deeper: what happens when a team takes those concepts and customizes them for their own environment.
The design
A startup VP of Engineering sent me their proposed management group hierarchy and asked me to review it before going to production. They'd done their homework: read the Cloud Adoption Framework docs, researched config options, and put together a three-level hierarchy with specific policies and RBAC at each level.
Here's the breakdown:
Tenant Root Group is the automatic top-level MG that Azure creates in every tenant. Be very selective about what you assign here. Anything at this level affects every subscription you'll ever create, including ones that don't exist yet. Some organizations do assign enterprise-wide "must have" policies at root, but for a startup still figuring out its governance posture, keeping root clean and pushing baselines to a company MG one level down gives you more flexibility.
Company MG sits directly below and carries the baseline that applies to everything: required tags on all resources (env, owner, cost-center, app), allowed regions locked to three US regions, Defender for Cloud enabled everywhere, and all diagnostic logs routed to a central Log Analytics workspace. Engineering gets Reader at this level, so everyone can see everything but can't change anything by default.
Three child MGs below that:
Nonprod MG is the relaxed zone. Tags are audited but not denied, so engineers can experiment without being blocked by policy. Public IPs are allowed. Engineering gets Contributor. This is where you iterate fast without filing PIM requests.
Prod MG is the strict zone. Tags are denied if missing. Public IPs are blocked. Encryption at rest is required. VM SKUs are restricted. Engineering gets Reader by default, and Contributor access is available through PIM (just-in-time, time-limited activation). You have to explicitly request write access, and it expires.
Platform MG protects the shared infrastructure that everything depends on. The Terraform state storage account, central Log Analytics workspace, and shared Key Vault all live here. Platform team gets Contributor; everyone else gets Reader. Critical resources are protected from deletion.
Under each MG, the subscriptions:
| MG | Subscription | Purpose |
|---|---|---|
| Nonprod | dev | Development and testing |
| Nonprod | devtest (MSDN) | Engineer's personal scratch (MSDN-bound) |
| Prod | prod | Production workloads |
| Platform | cloud-infra | Terraform state, Log Analytics, Key Vault, workload identity |
The parts that nail it
The hierarchy is flat and functional. CAF says keep it three to four levels deep and don't create management groups just for the sake of structure. This design does exactly that: a company MG for baselines, then Nonprod/Prod/Platform for the policy gradient. It's not "the one CAF pattern" (CAF deliberately avoids prescribing a single topology), but it's a clean startup pattern that scales to dozens of subscriptions without restructuring.
Audit in dev, deny in prod. Dev environments that deny everything become unusable. Engineers stop experimenting. Prod environments that only audit become insecure. The split is the right trade-off: visibility without friction in dev, enforcement without exceptions in prod.
The platform subscription for shared services. Centralizing Terraform state, the Log Analytics workspace, and shared Key Vault into a separate subscription (with its own RBAC) means application teams can't accidentally delete the infrastructure that manages their infrastructure. This is the "trust boundary" pattern, and most startups skip it until they learn the hard way.
What i'd change before going live
PIM licensing isn't one-seat-fits-all. They mentioned having "1 P2 seat" for PIM. PIM requires an Entra ID P2 (or Governance) license per user who's eligible for activation, plus anyone who approves or reviews PIM access. If four engineers need just-in-time Contributor access to production and one manager approves, that's five P2 licenses (~$9/user/month). Still cheap insurance compared to "everyone has standing Contributor," but budget for it correctly.
Think about SKU restrictions as a trade-off. Their prod MG had "restrict to approved SKUs." An allow-list gives you strict standardization (only pre-approved SKUs work), but every time Azure launches a new VM series, someone has to update it. A deny-list ("block these specific expensive or unnecessary SKUs") is easier to maintain since new SKUs are available by default. The right choice depends on your team: if you need tight control over what runs in prod, keep the allow-list. If you move fast and want less policy maintenance, a deny-list with periodic reviews is simpler.
Resource locks beat policy for protecting critical infra. Their Platform MG had "deny deletion of state storage / log workspace" as a policy. Azure Resource Locks (CanNotDelete) are simpler and more visible for this. A lock shows up right on the resource in the portal, so engineers see it immediately. A deny-delete policy is invisible until it blocks you, and the error message doesn't always make it obvious why. Locks are also easier to temporarily remove when you legitimately need to rotate or replace a resource.
Add cost alerts on every subscription from day one. Their design didn't mention budget alerts. Azure Cost Management lets you set budget thresholds per subscription with email and webhook notifications. Set them before any workloads deploy, not after the first surprise bill. Start with 80% and 100% of expected monthly spend. It takes 5 minutes and can save thousands.
Cap the MSDN subscription. Their devtest sub was MSDN-bound, described as "personal scratch." MSDN subscriptions come with a monthly credit ($50-$150 depending on the license tier), but the spending limit can be removed, which means charges hit a valid payment method with no cap. Keep the spending limit ON for scratch subs. If it's been removed, set a budget alert at the credit amount. Also note that some Marketplace and external services may bill separately regardless of the spending limit.
The break-glass question
This team was federating their primary domain with Google Workspace as the SAML identity provider (their whole company runs on Google). They asked: "Can I use my .onmicrosoft.com account as a break-glass account while my federated company.com is my daily driver?"
Yes. This is exactly the pattern Microsoft recommends.
Microsoft's security benchmark (PA-5) specifically calls for cloud-only break-glass accounts that bypass external IdP dependencies. If your Google SAML federation goes down (Google outage, misconfigured SAML cert, domain issues), all federated accounts fail to sign in. Cloud-only .onmicrosoft.com accounts authenticate directly against Entra ID with no external dependency.
How to harden them:
Create two break-glass accounts. Microsoft recommends at least two. Store credentials in separate physical locations. One person alone shouldn't be able to access both. Docs: Manage emergency access accounts.
Use phishing-resistant auth. Passkeys (FIDO2 security keys) are the strongest option: phishing-resistant and no dependency on a phone or authenticator app that might be unavailable during an emergency. If you already run PKI, certificate-based auth is another viable option. The key is diversity across your two accounts so a single authentication method failure doesn't lock out both. Docs: Enable FIDO2 security key sign-in.
Exclude at least one account from ALL Conditional Access policies. This is the account that guarantees access if a bad CA policy locks everyone out. Microsoft recommends excluding at least one break-glass account from every CA policy. The second account can optionally have phishing-resistant MFA enforced via CA, giving you a safer fallback for non-federation emergencies.
Assign Global Administrator permanently. Not through PIM. Break-glass accounts need immediate access. PIM activation requires the normal auth flow, which defeats the purpose in an emergency.
Monitor every sign-in. Set up alerts in Azure Monitor or Microsoft Sentinel for any authentication from a break-glass account. If these accounts show activity outside an emergency, investigate immediately.
Test quarterly. Actually sign in with the break-glass accounts on a schedule. Verify the credentials work, the FIDO2 keys work, and the monitoring alert fires. Don't wait for a real emergency to discover something is broken.
The pre-production governance checklist
Before deploying workloads into your new hierarchy, verify:
- All subscriptions are nested under the correct MG (not dangling under Tenant Root Group)
- Baseline policies applied at the company MG and verified with
Get-AzPolicyAssignment - PIM configured with appropriate activation duration (4-8 hours max)
- P2 licenses assigned to every user eligible for PIM activation, plus approvers and reviewers
- Two break-glass accounts exist, tested, and monitored
- At least one break-glass account excluded from all Conditional Access policies
- Budget alerts set on every subscription (80% and 100% thresholds)
- Resource locks on Terraform state, Log Analytics workspace, and Key Vault
- MSDN spending limit verified ON (or budget alert set if removed)
- Diagnostic settings routing all activity logs to the central Log Analytics workspace
Where this fits in the governance journey
If you're building Azure governance from zero, here's my recommended reading order:
- Demystifying Microsoft Entra ID, Tenants and Azure Subscriptions - understand what tenants, subscriptions, and Entra ID actually are
- Azure has three permission systems, and you're probably confusing them - the identity, resource, and billing planes
- This post - design your management group hierarchy
- Role Structures, Anti-Patterns, and the 10 Governance Principles - RBAC patterns and what not to do
- Introducing the Startup-Scale Landing Zone - the full reference architecture