<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Startups at Microsoft articles</title>
    <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/bg-p/StartupsatMicrosoftBlog</link>
    <description>Startups at Microsoft articles</description>
    <pubDate>Sat, 02 May 2026 12:47:06 GMT</pubDate>
    <dc:creator>StartupsatMicrosoftBlog</dc:creator>
    <dc:date>2026-05-02T12:47:06Z</dc:date>
    <item>
      <title>The flat-subscription problem</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/the-flat-subscription-problem/ba-p/4513777</link>
      <description>&lt;P&gt;&lt;EM&gt;A real design review: management groups, policies, break-glass accounts, and the five things I'd tweak before going to production.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Here's what I see at most startups when they first show up on Azure: one subscription, one Global Admin, everything in the same resource group, and everyone's an Owner.&lt;/P&gt;
&lt;P&gt;That works when you have three engineers and one environment. It stops working around the time you have a production workload, a dev environment, shared infrastructure, and an engineer who accidentally deleted the wrong resource group on a Friday afternoon.&lt;/P&gt;
&lt;P&gt;The next step is usually "let's create more subscriptions." That's the right instinct. But without management groups and policies tying them together, you end up with four subscriptions, four sets of inconsistent RBAC assignments, no shared tagging strategy, and no audit trail showing who deployed what.&lt;/P&gt;
&lt;P&gt;If you're at this stage and want a starting point, the &lt;A class="lia-external-url" href="https://aka.ms/sslz" target="_blank"&gt;Startup-Scale Landing Zone&lt;/A&gt; gives you an opinionated Bicep template with management groups, policies, and RBAC already wired together. This post goes deeper: what happens when a team takes those concepts and customizes them for their own environment.&lt;/P&gt;
&lt;H2&gt;The design&lt;/H2&gt;
&lt;P&gt;A startup VP of Engineering sent me their proposed management group hierarchy and asked me to review it before going to production. They'd done their homework: read the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/landing-zone/design-area/resource-org-management-groups" target="_blank"&gt;Cloud Adoption Framework&lt;/A&gt; docs, researched config options, and put together a three-level hierarchy with specific policies and RBAC at each level.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;Here's the breakdown:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Tenant Root Group&lt;/STRONG&gt; is the automatic top-level MG that Azure creates in every tenant. Be very selective about what you assign here. Anything at this level affects every subscription you'll ever create, including ones that don't exist yet. Some organizations do assign enterprise-wide "must have" policies at root, but for a startup still figuring out its governance posture, keeping root clean and pushing baselines to a company MG one level down gives you more flexibility.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Company MG&lt;/STRONG&gt; sits directly below and carries the baseline that applies to everything: required tags on all resources (env, owner, cost-center, app), allowed regions locked to three US regions, Defender for Cloud enabled everywhere, and all diagnostic logs routed to a central Log Analytics workspace. Engineering gets Reader at this level, so everyone can see everything but can't change anything by default.&lt;/P&gt;
&lt;P&gt;Three child MGs below that:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Nonprod MG&lt;/STRONG&gt; is the relaxed zone. Tags are audited but not denied, so engineers can experiment without being blocked by policy. Public IPs are allowed. Engineering gets Contributor. This is where you iterate fast without filing PIM requests.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Prod MG&lt;/STRONG&gt; is the strict zone. Tags are denied if missing. Public IPs are blocked. Encryption at rest is required. VM SKUs are restricted. Engineering gets Reader by default, and Contributor access is available through &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/entra/id-governance/privileged-identity-management/pim-configure" target="_blank"&gt;PIM&lt;/A&gt; (just-in-time, time-limited activation). You have to explicitly request write access, and it expires.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Platform MG&lt;/STRONG&gt; protects the shared infrastructure that everything depends on. The Terraform state storage account, central Log Analytics workspace, and shared Key Vault all live here. Platform team gets Contributor; everyone else gets Reader. Critical resources are protected from deletion.&lt;/P&gt;
&lt;P&gt;Under each MG, the subscriptions:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;MG&lt;/th&gt;&lt;th&gt;Subscription&lt;/th&gt;&lt;th&gt;Purpose&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Nonprod&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;dev&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Development and testing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Nonprod&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;devtest&lt;/STRONG&gt; (MSDN)&lt;/td&gt;&lt;td&gt;Engineer's personal scratch (MSDN-bound)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Prod&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;prod&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Production workloads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;cloud-infra&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Terraform state, Log Analytics, Key Vault, workload identity&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;The parts that nail it&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;The hierarchy is flat and functional.&lt;/STRONG&gt; CAF says keep it &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/landing-zone/design-area/resource-org-management-groups" target="_blank"&gt;three to four levels deep&lt;/A&gt; and don't create management groups just for the sake of structure. This design does exactly that: a company MG for baselines, then Nonprod/Prod/Platform for the policy gradient. It's not "the one CAF pattern" (CAF deliberately avoids prescribing a single topology), but it's a clean startup pattern that scales to dozens of subscriptions without restructuring.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Audit in dev, deny in prod.&lt;/STRONG&gt; Dev environments that deny everything become unusable. Engineers stop experimenting. Prod environments that only audit become insecure. The split is the right trade-off: visibility without friction in dev, enforcement without exceptions in prod.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;The platform subscription for shared services.&lt;/STRONG&gt; Centralizing Terraform state, the Log Analytics workspace, and shared Key Vault into a separate subscription (with its own RBAC) means application teams can't accidentally delete the infrastructure that manages their infrastructure. This is the "trust boundary" pattern, and most startups skip it until they learn the hard way.&lt;/P&gt;
&lt;H2&gt;What i'd change before going live&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;PIM licensing isn't one-seat-fits-all.&lt;/STRONG&gt; They mentioned having "1 P2 seat" for PIM. &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/entra/id-governance/licensing-fundamentals" target="_blank"&gt;PIM requires an Entra ID P2 (or Governance) license&lt;/A&gt; per user who's eligible for activation, plus anyone who approves or reviews PIM access. If four engineers need just-in-time Contributor access to production and one manager approves, that's five P2 licenses (~$9/user/month). Still cheap insurance compared to "everyone has standing Contributor," but budget for it correctly.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Think about SKU restrictions as a trade-off.&lt;/STRONG&gt; Their prod MG had "restrict to approved SKUs." An allow-list gives you strict standardization (only pre-approved SKUs work), but every time Azure launches a new VM series, someone has to update it. A deny-list ("block these specific expensive or unnecessary SKUs") is easier to maintain since new SKUs are available by default. The right choice depends on your team: if you need tight control over what runs in prod, keep the allow-list. If you move fast and want less policy maintenance, a deny-list with periodic reviews is simpler.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Resource locks beat policy for protecting critical infra.&lt;/STRONG&gt; Their Platform MG had "deny deletion of state storage / log workspace" as a policy. &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/lock-resources" target="_blank"&gt;Azure Resource Locks&lt;/A&gt; (CanNotDelete) are simpler and more visible for this. A lock shows up right on the resource in the portal, so engineers see it immediately. A deny-delete policy is invisible until it blocks you, and the error message doesn't always make it obvious why. Locks are also easier to temporarily remove when you legitimately need to rotate or replace a resource.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Add cost alerts on every subscription from day one.&lt;/STRONG&gt; Their design didn't mention budget alerts. &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/tutorial-acm-create-budgets" target="_blank"&gt;Azure Cost Management&lt;/A&gt; lets you set budget thresholds per subscription with email and webhook notifications. Set them before any workloads deploy, not after the first surprise bill. Start with 80% and 100% of expected monthly spend. It takes 5 minutes and can save thousands.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Cap the MSDN subscription.&lt;/STRONG&gt; Their devtest sub was MSDN-bound, described as "personal scratch." MSDN subscriptions come with a monthly credit ($50-$150 depending on the license tier), but the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/devtest/offer/how-to-manage-the-spending-limit" target="_blank"&gt;spending limit can be removed&lt;/A&gt;, which means charges hit a valid payment method with no cap. Keep the spending limit ON for scratch subs. If it's been removed, set a budget alert at the credit amount. Also note that some Marketplace and external services may bill separately regardless of the spending limit.&lt;/P&gt;
&lt;H2&gt;The break-glass question&lt;/H2&gt;
&lt;P&gt;This team was federating their primary domain with Google Workspace as the SAML identity provider (their whole company runs on Google). They asked: "Can I use my .onmicrosoft.com account as a break-glass account while my federated &lt;a href="javascript:void(0)" data-lia-user-mentions="" data-lia-user-uid="241535" data-lia-user-login="company" class="lia-mention lia-mention-user"&gt;company&lt;/a&gt;.com is my daily driver?"&lt;/P&gt;
&lt;P&gt;Yes. This is exactly the pattern Microsoft recommends.&lt;/P&gt;
&lt;img /&gt;
&lt;P class="lia-clear-both"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/security/benchmark/azure/mcsb-privileged-access#pa-5-set-up-emergency-access" target="_blank"&gt;Microsoft's security benchmark (PA-5)&lt;/A&gt; specifically calls for cloud-only break-glass accounts that bypass external IdP dependencies. If your Google SAML federation goes down (Google outage, misconfigured SAML cert, domain issues), all federated accounts fail to sign in. Cloud-only .onmicrosoft.com accounts authenticate directly against Entra ID with no external dependency.&lt;/P&gt;
&lt;P&gt;How to harden them:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Create two break-glass accounts.&lt;/STRONG&gt; Microsoft recommends at least two. Store credentials in separate physical locations. One person alone shouldn't be able to access both. Docs: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/entra/identity/role-based-access-control/security-emergency-access" target="_blank"&gt;Manage emergency access accounts&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Use phishing-resistant auth.&lt;/STRONG&gt; Passkeys (FIDO2 security keys) are the strongest option: phishing-resistant and no dependency on a phone or authenticator app that might be unavailable during an emergency. If you already run PKI, certificate-based auth is another viable option. The key is diversity across your two accounts so a single authentication method failure doesn't lock out both. Docs: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/entra/identity/authentication/howto-authentication-passwordless-security-key" target="_blank"&gt;Enable FIDO2 security key sign-in&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Exclude at least one account from ALL Conditional Access policies.&lt;/STRONG&gt; This is the account that guarantees access if a bad CA policy locks everyone out. Microsoft recommends excluding at least one break-glass account from every CA policy. The second account can optionally have phishing-resistant MFA enforced via CA, giving you a safer fallback for non-federation emergencies.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Assign Global Administrator permanently.&lt;/STRONG&gt; Not through PIM. Break-glass accounts need immediate access. PIM activation requires the normal auth flow, which defeats the purpose in an emergency.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Monitor every sign-in.&lt;/STRONG&gt; Set up alerts in Azure Monitor or Microsoft Sentinel for any authentication from a break-glass account. If these accounts show activity outside an emergency, investigate immediately.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Test quarterly.&lt;/STRONG&gt; Actually sign in with the break-glass accounts on a schedule. Verify the credentials work, the FIDO2 keys work, and the monitoring alert fires. Don't wait for a real emergency to discover something is broken.&lt;/P&gt;
&lt;H2&gt;The pre-production governance checklist&lt;/H2&gt;
&lt;P&gt;Before deploying workloads into your new hierarchy, verify:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;All subscriptions are nested under the correct MG (not dangling under Tenant Root Group)&lt;/LI&gt;
&lt;LI&gt;Baseline policies applied at the company MG and verified with &lt;CODE&gt;Get-AzPolicyAssignment&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;PIM configured with appropriate activation duration (4-8 hours max)&lt;/LI&gt;
&lt;LI&gt;P2 licenses assigned to every user eligible for PIM activation, plus approvers and reviewers&lt;/LI&gt;
&lt;LI&gt;Two break-glass accounts exist, tested, and monitored&lt;/LI&gt;
&lt;LI&gt;At least one break-glass account excluded from all Conditional Access policies&lt;/LI&gt;
&lt;LI&gt;Budget alerts set on every subscription (80% and 100% thresholds)&lt;/LI&gt;
&lt;LI&gt;Resource locks on Terraform state, Log Analytics workspace, and Key Vault&lt;/LI&gt;
&lt;LI&gt;MSDN spending limit verified ON (or budget alert set if removed)&lt;/LI&gt;
&lt;LI&gt;Diagnostic settings routing all activity logs to the central Log Analytics workspace&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Where this fits in the governance journey&lt;/H2&gt;
&lt;P&gt;If you're building Azure governance from zero, here's my recommended reading order:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/demystifying-microsoft-entra-id-tenants-and-azure-subscriptions/4155261" data-lia-auto-title="Demystifying Microsoft Entra ID, Tenants and Azure Subscriptions" data-lia-auto-title-active="0" target="_blank"&gt;Demystifying Microsoft Entra ID, Tenants and Azure Subscriptions&lt;/A&gt; - understand what tenants, subscriptions, and Entra ID actually are&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-has-three-permission-systems-and-youre-probably-confusing-them/4471854" data-lia-auto-title="Azure has three permission systems, and you're probably confusing them" data-lia-auto-title-active="0" target="_blank"&gt;Azure has three permission systems, and you're probably confusing them&lt;/A&gt; - the identity, resource, and billing planes&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;This post&lt;/STRONG&gt; - design your management group hierarchy&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/role-structures-anti-patterns-and-the-10-governance-principles/4510070" data-lia-auto-title="Role Structures, Anti-Patterns, and the 10 Governance Principles" data-lia-auto-title-active="0" target="_blank"&gt;Role Structures, Anti-Patterns, and the 10 Governance Principles&lt;/A&gt; - RBAC patterns and what not to do&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/introducing-the-startup-scale-landing-zone-get-azure-right-from-day-one/4501566" data-lia-auto-title="Introducing the Startup-Scale Landing Zone" data-lia-auto-title-active="0" target="_blank"&gt;Introducing the Startup-Scale Landing Zone&lt;/A&gt; - the full reference architecture&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Wed, 22 Apr 2026 19:06:12 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/the-flat-subscription-problem/ba-p/4513777</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2026-04-22T19:06:12Z</dc:date>
    </item>
    <item>
      <title>Your Azure VM went down and nobody knew why. Here's how to fix that.</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/your-azure-vm-went-down-and-nobody-knew-why-here-s-how-to-fix/ba-p/4513733</link>
      <description>&lt;P&gt;If you've ever had a production VM go unhealthy on Azure and found yourself scrambling to figure out what happened, you're not alone. I work with startups running production workloads on Azure, and this is one of the most common patterns I see: something goes wrong, the team opens a support ticket, and then everyone waits for a root cause while the CTO asks "how do we make sure we know about this before our customers do next time?"&lt;/P&gt;
&lt;P&gt;The good news: Azure already gives you the tools to answer both questions. Most teams just haven't set them up yet.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Scope note:&lt;/STRONG&gt; This post covers &lt;STRONG&gt;platform health and maintenance signals&lt;/STRONG&gt; for Azure VMs. We're not covering guest OS metrics, application telemetry, or Azure Monitor/VM Insights here. If you don't have a dedicated SRE team, these are the highest-leverage Azure-native checks to set up first.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Let's get into it.&lt;/P&gt;
&lt;img /&gt;
&lt;P class="lia-clear-both"&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Step 1: Figure out what actually happened (Resource Health)&lt;/H2&gt;
&lt;P&gt;Before you open a support ticket, check &lt;STRONG&gt;Resource Health&lt;/STRONG&gt;. It's the fastest way to determine whether your VM went down because of something Azure did (platform event) or something on your side (user-initiated or config issue).&lt;/P&gt;
&lt;P&gt;Go to your VM in the Azure portal &amp;gt; &lt;STRONG&gt;Resource Health&lt;/STRONG&gt; blade. You'll see:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Current status&lt;/STRONG&gt;: Available, Unavailable, Degraded, or Unknown&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Health history&lt;/STRONG&gt;: 30 days of state transitions with annotations explaining each one&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Root cause&lt;/STRONG&gt;: For platform-initiated outages on VMs, Azure automatically publishes root cause details within 72 hours, directly in this blade&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The annotations often tell you what kind of event occurred: live migration, host reboot, planned maintenance, degraded hardware, etc. In many cases, you get this information without filing a support ticket.&lt;/P&gt;
&lt;P&gt;If your VM was affected by a live migration, the annotation will show it was a platform-initiated event. Live migration is a memory-preserving operation that causes a brief pause, typically no more than 5 seconds (&lt;A href="https://learn.microsoft.com/en-us/azure/virtual-machines/maintenance-and-updates#maintenance-that-doesnt-require-a-reboot" target="_blank" rel="noopener"&gt;docs&lt;/A&gt;). But if your application is sensitive to even short freezes, or if you're seeing them frequently, that's worth investigating further.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Docs:&lt;/STRONG&gt; &lt;A href="https://learn.microsoft.com/en-us/azure/service-health/resource-health-overview" target="_blank" rel="noopener"&gt;Resource Health overview&lt;/A&gt;&lt;/P&gt;
&lt;H2&gt;Step 2: Get notified when it happens (Service Health + Resource Health Alerts)&lt;/H2&gt;
&lt;P&gt;Checking the portal after an incident is fine. Getting an alert &lt;EM&gt;when&lt;/EM&gt; the incident happens is better.&lt;/P&gt;
&lt;H3&gt;Service Health Alerts&lt;/H3&gt;
&lt;P&gt;These notify you about service issues, planned maintenance, health advisories, and security advisories for the Azure services and regions you're actually using. Service Health is best for subscription-level and region-level awareness. If there's a regional maintenance wave driving elevated live migrations, this is how you'd know about it proactively.&lt;/P&gt;
&lt;P&gt;Set them up to notify your ops channel via email, SMS, webhook (Slack, PagerDuty, Teams), or automation via Logic Apps or Azure Functions.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Docs:&lt;/STRONG&gt; &lt;A href="https://learn.microsoft.com/en-us/azure/service-health/alerts-activity-log-service-notifications-portal" target="_blank" rel="noopener"&gt;Create Service Health alerts&lt;/A&gt; | &lt;A href="https://learn.microsoft.com/en-us/azure/service-health/service-health-alert-webhook-pagerduty" target="_blank" rel="noopener"&gt;PagerDuty integration&lt;/A&gt;&lt;/P&gt;
&lt;H3&gt;Resource Health Alerts&lt;/H3&gt;
&lt;P&gt;These fire when a specific resource (or all resources in a resource group) changes health status. The alert includes health-change details such as status, cause type (platform vs. user-initiated), and descriptive event text, so you get more than a generic "VM is unhealthy" notification.&lt;/P&gt;
&lt;P&gt;This is the "never be surprised again" alert. If you only set up one thing from this post, make it this.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Docs:&lt;/STRONG&gt; &lt;A href="https://learn.microsoft.com/en-us/azure/service-health/resource-health-alert-monitor-guide" target="_blank" rel="noopener"&gt;Create Resource Health alerts&lt;/A&gt;&lt;/P&gt;
&lt;H2&gt;Step 3: See it coming (Scheduled Events API)&lt;/H2&gt;
&lt;P&gt;This is the part most teams don't know about, and it's the most powerful tool for handling live migrations gracefully.&lt;/P&gt;
&lt;P&gt;Azure exposes an &lt;STRONG&gt;Instance Metadata Service (IMDS)&lt;/STRONG&gt; endpoint on every VM that gives your application advance notice of upcoming maintenance events. Live migrations show up as &lt;CODE&gt;EventType: "Freeze"&lt;/CODE&gt;. In typical cases, you get up to ~15 minutes between the event appearing and Azure proceeding with the operation, though exact timing varies and some failures (like hardware issues) can bypass the advance notification entirely.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Note:&lt;/STRONG&gt; Most Azure VM families support live migration, but G, L, N, and H series VMs do not. If you run GPU or HPC workloads on these SKUs, you won't see &lt;CODE&gt;Freeze&lt;/CODE&gt; events. You'll still get &lt;CODE&gt;Reboot&lt;/CODE&gt; or &lt;CODE&gt;Redeploy&lt;/CODE&gt; events for other maintenance types.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;The endpoint is available from inside the VM at:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Here's an example response when a live migration is scheduled:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;{
  "DocumentIncarnation": 1,
  "Events": [
    {
      "EventId": "602d9444-d2cd-49c7-8624-8643e7171297",
      "EventType": "Freeze",
      "ResourceType": "VirtualMachine",
      "Resources": ["my-production-vm"],
      "EventStatus": "Scheduled",
      "NotBefore": "Mon, 22 Apr 2026 19:17:47 GMT",
      "Description": "Virtual machine is being paused for a memory-preserving Live Migration operation.",
      "EventSource": "Platform",
      "DurationInSeconds": 5
    }
  ]
}&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;You can poll this endpoint and use the lead time to:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Drain connections&lt;/STRONG&gt; so active users aren't affected&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Checkpoint application state&lt;/STRONG&gt; to recover faster&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Remove the VM from your load balancer&lt;/STRONG&gt; temporarily&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Log the event&lt;/STRONG&gt; so you have a record of migration frequency&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Here's a simple polling script in Python:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;import requests
import json
import time

ENDPOINT = "http://169.254.169.254/metadata/scheduledevents"
HEADERS = {"Metadata": "true"}
PARAMS = {"api-version": "2020-07-01"}

def get_scheduled_events():
    response = requests.get(ENDPOINT, headers=HEADERS, params=PARAMS)
    return response.json()

def handle_events(data):
    for event in data.get("Events", []):
        print(f"[{event['EventType']}] {event.get('Description', 'No description')}")
        print(f"  Status: {event['EventStatus']}, Not Before: {event['NotBefore']}")
        print(f"  Duration: {event['DurationInSeconds']}s, Source: {event['EventSource']}")
        # Your graceful drain/checkpoint logic here

def approve_event(event_id):
    """Acknowledge the event so Azure can proceed immediately."""
    payload = json.dumps({"StartRequests": [{"EventId": event_id}]})
    requests.post(ENDPOINT, headers=HEADERS, params=PARAMS, data=payload)

# Poll frequently - the official docs recommend every 1 second for production.
# Adjust based on your workload sensitivity.
while True:
    data = get_scheduled_events()
    handle_events(data)
    time.sleep(1)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Or a quick check in Bash:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;curl -s -H "Metadata:true" --noproxy "*" \
  "http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01" | jq .&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;STRONG&gt;Event approval:&lt;/STRONG&gt; Once your application has drained connections or checkpointed state, it can approve the event by POSTing back with the &lt;CODE&gt;EventId&lt;/CODE&gt;. This tells Azure your app is ready, and the platform can proceed without waiting for the full timeout. If you don't explicitly approve, Azure proceeds when the &lt;CODE&gt;NotBefore&lt;/CODE&gt; time is reached.&lt;/P&gt;
&lt;P&gt;If you're seeing elevated frequency of live migrations, this data lets you quantify the pattern (how often, what times, what durations) and bring hard numbers to a support conversation instead of "it feels like it's happening a lot."&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Docs:&lt;/STRONG&gt; &lt;A href="https://learn.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events" target="_blank" rel="noopener"&gt;Scheduled Events for VMs&lt;/A&gt;&lt;/P&gt;
&lt;H2&gt;Step 4: Check your overall posture (Azure Advisor)&lt;/H2&gt;
&lt;P&gt;While you're at it, check &lt;STRONG&gt;Azure Advisor's Reliability recommendations&lt;/STRONG&gt; for your VMs. It flags things like:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;VMs not deployed in availability zones&lt;/LI&gt;
&lt;LI&gt;Deprecated VM images that need updating&lt;/LI&gt;
&lt;LI&gt;Missing backup configurations&lt;/LI&gt;
&lt;LI&gt;Other resiliency gaps that make you more susceptible to availability issues&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Advisor won't explain a past incident, but it can help prevent the next one.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Docs:&lt;/STRONG&gt; &lt;A href="https://learn.microsoft.com/en-us/azure/advisor/advisor-reference-reliability-recommendations" target="_blank" rel="noopener"&gt;Azure Advisor Reliability recommendations&lt;/A&gt;&lt;/P&gt;
&lt;H2&gt;A quick note on resilience&lt;/H2&gt;
&lt;P&gt;These tools improve your visibility and response time, but they don't eliminate downtime by themselves. If a VM is truly critical, pair this monitoring with basic resilience patterns: multiple instances behind a load balancer, availability zones, health probes, regular backups, and cross-region recovery where needed. Monitoring tells you what's happening. Architecture determines whether it matters.&lt;/P&gt;
&lt;H2&gt;The setup checklist&lt;/H2&gt;
&lt;H3&gt;Quick wins (15 minutes)&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="width: 100%; height: 246px; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr style="height: 34.8px;"&gt;&lt;th style="height: 34.8px;"&gt;#&lt;/th&gt;&lt;th style="height: 34.8px;"&gt;What&lt;/th&gt;&lt;th style="height: 34.8px;"&gt;Why&lt;/th&gt;&lt;th style="height: 34.8px;"&gt;Time&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr style="height: 58.8px;"&gt;&lt;td style="height: 58.8px;"&gt;1&lt;/td&gt;&lt;td style="height: 58.8px;"&gt;Check Resource Health on your production VMs&lt;/td&gt;&lt;td style="height: 58.8px;"&gt;See if there are past events you didn't know about&lt;/td&gt;&lt;td style="height: 58.8px;"&gt;2 min&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 58.8px;"&gt;&lt;td style="height: 58.8px;"&gt;2&lt;/td&gt;&lt;td style="height: 58.8px;"&gt;Create a Service Health alert for your regions/services&lt;/td&gt;&lt;td style="height: 58.8px;"&gt;Get notified about platform issues proactively&lt;/td&gt;&lt;td style="height: 58.8px;"&gt;3 min&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 58.8px;"&gt;&lt;td style="height: 58.8px;"&gt;3&lt;/td&gt;&lt;td style="height: 58.8px;"&gt;Create Resource Health alerts for your VM resource groups&lt;/td&gt;&lt;td style="height: 58.8px;"&gt;Get notified when any VM changes health state&lt;/td&gt;&lt;td style="height: 58.8px;"&gt;3 min&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.8px;"&gt;&lt;td style="height: 34.8px;"&gt;4&lt;/td&gt;&lt;td style="height: 34.8px;"&gt;Review Azure Advisor Reliability tab&lt;/td&gt;&lt;td style="height: 34.8px;"&gt;Fix any posture gaps&lt;/td&gt;&lt;td style="height: 34.8px;"&gt;2 min&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3&gt;Advanced hardening (1+ hours depending on your app)&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;#&lt;/th&gt;&lt;th&gt;What&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;Deploy the Scheduled Events polling script on critical VMs&lt;/td&gt;&lt;td&gt;Get advance notice of live migrations and maintenance&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;Implement drain/checkpoint logic tied to Scheduled Events&lt;/td&gt;&lt;td&gt;Gracefully handle maintenance with zero user impact&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;Wire event approvals into your automation&lt;/td&gt;&lt;td&gt;Control the timing of when Azure proceeds with maintenance&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;Wrapping up&lt;/H2&gt;
&lt;P&gt;The pattern I keep seeing is teams treating Azure VM monitoring as something they'll get to "later." Then an incident happens, the RCA takes longer than anyone wants, and everyone wishes they had visibility sooner.&lt;/P&gt;
&lt;P&gt;The tools are already there. Resource Health tells you what happened. Service Health and Resource Health alerts tell you when it's happening. Scheduled Events tells you before it happens. And Advisor helps you make sure your setup is resilient in the first place.&lt;/P&gt;
&lt;P&gt;Fifteen minutes of setup for the quick wins, and you're in a fundamentally better place than most teams running VMs on Azure today.&lt;/P&gt;</description>
      <pubDate>Wed, 22 Apr 2026 15:49:46 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/your-azure-vm-went-down-and-nobody-knew-why-here-s-how-to-fix/ba-p/4513733</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2026-04-22T15:49:46Z</dc:date>
    </item>
    <item>
      <title>Role Structures, Anti-Patterns, and the 10 Governance Principles</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/role-structures-anti-patterns-and-the-10-governance-principles/ba-p/4510070</link>
      <description>&lt;P&gt;Part 3 of 3: The implementation playbook for engineering, finance, and security teams&lt;/P&gt;
&lt;P&gt;In&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-has-three-permission-systems-and-youre-probably-confusing-them/4471854" target="_blank" rel="noopener" data-lia-auto-title="Part 1" data-lia-auto-title-active="0"&gt;Part 1&lt;/A&gt;, we established Azure's three-plane model: Entra for identity, RBAC for resources, Commerce for billing. In&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/marketplace-governance-and-the-cross-plane-bridge/4510067" target="_blank" rel="noopener" data-lia-auto-title="Part 2" data-lia-auto-title-active="0"&gt;Part 2&lt;/A&gt;, we explored where those planes collide: Marketplace governance, Managed Identity, and ABAC.&lt;/P&gt;
&lt;P&gt;Now it's time to get practical. This post covers the patterns that work, the anti-patterns that don't, and the governance principles that every digital-native company should adopt&amp;nbsp;&lt;EM&gt;before&lt;/EM&gt;&amp;nbsp;they're forced to adopt them after an incident.&lt;/P&gt;
&lt;H2&gt;7 anti-patterns to avoid&lt;/H2&gt;
&lt;P&gt;These seven anti-patterns appear repeatedly across AI, SaaS, and digital-native customers. Every one of them has caused real incidents — surprise invoices, accidental deletions, compliance failures, or governance breakdowns.&lt;/P&gt;
&lt;H3&gt;❌ Anti-Pattern 1: Giving engineers billing permissions&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;What happens:&lt;/STRONG&gt; Engineers are given Billing Reader or Billing Contributor roles "so they can see costs." They can now see MACC credits, private offer terms, commercial discounts, and Marketplace purchase history, none of which they need.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Symptoms:&lt;/STRONG&gt;&amp;nbsp;Engineers purchasing Marketplace SaaS without oversight. Surprise invoices. Procurement loses visibility into vendor commitments.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Fix:&lt;/STRONG&gt;&amp;nbsp;Engineers need&amp;nbsp;&lt;STRONG&gt;Cost Management Reader&lt;/STRONG&gt;&amp;nbsp;(RBAC) for usage-based cost visibility. They do&amp;nbsp;&lt;EM&gt;not&lt;/EM&gt; need billing roles. If they need to understand MACC impact, create a reporting process, don't give them the keys.&lt;/P&gt;
&lt;H3&gt;❌ Anti-Pattern 2: Giving finance subscription owner access&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;What happens:&lt;/STRONG&gt;&amp;nbsp;Finance teams are given Owner or Contributor roles on subscriptions "so they can track spending." They now have the ability to deploy, modify, and delete production resources.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Symptoms:&lt;/STRONG&gt; Massive over-permissioning. Finance can accidentally delete production resources. Audit risk, regulators will flag this.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Fix:&lt;/STRONG&gt;&amp;nbsp;Finance roles belong in the&amp;nbsp;&lt;STRONG&gt;Billing plane&lt;/STRONG&gt;, not the resource plane. Give finance&amp;nbsp;&lt;STRONG&gt;Billing Reader&lt;/STRONG&gt;&amp;nbsp;for credit and invoice visibility. If they also need resource cost data, add&amp;nbsp;&lt;STRONG&gt;Cost Management Reader&lt;/STRONG&gt;&amp;nbsp;(RBAC) scoped to the appropriate subscriptions — that's a read-only, resource-plane role.&lt;/P&gt;
&lt;H3&gt;❌ Anti-Pattern 3: Too many subscription owners&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;What happens:&lt;/STRONG&gt;&amp;nbsp;Every senior engineer, team lead, and sometimes product managers get Owner on subscriptions. The logic: "they need to unblock themselves."&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Symptoms:&lt;/STRONG&gt; No accountability, when everyone is Owner, nobody is. High blast radius. Hard to trace role assignments when troubleshooting.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Fix:&lt;/STRONG&gt;&amp;nbsp;Maximum&amp;nbsp;&lt;STRONG&gt;2–3 Owners&lt;/STRONG&gt;&amp;nbsp;per subscription: Platform Lead, SRE Lead, and optionally the Cloud Architect. Everyone else gets Contributor or scoped roles. Use PIM for emergency elevation.&lt;/P&gt;
&lt;H3&gt;❌ Anti-Pattern 4: Believing Entra Global Admin = Azure Owner&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;What happens:&lt;/STRONG&gt; Leadership assumes Global Admin has universal access: subscriptions, resources, billing. They don't. Global Admin controls the identity plane only.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Symptoms:&lt;/STRONG&gt;&amp;nbsp;Security teams thinking they can see all resources (they can't). Incorrect governance designs that assume Entra = RBAC.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Fix:&lt;/STRONG&gt;&amp;nbsp;Train leadership explicitly:&amp;nbsp;&lt;STRONG&gt;Entra ≠ RBAC ≠ Billing&lt;/STRONG&gt;. Three planes, three sets of roles, zero overlap. A Global Admin who needs resource access must be separately granted RBAC roles.&lt;/P&gt;
&lt;H3&gt;❌ Anti-Pattern 5: Deploying marketplace SaaS without finance&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;What happens:&lt;/STRONG&gt;&amp;nbsp;Engineers purchase Marketplace tools directly because they have billing permissions (see Anti-Pattern 1) or because the org hasn't restricted Marketplace purchases.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Symptoms:&lt;/STRONG&gt;&amp;nbsp;Incorrect MACC burn. Licensing duplicates. Vendor lock-in without legal review. Private offer terms not applied.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Fix:&lt;/STRONG&gt;&amp;nbsp;Require finance approval for all paid Marketplace purchases. Follow the five-step workflow from&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/marketplace-governance-and-the-cross-plane-bridge/4510067" target="_blank" rel="noopener" data-lia-auto-title="Part 2" data-lia-auto-title-active="0"&gt;Part 2&lt;/A&gt;: Engineer requests → Finance reviews → Billing executes → Engineering deploys → Cost monitoring activated.&lt;/P&gt;
&lt;H3&gt;❌ Anti-Pattern 6: Mixed dev/test/prod in one subscription&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;What happens:&lt;/STRONG&gt;&amp;nbsp;To save time, teams put all environments in one subscription.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Symptoms:&lt;/STRONG&gt;&amp;nbsp;Can't isolate production costs. A Contributor on the sub can modify both dev and prod. Can't enforce stricter policies on prod without affecting dev. Compliance teams can't get clean boundaries.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Fix:&lt;/STRONG&gt;&amp;nbsp;Separate subscriptions by environment. Pattern:&amp;nbsp;&lt;STRONG&gt;1 subscription per environment per workload&lt;/STRONG&gt;&amp;nbsp;(or at minimum per environment). Use cross-subscription networking via Hub &amp;amp; Spoke or Landing Zones.&lt;/P&gt;
&lt;H3&gt;❌ Anti-Pattern 7: Not using Azure Policy&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;What happens:&lt;/STRONG&gt;&amp;nbsp;Teams deploy freely with no guardrails. Over time: VMs in unapproved regions, GPU SKUs in non-production, storage accounts without encryption, missing tags, public IP drift.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Symptoms:&lt;/STRONG&gt;&amp;nbsp;Inconsistent regions. Wrong VM families. Missing tags make cost attribution impossible. Non-compliant configurations.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Fix:&lt;/STRONG&gt; Adopt Azure Policy early, at Management Group scope. Critical policies: allowed locations, allowed VM SKUs, enforce HTTPS, enforce private endpoints, enforce tagging (environment, owner, cost-center).&lt;/P&gt;
&lt;H2&gt;Recommended role structure&lt;/H2&gt;
&lt;P&gt;Based on experience with dozens of digital-native customers, here's the role structure that works across the three planes.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;H3&gt;Engineering plane (RBAC)&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;2–3 subscription Owners:&amp;nbsp;&lt;/STRONG&gt;Platform Lead, SRE Lead, Cloud Architect&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Platform/SRE team&lt;/STRONG&gt;&amp;nbsp;as&amp;nbsp;&lt;STRONG&gt;Contributors:&amp;nbsp;&lt;/STRONG&gt;deploy and manage infrastructure&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Developers&lt;/STRONG&gt;&amp;nbsp;as&amp;nbsp;&lt;STRONG&gt;RG-scoped Contributors or Readers:&amp;nbsp;&lt;/STRONG&gt;limited to their workload's resource group&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Cost Management Reader&lt;/STRONG&gt; for budget owners: usage visibility without deployment rights&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure Policy&lt;/STRONG&gt; for guardrails: VM SKUs, regions, encryption, tags&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Management Groups&lt;/STRONG&gt;&amp;nbsp;for organizational structure&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Finance plane (Commerce)&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Billing Account Owner&lt;/STRONG&gt;&amp;nbsp;= CFO or Finance Director&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Billing Contributor&lt;/STRONG&gt;&amp;nbsp;= Finance Operations&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Billing Reader&lt;/STRONG&gt;&amp;nbsp;= FP&amp;amp;A and financial analysts&lt;/LI&gt;
&lt;LI&gt;All Marketplace-paid offers require finance approval&lt;/LI&gt;
&lt;LI&gt;MACC visibility restricted to finance roles&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Identity/Security plane (Entra)&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;2–4 Global Admins&lt;/STRONG&gt;&amp;nbsp;(break-glass accounts included)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;PIM enforced&lt;/STRONG&gt; for all privileged roles, no permanent admin access&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Conditional Access&lt;/STRONG&gt;&amp;nbsp;for all admin roles (MFA, compliant device, block legacy auth)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Groups&lt;/STRONG&gt; used for RBAC assignment, never assign RBAC to individual users&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Workload identities&lt;/STRONG&gt;&amp;nbsp;(Managed Identity) preferred over service principals&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Role mapping templates&lt;/H2&gt;
&lt;P&gt;Copy these into your onboarding documentation.&lt;/P&gt;
&lt;H3&gt;Engineering Team&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Role&lt;/th&gt;&lt;th&gt;Azure Role&lt;/th&gt;&lt;th&gt;Plane&lt;/th&gt;&lt;th&gt;Allowed actions&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Cloud Architect&lt;/td&gt;&lt;td&gt;Owner (2–3 per sub)&lt;/td&gt;&lt;td&gt;RBAC&lt;/td&gt;&lt;td&gt;Govern workloads, assign roles, manage infrastructure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Platform / SRE&lt;/td&gt;&lt;td&gt;Contributor&lt;/td&gt;&lt;td&gt;RBAC&lt;/td&gt;&lt;td&gt;Deploy and manage infrastructure&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Developer&lt;/td&gt;&lt;td&gt;Contributor or Reader (RG-scoped)&lt;/td&gt;&lt;td&gt;RBAC&lt;/td&gt;&lt;td&gt;Deploy to specific resource groups&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Budget Owner&lt;/td&gt;&lt;td&gt;Cost Management Reader&lt;/td&gt;&lt;td&gt;RBAC&lt;/td&gt;&lt;td&gt;View usage-based cost, manage budgets — not billing&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3&gt;Finance Team&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Role&lt;/th&gt;&lt;th&gt;Azure Role&lt;/th&gt;&lt;th&gt;Plane&lt;/th&gt;&lt;th&gt;Allowed actions&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Finance Lead&lt;/td&gt;&lt;td&gt;Billing Account Owner&lt;/td&gt;&lt;td&gt;Billing&lt;/td&gt;&lt;td&gt;View and manage credits, invoices, MACC, payment methods&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Finance Analyst&lt;/td&gt;&lt;td&gt;Billing Reader&lt;/td&gt;&lt;td&gt;Billing&lt;/td&gt;&lt;td&gt;Read-only billing visibility&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FP&amp;amp;A&lt;/td&gt;&lt;td&gt;Billing Reader&lt;/td&gt;&lt;td&gt;Billing&lt;/td&gt;&lt;td&gt;Read-only; no deployments, no resource access&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3&gt;Leadership&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Role&lt;/th&gt;&lt;th&gt;Azure Role&lt;/th&gt;&lt;th&gt;Plane&lt;/th&gt;&lt;th&gt;Actions&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;CTO / VP Engineering&lt;/td&gt;&lt;td&gt;Reader or Cost Mgmt Reader&lt;/td&gt;&lt;td&gt;RBAC&lt;/td&gt;&lt;td&gt;Visibility into platform and resource costs&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;CFO&lt;/td&gt;&lt;td&gt;Billing Reader&lt;/td&gt;&lt;td&gt;Billing&lt;/td&gt;&lt;td&gt;Visibility into credits, invoices, MACC, commitments&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;RACI Matrix&lt;/H2&gt;
&lt;P&gt;Adapted from the Microsoft&amp;nbsp;&lt;A href="https://learn.microsoft.com/azure/cloud-adoption-framework/organize/raci-alignment" target="_blank" rel="noopener"&gt;Cloud Adoption Framework&lt;/A&gt;.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="width: 72.1296%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Function&lt;/th&gt;&lt;th&gt;Accountable&lt;/th&gt;&lt;th&gt;Responsible&lt;/th&gt;&lt;th&gt;Consulted&lt;/th&gt;&lt;th&gt;Informed&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Billing account roles &amp;amp; access&lt;/td&gt;&lt;td&gt;Finance Lead&lt;/td&gt;&lt;td&gt;Finance Ops&lt;/td&gt;&lt;td&gt;Cloud Architect&lt;/td&gt;&lt;td&gt;Engineering&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Subscription role assignments&lt;/td&gt;&lt;td&gt;Cloud Architect&lt;/td&gt;&lt;td&gt;Platform / SRE&lt;/td&gt;&lt;td&gt;Finance, Security&lt;/td&gt;&lt;td&gt;Engineering&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cost monitoring &amp;amp; budgets&lt;/td&gt;&lt;td&gt;Finance&lt;/td&gt;&lt;td&gt;Engineering&lt;/td&gt;&lt;td&gt;Leadership&lt;/td&gt;&lt;td&gt;All teams&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Marketplace purchases&lt;/td&gt;&lt;td&gt;Finance Lead&lt;/td&gt;&lt;td&gt;Finance Ops&lt;/td&gt;&lt;td&gt;Engineering, Legal&lt;/td&gt;&lt;td&gt;CFO&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;IaC / Deployment governance&lt;/td&gt;&lt;td&gt;Platform Lead&lt;/td&gt;&lt;td&gt;Engineers&lt;/td&gt;&lt;td&gt;Security&lt;/td&gt;&lt;td&gt;Finance&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Policies &amp;amp; guardrails&lt;/td&gt;&lt;td&gt;Security / Cloud Architect&lt;/td&gt;&lt;td&gt;Platform Team&lt;/td&gt;&lt;td&gt;Engineering&lt;/td&gt;&lt;td&gt;Leadership&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Identity &amp;amp; access governance&lt;/td&gt;&lt;td&gt;Security Lead&lt;/td&gt;&lt;td&gt;Identity Admin&lt;/td&gt;&lt;td&gt;Cloud Architect&lt;/td&gt;&lt;td&gt;All teams&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PIM &amp;amp; Conditional Access&lt;/td&gt;&lt;td&gt;Security Lead&lt;/td&gt;&lt;td&gt;Identity Admin&lt;/td&gt;&lt;td&gt;Platform Lead&lt;/td&gt;&lt;td&gt;Engineering&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MACC tracking &amp;amp; credit visibility&lt;/td&gt;&lt;td&gt;Finance Lead&lt;/td&gt;&lt;td&gt;Finance Ops&lt;/td&gt;&lt;td&gt;Cloud Architect&lt;/td&gt;&lt;td&gt;Leadership&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Include this template in your onboarding documentation and review it quarterly.&lt;/P&gt;
&lt;H2&gt;Best Practices&lt;/H2&gt;
&lt;H3&gt;Use Entra Groups for RBAC assignment, never assign directly to users&lt;/H3&gt;
&lt;P&gt;Benefits: clear separation of identity and resource planes, easy onboarding/offboarding, predictable RBAC inheritance, enables PIM for group-based elevation.&lt;/P&gt;
&lt;P&gt;Naming pattern:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;grp-sub-&amp;lt;SubscriptionName&amp;gt;-Owner&lt;/LI&gt;
&lt;LI&gt;grp-sub-&amp;lt;SubscriptionName&amp;gt;-Contributor&lt;/LI&gt;
&lt;LI&gt;grp-rg-&amp;lt;WorkloadName&amp;gt;-Reader&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Assign the&amp;nbsp;&lt;STRONG&gt;group&lt;/STRONG&gt;&amp;nbsp;to the role, not individual users.&lt;/P&gt;
&lt;H3&gt;Enforce PIM + Conditional Access for all privileged roles&lt;/H3&gt;
&lt;P&gt;Key CA policies: MFA required for all admins, compliant device requirement, block legacy authentication, block sign-in from high-risk locations, require phishing-resistant MFA.&lt;/P&gt;
&lt;P&gt;No permanent admin access. Use time-based elevation for every privileged operation.&lt;/P&gt;
&lt;H3&gt;Separate subscriptions by environment and workload&lt;/H3&gt;
&lt;P&gt;Subscriptions are a security boundary. Pattern: 1 subscription per environment per workload. Platform teams get their own subscription. Use Hub &amp;amp; Spoke or Landing Zones for cross-subscription networking.&lt;/P&gt;
&lt;H3&gt;Keep billing data confidential&lt;/H3&gt;
&lt;P&gt;Only Billing roles should see credits, commitments, discounts, invoices, and MACC balance. Engineers should never have access to commercial data.&lt;/P&gt;
&lt;H2&gt;The 10 Principles of Azure Governance&lt;/H2&gt;
&lt;P&gt;After working with digital natives across AI, SaaS, and infrastructure companies, I can summarize Azure governance into these principles:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;#&lt;/th&gt;&lt;th&gt;Principle&lt;/th&gt;&lt;th&gt;Summary&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;1&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Separate identity, resources, and billing. Always.&lt;/td&gt;&lt;td&gt;Never mix roles across planes. An engineer should never hold billing roles. A finance analyst should never hold subscription Owner.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;2&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Engineering owns the resource plane.&lt;/td&gt;&lt;td&gt;Give them Contributor and Cost Management Reader. Don't burden them with billing or identity administration.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;3&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Finance owns the billing plane.&lt;/td&gt;&lt;td&gt;Credits, MACC, invoices, private offers. Every Marketplace purchase flows through Finance.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;4&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Security owns identity and governance.&lt;/td&gt;&lt;td&gt;PIM, Conditional Access, Azure Policy. Identity decisions should not be made by engineering or finance.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;5&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Keep subscription Owners scarce.&lt;/td&gt;&lt;td&gt;Maximum 2–3 per subscription. Use PIM for emergency elevation. Everyone else gets Contributor or scoped roles.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;6&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Lock down Marketplace.&lt;/td&gt;&lt;td&gt;Every SaaS purchase approved by Finance. No exceptions. Use the five-step workflow.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;7&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Use Infrastructure as Code.&lt;/td&gt;&lt;td&gt;Manual deployments don't scale and can't be audited. Use Bicep, Terraform, or Pulumi.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;8&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Use budgets early.&lt;/td&gt;&lt;td&gt;Set budgets at Management Group, Subscription, and Resource Group levels. Configure alerts to email, Teams, or automation.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;9&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Use Management Groups from day one.&lt;/td&gt;&lt;td&gt;Every startup that scales beyond a single subscription regrets not using them. Recommended hierarchy: Tenant Root → OrgName → Platform / Production / NonProduction / Sandbox / Shared Services.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;10&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Build governance before scale.&lt;/td&gt;&lt;td&gt;The companies that scale successfully treat Azure governance as infrastructure, not bureaucracy.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 5.46701%" /&gt;&lt;col style="width: 34.5824%" /&gt;&lt;col style="width: 59.932%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;References&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/role-based-access-control/overview" target="_blank" rel="noopener"&gt;Azure RBAC Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/role-based-access-control/rbac-and-directory-admin-roles" target="_blank" rel="noopener"&gt;Entra Directory &amp;amp; Admin Roles&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/cost-management-billing/manage/understand-mca-roles" target="_blank" rel="noopener"&gt;Billing Roles (Microsoft Customer Agreement)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/cost-management-billing/costs/assign-access-acm-data" target="_blank" rel="noopener"&gt;Assign Access to Cost Management Data&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/marketplace/azure-purchasing-invoicing" target="_blank" rel="noopener"&gt;Marketplace Purchases &amp;amp; Invoicing&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/marketplace/private-offers" target="_blank" rel="noopener"&gt;Private Offers in Azure Marketplace&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/role-based-access-control/conditions-overview" target="_blank" rel="noopener"&gt;Azure RBAC Conditions (ABAC)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/governance/policy/overview" target="_blank" rel="noopener"&gt;Azure Policy Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/cloud-adoption-framework/organize/raci-alignment" target="_blank" rel="noopener"&gt;Cloud Adoption Framework RACI&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/active-directory/managed-identities-azure-resources/overview" target="_blank" rel="noopener"&gt;Managed Identities Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/aks/workload-identity-overview" target="_blank" rel="noopener"&gt;AKS Workload Identity&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Closing thoughts&lt;/H2&gt;
&lt;P&gt;Azure's three permission planes aren't a problem to solve, they're a framework to leverage.&lt;/P&gt;
&lt;P&gt;The confusion happens when teams try to treat Azure as if it has a single permission system. It doesn't, and it never will. Because identity, billing, and resource deployment are fundamentally different domains that must be operated and secured differently.&lt;/P&gt;
&lt;P&gt;But when organizations understand these three planes and structure their roles accordingly, something powerful happens:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Engineering moves faster.&lt;/STRONG&gt;&amp;nbsp;Clear RBAC scopes mean teams deploy without waiting for approvals they don't need.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Finance gains real oversight.&lt;/STRONG&gt;&amp;nbsp;Billing roles provide full commercial visibility without the risk of touching production resources.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Security gets a clean, enforceable boundary model.&lt;/STRONG&gt;&amp;nbsp;Entra controls identity; PIM and Conditional Access control elevation; Azure Policy controls the guardrails.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Leadership sees clarity instead of chaos.&lt;/STRONG&gt;&amp;nbsp;The right roles in the right planes mean dashboards, reports, and alerts actually reflect what each stakeholder needs.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Good governance doesn't slow down innovation.&amp;nbsp;&lt;STRONG&gt;Bad governance does.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The companies that scale successfully, whether AI-native, SaaS platforms, or global digital-first organizations, are the ones that adopt a clean, intentional model early. They treat Azure governance as infrastructure, not bureaucracy.&lt;/P&gt;
&lt;P&gt;The model is simple:&amp;nbsp;&lt;STRONG&gt;Entra for who. RBAC for what. Commerce for how you pay.&lt;/STRONG&gt;&amp;nbsp;Start with that, and everything else becomes easier.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;This concludes the 3-part series on Azure Governance for Digital Natives. For the full model, start with&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-has-three-permission-systems-and-youre-probably-confusing-them/4471854" target="_blank" rel="noopener" data-lia-auto-title="Part 1: The Three Permission Planes" data-lia-auto-title-active="0"&gt;Part 1: The Three Permission Planes&lt;/A&gt;. For collision points and Managed Identity, read&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/marketplace-governance-and-the-cross-plane-bridge/4510067" target="_blank" rel="noopener" data-lia-auto-title="Part 2: Marketplace Governance and the Cross-Plane Bridge" data-lia-auto-title-active="0"&gt;Part 2: Marketplace Governance and the Cross-Plane Bridge&lt;/A&gt;.&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Apr 2026 21:25:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/role-structures-anti-patterns-and-the-10-governance-principles/ba-p/4510070</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2026-04-09T21:25:00Z</dc:date>
    </item>
    <item>
      <title>Marketplace governance and the cross-plane bridge</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/marketplace-governance-and-the-cross-plane-bridge/ba-p/4510067</link>
      <description>&lt;P&gt;&lt;EM&gt;Part 2 of 3: Where resource deployment meets financial authority and how to govern it&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;In&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-has-three-permission-systems-and-youre-probably-confusing-them/4471854" target="_blank" rel="noopener" data-lia-auto-title="Part 1" data-lia-auto-title-active="0"&gt;Part 1&lt;/A&gt;, we established the foundational model: Azure operates on three completely separate permission planes, &lt;STRONG&gt;Entra&lt;/STRONG&gt;&amp;nbsp;(identity),&amp;nbsp;&lt;STRONG&gt;RBAC&lt;/STRONG&gt;&amp;nbsp;(resources), and&amp;nbsp;&lt;STRONG&gt;Commerce&lt;/STRONG&gt;&amp;nbsp;(billing). A role in one plane grants zero access in the others.&lt;/P&gt;
&lt;P&gt;That model is clean in theory. But in practice, the planes collide. And when they do, teams get confused, purchases stall, and governance gaps appear.&lt;/P&gt;
&lt;P&gt;This post covers the biggest collision point:&amp;nbsp;&lt;STRONG&gt;Marketplace,&amp;nbsp;&lt;/STRONG&gt;where resource deployment meets financial authority. We'll also dig into&amp;nbsp;&lt;STRONG&gt;Managed Identity&lt;/STRONG&gt;&amp;nbsp;(the one construct that genuinely bridges two planes),&amp;nbsp;&lt;STRONG&gt;ABAC&lt;/STRONG&gt;&amp;nbsp;(advanced conditional governance within the resource plane), and the five-step Marketplace approval workflow every digital-native company should adopt.&lt;/P&gt;
&lt;H2&gt;Marketplace: Where the resource and billing planes intersect&lt;/H2&gt;
&lt;P&gt;Marketplace is the most common collision point between Azure's permission planes. Here's why: deploying an Azure resource and purchasing a Marketplace SaaS product feel like the same action from the Portal, but they are governed by completely different permission systems.&lt;/P&gt;
&lt;H3&gt;Deploying resources ≠ Purchasing SaaS&lt;/H3&gt;
&lt;P&gt;A Contributor can deploy any native Azure resource: VMs, Storage, AKS, Networking, Databases, Azure OpenAI. These are&amp;nbsp;&lt;STRONG&gt;resource plane&lt;/STRONG&gt;&amp;nbsp;operations governed by RBAC.&lt;/P&gt;
&lt;P&gt;But purchasing a third-party SaaS product through Marketplace — Datadog, Snowflake, Elastic, Confluent, MongoDB Atlas, is a&amp;nbsp;&lt;STRONG&gt;commercial transaction&lt;/STRONG&gt;. It creates a financial obligation between your organization and a vendor. That's the billing plane.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Deploying&lt;/STRONG&gt;&amp;nbsp;→ RBAC (Resource Plane)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Purchasing&lt;/STRONG&gt;&amp;nbsp;→ Commerce (Financial Plane)&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;The marketplace permission model&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Action&lt;/th&gt;&lt;th&gt;Requires RBAC?&lt;/th&gt;&lt;th&gt;Requires billing role?&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Deploy a VM&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Deploy AKS cluster&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Deploy Azure OpenAI&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Deploy Datadog agent extension&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Deploy Confluent cluster (Azure-native)&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Purchase Datadog SaaS plan&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Purchase Snowflake SaaS&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Accept Confluent SaaS contract&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;View Snowflake private offer&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Approve Marketplace private offer&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;This is why engineers often ask:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;"Why can't I buy Snowflake? I'm an Owner."&lt;BR /&gt;&lt;BR /&gt;Because&amp;nbsp;&lt;STRONG&gt;Owner&lt;/STRONG&gt;&amp;nbsp;has no financial authority. Owner is the highest role in the resource plane, but Marketplace SaaS purchases are commercial transactions that require billing plane permissions. These are different systems.&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3&gt;The subtlety: Azure-Native vs. SaaS&lt;/H3&gt;
&lt;P&gt;Some vendors have both Azure-native integrations and SaaS offerings, which makes this even more confusing:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Datadog agent extension:&amp;nbsp;&lt;/STRONG&gt;deploys as an Azure resource → RBAC ✅&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Datadog SaaS plan:&amp;nbsp;&lt;/STRONG&gt;creates a billing relationship → Commerce ✅&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Confluent for Azure:&amp;nbsp;&lt;/STRONG&gt;deploys Kafka as an Azure resource → RBAC ✅&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Confluent Cloud SaaS contract:&amp;nbsp;&lt;/STRONG&gt;financial commitment → Commerce ✅&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;When an engineer deploys a Datadog agent via the Portal, everything works. When they try to subscribe to the Datadog SaaS plan, they hit a wall. Same vendor, same Portal, different permission plane.&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;The five-step marketplace purchase workflow&lt;/H2&gt;
&lt;P&gt;For digital natives operating with financial governance, every Marketplace purchase should follow this workflow:&lt;/P&gt;
&lt;H3&gt;Step 1: Engineer requests a SaaS or marketplace resource&lt;/H3&gt;
&lt;P&gt;The request should include: why it's needed, expected cost, impact on MACC, preferred vendor, and alternatives considered.&lt;/P&gt;
&lt;H3&gt;Step 2: Finance reviews commercial implications&lt;/H3&gt;
&lt;P&gt;Finance checks: MACC impact (does this purchase count toward the commitment?), budget alignment, available discounts (private offers), vendor validation, and contract terms.&lt;/P&gt;
&lt;H3&gt;Step 3: Billing role executes the purchase&lt;/H3&gt;
&lt;P&gt;Billing Account Owner or Contributor completes the transaction in the Portal. This is a billing plane operation.&lt;/P&gt;
&lt;H3&gt;Step 4: Engineering deploys or configures the resource&lt;/H3&gt;
&lt;P&gt;SaaS connector setup, private offer entitlement, RBAC for workload integration, data pipelines and integration. This is a resource plane operation.&lt;/P&gt;
&lt;H3&gt;Step 5: Cost monitoring activated&lt;/H3&gt;
&lt;P&gt;Alerts configured, budgets set, tagging applied, forecasting enabled.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;This five-step workflow is simple, but most digital natives skip it&lt;/STRONG&gt; and end up with surprise invoices, unapproved vendor commitments, or MACC burn they didn't plan for.&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;The one cross-plane bridge: Managed Identity&lt;/H2&gt;
&lt;P&gt;If the three-plane model is about separation, Managed Identity is the one construct that genuinely bridges two of those planes.&lt;/P&gt;
&lt;P&gt;A Managed Identity is an&amp;nbsp;&lt;STRONG&gt;Entra identity&lt;/STRONG&gt;&amp;nbsp;tied to an&amp;nbsp;&lt;STRONG&gt;Azure resource&lt;/STRONG&gt;&amp;nbsp;and authorized via&amp;nbsp;&lt;STRONG&gt;RBAC&lt;/STRONG&gt;. It lets Azure workloads authenticate to other Azure services without storing credentials in code, environment variables, or configuration files.&lt;/P&gt;
&lt;H3&gt;The cross-plane flow&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="width: 75%; height: 139.2px; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr style="height: 34.8px;"&gt;&lt;th style="height: 34.8px;"&gt;Step&lt;/th&gt;&lt;th style="height: 34.8px;"&gt;Plane&lt;/th&gt;&lt;th style="height: 34.8px;"&gt;What happens&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr style="height: 34.8px;"&gt;&lt;td style="height: 34.8px;"&gt;&lt;STRONG&gt;1. Identity created&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 34.8px;"&gt;Entra (Identity)&lt;/td&gt;&lt;td style="height: 34.8px;"&gt;A service principal is registered in the directory&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.8px;"&gt;&lt;td style="height: 34.8px;"&gt;&lt;STRONG&gt;2. Access authorized&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 34.8px;"&gt;RBAC (Resource)&lt;/td&gt;&lt;td style="height: 34.8px;"&gt;Role assignments grant access to specific resources&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 34.8px;"&gt;&lt;td style="height: 34.8px;"&gt;&lt;STRONG&gt;3. Identity used&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 34.8px;"&gt;Runtime (Resource)&lt;/td&gt;&lt;td style="height: 34.8px;"&gt;The workload requests a token from Entra and calls the target service&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;No secrets. No passwords. No key rotation. The identity lifecycle is managed by Azure itself.&lt;/P&gt;
&lt;H3&gt;AI workload examples&lt;/H3&gt;
&lt;P&gt;For digital natives building AI applications, Managed Identity is essential:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-none" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;Source&lt;/th&gt;&lt;th&gt;Target&lt;/th&gt;&lt;th&gt;RBAC role needed&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;App calls Azure OpenAI&lt;/td&gt;&lt;td&gt;App Service / Container App&lt;/td&gt;&lt;td&gt;Azure OpenAI&lt;/td&gt;&lt;td&gt;Cognitive Services OpenAI User&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;App reads secrets&lt;/td&gt;&lt;td&gt;App Service / Container App&lt;/td&gt;&lt;td&gt;Key Vault&lt;/td&gt;&lt;td&gt;Key Vault Secrets User&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;App reads/writes blobs&lt;/td&gt;&lt;td&gt;App Service / Container App&lt;/td&gt;&lt;td&gt;Storage Account&lt;/td&gt;&lt;td&gt;Storage Blob Data Contributor&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AKS pod calls AOAI&lt;/td&gt;&lt;td&gt;AKS (Workload Identity)&lt;/td&gt;&lt;td&gt;Azure OpenAI&lt;/td&gt;&lt;td&gt;Cognitive Services OpenAI User&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AKS pod reads secrets&lt;/td&gt;&lt;td&gt;AKS (Workload Identity)&lt;/td&gt;&lt;td&gt;Key Vault&lt;/td&gt;&lt;td&gt;Key Vault Secrets User&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Function processes events&lt;/td&gt;&lt;td&gt;Azure Function&lt;/td&gt;&lt;td&gt;Event Hub&lt;/td&gt;&lt;td&gt;Azure Event Hubs Data Receiver&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Pipeline reads training data&lt;/td&gt;&lt;td&gt;ML Workspace&lt;/td&gt;&lt;td&gt;Storage Account&lt;/td&gt;&lt;td&gt;Storage Blob Data Reader&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3&gt;System-Assigned vs. User-Assigned&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;System-assigned:&lt;/STRONG&gt;&amp;nbsp;Tied to a single resource. When the resource is deleted, the identity is deleted. Best for simple scenarios with one resource accessing one or a few target services.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;User-assigned:&lt;/STRONG&gt;&amp;nbsp;Created as a standalone resource. Can be assigned to multiple resources. Best for shared identity across microservices, AKS Workload Identity, or when the identity must persist independently.&lt;/P&gt;
&lt;H3&gt;AKS Workload Identity&lt;/H3&gt;
&lt;P&gt;AKS Workload Identity deserves special mention, it's the most common Managed Identity pattern in digital-native companies running Kubernetes:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;A&amp;nbsp;&lt;STRONG&gt;User-Assigned Managed Identity&lt;/STRONG&gt;&amp;nbsp;is created in Azure&lt;/LI&gt;
&lt;LI&gt;A&amp;nbsp;&lt;STRONG&gt;Kubernetes Service Account&lt;/STRONG&gt;&amp;nbsp;is annotated with the identity's client ID&lt;/LI&gt;
&lt;LI&gt;A&amp;nbsp;&lt;STRONG&gt;Federated Identity Credential&lt;/STRONG&gt;&amp;nbsp;links the K8s service account to the Managed Identity&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;RBAC role assignments&lt;/STRONG&gt;&amp;nbsp;grant the Managed Identity access to target resources&lt;/LI&gt;
&lt;LI&gt;At runtime, the pod uses the service account to get an Entra token via workload identity federation&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;This is Entra + RBAC + Kubernetes working together: identity plane creates the trust, resource plane authorizes the access, and the workload uses it at runtime.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Key insight:&lt;/STRONG&gt; Managed Identity bridges Entra and RBAC, but never touches the third plane (billing). No identity, managed or otherwise, can see MACC credits or approve Marketplace purchases.&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;Advanced: Attribute-Based Access Control (ABAC)&lt;/H2&gt;
&lt;P&gt;ABAC extends RBAC with conditions based on resource attributes (tags), principal attributes, and request context. It is&amp;nbsp;&lt;STRONG&gt;not&lt;/STRONG&gt; a separate permission system,&amp;nbsp; it's an enhancement to the resource plane.&lt;/P&gt;
&lt;P&gt;For example, you can write a role assignment that says:&amp;nbsp;&lt;EM&gt;"Allow Contributor access only to resources tagged&amp;nbsp;Environment = Dev"&lt;/EM&gt;&amp;nbsp;or&amp;nbsp;&lt;EM&gt;"Allow read access only to storage blobs under a specific path prefix."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;ABAC is particularly useful for:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Multi-tenant SaaS applications&lt;/STRONG&gt;&amp;nbsp;that need tenant isolation at the resource layer&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Regulated workloads&lt;/STRONG&gt;&amp;nbsp;that require fine-grained access control beyond what standard RBAC scopes provide&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;What ABAC cannot do:&lt;/STRONG&gt;&amp;nbsp;grant billing access, override Entra roles, access MACC, or purchase Marketplace products. It operates entirely within the RBAC resource plane.&lt;/P&gt;
&lt;P&gt;For implementation details, see:&amp;nbsp;&lt;A href="https://learn.microsoft.com/azure/role-based-access-control/conditions-overview" target="_blank" rel="noopener"&gt;Azure RBAC Conditions (ABAC)&lt;/A&gt;&lt;/P&gt;
&lt;H2&gt;References&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/role-based-access-control/overview" target="_blank" rel="noopener"&gt;Azure RBAC Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/active-directory/managed-identities-azure-resources/overview" target="_blank" rel="noopener"&gt;Managed Identities Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/aks/workload-identity-overview" target="_blank" rel="noopener"&gt;AKS Workload Identity&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/marketplace/azure-purchasing-invoicing" target="_blank" rel="noopener"&gt;Marketplace Purchases &amp;amp; Invoicing&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/marketplace/private-offers" target="_blank" rel="noopener"&gt;Private Offers in Azure Marketplace&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/cost-management-billing/manage/understand-mca-roles" target="_blank" rel="noopener"&gt;Billing Roles (Microsoft Customer Agreement)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/role-based-access-control/conditions-overview" target="_blank" rel="noopener"&gt;Azure RBAC Conditions (ABAC)&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;What's Next →&lt;/STRONG&gt;&amp;nbsp;We've now covered the three-plane model (&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-has-three-permission-systems-and-youre-probably-confusing-them/4471854" target="_blank" rel="noopener" data-lia-auto-title="Part 1" data-lia-auto-title-active="0"&gt;Part 1&lt;/A&gt;) and the biggest collision points: Marketplace, Managed Identity, and ABAC. In&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/role-structures-anti-patterns-and-the-10-governance-principles/4510070" data-lia-auto-title="Part 3" data-lia-auto-title-active="0" target="_blank"&gt;&lt;STRONG&gt;Part 3&lt;/STRONG&gt;&lt;/A&gt;, we get tactical: the&amp;nbsp;&lt;STRONG&gt;7 anti-patterns&lt;/STRONG&gt;&amp;nbsp;to avoid, recommended&amp;nbsp;&lt;STRONG&gt;role structures&lt;/STRONG&gt;&amp;nbsp;for Engineering, Finance, and Security teams,&amp;nbsp;&lt;STRONG&gt;RACI templates&lt;/STRONG&gt;, and the&amp;nbsp;&lt;STRONG&gt;10 core governance principles&lt;/STRONG&gt; every scaling organization should adopt.&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;</description>
      <pubDate>Thu, 09 Apr 2026 21:17:44 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/marketplace-governance-and-the-cross-plane-bridge/ba-p/4510067</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2026-04-09T21:17:44Z</dc:date>
    </item>
    <item>
      <title>Introducing the Startup-Scale Landing Zone: Get Azure right from day one</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/introducing-the-startup-scale-landing-zone-get-azure-right-from/ba-p/4501566</link>
      <description>&lt;H3&gt;&lt;SPAN style="color: rgb(30, 30, 30); font-size: 16px;"&gt;If you've been following this blog, you may recall the post&amp;nbsp;&lt;/SPAN&gt;&lt;A style="font-style: normal; font-weight: 400; background-color: rgb(255, 255, 255); font-size: 16px;" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/from-zero-to-hero-with-azure-landing-zones/4229195" target="_blank" rel="noopener" data-href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/from-zero-to-hero-with-azure-landing-zones/4229195"&gt;From Zero to Hero with Azure Landing Zones&lt;/A&gt;&lt;SPAN style="color: rgb(30, 30, 30); font-size: 16px;"&gt;, where we walked through the full Azure Landing Zone journey, from identity and RBAC to Platform and Application Landing Zones. That guide covered the&amp;nbsp;&lt;/SPAN&gt;&lt;EM style="color: rgb(30, 30, 30); font-size: 16px;"&gt;what&lt;/EM&gt;&lt;SPAN style="color: rgb(30, 30, 30); font-size: 16px;"&gt;&amp;nbsp;and the&amp;nbsp;&lt;/SPAN&gt;&lt;EM style="color: rgb(30, 30, 30); font-size: 16px;"&gt;why&lt;/EM&gt;&lt;SPAN style="color: rgb(30, 30, 30); font-size: 16px;"&gt;. This post introduces the&amp;nbsp;&lt;/SPAN&gt;&lt;EM style="color: rgb(30, 30, 30); font-size: 16px;"&gt;how,&amp;nbsp;&lt;/EM&gt;&lt;SPAN style="color: rgb(30, 30, 30); font-size: 16px;"&gt;a deployable, open-source project that distills those principles into something a startup can actually ship in an afternoon:&lt;/SPAN&gt;&lt;/H3&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2 data-line="4"&gt;The problem: Cloud foundations shouldn't take two months&lt;/H2&gt;
&lt;P data-line="6"&gt;Every startup building on Azure faces the same fork in the road:&lt;/P&gt;
&lt;P data-line="8"&gt;&lt;STRONG&gt;Option A:&lt;/STRONG&gt;&amp;nbsp;Follow the&amp;nbsp;&lt;A href="https://aka.ms/alz" target="_blank" rel="noopener" data-href="https://aka.ms/alz"&gt;Azure Landing Zone (ALZ)&lt;/A&gt; guidance. It's comprehensive, battle-tested, and designed for organizations with thousands of users. It's also 100+ modules, a multi-layered management group hierarchy, and months of work to understand, let alone implement. For a 10-person startup, it's like buying a commercial kitchen to make breakfast.&lt;/P&gt;
&lt;P data-line="10"&gt;&lt;STRONG&gt;Option B:&lt;/STRONG&gt; Skip governance entirely. One subscription, no policies, no budgets, no RBAC strategy. Ship fast now, deal with security debt later. This is what most startups actually do, and it works until the first security questionnaire from an enterprise customer, the first runaway cost incident, or the first az group delete that hits production.&lt;/P&gt;
&lt;P data-line="12"&gt;Neither option is right. Startups need a third path: just enough governance to be secure and cost-aware from day one, without the operational overhead that slows them down.&lt;/P&gt;
&lt;P data-line="14"&gt;That's exactly what the&amp;nbsp;&lt;A class="lia-external-url" href="https://startupscalelanding.zone" target="_blank" rel="noopener"&gt;Startup-Scale Landing Zone (SSLZ)&lt;/A&gt;&amp;nbsp;provides.&lt;/P&gt;
&lt;H2 data-line="16"&gt;What is the Startup-Scale Landing Zone?&lt;/H2&gt;
&lt;P data-line="18"&gt;SSLZ is an opinionated, production-ready Azure infrastructure template that deploys in&amp;nbsp;&lt;STRONG&gt;under one hour&lt;/STRONG&gt; using Bicep or Terraform. It's built for teams of 5–50 engineers, typically pre-seed to Series A, who don't have a dedicated platform team but need to get Azure right from the start.&lt;/P&gt;
&lt;P data-line="20"&gt;It takes the core principles from the Azure Landing Zone architecture and strips them to the essentials:&lt;/P&gt;
&lt;UL data-line="22"&gt;
&lt;LI data-line="22"&gt;&lt;STRONG&gt;One management group, two subscriptions&lt;/STRONG&gt;&amp;nbsp;(prod + non-prod). That's it. No six-layer hierarchy.&lt;/LI&gt;
&lt;LI data-line="23"&gt;&lt;STRONG&gt;Security built-in.&lt;/STRONG&gt; Defender for Cloud, RBAC groups, NSG deny-all defaults, and policy enforcement, all automated.&lt;/LI&gt;
&lt;LI data-line="24"&gt;&lt;STRONG&gt;Cost controls from day one.&lt;/STRONG&gt;&amp;nbsp;Budget alerts at 50%, 80%, and 100%, mandatory tagging, and reservation guidance.&lt;/LI&gt;
&lt;LI data-line="25"&gt;&lt;STRONG&gt;An explicit graduation path.&lt;/STRONG&gt;&amp;nbsp;When you outgrow SSLZ, there's a step-by-step guide to migrate to the full ALZ architecture.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-line="27"&gt;&lt;STRONG&gt;Important:&lt;/STRONG&gt; SSLZ is not a replacement for Azure Landing Zones. It targets a different profile: very early-stage startups with a single workload, a single region, and no hybrid connectivity. For those teams, the realistic alternative isn't ALZ, it's usually&amp;nbsp;&lt;EM&gt;no governance at all&lt;/EM&gt;.&lt;/P&gt;
&lt;H2 data-line="29"&gt;Architecture: Simplicity as a design principle&lt;/H2&gt;
&lt;P data-line="31"&gt;The architecture is deliberately minimal:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;Tenant Root Group
└── mg-&amp;lt;yourcompany&amp;gt;              ← Policies applied here
    ├── sub-&amp;lt;yourcompany&amp;gt;-prod    ← Production workloads
    └── sub-&amp;lt;yourcompany&amp;gt;-nonprod ← Dev, staging, QA&lt;/LI-CODE&gt;
&lt;P&gt;Each subscription gets its own VNet with a standardized subnet layout:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;vnet-&amp;lt;co&amp;gt;-prod (10.0.0.0/16)
├── snet-aks         10.0.0.0/20    (4,091 IPs — AKS nodes + pods)
├── snet-app         10.0.16.0/22   (1,019 IPs — App Service / Container Apps)
├── snet-data        10.0.20.0/22   (1,019 IPs — Private Endpoints)
└── snet-shared      10.0.24.0/24   (251 IPs — CI/CD agents, jump boxes)&lt;/LI-CODE&gt;
&lt;P data-line="50"&gt;No hub network. No Azure Firewall. No VNet peering. Each subscription is a self-contained island.&lt;/P&gt;
&lt;H3 data-line="52"&gt;Why no hub?&lt;/H3&gt;
&lt;P data-line="54"&gt;A hub-spoke topology costs a minimum of ~$1,500/month. Azure Firewall alone runs $900+/month. For a startup with a single workload in a single region, that's cost and complexity with no return. NSGs provide L3/L4 filtering for free and handle 95% of startup networking use cases. When compliance or hybrid connectivity demands centralized egress control, the graduation guide walks you through adding a hub, without touching existing resources.&lt;/P&gt;
&lt;H3 data-line="56"&gt;Why two subscriptions?&lt;/H3&gt;
&lt;P data-line="58"&gt;Two subscriptions give you isolation that resource groups can't:&lt;/P&gt;
&lt;UL data-line="60"&gt;
&lt;LI data-line="60"&gt;&lt;STRONG&gt;Cost isolation for free:&amp;nbsp;&lt;/STRONG&gt;no tagging gymnastics to separate prod from dev spend.&lt;/LI&gt;
&lt;LI data-line="61"&gt;&lt;STRONG&gt;RBAC without custom roles:&amp;nbsp;&lt;/STRONG&gt;developers get Contributor on non-prod and Reader on prod.&lt;/LI&gt;
&lt;LI data-line="62"&gt;&lt;STRONG&gt;Blast radius containment:&amp;nbsp;&lt;/STRONG&gt;az group delete&amp;nbsp;in dev can't touch production.&lt;/LI&gt;
&lt;LI data-line="63"&gt;&lt;STRONG&gt;Quota isolation:&amp;nbsp;&lt;/STRONG&gt;non-prod experiments don't consume prod quotas.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-line="65"&gt;This is a habit that's cheap to form early and expensive to retrofit later. One primary workload per subscription; when you deploy a second independent workload, create a new subscription.&lt;/P&gt;
&lt;H2 data-line="67"&gt;What you get out of the box&lt;/H2&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Component&lt;/th&gt;&lt;th&gt;What's deployed&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Management Groups&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Single MG with two subscriptions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Azure Policy&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Microsoft Cloud Security Benchmark (audit mode), required tags (environment,&amp;nbsp;team), allowed locations, diagnostic settings&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Networking&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;VNet + 4 subnets per subscription, NSGs with deny-all-inbound default&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Monitoring&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Log Analytics workspace, Activity Log forwarding, 90-day retention&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Security&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Defender for Cloud CSPM (free), Defender for Servers P2 (prod), security contact alerts&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Cost Management&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Budget alerts at 50/80/100% thresholds, tag enforcement via policy&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;CI/CD&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;GitHub Actions workflows for both Bicep and Terraform, Workload Identity Federation (no secrets)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3 data-line="79"&gt;Security without friction&lt;/H3&gt;
&lt;P data-line="81"&gt;The security model avoids compliance theater. Instead of buying Entra ID P2 "to check a box," SSLZ enables Security Defaults, free MFA that blocks 99.9% of identity attacks. Instead of enforcing MCSB in Deny mode on day one (which blocks legitimate deployments and frustrates developers), it starts in Audit mode so you can understand your posture first, then selectively move to Deny as your team matures.&lt;/P&gt;
&lt;P data-line="83"&gt;RBAC follows three rules:&lt;/P&gt;
&lt;OL data-line="85"&gt;
&lt;LI data-line="85"&gt;&lt;STRONG&gt;Never assign roles to individuals:&amp;nbsp;&lt;/STRONG&gt;always use security groups.&lt;/LI&gt;
&lt;LI data-line="86"&gt;&lt;STRONG&gt;Developers don't get Contributor on prod:&amp;nbsp;&lt;/STRONG&gt;deployments go through CI/CD.&lt;/LI&gt;
&lt;LI data-line="87"&gt;&lt;STRONG&gt;No Owner at subscription level for non-admins:&amp;nbsp;&lt;/STRONG&gt;a compromised account with Owner can grant itself anything.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P data-line="89"&gt;For CI/CD, SSLZ uses Workload Identity Federation (WIF) instead of client secrets. No credentials to store, rotate, or accidentally commit. Short-lived OIDC tokens scoped to specific repos and branches.&lt;/P&gt;
&lt;H3 data-line="91"&gt;Cost transparency&lt;/H3&gt;
&lt;P data-line="93"&gt;Every recommendation includes real numbers:&lt;/P&gt;
&lt;UL data-line="95"&gt;
&lt;LI data-line="95"&gt;&lt;EM&gt;"Azure Firewall: $900+/month. Skip until compliance or hybrid demands it."&lt;/EM&gt;&lt;/LI&gt;
&lt;LI data-line="96"&gt;&lt;EM&gt;"DDoS Protection Standard: $2,944/month. Azure's free basic DDoS + Front Door WAF handles most cases."&lt;/EM&gt;&lt;/LI&gt;
&lt;LI data-line="97"&gt;&lt;EM&gt;"Defender for App Service: ~$15/month. Limited value compared to other plans. Revisit later."&lt;/EM&gt;&lt;/LI&gt;
&lt;LI data-line="98"&gt;&lt;EM&gt;"Standard_D4s_v5 VM: $140/month on-demand → $90/month with 1-year RI. 36% savings."&lt;/EM&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-line="100"&gt;The documentation also covers the six most common cost mistakes startups make: forgotten dev VMs, over-provisioned databases, ignoring Reserved Instances, premium storage where standard works, not using Spot VMs, and missing Dev/Test pricing. Each mistake comes with a concrete fix and code example.&lt;/P&gt;
&lt;H2 data-line="102"&gt;Starter examples: Three startup archetypes&lt;/H2&gt;
&lt;P data-line="104"&gt;SSLZ ships with three production-grade example architectures, each with Bicep + Terraform implementations, deployment instructions, and realistic cost estimates:&lt;/P&gt;
&lt;H3 data-line="106"&gt;SaaS Startup (~$330–440/month)&lt;/H3&gt;
&lt;P data-line="108"&gt;Container Apps + Azure SQL Elastic Pool + Redis + Key Vault. Multi-tenant with shared schema and&amp;nbsp;tenant_id&amp;nbsp;column. Container Apps scale to zero in non-prod. Elastic pools are 50–70% cheaper than individual databases.&lt;/P&gt;
&lt;H3 data-line="110"&gt;AI Startup (~$1,150–1,250/month)&lt;/H3&gt;
&lt;P data-line="112"&gt;AKS with GPU Spot node pools (60–90% savings) + Azure OpenAI + Blob Storage + Redis for inference caching. Covers model serving framework choices (vLLM vs Triton vs TGI) and GPU node management with taints and KEDA autoscaling.&lt;/P&gt;
&lt;H3 data-line="114"&gt;API-First Startup (~$163–345/month)&lt;/H3&gt;
&lt;P data-line="116"&gt;App Service with deployment slots (zero-downtime swaps) + API Management (Consumption tier, pay-per-call) + Cosmos DB + Application Insights. Includes API versioning strategy, rate limiting tiers, and Cosmos DB partitioning guidance.&lt;/P&gt;
&lt;H2 data-line="118"&gt;When to graduate&lt;/H2&gt;
&lt;P data-line="120"&gt;SSLZ is explicit about its limits. You'll outgrow it when 2–3 of these signals appear simultaneously:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Signal&lt;/th&gt;&lt;th&gt;Why it matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Second independent workload&lt;/td&gt;&lt;td&gt;Each workload gets its own subscription&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Engineering team &amp;gt; 50 people&lt;/td&gt;&lt;td&gt;Different teams need different permissions and cost boundaries&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Regulatory compliance (SOC2, HIPAA, PCI)&lt;/td&gt;&lt;td&gt;Requires specific controls SSLZ doesn't cover&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Multi-region deployment&lt;/td&gt;&lt;td&gt;Needs centralized network management&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hybrid connectivity (VPN, ExpressRoute)&lt;/td&gt;&lt;td&gt;Requires a Connectivity subscription with gateways&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;5+ subscriptions&lt;/td&gt;&lt;td&gt;Policy and RBAC at scale needs MG hierarchy&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P data-line="131"&gt;The&amp;nbsp;&lt;A href="https://github.com/ricmmartins/sslz/blob/main/docs/graduation-guide.md" target="_blank" rel="noopener" data-href="https://github.com/ricmmartins/sslz/blob/main/docs/graduation-guide.md"&gt;Graduation Guide&lt;/A&gt; provides a five-phase migration path to full ALZ: management group hierarchy, hub network + firewall, management subscription, policy hardening, and identity hardening with risk assessments for each phase. It also includes the cost of the full platform layer ($1,500–3,000/month), so you can make an informed decision about when the investment makes sense.&lt;/P&gt;
&lt;H2 data-line="133"&gt;Quick start: From zero to production-ready in under 1 hour&lt;/H2&gt;
&lt;H3 data-line="135"&gt;Prerequisites (5 min)&lt;/H3&gt;
&lt;UL data-line="137"&gt;
&lt;LI data-line="137"&gt;Azure CLI installed&lt;/LI&gt;
&lt;LI data-line="138"&gt;Two Azure subscriptions (prod + non-prod)&lt;/LI&gt;
&lt;LI data-line="139"&gt;Owner permissions on both subscriptions&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI-CODE lang="bash"&gt;git clone https://github.com/ricmmartins/sslz.git
cd sslz
az login
./scripts/validate-prerequisites.sh&lt;/LI-CODE&gt;
&lt;H3 data-line="148"&gt;Deploy with Bicep (20 min)&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;cd infra/bicep
cp parameters/prod.bicepparam parameters/prod.local.bicepparam
# Edit prod.local.bicepparam with your values

az deployment sub create \
  --location eastus2 \
  --template-file main.bicep \
  --parameters parameters/prod.local.bicepparam&lt;/LI-CODE&gt;
&lt;H3 data-line="161"&gt;Or Deploy with Terraform (20 min)&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;cd infra/terraform
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your values

terraform init
terraform plan -out=tfplan
terraform apply tfplan&lt;/LI-CODE&gt;
&lt;H3 data-line="173"&gt;Verify (5 min)&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;az group list --query "[?contains(name, 'yourcompany')].name" -o tsv
az policy assignment list --query "[].displayName" -o tsv
az security pricing list --query "value[?pricingTier=='Standard'].{Name:name, Tier:pricingTier}" -o table&lt;/LI-CODE&gt;
&lt;H2 data-line="181"&gt;Design philosophy&lt;/H2&gt;
&lt;P data-line="183"&gt;Three principles guided every decision in SSLZ:&lt;/P&gt;
&lt;OL&gt;
&lt;LI data-line="185"&gt;&lt;STRONG&gt; Opinionated over flexible. &lt;/STRONG&gt;"It depends" isn't helpful when you have five engineers and no platform team. SSLZ makes the call: two subscriptions, no hub, deny-all NSGs, MCSB in audit mode and tells you when to revisit.&lt;/LI&gt;
&lt;LI data-line="187"&gt;&lt;STRONG&gt; Reversible over perfect. &lt;/STRONG&gt;Every architectural decision is designed to be easy to change later. Moving subscriptions between management groups is a 10-second operation. Adding a hub VNet requires only a new deployment, not changes to existing resources. Policies can move from Audit to Deny on a schedule. Multi-region is a future add-on, not a prerequisite.&lt;/LI&gt;
&lt;LI data-line="189"&gt;&lt;STRONG&gt; Honest about trade-offs. &lt;/STRONG&gt;Instead of claiming "enterprise-grade," SSLZ says:&lt;EM&gt;"You'll outgrow this when..."&lt;/EM&gt;&amp;nbsp;and&amp;nbsp;&lt;EM&gt;"Here's exactly what it costs to add the next layer."&lt;/EM&gt;&amp;nbsp;That transparency is what separates it from frameworks that are either overkill for startups or under-engineered for production.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2 data-line="191"&gt;Get involved&lt;/H2&gt;
&lt;P data-line="193"&gt;SSLZ is open source under the MIT license. The project welcomes contributions, especially real-world configurations from startup CTOs and platform engineers who've battle-tested the patterns.&lt;/P&gt;
&lt;UL data-line="195"&gt;
&lt;LI data-line="195"&gt;&lt;STRONG&gt;GitHub:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://github.com/ricmmartins/sslz" target="_blank" rel="noopener" data-href="https://github.com/ricmmartins/sslz"&gt;github.com/ricmmartins/sslz&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="196"&gt;&lt;STRONG&gt;Documentation site:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://startupscalelanding.zone/" target="_blank" rel="noopener" data-href="https://startupscalelanding.zone"&gt;startupscalelanding.zone&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="197"&gt;&lt;STRONG&gt;Previous post:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/from-zero-to-hero-with-azure-landing-zones/4229195" target="_blank" rel="noopener" data-href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/from-zero-to-hero-with-azure-landing-zones/4229195"&gt;From Zero to Hero with Azure Landing Zones&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-line="199"&gt;If you're a startup building on Azure, give SSLZ a try. Deploy it, break it, and tell us what your real infrastructure looks like, so the next team doesn't have to figure it out from scratch.&lt;/P&gt;</description>
      <pubDate>Mon, 16 Mar 2026 18:25:26 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/introducing-the-startup-scale-landing-zone-get-azure-right-from/ba-p/4501566</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2026-03-16T18:25:26Z</dc:date>
    </item>
    <item>
      <title>Production-grade API Gateway patterns for Microsoft Foundry</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/production-grade-api-gateway-patterns-for-microsoft-foundry/ba-p/4490494</link>
      <description>&lt;P data-start="62" data-end="263"&gt;Most startup teams start with the simplest thing that can work. One or two apps call Microsoft Foundry model endpoints directly, traffic is predictable, and “routing” is just a config value in the app.&lt;/P&gt;
&lt;P data-start="265" data-end="424"&gt;The gateway pattern becomes necessary when Foundry stops being “an integration” and becomes “a shared platform”. That shift shows up in a few reliable signals:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-start="265" data-end="424"&gt;You do not fully control client code, or updating client configuration is riskier than updating a central routing configuration.&lt;/LI&gt;
&lt;LI data-start="265" data-end="424"&gt;You need blue green rollouts for model versions or fine-tuned variants without forcing every client to redeploy&lt;/LI&gt;
&lt;LI data-start="265" data-end="424"&gt;You need server-side retry and circuit breaking semantics to handle throttling and availability without duplicating logic across every app.&lt;/LI&gt;
&lt;LI data-start="265" data-end="424"&gt;You need consistent token governance and usage visibility across multiple apps and consumers.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;On Azure, this is commonly implemented with Azure API Management (APIM) using GenAI-aware “AI Gateway” capabilities, and it can be configured from the Foundry portal and applied per project.&lt;/P&gt;
&lt;H3 data-start="1306" data-end="1339"&gt;&lt;U&gt;What problems a gateway solves&lt;/U&gt;&lt;/H3&gt;
&lt;P data-start="1341" data-end="1505"&gt;A production gateway in front of Foundry is not about adding a hop. It is about centralizing cross-cutting concerns that otherwise get reimplemented inconsistently:&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-start="1341" data-end="1505"&gt;&lt;STRONG data-start="1509" data-end="1531"&gt;Stable API surface&lt;/STRONG&gt; while deployments and backends evolve.&lt;/LI&gt;
&lt;LI data-start="1341" data-end="1505"&gt;&lt;STRONG data-start="1573" data-end="1604"&gt;Consistent auth termination&lt;/STRONG&gt; at the gateway, then reestablish trust from the gateway to the model backend (for example with Azure RBAC).&lt;/LI&gt;
&lt;LI data-start="1341" data-end="1505"&gt;&lt;STRONG data-start="1755" data-end="1792"&gt;Token-based throttling and quotas&lt;/STRONG&gt; for fairness and cost control across consumers.&lt;/LI&gt;
&lt;LI data-start="1341" data-end="1505"&gt;&lt;STRONG data-start="1883" data-end="1909"&gt;Operational resiliency&lt;/STRONG&gt; via backend pools, priority and weight routing, retry, and circuit breaker behavior that honors throttling signals like Retry-After.&lt;/LI&gt;
&lt;LI data-start="1341" data-end="1505"&gt;&lt;STRONG data-start="2087" data-end="2108"&gt;Unified telemetry&lt;/STRONG&gt; at the choke point, even when you have multiple underlying instances.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;&lt;U&gt;Decoupling clients from backend topology&lt;/U&gt;&lt;/H3&gt;
&lt;P&gt;One secondary but important effect of introducing a gateway is that it shifts backend-specific details out of application code. Clients call a stable API owned by your platform team, while routing, credentials, and failover semantics live behind that boundary. This does not make models interchangeable, and it does not eliminate platform dependencies. What it does is contain them. As backend topology evolves, whether that means new deployments, additional subscriptions, or additional regions, those changes become operational updates rather than coordinated application rewrites.&lt;/P&gt;
&lt;P&gt;In practice, this means your platform team owns the API contract and operational semantics, while backend providers remain an implementation detail behind that contract.&lt;/P&gt;
&lt;H3 data-start="2225" data-end="2251"&gt;&lt;U&gt;One simple mental model&lt;/U&gt;&lt;/H3&gt;
&lt;img /&gt;
&lt;H3 data-start="2733" data-end="2761"&gt;&lt;U&gt;Concrete gateway patterns&lt;/U&gt;&lt;/H3&gt;
&lt;H4 data-start="421" data-end="458"&gt;Choosing the right gateway pattern&lt;/H4&gt;
&lt;P data-start="460" data-end="560"&gt;The table below summarizes when each pattern is most appropriate, and what trade-offs it introduces.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Pattern&lt;/th&gt;&lt;th&gt;Primary goal&lt;/th&gt;&lt;th&gt;Isolation level&lt;/th&gt;&lt;th&gt;Throughput scaling&lt;/th&gt;&lt;th&gt;Resiliency impact&lt;/th&gt;&lt;th&gt;Operational complexity&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Single Foundry, multi-deployment routing&lt;/td&gt;&lt;td&gt;Decouple clients from models and enable safe rollouts&lt;/td&gt;&lt;td&gt;Logical only (same resource boundary)&lt;/td&gt;&lt;td&gt;Limited to single resource quotas&lt;/td&gt;&lt;td&gt;Low to moderate (deployment-level)&lt;/td&gt;&lt;td&gt;Low&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Multi-resource, same region, same subscription&lt;/td&gt;&lt;td&gt;Security segmentation, reliability, backend pooling&lt;/td&gt;&lt;td&gt;Resource-level&lt;/td&gt;&lt;td&gt;Not increased for standard tier&lt;/td&gt;&lt;td&gt;Moderate (backend failover)&lt;/td&gt;&lt;td&gt;Medium&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Prioritized failover, spillover (PTU → standard)&lt;/td&gt;&lt;td&gt;Cost control and burst protection&lt;/td&gt;&lt;td&gt;Resource-level&lt;/td&gt;&lt;td&gt;Controlled spillover&lt;/td&gt;&lt;td&gt;High (explicit failover semantics)&lt;/td&gt;&lt;td&gt;Medium to high&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Multi-subscription, same region&lt;/td&gt;&lt;td&gt;Quota expansion, org boundaries, central AI service&lt;/td&gt;&lt;td&gt;Subscription-level&lt;/td&gt;&lt;td&gt;Scales with number of subscriptions&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Multi-region&lt;/td&gt;&lt;td&gt;Regional resilience, data residency, global access&lt;/td&gt;&lt;td&gt;Region-level&lt;/td&gt;&lt;td&gt;Region-bounded&lt;/td&gt;&lt;td&gt;Very high&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P data-start="1660" data-end="1687"&gt;&lt;STRONG data-start="1660" data-end="1687"&gt;How to read this table:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="1689" data-end="2058"&gt;
&lt;LI data-start="1689" data-end="1776"&gt;If your problem is &lt;STRONG data-start="1710" data-end="1751"&gt;model lifecycle and client decoupling&lt;/STRONG&gt;, start with Pattern 1.&lt;/LI&gt;
&lt;LI data-start="1777" data-end="1874"&gt;If your problem is &lt;STRONG data-start="1798" data-end="1830"&gt;reliability and segmentation&lt;/STRONG&gt;, Pattern 2 and 3 are the usual next step.&lt;/LI&gt;
&lt;LI data-start="1875" data-end="1965"&gt;If your problem is &lt;STRONG data-start="1896" data-end="1943"&gt;quota ceilings or organizational boundaries&lt;/STRONG&gt;, Pattern 4 appears.&lt;/LI&gt;
&lt;LI data-start="1966" data-end="2058"&gt;If your problem is &lt;STRONG data-start="1987" data-end="2026"&gt;regional resilience or global scale&lt;/STRONG&gt;, Pattern 5 becomes unavoidable.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2763" data-end="2926"&gt;Below are the most common patterns that show up as startups move from “one app calling one deployment” to “multiple products, multiple teams, and production SLOs”.&lt;/P&gt;
&lt;H4 data-start="2928" data-end="2996"&gt;Pattern 1: Single Foundry resource with multi-deployment routing&lt;/H4&gt;
&lt;img /&gt;
&lt;P data-start="2998" data-end="3017"&gt;&lt;STRONG data-start="2998" data-end="3017"&gt;When you use it&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="3018" data-end="3231"&gt;
&lt;LI data-start="3018" data-end="3120"&gt;You run multiple model deployments under one Foundry resource and want to control routing centrally.&lt;/LI&gt;
&lt;LI data-start="3121" data-end="3231"&gt;You want safer rollouts (blue green) without forcing client updates.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3233" data-end="3251"&gt;&lt;STRONG data-start="3233" data-end="3251"&gt;What it solves&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="3252" data-end="3537"&gt;
&lt;LI data-start="3252" data-end="3308"&gt;Routing decisions move from clients to a single place.&lt;/LI&gt;
&lt;LI data-start="3309" data-end="3537"&gt;You can gradually shift traffic between deployments, but you still need safe deployment practices because changing “which model” can be a breaking change from the client’s perspective.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3539" data-end="3565"&gt;&lt;STRONG data-start="3539" data-end="3565"&gt;Key operational detail&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="3566" data-end="3775"&gt;
&lt;LI data-start="3566" data-end="3775"&gt;Strongly consider &lt;STRONG data-start="3586" data-end="3632"&gt;credential termination and reestablishment&lt;/STRONG&gt;. Clients authenticate to the gateway. The gateway authenticates to the model backend via Azure RBAC.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="3782" data-end="3852"&gt;Pattern 2: Multi-resource in the same region and same subscription&lt;/H4&gt;
&lt;img /&gt;
&lt;P data-start="3854" data-end="3873"&gt;&lt;STRONG data-start="3854" data-end="3873"&gt;When you use it&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="3874" data-end="4166"&gt;
&lt;LI data-start="3874" data-end="3963"&gt;You need &lt;STRONG data-start="3885" data-end="3910"&gt;security segmentation&lt;/STRONG&gt; boundaries (separate keys or Azure RBAC per client).&lt;/LI&gt;
&lt;LI data-start="3964" data-end="4006"&gt;You want an easier &lt;STRONG data-start="3985" data-end="3999"&gt;chargeback&lt;/STRONG&gt; model.&lt;/LI&gt;
&lt;LI data-start="4007" data-end="4166"&gt;You want failover for availability issues, operational mistakes, or pairing provisioned and standard for spillover.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="4168" data-end="4186"&gt;&lt;STRONG data-start="4168" data-end="4186"&gt;What it solves&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="4187" data-end="4413"&gt;
&lt;LI data-start="4187" data-end="4318"&gt;You can treat multiple backends as &lt;STRONG data-start="4224" data-end="4241"&gt;active-active&lt;/STRONG&gt; and load balance across instances.&lt;/LI&gt;
&lt;LI data-start="4319" data-end="4413"&gt;You centralize retry and circuit-breaker behavior.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="4415" data-end="4438"&gt;&lt;STRONG data-start="4415" data-end="4438"&gt;Critical constraint&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="4439" data-end="4651"&gt;
&lt;LI data-start="4439" data-end="4651"&gt;&lt;STRONG data-start="4441" data-end="4503"&gt;Standard quotas are subscription-level, not instance-level&lt;/STRONG&gt;. Load balancing across standard instances in the same subscription does not create additional throughput&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="4658" data-end="4749"&gt;Pattern 3: Prioritized failover and planned spillover (PTU first, consumption fallback)&lt;/H4&gt;
&lt;img /&gt;
&lt;P data-start="4751" data-end="4882"&gt;This is the pattern you reach for when you want to maximize utilization of dedicated capacity and still survive bursts and outages.&lt;/P&gt;
&lt;P data-start="4884" data-end="5165"&gt;The AI Gateway workshop describes a “Prioritized PTU with Fallback Consumption” approach using APIM backend pools with &lt;STRONG data-start="5003" data-end="5040"&gt;priority and weight-based routing&lt;/STRONG&gt;, combined with &lt;STRONG data-start="5056" data-end="5081"&gt;circuit breaker rules&lt;/STRONG&gt; and retries for 429 and selected 503 cases.&lt;/P&gt;
&lt;P data-start="5167" data-end="5259"&gt;Concrete implementation details from the workshop that are worth copying into your playbook:&lt;/P&gt;
&lt;UL data-start="5261" data-end="5632"&gt;
&lt;LI data-start="5261" data-end="5316"&gt;Configure &lt;STRONG data-start="5273" data-end="5289"&gt;backend pool&lt;/STRONG&gt; across multiple endpoints.&lt;/LI&gt;
&lt;LI data-start="5317" data-end="5451"&gt;Add a &lt;STRONG data-start="5325" data-end="5349"&gt;circuit breaker rule&lt;/STRONG&gt; that can trip on throttling (429) and accept Retry-After&lt;/LI&gt;
&lt;LI data-start="5317" data-end="5451"&gt;Use APIM policy to authenticate with &lt;STRONG data-start="5491" data-end="5511"&gt;managed identity&lt;/STRONG&gt; and set the backend to the pool, then retry on 429 or specific 503 conditions.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="5634" data-end="5728"&gt;This moves “resiliency logic” out of every client and into one place you can test and iterate.&lt;/P&gt;
&lt;H4 data-start="5735" data-end="5821"&gt;Pattern 4: Multi-subscription, same region (quota scaling and centralized service)&lt;/H4&gt;
&lt;img /&gt;
&lt;P data-start="5823" data-end="5842"&gt;&lt;STRONG data-start="5823" data-end="5842"&gt;When you use it&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="5843" data-end="6186"&gt;
&lt;LI data-start="5843" data-end="5980"&gt;You need more quota in &lt;STRONG data-start="5868" data-end="5880"&gt;standard&lt;/STRONG&gt; deployments but must constrain models to a single region.&lt;/LI&gt;
&lt;LI data-start="5843" data-end="5980"&gt;You are building a centralized “Microsoft Foundry as a service” model. Standard quota is subscription-bound, so capacity pooling often implies multiple subscriptions.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="6188" data-end="6252"&gt;&lt;STRONG data-start="6188" data-end="6252"&gt;Implementation tips from the Azure Architecture Center guide&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="6253" data-end="6631"&gt;
&lt;LI data-start="6253" data-end="6365"&gt;Prefer subscriptions backed by the same Microsoft Entra tenant for consistency in Azure RBAC and Azure Policy.&lt;/LI&gt;
&lt;LI data-start="6366" data-end="6422"&gt;Deploy the gateway in the same region as the backends.&lt;/LI&gt;
&lt;LI data-start="6423" data-end="6467"&gt;Consider a dedicated gateway subscription.&lt;/LI&gt;
&lt;LI data-start="6468" data-end="6631"&gt;Ensure private endpoints are reachable across subscriptions, including cross-subscription Private Link where supported.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="6638" data-end="6665"&gt;Pattern 5: Multi-region&lt;/H4&gt;
&lt;img /&gt;
&lt;P data-start="6667" data-end="6686"&gt;&lt;STRONG data-start="6667" data-end="6686"&gt;When you use it&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="6687" data-end="6921"&gt;
&lt;LI data-start="6687" data-end="6772"&gt;You need a service availability failover strategy (for example cross-region pairs).&lt;/LI&gt;
&lt;LI data-start="6773" data-end="6827"&gt;You have data residency and compliance requirements.&lt;/LI&gt;
&lt;LI data-start="6828" data-end="6921"&gt;You face mixed model availability across regions.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The Azure Architecture Center guide calls out that for business-critical architectures that must survive a complete regional outage, a &lt;STRONG data-start="7058" data-end="7084"&gt;global unified gateway&lt;/STRONG&gt; helps eliminate failover logic from client code. It also notes the trade-offs of single-region gateway deployment doing active-active load balancing across regions, including added latency and egress charges for cross-region calls.&lt;/P&gt;
&lt;H3 data-start="2225" data-end="2275"&gt;&lt;U&gt;Real-world scenarios this architecture supports&lt;/U&gt;&lt;/H3&gt;
&lt;P data-start="2277" data-end="2431"&gt;These are representative scenarios drawn from common production environments and directly supported by the gateway patterns and reference implementations.&lt;/P&gt;
&lt;H4 data-start="2433" data-end="2481"&gt;Scenario A: Containing a runaway application&lt;/H4&gt;
&lt;P data-start="2483" data-end="2656"&gt;A company has five internal applications sharing the same Foundry environment. One application ships a prompt regression that suddenly multiplies average request size by 8x.&lt;/P&gt;
&lt;P data-start="2658" data-end="2676"&gt;Without a gateway:&lt;/P&gt;
&lt;UL data-start="2677" data-end="2832"&gt;
&lt;LI data-start="2677" data-end="2713"&gt;Token consumption spikes globally.&lt;/LI&gt;
&lt;LI data-start="2714" data-end="2764"&gt;Other apps experience 429s and degraded latency.&lt;/LI&gt;
&lt;LI data-start="2765" data-end="2832"&gt;Root cause takes time to identify because telemetry is scattered.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2834" data-end="2873"&gt;With an AI Gateway in front of Foundry:&lt;/P&gt;
&lt;UL data-start="2874" data-end="3102"&gt;
&lt;LI data-start="2874" data-end="2924"&gt;Token-based limits are enforced per application.&lt;/LI&gt;
&lt;LI data-start="2925" data-end="2970"&gt;The faulty app is throttled at the gateway.&lt;/LI&gt;
&lt;LI data-start="2971" data-end="3020"&gt;Other applications continue operating normally.&lt;/LI&gt;
&lt;LI data-start="3021" data-end="3102"&gt;The gateway telemetry immediately shows which consumer is exhausting the quota.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3104" data-end="3112"&gt;Outcome:&lt;/P&gt;
&lt;UL data-start="3113" data-end="3215"&gt;
&lt;LI data-start="3113" data-end="3164"&gt;Incident blast radius is limited to one consumer.&lt;/LI&gt;
&lt;LI data-start="3165" data-end="3184"&gt;No global outage.&lt;/LI&gt;
&lt;LI data-start="3185" data-end="3215"&gt;Faster root cause isolation.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="3222" data-end="3267"&gt;Scenario B: Zero-downtime model migration&lt;/H4&gt;
&lt;P data-start="3269" data-end="3348"&gt;A startup is migrating from one production deployment to a newer model version.&lt;/P&gt;
&lt;P data-start="3350" data-end="3427"&gt;They deploy the new model alongside the old one and configure the gateway to:&lt;/P&gt;
&lt;UL data-start="3428" data-end="3520"&gt;
&lt;LI data-start="3428" data-end="3479"&gt;Route 5 percent of traffic to the new deployment.&lt;/LI&gt;
&lt;LI data-start="3480" data-end="3520"&gt;Keep 95 percent on the old deployment.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3522" data-end="3535"&gt;They observe:&lt;/P&gt;
&lt;UL data-start="3536" data-end="3576"&gt;
&lt;LI data-start="3536" data-end="3549"&gt;Error rate.&lt;/LI&gt;
&lt;LI data-start="3550" data-end="3560"&gt;Latency.&lt;/LI&gt;
&lt;LI data-start="3561" data-end="3576"&gt;Token growth.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3578" data-end="3681"&gt;Over several days they progressively shift traffic to 100 percent without requiring any client changes.&lt;/P&gt;
&lt;P data-start="3683" data-end="3691"&gt;Outcome:&lt;/P&gt;
&lt;UL data-start="3692" data-end="3828"&gt;
&lt;LI data-start="3692" data-end="3718"&gt;No forced redeployments.&lt;/LI&gt;
&lt;LI data-start="3719" data-end="3752"&gt;No mass client reconfiguration.&lt;/LI&gt;
&lt;LI data-start="3753" data-end="3828"&gt;Rollback is a gateway configuration change, not an emergency code change.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="3835" data-end="3881"&gt;Scenario C: Cost-controlled burst handling&lt;/H4&gt;
&lt;P data-start="3883" data-end="3983"&gt;A product runs steady baseline traffic on provisioned capacity and experiences unpredictable spikes.&lt;/P&gt;
&lt;P data-start="3985" data-end="4007"&gt;Gateway configuration:&lt;/P&gt;
&lt;UL data-start="4008" data-end="4143"&gt;
&lt;LI data-start="4008" data-end="4032"&gt;Priority backend pool.&lt;/LI&gt;
&lt;LI data-start="4033" data-end="4069"&gt;Provisioned deployment as primary.&lt;/LI&gt;
&lt;LI data-start="4070" data-end="4105"&gt;Standard deployment as secondary.&lt;/LI&gt;
&lt;LI data-start="4106" data-end="4143"&gt;Circuit breaker honors Retry-After.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="4145" data-end="4162"&gt;Normal operation:&lt;/P&gt;
&lt;UL data-start="4163" data-end="4212"&gt;
&lt;LI data-start="4163" data-end="4212"&gt;Nearly all traffic hits provisioned throughput.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="4214" data-end="4228"&gt;During spikes:&lt;/P&gt;
&lt;UL data-start="4229" data-end="4322"&gt;
&lt;LI data-start="4229" data-end="4267"&gt;Overflow is routed to standard tier.&lt;/LI&gt;
&lt;LI data-start="4268" data-end="4322"&gt;The gateway absorbs throttling behavior and retries.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="4324" data-end="4332"&gt;Outcome:&lt;/P&gt;
&lt;UL data-start="4333" data-end="4470"&gt;
&lt;LI data-start="4333" data-end="4374"&gt;Provisioned capacity is fully utilized.&lt;/LI&gt;
&lt;LI data-start="4375" data-end="4418"&gt;Spikes are handled without hard failures.&lt;/LI&gt;
&lt;LI data-start="4419" data-end="4470"&gt;Clients are unaware that backend routing changed.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="4477" data-end="4519"&gt;Scenario D: Subscription quota pooling&lt;/H4&gt;
&lt;P data-start="4521" data-end="4599"&gt;An organization reaches standard tier quota ceilings in a single subscription.&lt;/P&gt;
&lt;P data-start="4601" data-end="4697"&gt;They deploy Foundry resources across multiple subscriptions and place a single gateway in front.&lt;/P&gt;
&lt;P data-start="4699" data-end="4716"&gt;Gateway behavior:&lt;/P&gt;
&lt;UL data-start="4717" data-end="4847"&gt;
&lt;LI data-start="4717" data-end="4761"&gt;Distributes requests across subscriptions.&lt;/LI&gt;
&lt;LI data-start="4762" data-end="4797"&gt;Applies unified token governance.&lt;/LI&gt;
&lt;LI data-start="4798" data-end="4847"&gt;Exposes one API endpoint to all internal teams.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="4849" data-end="4857"&gt;Outcome:&lt;/P&gt;
&lt;UL data-start="4858" data-end="4982"&gt;
&lt;LI data-start="4858" data-end="4893"&gt;Aggregate usable quota increases.&lt;/LI&gt;
&lt;LI data-start="4894" data-end="4936"&gt;Organizational boundaries are preserved.&lt;/LI&gt;
&lt;LI data-start="4937" data-end="4982"&gt;Clients remain unaware of backend topology.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-start="7365" data-end="7388"&gt;&lt;U&gt;Operational playbook&lt;/U&gt;&lt;/H3&gt;
&lt;P data-start="7390" data-end="7463"&gt;This is the part that separates “it works” from “it survives production”.&lt;/P&gt;
&lt;H4 data-start="7465" data-end="7495"&gt;1. Authentication strategy&lt;/H4&gt;
&lt;P data-start="7497" data-end="7520"&gt;&lt;STRONG data-start="7497" data-end="7520"&gt;Recommended default&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="7521" data-end="7708"&gt;
&lt;LI data-start="7521" data-end="7560"&gt;Terminate client auth at the gateway.&lt;/LI&gt;
&lt;LI data-start="7561" data-end="7708"&gt;Reestablish gateway-to-backend authorization via Azure RBAC rather than passing through client secrets.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The AI Gateway workshop provides a concrete example using authentication-managed-identity and setting the Authorization header for the backend call.&lt;/P&gt;
&lt;P data-start="7902" data-end="7915"&gt;&lt;STRONG data-start="7902" data-end="7915"&gt;Guardrail&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="7916" data-end="8070"&gt;
&lt;LI data-start="7916" data-end="8070"&gt;If you choose pass-through client credentials, ensure clients cannot bypass the gateway or model restrictions.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="8077" data-end="8113"&gt;2. Token throttling and fairness&lt;/H4&gt;
&lt;P data-start="8115" data-end="8179"&gt;You want limits that match how LLMs consume capacity and budget.&lt;/P&gt;
&lt;UL data-start="8181" data-end="8483"&gt;
&lt;LI data-start="8181" data-end="8322"&gt;APIM GenAI capabilities emphasize &lt;STRONG data-start="8217" data-end="8244"&gt;controlled token limits&lt;/STRONG&gt; and monitoring for cost efficiency.&lt;/LI&gt;
&lt;LI data-start="8323" data-end="8483"&gt;Foundry AI Gateway governance scenarios explicitly include configuring token limits for models at the project level.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="8485" data-end="8581"&gt;Use token throttling as your primary fairness control, then layer request-rate limits if needed.&lt;/P&gt;
&lt;H4 data-start="8588" data-end="8613"&gt;3. Failover semantics&lt;/H4&gt;
&lt;P data-start="8615" data-end="8668"&gt;Two rules that prevent most “self-inflicted outages”:&lt;/P&gt;
&lt;UL data-start="8670" data-end="9016"&gt;
&lt;LI data-start="8670" data-end="8871"&gt;&lt;STRONG data-start="8672" data-end="8695"&gt;Honor Retry-After&lt;/STRONG&gt; from the backend when implementing failover and circuit breaker behavior. Do not continuously hit a throttled endpoint returning 429.&lt;/LI&gt;
&lt;LI data-start="8872" data-end="9016"&gt;Prefer gateway-side retry and circuit breaking to avoid repeated code and to keep one place to test.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="9018" data-end="9237"&gt;The workshop shows a pragmatic retry condition on 429 and selected 503, combined with backend pool routing and a circuit breaker that can trip on 429 while checking Retry-After.&lt;/P&gt;
&lt;H4 data-start="9244" data-end="9289"&gt;4. Observability and consumption tracking&lt;/H4&gt;
&lt;P data-start="9291" data-end="9486"&gt;A gateway is uniquely positioned to publish telemetry across all consumed models to a single store, which makes unified dashboarding and alerting easier.&lt;/P&gt;
&lt;P data-start="9488" data-end="9777"&gt;APIM’s GenAI positioning highlights token monitoring as part of “cost efficiency”. &lt;BR data-start="9610" data-end="9613" /&gt;The workshop navigation includes model monitoring and consumption tracking as first-class steps in the AI Gateway journey.&lt;/P&gt;
&lt;P data-start="9779" data-end="9941"&gt;Operationally, decide up front what you will dimension your telemetry by (project, tenant, application, environment) and enforce those identifiers at the gateway.&lt;/P&gt;
&lt;H4 data-start="9948" data-end="9998"&gt;5. APIOps: Treat gateway configuration as code&lt;/H4&gt;
&lt;P data-start="10000" data-end="10093"&gt;Even if you configure the first version in the portal, production systems need repeatability:&lt;/P&gt;
&lt;UL data-start="10095" data-end="10576"&gt;
&lt;LI data-start="10095" data-end="10248"&gt;Use a code-driven workflow for policies and configuration so routing and governance changes are reviewed and promoted like any other production change.&lt;/LI&gt;
&lt;LI data-start="10249" data-end="10421"&gt;If you adopt a federated model, APIM Workspaces are positioned to help organizations manage APIs more productively and securely.&lt;/LI&gt;
&lt;LI data-start="10422" data-end="10576"&gt;Keep an eye on the APIM changelog and GenAI feature updates because gateway capabilities are evolving quickly.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-start="10583" data-end="10611"&gt;&lt;U&gt;When not to add a gateway&lt;/U&gt;&lt;/H3&gt;
&lt;P data-start="10613" data-end="10874"&gt;The Architecture Center guide is explicit: If controlling client configuration is as easy as controlling gateway routing, the added reliability, security, cost, maintenance, and performance impact might not be worth it.&lt;/P&gt;
&lt;P data-start="10876" data-end="11169"&gt;Also, if you are using a single instance with multiple deployments primarily to simulate identity segmentation, you might be better served by multiple instances with distinct Azure RBAC boundaries instead of pushing that complexity into gateway logic.&lt;/P&gt;
&lt;H3 data-start="11176" data-end="11194"&gt;&lt;U&gt;Closing thought&lt;/U&gt;&lt;/H3&gt;
&lt;P data-start="11196" data-end="11276"&gt;A gateway is not a prerequisite for Foundry. It is an operational maturity step.&lt;/P&gt;
&lt;P data-start="11278" data-end="11593"&gt;When Foundry usage becomes multi-tenant, SLO-driven, and quota-sensitive, the gateway stops being “extra architecture” and becomes the place you express your platform intent. Auth boundaries. Token governance. Failover semantics. Telemetry. And a repeatable APIOps process to keep it all sane as the system evolves.&lt;/P&gt;
&lt;H3 data-start="11278" data-end="11593"&gt;&lt;U&gt;References&lt;/U&gt;&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/azure-openai-gateway-multi-backend" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="Use a gateway in front of multiple Azure OpenAI deployments or instances - Azure Architecture Center"&gt;Use a gateway in front of multiple Azure OpenAI deployments or instances&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/ai-foundry/configuration/enable-ai-api-management-gateway-portal" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="Configure AI Gateway in your Foundry resources - Microsoft Foundry"&gt;Configure AI Gateway in your Foundry resources &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/api-management/genai-gateway-capabilities" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI gateway in Azure API Management"&gt;AI gateway in Azure API Management&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://azure.github.io/api-management-resources/#:~:text=,Enhanced%20Governance%20with%20runtime%20policies" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="Azure API Management - apimlove"&gt;Azure API Management &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://azure-samples.github.io/AI-Gateway/docs/azure-openai/dynamic-failover" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="Ensure resiliency and optimized resource consumption with load balancer &amp;amp; circuit breaker | AI Gateway workshop"&gt;Ensure resiliency and optimized resource consumption with load balancer &amp;amp; circuit breaker&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://azure-samples.github.io/AI-Gateway/docs/azure-openai/rate-limit" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="Control cost and performance with token quotas and limits | AI Gateway workshop"&gt;Control cost and performance with token quotas and limits &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://azure-samples.github.io/AI-Gateway/docs/azure-openai/track-consumption" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="Keep visibility into AI consumption with model monitoring | AI Gateway workshop"&gt;Keep visibility into AI consumption with model monitoring &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="GitHub - Azure-Samples/AI-Gateway: Labs to explore AI Models, MCP servers, and Agents with the AI Gateway powered by Azure API Management and Microsoft Foundry 🚀"&gt;GitHub - Azure-Samples/AI-Gateway&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/access-controlling" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI-Gateway/labs/access-controlling at main · Azure-Samples/AI-Gateway"&gt;AI-Gateway/labs/access-controlling &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/function-calling" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI-Gateway/labs/function-calling at main · Azure-Samples/AI-Gateway"&gt;AI-Gateway/labs/function-calling &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/model-context-protocol" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI-Gateway/labs/model-context-protocol at main · Azure-Samples/AI-Gateway"&gt;AI-Gateway/labs/model-context-protocol &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/openai-agents" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI-Gateway/labs/openai-agents at main · Azure-Samples/AI-Gateway"&gt;AI-Gateway/labs/openai-agents &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/ai-agent-service" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI-Gateway/labs/ai-agent-service at main · Azure-Samples/AI-Gateway"&gt;AI-Gateway/labs/ai-agent-service &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/semantic-caching" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI-Gateway/labs/semantic-caching at main · Azure-Samples/AI-Gateway"&gt;AI-Gateway/labs/semantic-caching &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/finops-framework" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI-Gateway/labs/finops-framework at main · Azure-Samples/AI-Gateway"&gt;AI-Gateway/labs/finops-framework&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/slm-self-hosting" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI-Gateway/labs/slm-self-hosting at main · Azure-Samples/AI-Gateway"&gt;AI-Gateway/labs/slm-self-hosting &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/ai-foundry-deepseek" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AI-Gateway/labs/ai-foundry-deepseek at main · Azure-Samples/AI-Gateway"&gt;AI-Gateway/labs/ai-foundry-deepseek &lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 29 Jan 2026 19:35:08 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/production-grade-api-gateway-patterns-for-microsoft-foundry/ba-p/4490494</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2026-01-29T19:35:08Z</dc:date>
    </item>
    <item>
      <title>When and why startups add a gateway in front of Microsoft Foundry</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/when-and-why-startups-add-a-gateway-in-front-of-microsoft/ba-p/4489490</link>
      <description>&lt;BLOCKQUOTE&gt;
&lt;P&gt;Note: This post focuses on when and why startups begin adopting a gateway in front of Microsoft Foundry. In a follow-up article, we’ll go into a technical deep dive, covering design decisions, operational tradeoffs, latency considerations, observability, and patterns used in production-scale environments.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="2865" data-end="2938"&gt;&lt;BR /&gt;Most teams don’t hit scaling challenges with Microsoft Foundry on day one.&lt;/P&gt;
&lt;P data-start="2940" data-end="3134"&gt;Early on, things are simple. One or two applications call Foundry directly. Traffic is predictable. Model experimentation moves fast. Everything works, and there’s no reason to add extra layers.&lt;/P&gt;
&lt;P data-start="3136" data-end="3156"&gt;Then adoption grows. More applications start calling the same models. Traffic becomes spiky. Teams want better visibility into usage. Questions about rate limits, authentication, and how to evolve models over time begin to surface.&lt;/P&gt;
&lt;P data-start="3370" data-end="3488"&gt;This is usually the moment when teams start asking: &lt;STRONG data-start="3424" data-end="3488"&gt;“Do we need some kind of control &lt;/STRONG&gt;&lt;STRONG style="font-style: var(--lia-blog-font-style); font-family: var(--lia-blog-font-family); font-size: var(--lia-bs-font-size-base); -webkit-tap-highlight-color: hsla(var(--lia-bs-black-h),var(--lia-bs-black-s),var(--lia-bs-black-l),0); -webkit-text-size-adjust: 100%;" data-start="3424" data-end="3488"&gt;layer in front of Foundry?”&lt;/STRONG&gt;&lt;/P&gt;
&lt;H2 data-start="3490" data-end="3526"&gt;The signals that start to show up&lt;/H2&gt;
&lt;P data-start="3528" data-end="3607"&gt;Across many startups, the same patterns tend to emerge as Foundry usage scales:&lt;/P&gt;
&lt;UL data-start="3609" data-end="3876"&gt;
&lt;LI data-start="3609" data-end="3677"&gt;Multiple clients and services calling the same Foundry endpoints&lt;/LI&gt;
&lt;LI data-start="3678" data-end="3738"&gt;The need for consistent rate limiting and access control&lt;/LI&gt;
&lt;LI data-start="3739" data-end="3813"&gt;A desire to evolve models or deployments without touching every client&lt;/LI&gt;
&lt;LI data-start="3814" data-end="3876"&gt;Limited visibility into who is calling what, and how often&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3878" data-end="3971"&gt;None of these are problems at small scale. But together, they create friction as usage grows.&lt;/P&gt;
&lt;H2 data-start="3973" data-end="4011"&gt;A pattern we often see working well&lt;/H2&gt;
&lt;P data-start="4013" data-end="4103"&gt;A common pattern at this stage is placing a &lt;STRONG data-start="4057" data-end="4102"&gt;gateway in front of Microsoft Foundry APIs&lt;/STRONG&gt;.&lt;/P&gt;
&lt;img&gt;Client applications call a single gateway endpoint, where policies such as authentication, rate limits, and routing are applied before requests are forwarded to Foundry model deployments.&lt;/img&gt;
&lt;P data-start="4105" data-end="4324"&gt;Rather than having every application talk directly to Foundry, teams introduce a control layer that sits between clients and Foundry.&lt;/P&gt;
&lt;P data-start="4105" data-end="4324"&gt;On Azure, this is often implemented using &lt;STRONG data-start="4281" data-end="4323"&gt;API Management with GenAI capabilities&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P data-start="4326" data-end="4470"&gt;This gateway does not replace Foundry. Foundry remains the model and AI platform. The gateway simply becomes the entry point for client traffic.&lt;/P&gt;
&lt;H2 data-start="4472" data-end="4504"&gt;What this enables in practice&lt;/H2&gt;
&lt;P data-start="4506" data-end="4576"&gt;When teams introduce a gateway layer, a few things become much easier:&lt;/P&gt;
&lt;UL data-start="4578" data-end="4914"&gt;
&lt;LI data-start="4578" data-end="4669"&gt;&lt;STRONG data-start="4580" data-end="4612"&gt;A single, stable API surface&lt;/STRONG&gt; for applications, even as models or deployments evolve&lt;/LI&gt;
&lt;LI data-start="4670" data-end="4748"&gt;&lt;STRONG data-start="4672" data-end="4717"&gt;Centralized throttling and authentication&lt;/STRONG&gt;, instead of per-client logic&lt;/LI&gt;
&lt;LI data-start="4749" data-end="4828"&gt;&lt;STRONG data-start="4751" data-end="4775"&gt;Policy-based routing&lt;/STRONG&gt; across models or backends without changing clients&lt;/LI&gt;
&lt;LI data-start="4829" data-end="4914"&gt;&lt;STRONG data-start="4831" data-end="4871"&gt;Improved request-level observability&lt;/STRONG&gt; into usage patterns, latency, and errors&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="4916" data-end="5077"&gt;Importantly, this structure lets teams scale without slowing down experimentation. Model teams can continue to iterate, while platform concerns stay centralized.&lt;/P&gt;
&lt;H2 data-start="5079" data-end="5106"&gt;What this pattern is not&lt;/H2&gt;
&lt;P data-start="5108" data-end="5159"&gt;It’s worth calling out what this approach is &lt;EM data-start="5153" data-end="5158"&gt;not&lt;/EM&gt;:&lt;/P&gt;
&lt;UL data-start="5161" data-end="5289"&gt;
&lt;LI data-start="5161" data-end="5197"&gt;It’s &lt;STRONG data-start="5168" data-end="5195"&gt;not required on day one&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-start="5198" data-end="5242"&gt;It’s &lt;STRONG data-start="5205" data-end="5240"&gt;not mandatory for every startup&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-start="5243" data-end="5289"&gt;It’s &lt;STRONG data-start="5250" data-end="5287"&gt;not about adding complexity early&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="5291" data-end="5468"&gt;Many teams run successfully without a gateway for a long time. This pattern becomes useful when scale, team size, or operational needs make direct integrations harder to manage.&lt;/P&gt;
&lt;H2 data-start="5470" data-end="5505"&gt;When teams usually consider this&lt;/H2&gt;
&lt;P data-start="5507" data-end="5564"&gt;From experience, teams tend to explore this pattern when:&lt;/P&gt;
&lt;UL data-start="5566" data-end="5794"&gt;
&lt;LI data-start="5566" data-end="5620"&gt;Foundry usage spans multiple applications or teams&lt;/LI&gt;
&lt;LI data-start="5621" data-end="5675"&gt;Rate limits and quotas need consistent enforcement&lt;/LI&gt;
&lt;LI data-start="5676" data-end="5740"&gt;There’s a desire to future-proof model or deployment changes&lt;/LI&gt;
&lt;LI data-start="5741" data-end="5794"&gt;Observability and governance start to matter more&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="5796" data-end="5895"&gt;If those conversations are already happening, it’s often a good time to look at a gateway approach.&lt;/P&gt;
&lt;H2 data-start="5897" data-end="5923"&gt;How this looks on Azure&lt;/H2&gt;
&lt;P data-start="5925" data-end="5978"&gt;On Azure, this pattern is commonly implemented using:&lt;/P&gt;
&lt;UL data-start="5980" data-end="6147"&gt;
&lt;LI data-start="5980" data-end="6023"&gt;&lt;STRONG data-start="5982" data-end="6006"&gt;Azure API Management&lt;/STRONG&gt; as the gateway&lt;/LI&gt;
&lt;LI data-start="6024" data-end="6092"&gt;&lt;STRONG data-start="6026" data-end="6047"&gt;AI-aware policies&lt;/STRONG&gt; for rate limiting, routing, and governance&lt;/LI&gt;
&lt;LI data-start="6093" data-end="6147"&gt;&lt;STRONG data-start="6095" data-end="6115"&gt;Microsoft Foundry&lt;/STRONG&gt; as the backend model platform&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="6149" data-end="6252"&gt;The architecture stays flexible. Teams can start simple and add capabilities over time as needs evolve.&lt;/P&gt;
&lt;H2 data-start="6254" data-end="6273"&gt;Closing thoughts&lt;/H2&gt;
&lt;P data-start="6275" data-end="6332"&gt;This pattern is less about tooling and more about timing.&lt;/P&gt;
&lt;P data-start="6334" data-end="6543"&gt;Adding a gateway too early can slow teams down. Adding it too late can make change painful. The right moment is usually when Foundry usage starts to feel like a shared platform rather than a single experiment.&lt;/P&gt;
&lt;P data-start="6545" data-end="6637"&gt;For teams approaching that stage, a gateway can provide structure without taking away speed.&lt;/P&gt;
&lt;H2 data-start="6639" data-end="6652"&gt;References&lt;/H2&gt;
&lt;UL data-start="6654" data-end="7120"&gt;
&lt;LI data-start="6654" data-end="6830"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/ai-foundry/configuration/enable-ai-api-management-gateway-portal?view=foundry" target="_blank" rel="noopener"&gt;Enable API Management gateway for Microsoft Foundry &lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="6654" data-end="6830"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/api-management/genai-gateway-capabilities" target="_blank" rel="noopener"&gt;GenAI gateway capabilities in API Management&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="6968" data-end="7120"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/azure-openai-gateway-multi-backend" target="_blank" rel="noopener"&gt;Gateway patterns for multi-backend AI setups&lt;/A&gt;&lt;BR data-start="7014" data-end="7017" /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 27 Jan 2026 03:24:31 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/when-and-why-startups-add-a-gateway-in-front-of-microsoft/ba-p/4489490</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2026-01-27T03:24:31Z</dc:date>
    </item>
    <item>
      <title>Azure has three permission systems, and you're probably confusing them</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-has-three-permission-systems-and-you-re-probably-confusing/ba-p/4471854</link>
      <description>&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Series: Azure Governance for Digital Natives and Startups:&amp;nbsp;&lt;/STRONG&gt;This is&amp;nbsp;&lt;STRONG&gt;Part 1&lt;/STRONG&gt;&amp;nbsp;of a 3-part series on Azure governance for digital-native companies scaling on Azure.&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Part 1&lt;/STRONG&gt; (this post): The three-plane model: Identity, Resources, and Billing&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Part 2&lt;/STRONG&gt;: &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/marketplace-governance-and-the-cross-plane-bridge/4510067" target="_blank" rel="noopener" data-lia-auto-title="Marketplace, Managed Identity, and where the planes collide&amp;nbsp;" data-lia-auto-title-active="0"&gt;Marketplace, Managed Identity, and where the planes collide&amp;nbsp;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Part 3&lt;/STRONG&gt;: &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/role-structures-anti-patterns-and-the-10-governance-principles/4510070" target="_blank" rel="noopener" data-lia-auto-title="Anti-patterns, role structures, and the 10 principles of Azure governance" data-lia-auto-title-active="0"&gt;Anti-patterns, role structures, and the 10 principles of Azure governance&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Azure is a powerful cloud platform, but its governance model is widely misunderstood, especially in fast-moving, engineering-led organizations.&lt;/P&gt;
&lt;P&gt;After working with dozens of digital-native customers (AI startups, SaaS platforms, companies scaling from zero to millions in Azure spend), I've seen the same confusion play out over and over. Engineers can't see MACC credits. Finance can't see workloads. Global Admins think they own everything. And Marketplace purchases happen without anyone in Finance knowing.&lt;/P&gt;
&lt;P&gt;The root cause is always the same:&amp;nbsp;&lt;STRONG&gt;Azure is governed by three completely separate permission systems&lt;/STRONG&gt;, and most teams treat it like one.&lt;/P&gt;
&lt;P&gt;If you're a customer moving fast on Azure, you've likely heard these questions:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;"Why can't my engineering Owner see MACC credits?"&lt;/LI&gt;
&lt;LI&gt;"Why can't a Billing Contributor deploy a VM?"&lt;/LI&gt;
&lt;LI&gt;"Why doesn't Global Admin let me access subscriptions?"&lt;/LI&gt;
&lt;LI&gt;"Why can a Contributor deploy AKS but not buy Snowflake?"&lt;/LI&gt;
&lt;LI&gt;"Why does Cost Management Reader show cost but not credit balance?"&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;These questions appear in nearly every customer I work with: AI companies consuming Azure OpenAI at scale, SaaS companies running global AKS footprints, and digital natives under Microsoft Azure Consumption Commitments (MACC).&lt;/P&gt;
&lt;P&gt;This guide breaks down the entire model with practical patterns and deep insight into each plane — so these questions are never confusing again.&lt;/P&gt;
&lt;H2&gt;Why digital natives struggle with this&lt;/H2&gt;
&lt;P&gt;Before diving into the technical model, it's worth understanding&amp;nbsp;&lt;EM&gt;why&lt;/EM&gt;&amp;nbsp;this causes so much friction in digital-native companies specifically. These problems hit startups and scaling companies harder than traditional enterprises for three reasons:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Speed over governance.&lt;/STRONG&gt;&amp;nbsp;Engineering-led companies prioritize shipping over process. Governance is added retroactively, often after something goes wrong.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Flat org structures.&lt;/STRONG&gt; Without clear Platform, Finance, and Security functions, the same people end up with roles across multiple planes creating exactly the kind of role sprawl the three-plane model was designed to prevent.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;MACC commitments.&lt;/STRONG&gt;&amp;nbsp;Digital natives under MACC have a financial relationship with Azure that most team members don't even know exists. When engineers can't see MACC burn and finance can't see resource usage, nobody has the full picture.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;The result is predictable:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Role&lt;/th&gt;&lt;th&gt;What They Expect&lt;/th&gt;&lt;th&gt;What They Actually Get&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Engineers&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;"I'm Owner, I should see everything, including billing"&lt;/td&gt;&lt;td&gt;RBAC gives full resource control but zero billing visibility&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Finance&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;"I need to see what's running so I can forecast"&lt;/td&gt;&lt;td&gt;Billing Reader shows credits and invoices but not workloads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Security&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;"I'm Global Admin, I have total control"&lt;/td&gt;&lt;td&gt;Entra controls identity but not resources or billing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Procurement&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;"I need to buy Marketplace software for the team"&lt;/td&gt;&lt;td&gt;Marketplace purchases require billing roles, not RBAC&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Leadership&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;"I want a single dashboard for cost, resources, and credits"&lt;/td&gt;&lt;td&gt;No single role spans all three planes; you need a combination&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;When these expectations go unaddressed: engineers get billing access "just to see costs" (creating financial risk), Marketplace purchases happen without finance oversight, and Global Admin is treated as the "master key" when it controls only one of three planes.&lt;/P&gt;
&lt;P&gt;The fix isn't more permissions. It's&amp;nbsp;&lt;STRONG&gt;the right permissions in the right plane for the right people&lt;/STRONG&gt;.&lt;/P&gt;
&lt;H2&gt;The three-plane model&lt;/H2&gt;
&lt;P&gt;Everything in Azure governance flows from this single truth:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Plane&lt;/th&gt;&lt;th&gt;Controls&lt;/th&gt;&lt;th&gt;Example Roles&lt;/th&gt;&lt;th&gt;See Billing?&lt;/th&gt;&lt;th&gt;Deploy Resources?&lt;/th&gt;&lt;th&gt;Manage Identity?&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Microsoft Entra&lt;/STRONG&gt;&amp;nbsp;(Identity)&lt;/td&gt;&lt;td&gt;Users, groups, MFA, PIM, Conditional Access&lt;/td&gt;&lt;td&gt;Global Admin, Groups Admin, PIM Admin&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Azure RBAC&lt;/STRONG&gt;&amp;nbsp;(Resources)&lt;/td&gt;&lt;td&gt;VMs, AKS, Storage, AOAI, networking, policies&lt;/td&gt;&lt;td&gt;Owner, Contributor, Reader&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Billing / Commerce&lt;/STRONG&gt;&amp;nbsp;(Financial)&lt;/td&gt;&lt;td&gt;Invoices, credits, MACC, payments, Marketplace purchases&lt;/td&gt;&lt;td&gt;Billing Owner, Billing Reader&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;col style="width: 16.67%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Three planes. Zero overlap. A role in one plane grants&amp;nbsp;&lt;STRONG&gt;zero&lt;/STRONG&gt;&amp;nbsp;access in the others.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Entra Global Admin can't access subscriptions.&lt;/LI&gt;
&lt;LI&gt;Subscription Owner can't see the MACC balance.&lt;/LI&gt;
&lt;LI&gt;Billing Account Owner can't deploy resources.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This separation is by design. Once your company internalizes it, governance becomes dramatically more predictable.&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;Plane 1: Microsoft Entra (Identity Plane)&lt;/H2&gt;
&lt;P&gt;&lt;EM&gt;Security, authentication, authorization, administrative boundaries.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Microsoft Entra (formerly Azure AD) is the authoritative identity provider for Azure. It governs identity, authentication, Conditional Access, PIM (Privileged Identity Management), group membership, and tenant-wide administrative policies. Entra is the security boundary for the entire tenant.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;💡 Common misunderstanding:&lt;/STRONG&gt;&amp;nbsp;&lt;EM&gt;"I'm Global Admin, why can't I access subscriptions?"&lt;/EM&gt;&lt;BR /&gt;&lt;BR /&gt;Because Entra roles do&amp;nbsp;&lt;STRONG&gt;not&lt;/STRONG&gt;&amp;nbsp;grant Azure RBAC permissions by default. This behavior is intentional and foundational. A compromised Global Admin cannot delete all subscriptions. A compromised Subscription Owner cannot compromise directory security. Identity and infrastructure operate independently for resiliency.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3&gt;What Entra roles can do&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Create and manage users&lt;/LI&gt;
&lt;LI&gt;Manage MFA &amp;amp; Conditional Access&lt;/LI&gt;
&lt;LI&gt;Approve PIM requests&lt;/LI&gt;
&lt;LI&gt;Manage security settings&lt;/LI&gt;
&lt;LI&gt;Create/assign groups (which can then hold RBAC roles)&lt;/LI&gt;
&lt;LI&gt;Manage enterprise applications, OIDC apps, etc.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;What Entra roles cannot do&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Action&lt;/th&gt;&lt;th&gt;Allowed?&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Deploy resources&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Access subscriptions&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;View MACC credits&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Make Marketplace purchases&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Modify billing profiles&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Change RBAC roles&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Access data or storage accounts&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3&gt;Most relevant Entra roles for startups&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Entra Role&lt;/th&gt;&lt;th&gt;Purpose&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Global Administrator&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Full directory control (identity, security)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Privileged Role Administrator&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Manages privileged role assignments&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Groups Administrator&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Creates and manages groups (often used for RBAC assignments)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Conditional Access Administrator&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Manages CA policies&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Authentication Administrator&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Controls authentication settings&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Security Administrator&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Manages security policies and alerts&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Key insight:&lt;/STRONG&gt;&amp;nbsp;Entra governs&amp;nbsp;&lt;EM&gt;identity and security&lt;/EM&gt;, not cloud resources or billing. Because Entra manages groups, and groups are often used for RBAC assignments, Entra is the root of&amp;nbsp;&lt;EM&gt;who can be given access,&amp;nbsp;&lt;/EM&gt;but not&amp;nbsp;&lt;EM&gt;what access they have&lt;/EM&gt;. This is where many organizations misunderstand the boundary.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;Plane 2: Azure RBAC (Resource Plane)&lt;/H2&gt;
&lt;P&gt;&lt;EM&gt;Everything engineering touches: workloads, clusters, deployments, pipelines, resources.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Azure RBAC is the backbone of the Azure operational model. It controls all deployments (IaC, CLI, Portal, API), resource creation &amp;amp; modification, monitoring &amp;amp; diagnostics, Key Vault, Storage, Networking, AKS cluster operations, Azure OpenAI deployments, everything under Azure Resource Manager (ARM).&lt;/P&gt;
&lt;H3&gt;RBAC scopes&lt;/H3&gt;
&lt;P&gt;RBAC can be assigned at: Tenant root → Management group → Subscription → Resource group → Individual resource → Sub-resource (e.g., Key Vault secret).&lt;/P&gt;
&lt;H3&gt;RBAC role behavior&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Role&lt;/th&gt;&lt;th&gt;Can Deploy?&lt;/th&gt;&lt;th&gt;Can View Usage Cost?&lt;/th&gt;&lt;th&gt;Can View Billing/MACC?&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Owner&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Contributor&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Reader&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;✅ Yes (limited)&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Cost Management Reader&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;User Access Admin&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The critical point:&lt;/STRONG&gt;&amp;nbsp;RBAC cannot see billing. RBAC cannot view MACC. RBAC cannot read invoices. RBAC cannot approve Marketplace purchases. Even&amp;nbsp;&lt;EM&gt;Owner&lt;/EM&gt;, the highest role in the resource plane, is&amp;nbsp;&lt;STRONG&gt;blind&lt;/STRONG&gt;&amp;nbsp;to billing.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;Plane 3: Azure Billing/Commerce (Financial Plane)&lt;/H2&gt;
&lt;P&gt;&lt;EM&gt;Governed by the Microsoft Commerce Platform, not Azure Resource Manager.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;This plane governs the financial relationship between the customer and Microsoft: billing accounts, invoices, credits (MACC, Azure credits, grants), commitments, discounts, payment methods, invoice sections, Marketplace SaaS purchases, reservations &amp;amp; savings plans, and private offers. Commerce roles live in an entirely different system from RBAC.&lt;/P&gt;
&lt;H3&gt;Common billing roles&lt;/H3&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Role&lt;/th&gt;&lt;th&gt;Can see credits?&lt;/th&gt;&lt;th&gt;Can deploy?&lt;/th&gt;&lt;th&gt;Notes&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Billing Account Owner&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;Full financial authority&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Billing Contributor&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;Can update payment methods&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Billing Reader&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;Most finance teams use this&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Invoice Section Owner&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;✅ Yes&lt;/td&gt;&lt;td&gt;❌ No&lt;/td&gt;&lt;td&gt;Scoped financial management&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;STRONG&gt;What billing roles can see:&lt;/STRONG&gt;&amp;nbsp;MACC balance, credits, invoices, payment history, reservations &amp;amp; savings plans (financial side), and Marketplace purchase capabilities.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;What billing roles cannot do:&lt;/STRONG&gt;&amp;nbsp;deploy anything, modify RBAC, access resources, see workloads, change policy, or access cost analysis at resource group level.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Billing is where MACC lives.&lt;/STRONG&gt;&amp;nbsp;MACC (Azure Consumption Commitment) visibility is tied to Billing Account Owner, Billing Account Contributor, and Billing Reader. Even a subscription Owner cannot see MACC burn rate. This single point causes confusion in almost every startup onboarding Azure.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;Full comparison matrix&lt;/H2&gt;
&lt;P&gt;When you need to answer "who can see what?" Use this table:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Data type&lt;/th&gt;&lt;th&gt;System&lt;/th&gt;&lt;th&gt;Who can see it&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Resource usage cost&lt;/td&gt;&lt;td&gt;ARM (RBAC)&lt;/td&gt;&lt;td&gt;Cost Mgmt Reader, Owner, Contributor&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Resource inventory&lt;/td&gt;&lt;td&gt;ARM (RBAC)&lt;/td&gt;&lt;td&gt;Owner, Contributor, Reader&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Budgets &amp;amp; cost alerts&lt;/td&gt;&lt;td&gt;ARM (RBAC)&lt;/td&gt;&lt;td&gt;Owner, Contributor, Cost Mgmt Reader&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Azure OpenAI cost analysis&lt;/td&gt;&lt;td&gt;ARM (RBAC)&lt;/td&gt;&lt;td&gt;RBAC roles&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MACC credit balance&lt;/td&gt;&lt;td&gt;Commerce Platform&lt;/td&gt;&lt;td&gt;Billing roles only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Invoices &amp;amp; payments&lt;/td&gt;&lt;td&gt;Commerce Platform&lt;/td&gt;&lt;td&gt;Billing roles only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Marketplace private offers&lt;/td&gt;&lt;td&gt;Commerce Platform&lt;/td&gt;&lt;td&gt;Billing roles only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Commercial discounts&lt;/td&gt;&lt;td&gt;Commerce Platform&lt;/td&gt;&lt;td&gt;Billing roles only&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;💡 If your engineering lead says "I can see costs" and your CFO says "I can see costs", they are looking at different data from different systems.&lt;/STRONG&gt;&amp;nbsp;Both are right. Neither has the full picture.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;The #1 source of confusion: Cost Management Reader vs. Billing Reader&lt;/H2&gt;
&lt;P&gt;This is the single most frequent misunderstanding in Azure governance. These two roles sound similar. They are completely different systems.&lt;/P&gt;
&lt;H3&gt;Cost Management Reader (RBAC Plane)&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Can see:&lt;/STRONG&gt;&amp;nbsp;usage-based resource cost, cost by tags, cost by resource, cost forecast, budgets &amp;amp; alerts.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Cannot see:&lt;/STRONG&gt;&amp;nbsp;credits, invoices, payments, MACC, private offers, or contract terms.&lt;/P&gt;
&lt;H3&gt;Billing Reader (Commerce Plane)&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Can see:&lt;/STRONG&gt;&amp;nbsp;invoices, credits, payments, MACC balance, Marketplace transaction history.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Cannot see:&lt;/STRONG&gt;&amp;nbsp;resource-level cost breakdown, cost by tags, subscription usage trends, or resource inventory.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Data type&lt;/th&gt;&lt;th&gt;Where it lives&lt;/th&gt;&lt;th&gt;Who can see it&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Resource usage cost&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Azure Cost Management (ARM)&lt;/td&gt;&lt;td&gt;Cost Mgmt Reader, Owner, Contributor&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Budgets &amp;amp; cost alerts&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;ARM&lt;/td&gt;&lt;td&gt;Owner, Contributor, Cost Mgmt Reader&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;MACC credit balance&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Commerce Platform&lt;/td&gt;&lt;td&gt;Billing roles only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Invoices&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Commerce Platform&lt;/td&gt;&lt;td&gt;Billing roles only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Marketplace private offers&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Commerce Platform&lt;/td&gt;&lt;td&gt;Billing roles only&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Commercial discounts&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Commerce Platform&lt;/td&gt;&lt;td&gt;Billing roles only&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Cost visibility (usage-based cost) comes from RBAC. Billing visibility (credits, invoices, MACC) comes from Commerce. These are two completely different datasets. When you understand this distinction, half of the "why can't I see…?" questions answer themselves.&lt;/P&gt;
&lt;H2&gt;Quick start: where to set this up&lt;/H2&gt;
&lt;P&gt;Here's exactly where each plane is configured, in the Portal and via CLI.&lt;/P&gt;
&lt;H3&gt;Microsoft Entra (Identity Plane)&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Portal:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://portal.azure.com" target="_blank" rel="noopener"&gt;Azure Portal&lt;/A&gt;&amp;nbsp;→&amp;nbsp;&lt;STRONG&gt;Microsoft Entra ID&lt;/STRONG&gt;&amp;nbsp;→&amp;nbsp;&lt;STRONG&gt;Roles and administrators&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# List Entra directory role assignments az rest --method GET --url "https://graph.microsoft.com/v1.0/directoryRoles" # Add a user to a directory role az ad group member add --group "Groups Administrator" --member-id &amp;lt;user-object-id&amp;gt;&lt;/LI-CODE&gt;
&lt;H3&gt;Azure RBAC (Resource Plane)&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Portal:&lt;/STRONG&gt;&amp;nbsp;Subscription →&amp;nbsp;&lt;STRONG&gt;Access Control (IAM)&lt;/STRONG&gt;&amp;nbsp;→&amp;nbsp;&lt;STRONG&gt;Add role assignment&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# Assign Contributor at subscription scope az role assignment create \ --assignee "user@contoso.com" \ --role "Contributor" \ --scope "/subscriptions/{subscription-id}" # Assign Cost Management Reader at resource group scope az role assignment create \ --assignee "user@contoso.com" \ --role "Cost Management Reader" \ --scope "/subscriptions/{sub-id}/resourceGroups/{rg-name}"&lt;/LI-CODE&gt;
&lt;H3&gt;Azure Billing/Commerce (Financial Plane)&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Portal:&lt;/STRONG&gt;&amp;nbsp;&lt;A href="https://portal.azure.com" target="_blank" rel="noopener"&gt;Azure Portal&lt;/A&gt;&amp;nbsp;→&amp;nbsp;&lt;STRONG&gt;Cost Management + Billing&lt;/STRONG&gt;&amp;nbsp;→&amp;nbsp;&lt;STRONG&gt;Billing scopes&lt;/STRONG&gt;&amp;nbsp;→ select billing account →&amp;nbsp;&lt;STRONG&gt;Access control (IAM)&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# List billing accounts az billing account list --output table # Assign Billing Reader via REST API az rest --method PUT \ --url "https://management.azure.com/providers/Microsoft.Billing/billingAccounts/{billing-account-id}/billingRoleAssignments/{id}?api-version=2024-04-01" \ --body '{"properties":{"principalId":"{user-object-id}","roleDefinitionId":"/providers/Microsoft.Billing/billingAccounts/{billing-account-id}/billingRoleDefinitions/{billing-reader-role-id}"}}'&lt;/LI-CODE&gt;
&lt;H2&gt;References&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/role-based-access-control/overview" target="_blank" rel="noopener"&gt;Azure RBAC Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/role-based-access-control/rbac-and-directory-admin-roles" target="_blank" rel="noopener"&gt;Entra Directory &amp;amp; Admin Roles&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/cost-management-billing/manage/understand-mca-roles" target="_blank" rel="noopener"&gt;Billing Roles (Microsoft Customer Agreement)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/cost-management-billing/costs/assign-access-acm-data" target="_blank" rel="noopener"&gt;Assign Access to Cost Management Data&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;What's next →&lt;/STRONG&gt;&amp;nbsp;This post established the foundation: Azure's three permission planes are separate by design. But the real complexity begins where these planes&amp;nbsp;&lt;EM&gt;intersect&lt;/EM&gt;. &lt;BR /&gt;&lt;BR /&gt;In&amp;nbsp; the &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/marketplace-governance-and-the-cross-plane-bridge/4510067" target="_blank" rel="noopener" data-lia-auto-title="part 2" data-lia-auto-title-active="0"&gt;part 2&lt;/A&gt;, we'll explore &lt;STRONG&gt;Marketplace governance,&amp;nbsp;&lt;/STRONG&gt;where resource deployment meets financial authority along with&amp;nbsp;&lt;STRONG&gt;Managed Identity&lt;/STRONG&gt;, the one construct that bridges two planes, and&amp;nbsp;&lt;STRONG&gt;ABAC&lt;/STRONG&gt; for advanced conditional governance.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;</description>
      <pubDate>Thu, 09 Apr 2026 21:10:41 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-has-three-permission-systems-and-you-re-probably-confusing/ba-p/4471854</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2026-04-09T21:10:41Z</dc:date>
    </item>
    <item>
      <title>Azure capacity planning: Using quotas, reservations, vmss instance mix, and compute fleet</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-capacity-planning-using-quotas-reservations-vmss-instance/ba-p/4464893</link>
      <description>&lt;H4&gt;&lt;SPAN data-contrast="auto"&gt;Introduction&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Over the past few months,&amp;nbsp;I’ve&amp;nbsp;been helping several digital-native customers navigate&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;capacity constraints&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;while scaling AI and compute-intensive workloads on Azure.&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;Many teams run into the same frustrating message:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;“SkuNotAvailable&amp;nbsp;– The requested size is currently not available in the location.”&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This post summarizes the strategy&amp;nbsp;I’ve&amp;nbsp;been using to help customers design around these challenges combining&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Quota Groups&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Capacity Reservations (ODCR)&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;, VMSS Instance Mix, and &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Compute Fleet&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt; &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;These tools&amp;nbsp;don’t&amp;nbsp;create capacity where none exists, but together, when paired with&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;proactive alerts&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;, they form a practical playbook for scaling reliably through regional constraints.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;SPAN data-contrast="auto"&gt;Quota vs. Capacity:&amp;nbsp;What’s&amp;nbsp;the&amp;nbsp;difference?&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-hidden" border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Concept&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;What It Is&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Who Controls It&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Can You Fix It Yourself?&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Quota&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;A&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;logical limit&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;on how many vCPUs or specific VM series you can deploy.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Microsoft (adjustable on request).&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;✅&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;Yes,&amp;nbsp;request an increase.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Capacity&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;physical availability&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;of hardware in the datacenter.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Azure datacenter (supply and&amp;nbsp;utilization).&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;❌&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;No,&amp;nbsp;if no servers exist, no deployment will succeed.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Example:&lt;/SPAN&gt;&lt;/U&gt;&lt;SPAN data-contrast="auto"&gt; You have 300 vCPUs of quota for the D-series in East US 2. You try to deploy 100 D8as_v5 VMs and get a failure. You open a support request and find:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="18" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Your quota is fine&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="18" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;But the region has no physical capacity for D8as_v5&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Even if Microsoft raised your quota to 1,000 vCPUs, the deployment would still fail because&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;quota ≠ capacity&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Quota issue:&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;You’ll see errors like&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;OperationNotAllowed&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;or&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;QuotaExceeded&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;Capacity issue:&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;&amp;nbsp;The message will be&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;SkuNotAvailable&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;&amp;nbsp;or&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;AllocationFailed&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;.&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;If you see a quota error, open the&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Usage + quotas&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;blade and request an increase.&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN data-contrast="auto"&gt;If it’s a capacity error, switching zones, SKUs, or regions, or using VMSS Instance Mix &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;or Compute Fleet &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;is your best next step.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559685&amp;quot;:720,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;“Quota is a number on paper. Capacity is&amp;nbsp;what’s&amp;nbsp;physically sitting in the racks.”&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H4&gt;&lt;SPAN data-contrast="auto"&gt;Strategy 1: Quota management and Quota Groups&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Azure applies vCPU quotas by region and VM family (e.g., Dsv5, Esv5).&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Quota Groups&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;provide&amp;nbsp;a consolidated way to&amp;nbsp;monitor&amp;nbsp;and manage these logical limits across families.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Learn more:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="9" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/quotas/quota-groups" target="_blank" rel="noopener"&gt;Azure Quota Groups – Overview&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="9" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/quotas/how-to-guide-monitoring-alerting" target="_blank" rel="noopener"&gt;Set up monitoring and alerts for quotas&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Quota limits are easy to overlook until automation or scale pipelines fail.&amp;nbsp;AI-heavy startups often discover too late that&amp;nbsp;they’ve&amp;nbsp;maxed out their quota family.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Best practices:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI aria-setsize="-1" data-leveltext="%1." data-font="" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:0,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[65533,0],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;%1.&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;Monitor with Quota Group alerts&lt;/STRONG&gt;&lt;U&gt;:&lt;/U&gt; &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Use&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Quota Alerts (preview)&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;to automatically&amp;nbsp;notify you&amp;nbsp;when usage reaches thresholds (e.g., 80%).&amp;nbsp;Alerts integrate with Azure Monitor and Action Groups.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="%1." data-font="" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:0,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[65533,0],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;%1.&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;Request increases proactively&lt;/STRONG&gt;: &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Portal path:&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Subscriptions → Usage + quotas → Request increase&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;.&amp;nbsp;Most CPU SKUs are approved quickly; GPUs can take longer.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="%1." data-font="" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:0,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769242&amp;quot;:[65533,0],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;%1.&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;Plan by family, not by SKU&lt;/STRONG&gt;: &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;If you only check “D8as_v5 usage,” you may miss that the entire D-series family is at its quota limit.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4&gt;&lt;SPAN data-contrast="auto"&gt;Strategy 2: Capacity Reservations (ODCR)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;A&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Capacity Reservation&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;(formally&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;On-Demand Capacity Reservation&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;, ODCR) lets you&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;pre-book physical infrastructure&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;in a specific region, zone, and VM size.&amp;nbsp;You’re&amp;nbsp;reserving capacity, not&amp;nbsp;committing to&amp;nbsp;a term or discount.&amp;nbsp;Azure holds those servers for your subscription, ensuring your workloads can always start.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Learn more:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="20" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-machines/capacity-reservation-overview" target="_blank" rel="noopener"&gt;Capacity Reservations in Azure Virtual Machines&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="20" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cost-management-billing/reservations/save-compute-costs-reservations" target="_blank" rel="noopener"&gt;Save on compute with Azure Reservations&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Capacity Reservation vs. Reserved Instance (RI)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-hidden" border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Aspect&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Capacity Reservation (ODCR)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Reserved Instance (RI)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Purpose&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Guarantees&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;capacity&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;(hardware availability).&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Locks in&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;price&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;(discounted rate).&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Scope&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Specific region, zone, and VM size.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Region and VM family.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Billing&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Pay-as-you-go,&amp;nbsp;no term commitment.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;1 or 3-year fixed term.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Capacity Guarantee&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;✅&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;Yes,&amp;nbsp;hardware is held for you.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;❌&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;No,&amp;nbsp;no guarantee.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Price Benefit&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;❌&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;None,&amp;nbsp;PAYG rate.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;✅&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;Up to ~70% discount.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Flexibility&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Modify or cancel anytime.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Bound to term.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;In short:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="15" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;ODCR = “Hold my spot in the datacenter.”&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="15" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;RI = “Give me a discount because I’ll keep using it.”&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;You can use both: &lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;ODCR&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;&amp;nbsp;for capacity,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;RI&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;&amp;nbsp;for savings.&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Example:&lt;/SPAN&gt;&lt;/U&gt;&lt;SPAN data-contrast="auto"&gt; A startup consistently runs 20× D16as_v5 VMs nightly for training. They reserve that capacity (ODCR) and apply RIs for discounts ensuring predictable performance&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;and&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;cost.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Limitations:&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="7" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;You can’t reserve SKUs already out of stock.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="7" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;ODCR doesn’t autoscale, it holds your baseline.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="7" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Best for &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;core workloads&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;, not ephemeral jobs.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;SPAN data-contrast="auto"&gt;Strategy 3: VMSS Instance Mix&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;Virtual Machine Scale Set (VMSS) Instance Mix is a feature of VMSS Flex that enables capacity-aware scaling across multiple VM sizes, and even across different purchase options (Standard and Spot).&amp;nbsp;When you define more than one acceptable VM size, Azure automatically chooses whichever has capacity available during scale-out.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Learn more:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="4" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/instance-mix-overview" target="_blank" rel="noopener"&gt;VMSS Instance Mix – Overview&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Example:&lt;/SPAN&gt;&lt;/U&gt;&lt;SPAN data-contrast="auto"&gt; Here’s a simplified configuration snippet from an ARM or Bicep template using Instance Mix:&lt;/SPAN&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;"virtualMachineProfile": {
  "hardwareProfile": {
    "vmSizeProperties": {
      "vmSizes": [
        "Standard_D8as_v5",
        "Standard_E8as_v5",
        "Standard_F8s_v2"
      ]
    }
  }
}
&lt;/LI-CODE&gt;
&lt;P&gt;&lt;BR /&gt;VMSS Instance Mix helps you survive temporary SKU shortages by dynamically selecting the next available size, while Spot Priority Mix lets you blend Spot and Standard instances to reduce cost and improve resilience. This makes it ideal for large-scale app tiers, batch processing, and AI inference.&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Limitations:&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="5" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Works across zones, not regions.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="5" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Doesn’t mix Spot + Standard in the same pool.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="5" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;Doesn’t reserve hardware capacity, it only improves allocation success rates.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;&lt;SPAN data-contrast="auto"&gt;Strategy&amp;nbsp;4:&amp;nbsp;Azure Compute Fleet&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Azure Compute Fleet&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;can deploy up to&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;10,000 VMs&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;across multiple SKUs, zones, and (in preview) regions.&amp;nbsp;You define acceptable SKUs, and Azure picks the ones that have capacity.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Learn more:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="12" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-compute-fleet/overview" target="_blank" rel="noopener"&gt;Azure Compute Fleet – Overview&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Fleet automatically:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="11" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Tries alternate SKUs (D8as_v5 → E8as_v5).&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="11" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Expands to other zones or regions.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="11" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Combines &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Standard&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;and&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Spot&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;instances.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;In short, it automates the “try this, then that” logic,&amp;nbsp;improving your odds of successful deployment.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Example:&lt;/SPAN&gt;&lt;/U&gt;&lt;SPAN data-contrast="auto"&gt; A rendering studio needs 2,000 VMs nightly.&amp;nbsp;Fleet dynamically uses D8s_v5, D16s_v5, or E8s_v5 across East US 2 and West US 2, depending on live availability.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Limitations:&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Fleet&amp;nbsp;doesn’t&amp;nbsp;create capacity&amp;nbsp;it just searches smarter.&amp;nbsp;If every zone and region is full, it still fails.&amp;nbsp;Ideal for&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;AI training, batch jobs,&amp;nbsp;rendering, or HPC,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;not for stateful services.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;When to&amp;nbsp;use&amp;nbsp;what&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-hidden" border="1" style="width: 67.4074%; height: 234px; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Scenario&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Best&amp;nbsp;tool&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;What&amp;nbsp;it solves&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Logical limits before deployment&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Quota Groups + Alerts&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Prevent hitting soft limits.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Guaranteed baseline&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Capacity Reservation (ODCR)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Reserve real hardware.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Managed autoscaling&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;VMSS Instance Mix&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Scale out despite partial shortages.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Large-scale/bursty workloads&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Azure Compute Fleet&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Try alternate SKUs and regions.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;GPU/high-demand SKUs&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;ODCR + Fleet&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Reserve base, burst flexibly.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;&lt;BR /&gt;Real Talk: There’s no magic when a datacenter is full. &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Let’s&amp;nbsp;be transparent:&amp;nbsp;If a region has no physical servers available, no tool can make capacity appear.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="16" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Quota Groups&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;remove logical blockers.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="16" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Capacity Reservations&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;secure what you need.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="16" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Compute Fleet&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;and&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;VMSS Instance Mix&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;increase the odds of success.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Together, they&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;maximize probability&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;, but none can override a physically full region.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;The Azure&amp;nbsp;capacity&amp;nbsp;strategy&amp;nbsp;flow&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;H4&gt;&lt;SPAN data-contrast="auto"&gt;Final&amp;nbsp;thoughts&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;For fast-scaling digital-native companies, the right question isn’t&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;“How do I guarantee capacity?”. I&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;t’s&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;“How do I design for capacity uncertainty?”&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;Start by putting the basics on autopilot:&lt;/SPAN&gt; &lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;Configure&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;Quota Group alerts&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;&amp;nbsp;to prevent silent blockers.&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-ccp-props="{&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="2" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Use&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Capacity Reservations (ODCR)&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;to secure your baseline compute.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="2" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Add elasticity through &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;VMSS Instance Mix&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;and, when flexibility allows,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Compute Fleet&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" data-aria-posinset="2" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Monitor everything with &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Azure Monitor alerts&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;— from quotas and reservations to scale-out failures and Fleet allocation health.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;💡&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Pro tip:&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;Combine&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Quota Group Alerts&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Reservation coverage monitoring&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;, and&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;VMSS/Fleet deployment telemetry&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;in&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Azure Monitor&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;to detect issues early.&lt;/SPAN&gt; &amp;nbsp;&lt;SPAN data-contrast="auto"&gt;The faster you know&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;what kind of failure&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;you’re hitting, the faster you can act.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Accept that&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;capacity is finite&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;, but also that&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;visibility is your greatest advantage&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;. Azure gives you multiple levers; success comes from knowing when and how to use each one together.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Over the past few months, I’ve supported multiple customers, from AI platforms to SaaS startups, who faced real capacity challenges in regions like East US 2 and West US 2. This post came directly from those experiences, with one goal: to help others move from reactive firefighting to&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;proactive, layered capacity planning&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;.&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;If your workloads are scaling fast, I hope this guide helps you build not just a plan, but a mindset, for running reliably when the cloud gets crowded.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Further&amp;nbsp;reading&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;UL&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="21" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/quotas/quota-groups" target="_blank" rel="noopener"&gt;Azure Quota Groups – Overview&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="21" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/quotas/how-to-guide-monitoring-alerting" target="_blank" rel="noopener"&gt;Monitoring and Alerting for Quotas&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="21" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-quota-alerts-preview-still-overlooked-but-incredibly-useful/4447140" target="_blank" rel="noopener" data-lia-auto-title="Azure Quota Alerts" data-lia-auto-title-active="0"&gt;Azure Quota Alerts&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="21" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-machines/capacity-reservation-overview" target="_blank" rel="noopener"&gt;Capacity Reservations Overview&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="21" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cost-management-billing/reservations/save-compute-costs-reservations" target="_blank" rel="noopener"&gt;Save on Compute with Azure Reservations&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="21" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/instance-mix-overview" target="_blank" rel="noopener"&gt;VMSS Instance Mix Overview&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI aria-setsize="-1" data-leveltext="" data-font="Symbol" data-listid="21" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" data-aria-posinset="1" data-aria-level="1"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-compute-fleet/overview" target="_blank" rel="noopener"&gt;Azure Compute Fleet Overview&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 06 Nov 2025 16:47:25 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-capacity-planning-using-quotas-reservations-vmss-instance/ba-p/4464893</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-11-06T16:47:25Z</dc:date>
    </item>
    <item>
      <title>Azure Monitor 101: The missing guide to understanding monitoring on Azure</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-monitor-101-the-missing-guide-to-understanding-monitoring/ba-p/4462799</link>
      <description>&lt;H4 data-start="1044" data-end="1061"&gt;Introduction&lt;/H4&gt;
&lt;P data-start="1063" data-end="1363"&gt;Monitoring in the cloud is often misunderstood. Some think it’s about checking whether a virtual machine is up; others equate it with dashboards or alerts. In reality, &lt;STRONG data-start="1233" data-end="1292"&gt;monitoring is about visibility, correlation, and action&lt;/STRONG&gt;, and in Azure, that all converges in one platform: &lt;STRONG data-start="1343" data-end="1360"&gt;Azure Monitor&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P data-start="1365" data-end="1534"&gt;This article explains, in practical terms, how Azure Monitor works, the role of &lt;STRONG data-start="1445" data-end="1462"&gt;Log Analytics&lt;/STRONG&gt;, and how to build a foundation for observability across your workloads.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="1365" data-end="1534"&gt;If you’ve read our earlier posts, on &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/the-importance-of-setting-up-service-and-resource-health-monitoring-in-azure/4372478" target="_blank" rel="noopener" data-start="1577" data-end="1768" data-lia-auto-title="Service and Resource Health Monitoring" data-lia-auto-title-active="0"&gt;Service and Resource Health Monitoring&lt;/A&gt;, &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/advanced-alerting-strategies-for-azure-monitoring/4268698" target="_blank" rel="noopener" data-start="1770" data-end="1924" data-lia-auto-title="Advanced Alerting Strategies" data-lia-auto-title-active="0"&gt;Advanced Alerting Strategies&lt;/A&gt;, &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-workbooks-advanced-customization-and-data-visualization-in-azure/4369588" target="_blank" rel="noopener" data-start="1926" data-end="2102" data-lia-auto-title="Azure Workbooks Customization" data-lia-auto-title-active="0"&gt;Azure Workbooks Customization&lt;/A&gt;, or &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-monitor--melt-a-comprehensive-approach-to-cloud-observability/4251166" target="_blank" rel="noopener" data-start="2107" data-end="2271" data-lia-auto-title="Azure Monitor &amp;amp; MELT" data-lia-auto-title-active="0"&gt;Azure Monitor &amp;amp; MELT, &lt;/A&gt;this post ties them all together.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H4 data-start="2314" data-end="2341"&gt;What Is Azure Monitor?&lt;/H4&gt;
&lt;P data-start="2343" data-end="2558"&gt;&lt;STRONG data-start="2343" data-end="2360"&gt;Azure Monitor&lt;/STRONG&gt; is Microsoft’s unified platform for collecting, analyzing, and acting on telemetry across applications, infrastructure, and networks, whether they run on Azure, hybrid, or multicloud environments.&lt;/P&gt;
&lt;P data-start="2560" data-end="2597"&gt;It helps you answer four questions:&lt;/P&gt;
&lt;OL data-start="2599" data-end="2723"&gt;
&lt;LI data-start="2599" data-end="2632"&gt;&lt;EM data-start="2602" data-end="2630"&gt;Is my environment healthy?&lt;/EM&gt;&lt;/LI&gt;
&lt;LI data-start="2633" data-end="2667"&gt;&lt;EM data-start="2636" data-end="2665"&gt;What’s happening right now?&lt;/EM&gt;&lt;/LI&gt;
&lt;LI data-start="2668" data-end="2693"&gt;&lt;EM data-start="2671" data-end="2691"&gt;Why did it happen?&lt;/EM&gt;&lt;/LI&gt;
&lt;LI data-start="2694" data-end="2723"&gt;&lt;EM data-start="2697" data-end="2721"&gt;What should I do next?&lt;/EM&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4&gt;The Building Blocks&lt;/H4&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-hidden" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Layer&lt;/th&gt;&lt;th&gt;Description&lt;/th&gt;&lt;th&gt;Examples&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;1. Data Sources&lt;/td&gt;&lt;td&gt;Where telemetry originates: VMs, AKS, databases, applications, networks.&lt;/td&gt;&lt;td&gt;Activity Logs, Performance Counters, Container Metrics, App Insights telemetry.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2. Data Platform (Log Analytics)&lt;/td&gt;&lt;td&gt;Central workspace where logs are stored and queried using &lt;STRONG&gt;KQL&lt;/STRONG&gt;.&lt;/td&gt;&lt;td&gt;Diagnostic Settings → Workspace → Query → Alert/Workbook.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3.&amp;nbsp;&amp;nbsp;Insights &amp;amp; Visualizations&lt;/td&gt;&lt;td&gt;Built-in experiences that interpret raw data.&lt;/td&gt;&lt;td&gt;Azure Monitor for VMs, Containers, Apps, Network.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4.&amp;nbsp;&lt;STRONG&gt; &lt;/STRONG&gt;Action &amp;amp; Automation&lt;/td&gt;&lt;td&gt;Responding through alerts, workflows, or ITSM integrations.&lt;/td&gt;&lt;td&gt;Alerts + Action Groups → Teams, Logic Apps, PagerDuty.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H6&gt;&lt;STRONG&gt;Azure Monitor core layers&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/H6&gt;
&lt;img /&gt;
&lt;H4&gt;&amp;nbsp;&lt;/H4&gt;
&lt;H4&gt;&amp;nbsp;&lt;/H4&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;Metrics vs. Logs&lt;/H4&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-hidden" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Aspect&lt;/th&gt;&lt;th&gt;Metrics&lt;/th&gt;&lt;th&gt;Logs&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Format&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Numeric values sampled over time&lt;/td&gt;&lt;td&gt;Text-based records with context&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Best for&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Performance monitoring and thresholds&lt;/td&gt;&lt;td&gt;Troubleshooting and auditing&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Examples&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;CPU %, latency, requests/sec&lt;/td&gt;&lt;td&gt;Error messages, policy changes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Store&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Azure Monitor metrics DB&lt;/td&gt;&lt;td&gt;Log Analytics workspace&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Metrics are fast and lightweight; logs are richer and more flexible. Both live under Azure Monitor.&lt;/P&gt;
&lt;H4&gt;The role of Log Analytics Workspace&lt;/H4&gt;
&lt;P data-start="4124" data-end="4197"&gt;If Azure Monitor is the nervous system, &lt;STRONG data-start="4164" data-end="4194"&gt;Log Analytics is the brain&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P data-start="4199" data-end="4382"&gt;Resources send diagnostic and activity data via &lt;STRONG data-start="4247" data-end="4270"&gt;Diagnostic Settings&lt;/STRONG&gt;, agents, or connectors. Once in the workspace, you can query everything using &lt;STRONG data-start="4351" data-end="4381"&gt;Kusto Query Language (KQL)&lt;/STRONG&gt;.&lt;/P&gt;
&lt;LI-CODE lang="kusto"&gt;AzureActivity
| where OperationNameValue contains "Delete"
| summarize Count = count() by Caller, bin(TimeGenerated, 1d)
&lt;/LI-CODE&gt;
&lt;P data-start="4517" data-end="4532"&gt;You can then:&lt;/P&gt;
&lt;UL data-start="4533" data-end="4699"&gt;
&lt;LI data-start="4533" data-end="4582"&gt;Create &lt;STRONG data-start="4542" data-end="4552"&gt;alerts&lt;/STRONG&gt; that fire on query results.&lt;/LI&gt;
&lt;LI data-start="4583" data-end="4639"&gt;Build &lt;STRONG data-start="4591" data-end="4604"&gt;workbooks&lt;/STRONG&gt; for dashboards and storytelling.&lt;/LI&gt;
&lt;LI data-start="4640" data-end="4699"&gt;Export data to &lt;STRONG data-start="4657" data-end="4670"&gt;Event Hub&lt;/STRONG&gt;, &lt;STRONG data-start="4672" data-end="4683"&gt;Storage&lt;/STRONG&gt;, or &lt;STRONG data-start="4688" data-end="4696"&gt;SIEM&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H6 data-start="4701" data-end="4842"&gt;&lt;STRONG data-start="4701" data-end="4755"&gt;Log Analytics as the central data plane&lt;/STRONG&gt;&lt;/H6&gt;
&lt;img /&gt;
&lt;H6&gt;&lt;STRONG&gt;Data flow overview&lt;/STRONG&gt;&lt;/H6&gt;
&lt;img /&gt;
&lt;H4 data-start="5210" data-end="5229"&gt;The MELT Model&lt;/H4&gt;
&lt;P data-start="5231" data-end="5530"&gt;To understand observability holistically, adopt the &lt;STRONG data-start="5283" data-end="5291"&gt;MELT&lt;/STRONG&gt; framework:&amp;nbsp;&lt;STRONG data-start="5302" data-end="5339"&gt;Metrics, Events, Logs, and Traces,&amp;nbsp;&lt;/STRONG&gt;explained in detail in &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-monitor--melt-a-comprehensive-approach-to-cloud-observability/4251166" target="_blank" rel="noopener" data-start="5363" data-end="5527" data-lia-auto-title="Azure Monitor &amp;amp; MELT" data-lia-auto-title-active="0"&gt;Azure Monitor &amp;amp; MELT&lt;/A&gt;.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-hidden" border="1" style="width: 33.6111%; height: 199px; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr style="height: 35px;"&gt;&lt;th style="height: 35px;"&gt;Pillar&lt;/th&gt;&lt;th style="height: 35px;"&gt;Purpose&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr style="height: 35px;"&gt;&lt;td style="height: 35px;"&gt;&lt;STRONG&gt;Metrics&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 35px;"&gt;How your system performs&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 35px;"&gt;&lt;td style="height: 35px;"&gt;&lt;STRONG&gt;Events&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 35px;"&gt;What changed&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 35px;"&gt;&lt;td style="height: 35px;"&gt;&lt;STRONG&gt;Logs&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 35px;"&gt;Why it happened&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 59px;"&gt;&lt;td style="height: 59px;"&gt;&lt;STRONG&gt;Traces&lt;/STRONG&gt;&lt;/td&gt;&lt;td style="height: 59px;"&gt;How requests flow through components&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H4 data-start="5740" data-end="5788"&gt;From data to action: alerts and automation&lt;/H4&gt;
&lt;P data-start="5790" data-end="5815"&gt;Azure Monitor includes:&lt;/P&gt;
&lt;UL data-start="5816" data-end="5956"&gt;
&lt;LI data-start="5816" data-end="5865"&gt;&lt;STRONG data-start="5818" data-end="5835"&gt;Metric alerts&lt;/STRONG&gt; (near real-time thresholds)&lt;/LI&gt;
&lt;LI data-start="5866" data-end="5910"&gt;&lt;STRONG data-start="5868" data-end="5882"&gt;Log alerts&lt;/STRONG&gt; (KQL queries on schedule)&lt;/LI&gt;
&lt;LI data-start="5911" data-end="5956"&gt;&lt;STRONG data-start="5913" data-end="5936"&gt;Activity Log alerts&lt;/STRONG&gt; (platform events)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="5958" data-end="6037"&gt;Use &lt;STRONG data-start="5962" data-end="5979"&gt;Action Groups&lt;/STRONG&gt; to define responses: email, Teams, Logic App, or ticket.&lt;/P&gt;
&lt;P data-start="6039" data-end="6284"&gt;For advanced patterns like dynamic thresholds and suppression, see &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/advanced-alerting-strategies-for-azure-monitoring/4268698" target="_blank" rel="noopener" data-start="6106" data-end="6281" data-lia-auto-title="Advanced Alerting Strategies for Azure Monitoring" data-lia-auto-title-active="0"&gt;Advanced Alerting Strategies for Azure Monitoring&lt;/A&gt;.&lt;/P&gt;
&lt;H6 data-start="6286" data-end="6333"&gt;&lt;STRONG data-start="6286" data-end="6333"&gt;Alerting and automation workflow&lt;/STRONG&gt;&lt;/H6&gt;
&lt;img /&gt;
&lt;H4 data-start="6340" data-end="6372"&gt;Visualization and Workbooks&lt;/H4&gt;
&lt;P data-start="6374" data-end="6488"&gt;Workbooks transform data into decisions. Combine KQL queries, parameters, and visuals: all within the Azure portal.&lt;/P&gt;
&lt;LI-CODE lang="kusto"&gt;Perf
| where ObjectName == "Processor"
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m), Computer&lt;/LI-CODE&gt;
&lt;P&gt;To go beyond basics: multi-resource joins, conditional formatting, custom JSON parameters, see &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-workbooks-advanced-customization-and-data-visualization-in-azure/4369588" target="_blank" rel="noopener" data-start="6709" data-end="6927" data-lia-auto-title="Azure Workbooks: Advanced Customization and Data Visualization in Azure" data-lia-auto-title-active="0"&gt;Azure Workbooks: Advanced Customization and Data Visualization in Azure&lt;/A&gt;.&lt;/P&gt;
&lt;H6&gt;&lt;STRONG&gt;Example workbook visualization&lt;/STRONG&gt;&lt;/H6&gt;
&lt;img /&gt;
&lt;H4 data-start="6984" data-end="7027"&gt;Health Monitoring and Platform Signals&lt;/H4&gt;
&lt;P data-start="7029" data-end="7248"&gt;Azure provides &lt;STRONG data-start="7044" data-end="7062"&gt;Service Health&lt;/STRONG&gt; and &lt;STRONG data-start="7067" data-end="7086"&gt;Resource Health&lt;/STRONG&gt; to help differentiate between Azure-side issues and workload issues. They complement Azure Monitor by tracking platform events and maintenance notifications.&lt;/P&gt;
&lt;P data-start="7250" data-end="7521"&gt;Configuration guidance is available in &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/the-importance-of-setting-up-service-and-resource-health-monitoring-in-azure/4372478" target="_blank" rel="noopener" data-start="7289" data-end="7518" data-lia-auto-title="The Importance of Setting Up Service and Resource Health Monitoring in Azure" data-lia-auto-title-active="0"&gt;The Importance of Setting Up Service and Resource Health Monitoring in Azure&lt;/A&gt;.&lt;/P&gt;
&lt;H6 data-start="7523" data-end="7584"&gt;&lt;STRONG data-start="7523" data-end="7584"&gt;Service Health and Resource Health integration&lt;/STRONG&gt;&lt;/H6&gt;
&lt;img /&gt;
&lt;H4 data-start="7591" data-end="7625"&gt;Best practices for workspaces&lt;/H4&gt;
&lt;OL data-start="7627" data-end="7978"&gt;
&lt;LI data-start="7627" data-end="7714"&gt;&lt;STRONG data-start="7630" data-end="7658"&gt;Centralize intelligently: &lt;/STRONG&gt;aggregate where cross-resource correlation matters.&lt;/LI&gt;
&lt;LI data-start="7715" data-end="7782"&gt;&lt;STRONG data-start="7718" data-end="7735"&gt;Control costs: &lt;/STRONG&gt;use Data Collection Rules to filter noise.&lt;/LI&gt;
&lt;LI data-start="7783" data-end="7839"&gt;&lt;STRONG data-start="7786" data-end="7806"&gt;Manage retention: &lt;/STRONG&gt;align with compliance needs.&lt;/LI&gt;
&lt;LI data-start="7840" data-end="7904"&gt;&lt;STRONG data-start="7843" data-end="7860"&gt;Secure access:&amp;nbsp;&lt;/STRONG&gt;apply RBAC and table-level permissions.&lt;/LI&gt;
&lt;LI data-start="7905" data-end="7978"&gt;&lt;STRONG data-start="7908" data-end="7931"&gt;Automate deployment: &lt;/STRONG&gt;define diagnostics via Bicep or Terraform.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H5 data-start="7985" data-end="8011"&gt;Quick start checklist&lt;/H5&gt;
&lt;OL data-start="8013" data-end="8268"&gt;
&lt;LI data-start="8013" data-end="8055"&gt;Create a &lt;STRONG data-start="8025" data-end="8052"&gt;Log Analytics workspace&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI data-start="8056" data-end="8110"&gt;Enable &lt;STRONG data-start="8066" data-end="8089"&gt;Diagnostic Settings&lt;/STRONG&gt; for key resources.&lt;/LI&gt;
&lt;LI data-start="8111" data-end="8157"&gt;Run a basic &lt;STRONG data-start="8126" data-end="8139"&gt;KQL query&lt;/STRONG&gt; to verify data.&lt;/LI&gt;
&lt;LI data-start="8158" data-end="8213"&gt;Configure a &lt;STRONG data-start="8173" data-end="8189"&gt;metric alert&lt;/STRONG&gt; and &lt;STRONG data-start="8194" data-end="8210"&gt;action group&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI data-start="8214" data-end="8268"&gt;Build a simple &lt;STRONG data-start="8232" data-end="8244"&gt;workbook&lt;/STRONG&gt; to visualize results.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P data-start="8270" data-end="8346"&gt;You now have a full feedback loop: &lt;EM data-start="8305" data-end="8346"&gt;data → query → alert → visualize → act.&lt;/EM&gt;&lt;/P&gt;
&lt;H4 data-start="8353" data-end="8386"&gt;Next steps &amp;amp; further reading&lt;/H4&gt;
&lt;UL data-start="8388" data-end="9131"&gt;
&lt;LI data-start="8388" data-end="8592"&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/the-importance-of-setting-up-service-and-resource-health-monitoring-in-azure/4372478" target="_blank" rel="noopener" data-start="8390" data-end="8590" data-lia-auto-title="Service and Resource Health Monitoring in Azure" data-lia-auto-title-active="0"&gt;Service and Resource Health Monitoring in Azure&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="8593" data-end="8772"&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/advanced-alerting-strategies-for-azure-monitoring/4268698" target="_blank" rel="noopener" data-start="8595" data-end="8770" data-lia-auto-title="Advanced Alerting Strategies for Azure Monitoring" data-lia-auto-title-active="0"&gt;Advanced Alerting Strategies for Azure Monitoring&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="8773" data-end="8962"&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-workbooks-advanced-customization-and-data-visualization-in-azure/4369588" target="_blank" rel="noopener" data-start="8775" data-end="8960" data-lia-auto-title="Azure Workbooks Advanced Customization" data-lia-auto-title-active="0"&gt;Azure Workbooks Advanced Customization&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="8963" data-end="9131"&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-monitor--melt-a-comprehensive-approach-to-cloud-observability/4251166" target="_blank" rel="noopener" data-start="8965" data-end="9129" data-lia-auto-title="Azure Monitor &amp;amp; MELT" data-lia-auto-title-active="0"&gt;Azure Monitor &amp;amp; MELT&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="9133" data-end="9225"&gt;Together these form a complete learning path, from monitoring basics to full observability.&lt;/P&gt;
&lt;H4 data-start="9232" data-end="9247"&gt;Conclusion&lt;/H4&gt;
&lt;P data-start="9249" data-end="9469"&gt;Azure Monitor is more than a tool, it’s the &lt;STRONG data-start="9292" data-end="9318"&gt;observability backbone&lt;/STRONG&gt; of Azure. Once you understand its layers, the rest of the ecosystem, health alerts, workbooks, advanced rules, and MELT falls naturally into place.&lt;/P&gt;
&lt;P data-start="9471" data-end="9611"&gt;Start simple. Connect a resource, explore your workspace, and let data guide your next question. That’s when monitoring becomes insight.&lt;/P&gt;</description>
      <pubDate>Mon, 20 Oct 2025 14:24:09 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-monitor-101-the-missing-guide-to-understanding-monitoring/ba-p/4462799</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-10-20T14:24:09Z</dc:date>
    </item>
    <item>
      <title>Monitoring Azure OpenAI without switching from your existing observability platform</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/monitoring-azure-openai-without-switching-from-your-existing/ba-p/4458898</link>
      <description>&lt;P&gt;Recently, one of my customers asked me a simple but powerful question:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;“We already use Datadog for observability, but the rate-limit metrics we see in the Azure Portal don’t match what we get in Datadog. Why does Azure show higher TPM numbers?”&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;BR /&gt;That question led to a deeper conversation about how Azure measures rate limits for Azure OpenAI.&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="font-style: var(--lia-blog-font-style); font-family: var(--lia-blog-font-family); font-size: var(--lia-bs-font-size-base); -webkit-tap-highlight-color: hsla(var(--lia-bs-black-h),var(--lia-bs-black-s),var(--lia-bs-black-l),0); -webkit-text-size-adjust: 100%;"&gt;They weren’t necessarily trying to move away from Datadog, in fact, they already have a mature observability stack built on it, but they wanted to understand and monitor Azure OpenAI usage directly in the portal.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="font-style: var(--lia-blog-font-style); font-family: var(--lia-blog-font-family); font-size: var(--lia-bs-font-size-base); -webkit-tap-highlight-color: hsla(var(--lia-bs-black-h),var(--lia-bs-black-s),var(--lia-bs-black-l),0); -webkit-text-size-adjust: 100%;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN style="font-style: var(--lia-blog-font-style); font-family: var(--lia-blog-font-family); font-size: var(--lia-bs-font-size-base); -webkit-tap-highlight-color: hsla(var(--lia-bs-black-h),var(--lia-bs-black-s),var(--lia-bs-black-l),0); -webkit-text-size-adjust: 100%;"&gt;After reviewing &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/quota" target="_blank" rel="noopener"&gt;the documentation&lt;/A&gt; and confirming with Azure OpenAI Engineering team, the answer made sense:&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Azure’s Tokens-Per-Minute (TPM) metric is based on an estimated token count derived from the character length of the request, not the exact tokenized count used for billing.&lt;/LI&gt;
&lt;LI&gt;This estimate accounts for the worst-case request scenario (prompt + max_tokens + best_of), so Azure’s TPM can appear “inflated” compared to Datadog, which measures actual tokens consumed after completion.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;That conversation inspired this post because many customers find themselves in a similar spot: they already have powerful observability tools but still want quick, built-in visibility into Azure OpenAI usage and rate limits without adding new integrations or switching platforms.&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="font-family: var(--lia-bs-headings-font-family); font-size: var(--lia-bs-h4-font-size); font-style: var(--lia-headings-font-style); letter-spacing: var(--lia-h4-letter-spacing); -webkit-tap-highlight-color: hsla(var(--lia-bs-black-h),var(--lia-bs-black-s),var(--lia-bs-black-l),0); -webkit-text-size-adjust: 100%;"&gt;&lt;BR /&gt;The two monitoring paths&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;When it comes to monitoring Azure OpenAI, there are two main options:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;1. The full flow &lt;/STRONG&gt;(most powerful, requires Log Analytics): This unlocks correlation, deep queries, and exporting metrics/logs to external tools.&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;Azure OpenAI Service → Azure Monitor → Log Analytics → KQL, Workbooks, Alerts → integrations like Datadog, Grafana.&lt;/P&gt;
&lt;/img&gt;
&lt;P&gt;&lt;STRONG&gt;2. The lightweight flow&amp;nbsp;&lt;/STRONG&gt;(fast, free, no Log Analytics): This is what we’ll explore: simple dashboards and quota-based alerts right in the Azure Portal.&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;Azure OpenAI Service → Azure Monitor (Metrics) → Portal Workbooks + Alerts.&lt;/P&gt;
&lt;/img&gt;
&lt;H4&gt;Metrics available in Azure OpenAI&lt;/H4&gt;
&lt;P&gt;Azure OpenAI publishes several key metrics natively (no ingestion required). According to the &lt;A href="https://learn.microsoft.com/azure/ai-foundry/openai/monitor-openai-reference" target="_blank" rel="noopener"&gt;official documentation&lt;/A&gt;:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Processed Inference Tokens → tokens consumed (prompt + completion).&lt;/LI&gt;
&lt;LI&gt;Azure OpenAI Requests → total API calls.&lt;/LI&gt;
&lt;LI&gt;Request Errors → failed requests (429s, 5xx).&lt;/LI&gt;
&lt;LI&gt;Availability Rate → percentage of successful calls.&lt;/LI&gt;
&lt;LI&gt;Latency metrics → TTFT (time to first token), TTLB (time to last byte).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;You can view these under:&amp;nbsp;&lt;STRONG&gt;AOAI Resource → Monitoring → Metrics.&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;Azure OpenAI exposes native metrics like tokens, requests, errors, and latency directly in the Azure Portal&lt;/P&gt;
&lt;/img&gt;
&lt;H4&gt;Quotas: The other half of the picture&lt;BR /&gt;&lt;SPAN style="color: rgb(30, 30, 30); font-size: 16px;"&gt;&lt;BR /&gt;Metrics tell you usage. Quotas tell you capacity. Every deployment has fixed Tokens per Minute (TPM) and Requests per Minute (RPM) limits. You can find these under: &lt;STRONG&gt;AOAI Foundry Portal → Deployments → Select Deployment → Rate Limits.&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;Example:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;GPT-4.1-mini deployment →&lt;STRONG&gt; 250,000 TPM / 250 RPM&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;These are the values you’ll compare against metrics and use in alerts.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;Each deployment has fixed TPM/RPM quotas. Here, GPT-4.1-mini is capped at 250,000 TPM and 250 RPM.&lt;/P&gt;
&lt;/img&gt;
&lt;P&gt;If you prefer a more programmatically way, you could run this command:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az rest --method get \
  --url "https://management.azure.com/subscriptions/&amp;lt;subscriptionId&amp;gt;/resourceGroups/&amp;lt;resourceGroup&amp;gt;/providers/Microsoft.CognitiveServices/accounts/&amp;lt;accountName&amp;gt;/deployments/&amp;lt;deploymentName&amp;gt;?api-version=2023-05-01" \
  --query "{deployment:name, TPM:properties.rateLimits[?key=='token'].count | [0], RPM:properties.rateLimits[?key=='request'].count | [0]}"
&lt;/LI-CODE&gt;
&lt;P&gt;Sample output:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;{
  "RPM": 250,
  "TPM": 250000,
  "deployment": "gpt-4.1-mini"
}&lt;/LI-CODE&gt;
&lt;H4&gt;Building a lightweight workbook&lt;/H4&gt;
&lt;P&gt;Even without Log Analytics, you can build a simple workbook to track usage vs quota:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Go to &lt;STRONG&gt;Azure Monitor → Workbooks → + New&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI&gt;Add a metric visualization for Processed Inference Tokens (Sum).
&lt;UL&gt;
&lt;LI&gt;Metric: Processed Inference Tokens&lt;/LI&gt;
&lt;LI&gt;Aggregation: Sum&lt;/LI&gt;
&lt;LI&gt;Display name: Token Usage vs Quota.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Resource Type: Azure AI Foundry&lt;/LI&gt;
&lt;LI&gt;Azure AI Foundry: Select your instance&lt;/LI&gt;
&lt;LI&gt;Click to add metric&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI&gt;Add another metric for Azure OpenAI Requests (Count).&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Metric: Azure OpenAI Requests&lt;/LI&gt;
&lt;LI&gt;Aggregation: Count&lt;/LI&gt;
&lt;LI&gt;Display name: Requests per Minute vs Quota.&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI&gt;Click to Run Metrics&lt;/LI&gt;
&lt;LI&gt;Save as AOAI Usage vs Capacity.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;img&gt;
&lt;P&gt;Workbooks let you visualize token and request usage against your deployment’s fixed quotas&lt;/P&gt;
&lt;/img&gt;
&lt;H4&gt;Creating alerts (proactive notification)&lt;/H4&gt;
&lt;P&gt;From the portal you can also configure alerts directly on metrics:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Go to &lt;STRONG&gt;Azure Monitor → Alerts → + Create → Alert rule&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI&gt;Scope = your AOAI resource.&lt;/LI&gt;
&lt;LI&gt;Condition step:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Signal name = Processed Inference Tokens.&lt;/LI&gt;
&lt;LI&gt;Threshold type: Static&lt;/LI&gt;
&lt;LI&gt;Value is: Greater than&lt;/LI&gt;
&lt;LI&gt;Unit: Count&lt;/LI&gt;
&lt;LI&gt;Threshold = 200,000 (warning) or 250,000 (critical).&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI&gt;Actions step:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Use &lt;STRONG&gt;Quick Actions &lt;/STRONG&gt;→ add your email (or Azure mobile push).&lt;/LI&gt;
&lt;LI&gt;Or create an &lt;STRONG&gt;Action Group &lt;/STRONG&gt;for Teams/webhook integration.&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI&gt;Details step:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Name = AOAI-TPM-Warning / AOAI-TPM-Critical.&lt;/LI&gt;
&lt;LI&gt;Severity = 2 (Warning) or 0 (Critical).&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI&gt;Review + Create.&lt;/LI&gt;
&lt;LI&gt;Repeat for &lt;STRONG&gt;Azure OpenAI Requests&lt;/STRONG&gt; with thresholds of 200 (warning) and 250 (critical).&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;STRONG&gt;Alert conditions:&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;Configure alert conditions directly on metrics. Here, we trigger at 200,000 tokens per minute (80% of quota)&lt;/P&gt;
&lt;/img&gt;
&lt;P&gt;&lt;STRONG&gt;Quick Actions:&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;Quick Actions let you add email or mobile notifications without creating a full Action Group.&lt;/P&gt;
&lt;/img&gt;
&lt;P&gt;&lt;STRONG&gt;Overview from the Alert:&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;Give your alert a descriptive name and severity. Here, AOAI-TPM-Warning at Severity 2.&lt;/P&gt;
&lt;/img&gt;
&lt;H4&gt;How this helps with 429 errors&lt;/H4&gt;
&lt;P&gt;One of the most common issues Azure OpenAI customers face is the dreaded “&lt;STRONG&gt;Too Many Requests” (429)&lt;/STRONG&gt; error.&lt;/P&gt;
&lt;P&gt;Why it happens:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Each deployment enforces hard TPM/RPM quotas.&lt;/LI&gt;
&lt;LI&gt;If you send more tokens or requests than allowed in a minute, the service rejects them with a 429.&lt;/LI&gt;
&lt;LI&gt;You may see headers like x-ms-retry-after-ms telling you how long to wait.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;How monitoring helps:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Metrics as early warning&lt;/STRONG&gt;: Watching token/request metrics shows when you’re approaching the cap.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Alerts before throttling&lt;/STRONG&gt;: Warning alerts at 80% (200k TPM / 200 RPM) give you time to react before 429s hit.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Critical alerts at 100%&lt;/STRONG&gt;: Confirm you’ve saturated the quota and need to adjust.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Important note:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Monitoring doesn’t prevent 429s, your app should still implement retry with backoff and consider batching/queuing requests.&lt;/LI&gt;
&lt;LI&gt;But with this setup, you’ll know before the error storm begins, and can respond faster.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;Why this matters&lt;/H4&gt;
&lt;P&gt;For many companies, time-to-value is more important than building a new monitoring stack.&lt;/P&gt;
&lt;P&gt;This approach means:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;No Log Analytics ingestion.&lt;/LI&gt;
&lt;LI&gt;No need to replace Datadog or Splunk.&lt;/LI&gt;
&lt;LI&gt;Free visibility into &lt;STRONG&gt;usage vs quota&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI&gt;Proactive notifications on approaching limits.&lt;/LI&gt;
&lt;LI&gt;Fewer surprises with 429 errors.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;And if later you want deeper insights, you can still enable Log Analytics and export into your existing observability platform.&lt;/P&gt;
&lt;H4&gt;References:&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/monitor-openai" target="_blank"&gt;Monitor Azure OpenAI in Azure AI Foundry Models&amp;nbsp;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/azure-monitor/visualize/workbooks-overview" target="_blank"&gt;Azure Workbooks overview - Azure Monitor&amp;nbsp;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/azure-monitor/visualize/workbooks-templates" target="_blank"&gt;Azure Workbooks templates - Azure Monitor&amp;nbsp;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://techcommunity.microsoft.com/blog/fasttrackforazureblog/azure-openai-insights-monitoring-ai-with-confidence/4026850" target="_blank"&gt;Monitoring Azure OpenAI&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/ai-services/diagnostic-logging" target="_blank"&gt;Enable diagnostic logging - Azure AI services&amp;nbsp;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/the-importance-of-setting-up-service-and-resource-health-monitoring-in-azure/4372478" target="_blank"&gt;The importance of setting up Service and Resource Health monitoring in Azure&amp;nbsp;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;Closing thoughts&lt;/H4&gt;
&lt;P&gt;This article was inspired by a customer request, but I believe many others will benefit from the same approach. In just a few minutes, you can build a dashboard, set alerts, and gain confidence in your Azure OpenAI usage, all without leaving the Azure Portal.&lt;BR /&gt;&lt;BR /&gt;I’d love to hear from you: how is your team monitoring Azure OpenAI today? Share in the comments, your feedback will help shape what we build next.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Oct 2025 20:16:23 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/monitoring-azure-openai-without-switching-from-your-existing/ba-p/4458898</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-10-09T20:16:23Z</dc:date>
    </item>
    <item>
      <title>Azure routing preference: A hidden lever for performance vs. cost trade-offs</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-routing-preference-a-hidden-lever-for-performance-vs-cost/ba-p/4451425</link>
      <description>&lt;img /&gt;
&lt;P data-start="313" data-end="540"&gt;For Digital Native companies, every engineering decision is also a business decision. How you design your cloud architecture affects not just performance but also your burn rate, margins, and ultimately your ability to scale.&lt;/P&gt;
&lt;P data-start="542" data-end="729"&gt;One of the most overlooked levers in Azure networking is &lt;STRONG data-start="599" data-end="621"&gt;Routing Preference, &lt;/STRONG&gt;a simple setting that determines how your outbound internet traffic leaves Azure. The choice is binary:&lt;/P&gt;
&lt;UL data-start="731" data-end="920"&gt;
&lt;LI data-start="731" data-end="844"&gt;&lt;STRONG data-start="733" data-end="771"&gt;Microsoft Global Network (Premium): &lt;/STRONG&gt;High-quality, low-latency routing on Microsoft’s backbone (default).&lt;/LI&gt;
&lt;LI data-start="845" data-end="920"&gt;&lt;STRONG data-start="847" data-end="881"&gt;ISP Transit (Internet Routing): &lt;/STRONG&gt;Lower-cost routing via local ISPs.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="922" data-end="1064"&gt;Most startups never change the default, but understanding when to switch can save you serious money without sacrificing customer experience.&lt;/P&gt;
&lt;H4 data-start="1071" data-end="1110"&gt;Why it matters for digital natives&lt;/H4&gt;
&lt;P data-start="1112" data-end="1339"&gt;Bandwidth is one of those quiet COGS items (&lt;EM data-start="1060" data-end="1138"&gt;Cost of Goods Sold, the direct cost of delivering your product to customers&lt;/EM&gt;) that doesn’t make noise until the bill arrives. If your product depends on moving data, whether streaming, analytics, or SaaS APIs, outbound traffic is part of your unit economics.&lt;/P&gt;
&lt;P data-start="1341" data-end="1426"&gt;Routing Preference is your &lt;STRONG data-start="1368" data-end="1407"&gt;toggle between performance and cost&lt;/STRONG&gt;. Think of it as:&lt;/P&gt;
&lt;UL data-start="1427" data-end="1582"&gt;
&lt;LI data-start="1427" data-end="1498"&gt;&lt;STRONG data-start="1429" data-end="1465"&gt;Business class routing (Premium): &lt;/STRONG&gt;smoother ride, higher price.&lt;/LI&gt;
&lt;LI data-start="1499" data-end="1582"&gt;&lt;STRONG data-start="1501" data-end="1526"&gt;Economy routing (ISP): &lt;/STRONG&gt;cheaper seat, gets you there, but less predictable.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="1589" data-end="1610"&gt;Pricing Snapshot&lt;/H4&gt;
&lt;P data-start="1612" data-end="1755"&gt;Outbound internet rates vary by region, and they differ sharply between routing options. For example (first 10 TB/month, beyond free 100 GB):&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="width: 71.7593%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&amp;nbsp;Region&lt;/th&gt;&lt;th&gt;&amp;nbsp;Microsoft Global Network (Premium)&lt;/th&gt;&lt;th&gt;&amp;nbsp;ISP Transit (Internet Routing)&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;United States&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;$0.087 / GB&lt;/td&gt;&lt;td&gt;$0.04 / GB&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Australia&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;$0.12 / GB&lt;/td&gt;&lt;td&gt;$0.06 / GB&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://azure.microsoft.com/en-us/pricing/details/bandwidth/" target="_blank" rel="noopener" data-start="2073" data-end="2160"&gt;Azure Bandwidth Pricing&lt;/A&gt;&lt;/P&gt;
&lt;H4 data-start="2169" data-end="2223"&gt;Which Azure resources support routing preference?&lt;/H4&gt;
&lt;P data-start="2225" data-end="2357"&gt;Routing Preference applies to any Azure resource backed by a&amp;nbsp;&lt;STRONG data-start="2330" data-end="2343"&gt;Public IP&lt;/STRONG&gt;, including:&lt;/P&gt;
&lt;UL data-start="2359" data-end="2588"&gt;
&lt;LI data-start="2359" data-end="2385"&gt;Virtual Machines (VMs)&lt;/LI&gt;
&lt;LI data-start="2386" data-end="2423"&gt;Virtual Machine Scale Sets (VMSS)&lt;/LI&gt;
&lt;LI data-start="2424" data-end="2458"&gt;Azure Kubernetes Service (AKS)&lt;/LI&gt;
&lt;LI data-start="2459" data-end="2504"&gt;Public Load Balancers (NIC-based backend)&lt;/LI&gt;
&lt;LI data-start="2505" data-end="2528"&gt;Application Gateway&lt;/LI&gt;
&lt;LI data-start="2529" data-end="2547"&gt;Azure Firewall&lt;/LI&gt;
&lt;LI data-start="2548" data-end="2588"&gt;Storage Accounts (Blob, Files, etc.)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2590" data-end="2721"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-network/ip-services/routing-preference-overview" target="_blank" rel="noopener" data-start="2593" data-end="2719"&gt;Routing preference overview&lt;/A&gt;&lt;/P&gt;
&lt;P data-start="2590" data-end="2721"&gt;&lt;STRONG&gt;How to configure it&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="2590" data-end="2721"&gt;&lt;U&gt;Public IP Example (CLI)&lt;/U&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az network public-ip create \
  --name MyPublicIP \
  --resource-group MyResourceGroup \
  --location eastus \
  --ip-tags 'RoutingPreference=Internet' \
  --sku Standard \
  --allocation-method Static \
  --version IPv4&lt;/LI-CODE&gt;
&lt;P data-start="2590" data-end="2721"&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;
&lt;P data-start="3212" data-end="3222"&gt;Docs:&lt;/P&gt;
&lt;UL data-start="3223" data-end="3485"&gt;
&lt;LI data-start="3223" data-end="3356"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-network/ip-services/routing-preference-portal" target="_blank" rel="noopener" data-start="3225" data-end="3354"&gt;Routing Preference for Public IP&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="3357" data-end="3485"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/storage/common/network-routing-preference" target="_blank" rel="noopener" data-start="3359" data-end="3483"&gt;Routing Preference for Storage Accounts&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="3492" data-end="3522"&gt;A digital native playbook&lt;/H4&gt;
&lt;P data-start="3524" data-end="3566"&gt;Here’s a quick guide to help you decide:&lt;/P&gt;
&lt;table border="1" style="width: 89.8148%; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Scenario&lt;/th&gt;&lt;th&gt;Recommended Routing&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Latency-sensitive SaaS APIs&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Premium (Global Network)&lt;/td&gt;&lt;td&gt;Predictable performance, better customer experience&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Dev/Test environments&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;ISP Transit&lt;/td&gt;&lt;td&gt;Optimize cost where performance isn’t critical&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Bulk log exports, backups&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;ISP Transit&lt;/td&gt;&lt;td&gt;Cut bandwidth costs significantly&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Production workloads with end-users&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Premium&lt;/td&gt;&lt;td&gt;Protect SLA and latency for customers&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H4 data-start="4045" data-end="4063"&gt;Key takeaways&lt;/H4&gt;
&lt;UL data-start="4065" data-end="4491"&gt;
&lt;LI data-start="4065" data-end="4146"&gt;By default, you’re paying for &lt;STRONG data-start="4097" data-end="4116"&gt;Premium routing&lt;/STRONG&gt; whether you need it or not.&lt;/LI&gt;
&lt;LI data-start="4147" data-end="4238"&gt;ISP Transit can &lt;STRONG data-start="4165" data-end="4193"&gt;cut costs nearly in half, &lt;/STRONG&gt;a huge win for cost-sensitive workloads.&lt;/LI&gt;
&lt;LI data-start="4239" data-end="4340"&gt;Routing Preference applies to &lt;STRONG data-start="4271" data-end="4337"&gt;VMs, AKS, Load Balancers, App Gateways, Firewalls, and Storage&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI data-start="4341" data-end="4491"&gt;The right choice depends on your &lt;STRONG data-start="4376" data-end="4410"&gt;growth stage and workload type&lt;/STRONG&gt;: optimize for performance where it matters, optimize for cost everywhere else.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="4498" data-end="4511"&gt;Closing&lt;/H4&gt;
&lt;P data-start="4513" data-end="4793"&gt;For digital natives, scaling is a balance: you need to &lt;STRONG data-start="4568" data-end="4589"&gt;delight customers&lt;/STRONG&gt; while &lt;STRONG data-start="4596" data-end="4613"&gt;watching COGS&lt;/STRONG&gt;. Routing Preference is a small Azure feature that gives you a big lever on both. Next time you spin up a VM, AKS cluster, or Storage account, don’t just go through defaults.&lt;/P&gt;
&lt;P data-start="4795" data-end="4910"&gt;Ask:&amp;nbsp;&lt;EM data-start="4803" data-end="4849"&gt;Do I want business class routing or economy? &lt;/EM&gt;That one decision could save you thousands as you scale.&lt;/P&gt;
&lt;P data-start="4795" data-end="4910"&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 05 Sep 2025 20:40:21 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-routing-preference-a-hidden-lever-for-performance-vs-cost/ba-p/4451425</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-09-05T20:40:21Z</dc:date>
    </item>
    <item>
      <title>Azure Quota Alerts (Preview): Still overlooked, but incredibly useful</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-quota-alerts-preview-still-overlooked-but-incredibly/ba-p/4447140</link>
      <description>&lt;P data-start="322" data-end="496"&gt;Quota limits are one of those hidden blockers that can catch digital native companies by surprise. You’re scaling fast, deploying more VMs or GPU nodes, and suddenly:&amp;nbsp;&lt;STRONG data-start="473" data-end="494"&gt;“Quota exceeded.”&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="498" data-end="733"&gt;Since late 2024, Azure has offered &lt;STRONG data-start="533" data-end="559"&gt;Quota Alerts (Preview), &lt;/STRONG&gt;a built-in way to monitor and get notified before you hit subscription limits. It’s not brand new, but many digital native companies still aren’t taking advantage of it.&lt;/P&gt;
&lt;H4 data-start="740" data-end="790"&gt;Why this matters for startups &amp;amp; digital natives&lt;/H4&gt;
&lt;UL data-start="791" data-end="1105"&gt;
&lt;LI data-start="791" data-end="877"&gt;&lt;STRONG data-start="793" data-end="820"&gt;Avoid outages at scale:&lt;/STRONG&gt; deployments won’t suddenly fail due to quota ceilings.&lt;/LI&gt;
&lt;LI data-start="878" data-end="949"&gt;&lt;STRONG data-start="880" data-end="906"&gt;Save engineering time:&lt;/STRONG&gt; no need for custom monitoring pipelines.&lt;/LI&gt;
&lt;LI data-start="950" data-end="1015"&gt;&lt;STRONG data-start="952" data-end="969"&gt;Simple setup:&lt;/STRONG&gt; alerts in minutes directly from the portal.&lt;/LI&gt;
&lt;LI data-start="1016" data-end="1105"&gt;&lt;STRONG data-start="1018" data-end="1046"&gt;Fits existing workflows:&lt;/STRONG&gt; integrates with Action Groups (email, Teams, PagerDuty).&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="1112" data-end="1161"&gt;How to create a quota alert in Azure (Preview)&lt;/H4&gt;
&lt;P data-start="1163" data-end="1197"&gt;&lt;STRONG&gt;1. Open &lt;EM data-start="1175" data-end="1183"&gt;Quotas&lt;/EM&gt; in the Portal&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="1198" data-end="1258"&gt;Search &lt;STRONG data-start="1205" data-end="1215"&gt;Quotas&lt;/STRONG&gt; in the Azure Portal and go to the blade.&lt;/P&gt;
&lt;P data-start="1265" data-end="1301"&gt;&lt;STRONG&gt;2. Go to &lt;EM data-start="1278" data-end="1301"&gt;Alert rules (Preview)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="1302" data-end="1376"&gt;Click &lt;STRONG data-start="1308" data-end="1333"&gt;Alert rules (Preview)&lt;/STRONG&gt;, then &lt;STRONG data-start="1340" data-end="1373"&gt;+ Create Alert Rule (Preview)&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P data-start="1383" data-end="1409"&gt;&lt;STRONG&gt;3. Configure the Alert&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P data-start="1410" data-end="1432"&gt;On the &lt;STRONG data-start="668" data-end="695"&gt;Create usage alert rule&lt;/STRONG&gt; page:&lt;/P&gt;
&lt;UL data-start="705" data-end="1229"&gt;
&lt;LI data-start="705" data-end="752"&gt;&lt;STRONG data-start="707" data-end="723"&gt;Subscription&lt;/STRONG&gt; → choose the subscription.&lt;/LI&gt;
&lt;LI data-start="753" data-end="798"&gt;&lt;STRONG data-start="755" data-end="767"&gt;Provider&lt;/STRONG&gt; → e.g., &lt;EM data-start="776" data-end="785"&gt;Compute&lt;/EM&gt; for vCPUs.&lt;/LI&gt;
&lt;LI data-start="799" data-end="860"&gt;&lt;STRONG data-start="801" data-end="820"&gt;Alert rule name&lt;/STRONG&gt; → e.g., &lt;EM data-start="829" data-end="857"&gt;Quota Alert – EastUS vCPUs&lt;/EM&gt;.&lt;/LI&gt;
&lt;LI data-start="861" data-end="901"&gt;&lt;STRONG data-start="863" data-end="876"&gt;Threshold&lt;/STRONG&gt; → usage % (e.g., 80%).&lt;/LI&gt;
&lt;LI data-start="902" data-end="946"&gt;&lt;STRONG data-start="904" data-end="916"&gt;Severity&lt;/STRONG&gt; → pick according to policy.&lt;/LI&gt;
&lt;LI data-start="947" data-end="998"&gt;&lt;STRONG data-start="949" data-end="976"&gt;Frequency of evaluation&lt;/STRONG&gt; → e.g., 15 minutes.&lt;/LI&gt;
&lt;LI data-start="999" data-end="1042"&gt;&lt;STRONG data-start="1001" data-end="1019"&gt;Resource group&lt;/STRONG&gt; → select/create one.&lt;/LI&gt;
&lt;LI data-start="1043" data-end="1091"&gt;&lt;STRONG data-start="1045" data-end="1065"&gt;Managed Identity&lt;/STRONG&gt; → click &lt;STRONG data-start="1074" data-end="1088"&gt;Create new&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI data-start="1092" data-end="1147"&gt;&lt;STRONG data-start="1094" data-end="1110"&gt;Notify me by&lt;/STRONG&gt; → email, Action Group, Teams, etc.&lt;/LI&gt;
&lt;LI data-start="1148" data-end="1229"&gt;&lt;STRONG data-start="1150" data-end="1164"&gt;Dimensions&lt;/STRONG&gt; → select &lt;STRONG data-start="1174" data-end="1186"&gt;Location&lt;/STRONG&gt; and &lt;STRONG data-start="1191" data-end="1200"&gt;Quota&lt;/STRONG&gt; (e.g., DSv5 Family vCPUs).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1796" data-end="1820"&gt;Save, and you’re done. You can find more detailed configuration options in the&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/quotas/how-to-guide-monitoring-alerting" target="_blank" rel="noopener" data-start="546" data-end="652"&gt;official Microsoft docs&lt;/A&gt;.&lt;/P&gt;
&lt;P data-start="1260" data-end="1311"&gt;&lt;STRONG&gt;4. Assign Permissions to the Managed Identity&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="1312" data-end="1439"&gt;When the new Managed Identity is created (e.g., quota-alert-managed-identity), you must give it access to read quota usage.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL data-start="1441" data-end="1800"&gt;
&lt;LI data-start="1441" data-end="1488"&gt;Go to &lt;STRONG data-start="1449" data-end="1471"&gt;Managed Identities&lt;/STRONG&gt; in the portal.&lt;/LI&gt;
&lt;LI data-start="1489" data-end="1541"&gt;Select the identity created for the quota alert.&lt;/LI&gt;
&lt;LI data-start="1542" data-end="1616"&gt;Open &lt;STRONG data-start="1549" data-end="1575"&gt;Azure role assignments&lt;/STRONG&gt; → &lt;STRONG data-start="1578" data-end="1613"&gt;+ Add role assignment (Preview)&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI data-start="1617" data-end="1790"&gt;Set:
&lt;UL data-start="1628" data-end="1790"&gt;
&lt;LI data-start="1628" data-end="1655"&gt;&lt;STRONG data-start="1630" data-end="1639"&gt;Scope&lt;/STRONG&gt;: Subscription&lt;/LI&gt;
&lt;LI data-start="1658" data-end="1723"&gt;&lt;STRONG data-start="1660" data-end="1676"&gt;Subscription&lt;/STRONG&gt;: the subscription where quotas are monitored&lt;/LI&gt;
&lt;LI data-start="1726" data-end="1790"&gt;&lt;STRONG data-start="1728" data-end="1736"&gt;Role&lt;/STRONG&gt;: &lt;STRONG data-start="1738" data-end="1748"&gt;Reader&lt;/STRONG&gt; (or any role that includes read access)&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI data-start="1791" data-end="1800"&gt;Save.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1827" data-end="1845"&gt;&lt;STRONG&gt;5. Track &amp;amp; Act&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="1846" data-end="1977"&gt;
&lt;LI data-start="1846" data-end="1914"&gt;Your rules are visible under &lt;STRONG data-start="1877" data-end="1911"&gt;Quotas → Alert rules (Preview)&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI data-start="1915" data-end="1977"&gt;Triggered alerts show up under &lt;STRONG data-start="1948" data-end="1974"&gt;Fired alerts (Preview)&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;/UL&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4 data-start="1984" data-end="2005"&gt;Old vs New Reality&lt;/H4&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Approach&lt;/th&gt;&lt;th&gt;Custom Scripts &amp;amp; Logs&lt;/th&gt;&lt;th&gt;Quota Alerts (Preview)&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Effort to set up&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;td&gt;Very low&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Extra services needed&lt;/td&gt;&lt;td&gt;Log Analytics, Automation&lt;/td&gt;&lt;td&gt;None&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Visibility&lt;/td&gt;&lt;td&gt;Manual dashboards&lt;/td&gt;&lt;td&gt;Native alert rules&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Best fit&lt;/td&gt;&lt;td&gt;Ops-heavy teams&lt;/td&gt;&lt;td&gt;Startups, lean teams&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H4 data-start="2510" data-end="2523"&gt;Takeaway&lt;/H4&gt;
&lt;P data-start="2524" data-end="2752"&gt;Quota alerts may have been around since late 2024, but they remain one of the most &lt;STRONG data-start="2607" data-end="2630"&gt;underrated features&lt;/STRONG&gt; in Azure. For startups and digital native companies scaling quickly, they provide peace of mind with almost zero setup.&lt;/P&gt;
&lt;P data-start="2754" data-end="2946"&gt;Don’t wait until your next deployment fails, set up a&amp;nbsp;&lt;STRONG data-start="2812" data-end="2827"&gt;quota alert&lt;/STRONG&gt; today (start with Regional vCPUs in your main region). It only takes a couple of minutes and could save your launch.&lt;/P&gt;
&lt;P data-start="2953" data-end="3098"&gt;⚡️ Pro tip: You can reuse your &lt;STRONG data-start="2984" data-end="3001"&gt;Action Groups&lt;/STRONG&gt; for quota alerts, keeping all notifications consistent with your existing monitoring strategy.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Aug 2025 16:24:56 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-quota-alerts-preview-still-overlooked-but-incredibly/ba-p/4447140</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-08-22T16:24:56Z</dc:date>
    </item>
    <item>
      <title>Azure Support Slack Bot on Azure Container Apps: Production-ready guide</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-support-slack-bot-on-azure-container-apps-production-ready/ba-p/4436423</link>
      <description>&lt;P data-start="416" data-end="592"&gt;Launch a secure, scalable Slack bot for Azure support tickets in minutes — no secrets in code, no manual admin steps, and fully aligned with modern cloud-native best practices.&lt;/P&gt;
&lt;P data-start="594" data-end="960"&gt;This guide walks you through deploying the GitHub sample &lt;A class="lia-external-url" href="https://github.com/Azure-Samples/azure-support-slack-bot" target="_blank" rel="noopener" data-start="651" data-end="732"&gt;azure-support-slack-bot&lt;/A&gt; on &lt;STRONG data-start="736" data-end="760"&gt;Azure Container Apps&lt;/STRONG&gt;, using &lt;STRONG data-start="768" data-end="790"&gt;managed identities&lt;/STRONG&gt;, &lt;STRONG data-start="792" data-end="805"&gt;Key Vault&lt;/STRONG&gt;, and &lt;STRONG data-start="831" data-end="861"&gt;scale-to-zero architecture&lt;/STRONG&gt; that just works, whether you're building from scratch or plugging into your existing DevOps flow.&lt;/P&gt;
&lt;H3 data-start="962" data-end="991"&gt;Here’s what you’ll build:&lt;/H3&gt;
&lt;UL data-start="993" data-end="1330"&gt;
&lt;LI data-start="993" data-end="1066"&gt;Zero-admin secrets management with Managed Identity + Key Vault&lt;/LI&gt;
&lt;LI data-start="1067" data-end="1112"&gt;RBAC-first access to Azure Support APIs&lt;/LI&gt;
&lt;LI data-start="1113" data-end="1177"&gt;A clean, local-first development workflow (with ngrok support)&lt;/LI&gt;
&lt;LI data-start="1178" data-end="1224"&gt;Slack integration using manifest-based setup&lt;/LI&gt;
&lt;LI data-start="1225" data-end="1274"&gt;Observability with App Insights + Log Analytics&lt;/LI&gt;
&lt;LI data-start="1275" data-end="1330"&gt;Scale from 0 to N replicas, with autoscaling baked in&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1332" data-end="1443"&gt;And yes, all of this,&amp;nbsp;&lt;STRONG data-start="1355" data-end="1391"&gt;without ever hardcoding a secret&lt;/STRONG&gt; or exposing a public endpoint you didn’t intend to.&lt;/P&gt;
&lt;P data-start="1332" data-end="1443"&gt;If you’re running lean and building fast, this bot is a solid foundation. It’s not just a cool demo — it’s a production-ready blueprint for any digital native team that wants to integrate Slack with Azure support in a secure, automated, and developer-friendly way.&lt;/P&gt;
&lt;H3&gt;1. What you're building&lt;/H3&gt;
&lt;img /&gt;
&lt;P&gt;&lt;STRONG&gt;Features:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;/azure-support slash command&lt;/LI&gt;
&lt;LI&gt;Auto-scaling from 0→N replicas based on HTTP load&lt;/LI&gt;
&lt;LI&gt;Zero secrets in code or environment variables&lt;/LI&gt;
&lt;LI&gt;Comprehensive logging and Azure RBAC integration&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;2. Why Azure Container Apps (ACA)?&lt;/H3&gt;
&lt;P data-start="242" data-end="373"&gt;When you're building for speed, security, and scale without a huge ops team, &lt;STRONG data-start="322" data-end="352"&gt;Azure Container Apps (ACA)&lt;/STRONG&gt; hits the sweet spot.&lt;/P&gt;
&lt;P data-start="375" data-end="550"&gt;This Slack bot doesn't need a full-blown cluster. It needs &lt;STRONG data-start="434" data-end="456"&gt;event-driven scale&lt;/STRONG&gt;, &lt;STRONG data-start="458" data-end="481"&gt;zero-trust security&lt;/STRONG&gt;, and &lt;STRONG data-start="487" data-end="510"&gt;built-in automation&lt;/STRONG&gt;,&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;and that’s exactly what ACA delivers. Here’s why ACA is a better fit than the usual suspects:&lt;/P&gt;
&lt;P data-start="614" data-end="652"&gt;&lt;STRONG&gt;Azure Container Instances (ACI)&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="653" data-end="906"&gt;
&lt;LI data-start="653" data-end="766"&gt;Great for quick scripts or batch jobs , but&amp;nbsp;no built-in ingress, TLS, scaling rules, or managed identities.&lt;/LI&gt;
&lt;LI data-start="767" data-end="906"&gt;ACA gives you all that out of the box, with production features and native integration with Key Vault, App Insights, and autoscaling.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="908" data-end="937"&gt;&lt;STRONG&gt;Web App for Containers&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="938" data-end="1181"&gt;
&lt;LI data-start="938" data-end="1068"&gt;Web Apps are more suited for classic web hosting. You’ll hit limits with scaling flexibility, networking, and secret management.&lt;/LI&gt;
&lt;LI data-start="1069" data-end="1181"&gt;ACA gives you Kubernetes-grade scale and observability, without having to think about servers or patching.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1183" data-end="1220"&gt;&lt;STRONG&gt;Azure Kubernetes Service (AKS)&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="1221" data-end="1611"&gt;
&lt;LI data-start="1221" data-end="1338"&gt;Powerful, but heavy. You manage clusters, patch nodes, deal with autoscaler configs, ingress controllers, and more.&lt;/LI&gt;
&lt;LI data-start="1339" data-end="1433"&gt;ACA does the heavy lifting for you, zero node management, zero cluster maintenance.&lt;/LI&gt;
&lt;LI data-start="1434" data-end="1611"&gt;And here’s the kicker: AKS charges for the VM nodes 24/7, even when idle.
&lt;UL data-start="1516" data-end="1611"&gt;
&lt;LI data-start="1516" data-end="1611"&gt;ACA? Pay-per-request. When there’s no traffic, it scales to zero, and&amp;nbsp;you don’t pay.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="1618" data-end="1661"&gt;Cost Efficiency That Scales With You&lt;/H4&gt;
&lt;P data-start="1663" data-end="1792"&gt;For digital native teams, especially startups and growth-stage companies, ACA’s&amp;nbsp;&lt;STRONG data-start="1745" data-end="1773"&gt;serverless pricing model&lt;/STRONG&gt; is a game-changer:&lt;/P&gt;
&lt;UL data-start="1793" data-end="1957"&gt;
&lt;LI data-start="1793" data-end="1852"&gt;You scale from 0 to N replicas based on actual demand&lt;/LI&gt;
&lt;LI data-start="1853" data-end="1896"&gt;You only pay when your app is running&lt;/LI&gt;
&lt;LI data-start="1897" data-end="1957"&gt;No need to over-provision VMs or guess your future traffic&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1959" data-end="2091"&gt;This means you can launch a support bot, an internal API, or a microservice without worrying about burning cash while it's idle.&lt;/P&gt;
&lt;H4 data-start="2098" data-end="2126"&gt;Built for These Teams&lt;/H4&gt;
&lt;P data-start="2128" data-end="2145"&gt;ACA is ideal for:&lt;/P&gt;
&lt;UL data-start="2146" data-end="2486"&gt;
&lt;LI data-start="2146" data-end="2228"&gt;&lt;STRONG data-start="2151" data-end="2181"&gt;Platform engineering teams&lt;/STRONG&gt; who want secure templates, not snowflake infra&lt;/LI&gt;
&lt;LI data-start="2229" data-end="2306"&gt;&lt;STRONG data-start="2234" data-end="2256"&gt;DevOps-light teams&lt;/STRONG&gt; who need autoscaling without managing YAML storms&lt;/LI&gt;
&lt;LI data-start="2307" data-end="2397"&gt;&lt;STRONG data-start="2312" data-end="2343"&gt;Growth-stage product squads&lt;/STRONG&gt; rolling out bots, APIs, or event-driven services fast&lt;/LI&gt;
&lt;LI data-start="2398" data-end="2486"&gt;&lt;STRONG data-start="2403" data-end="2415"&gt;Startups&lt;/STRONG&gt; who care about velocity, observability, and not hiring a full SRE team&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Comparison table:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-solid" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;STRONG&gt;Platform&lt;/STRONG&gt;&lt;/th&gt;&lt;th&gt;&lt;STRONG&gt;Best Fit&lt;/STRONG&gt;&lt;/th&gt;&lt;th&gt;&lt;STRONG&gt;Where It Falls Short&lt;/STRONG&gt;&lt;/th&gt;&lt;th&gt;&lt;STRONG&gt;Why ACA Wins&lt;/STRONG&gt;&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;ACI&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Short-lived scripts &amp;amp; jobs&lt;/td&gt;&lt;td&gt;No ingress, limited identity, lacks autoscaling&lt;/td&gt;&lt;td&gt;ACA supports scale-to-zero, secure access, and managed identity out of the box&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Web App&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Traditional web hosting&lt;/td&gt;&lt;td&gt;Rigid scaling, fewer network/runtime controls&lt;/td&gt;&lt;td&gt;ACA offers greater flexibility and microservice patterns&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;AKS&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Complex, large-scale distributed apps&lt;/td&gt;&lt;td&gt;Operational overhead, always-on cost&lt;/td&gt;&lt;td&gt;ACA simplifies ops with managed scaling &amp;amp; lower cost&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;ACA&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Cloud-native APIs, internal tools, microservices&lt;/td&gt;&lt;td&gt;—&lt;/td&gt;&lt;td&gt;Built-in identity, ingress, scale-to-zero, lower total cost&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P data-start="3098" data-end="3157"&gt;ACA is your serverless container platform when you want:&lt;/P&gt;
&lt;UL data-start="3158" data-end="3366"&gt;
&lt;LI data-start="3158" data-end="3184"&gt;TLS and ingress baked in&lt;/LI&gt;
&lt;LI data-start="3185" data-end="3224"&gt;GitHub Actions support out of the box&lt;/LI&gt;
&lt;LI data-start="3225" data-end="3307"&gt;Built-in support for Key Vault, managed identities, and auto-scaling&lt;/LI&gt;
&lt;LI data-start="3308" data-end="3366"&gt;Production-grade infra, without managing a single VM&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3368" data-end="3461"&gt;If you're moving fast and don’t want to build a platform just to run a bot, ACA is the move.&lt;/P&gt;
&lt;H3&gt;3. Prerequisites &amp;amp; verification&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Required:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Azure subscription with&amp;nbsp;&lt;STRONG&gt;Contributor&lt;/STRONG&gt;&amp;nbsp;access&lt;/LI&gt;
&lt;LI&gt;Azure CLI ≥ 2.49.0 with `containerapp` extension&lt;/LI&gt;
&lt;LI&gt;Docker Desktop or equivalent &amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Slack workspace with app creation permissions&lt;/LI&gt;
&lt;LI&gt;Python 3.8+ for local development&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Optional:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://ngrok.com/" target="_blank" rel="noopener"&gt;ngrok &lt;/A&gt;&amp;nbsp;account for stable local testing URLs&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Setup verification:&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# Verify Azure CLI and login
az --version
az login
az account set --subscription &amp;lt;your-subscription-id&amp;gt;
az extension add --name containerapp --upgrade

# Verify Docker and Python
docker --version
python --version

# Verify current user permissions
currentUserId=$(az ad signed-in-user show --query id -o tsv)
subscriptionId=$(az account show --query id -o tsv)
az role assignment list --assignee $currentUserId --scope "/subscriptions/$subscriptionId" --query "[].roleDefinitionName" -o table&lt;/LI-CODE&gt;
&lt;H3&gt;4. Azure permissions setup&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;The repository requires specific Azure RBAC roles that are often missed:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# Get current subscription and user
subscriptionId=$(az account show --query id -o tsv)
currentUserId=$(az ad signed-in-user show --query id -o tsv)

# Support Request Contributor - Required to create/manage Azure support tickets
az role assignment create \
  --assignee $currentUserId \
  --role "Support Request Contributor" \
  --scope "/subscriptions/$subscriptionId"

# Reader - Required to list and view Azure resources in the bot
az role assignment create \
  --assignee $currentUserId \
  --role "Reader" \
  --scope "/subscriptions/$subscriptionId"&lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;Why these roles are required:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Support Request Contributor: Allows creating and managing Azure support tickets&lt;/LI&gt;
&lt;LI&gt;Reader: Allows the bot to list subscriptions, services, and resources in dropdown menus&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;5. Local development setup&lt;/H3&gt;
&lt;H4&gt;5.1 Clone and initialize project&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;git clone https://github.com/Azure-Samples/azure-support-slack-bot.git
cd azure-support-slack-bot

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/macOS

# Install dependencies
pip install -r requirements.txt

# Create local environment file
cp .env-example .env&lt;/LI-CODE&gt;
&lt;H4&gt;5.2 Create Slack app with manifest&lt;/H4&gt;
&lt;P&gt;Using the provided manifest is crucial for correct configuration:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Visit &lt;A class="lia-external-url" href="https://api.slack.com/apps" target="_blank" rel="noopener"&gt;Slack API Apps&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Click "&lt;STRONG&gt;Create New App&lt;/STRONG&gt;" → "&lt;STRONG&gt;From a manifest&lt;/STRONG&gt;"&lt;/LI&gt;
&lt;LI&gt;Choose &lt;STRONG&gt;YAML&lt;/STRONG&gt; and paste the contents from &lt;A class="lia-external-url" href="https://github.com/Azure-Samples/azure-support-slack-bot/blob/main/slack_app_manifest.yaml" target="_blank" rel="noopener"&gt;slack_app_manifest.yaml&lt;/A&gt;:&lt;/LI&gt;
&lt;LI&gt;Click &lt;STRONG&gt;Next&lt;/STRONG&gt; →&amp;nbsp;&lt;STRONG&gt;Create&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;Copy the&amp;nbsp;&lt;STRONG&gt;Signing Secret&lt;/STRONG&gt;&amp;nbsp;from Basic Information&lt;/LI&gt;
&lt;LI&gt;Important: Click "&lt;STRONG&gt;Install App&lt;/STRONG&gt;" → "&lt;STRONG&gt;Install to Workspace&lt;/STRONG&gt;" to generate the Bot User OAuth Token (xoxb-...)&lt;/LI&gt;
&lt;LI&gt;After installation, copy the&amp;nbsp;&lt;STRONG&gt;Bot User OAuth Token&lt;/STRONG&gt;&amp;nbsp;from the OAuth &amp;amp; Permissions page&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4&gt;5.3 Local testing with ngrok&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Edit .env with your tokens (local development only)
# SLACK_SIGNING_SECRET=your-signing-secret-here
# SLACK_BOT_TOKEN=xoxb-your-bot-token-here


# Terminal 1: Start the Flask app 
python app.py&lt;/LI-CODE&gt;
&lt;P&gt;Expected output:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;INFO:azure_support:Azure credentials configured successfully
INFO:azure_support:Preloading subscriptions completed
⚡️ Bolt app is running on port 5000!&lt;/LI-CODE&gt;&lt;LI-CODE lang="bash"&gt;# Terminal 2: Create ngrok tunnel 
ngrok http 5000&lt;/LI-CODE&gt;
&lt;P&gt;Copy the &lt;STRONG&gt;https&lt;/STRONG&gt; forwarding URL (e.g., &lt;A class="lia-external-url" href="https://abc123.ngrok-free.app)" target="_blank" rel="noopener"&gt;https://abc123.ngrok-free.app)&lt;/A&gt;&lt;/P&gt;
&lt;H4&gt;5.4 Update Slack app manifest for local testing&lt;/H4&gt;
&lt;P&gt;This is the critical step that's often missed:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;In your Slack app settings, go to&amp;nbsp;&lt;STRONG&gt;App Manifest&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;Replace &lt;STRONG&gt;ALL instances&lt;/STRONG&gt; of YOUR-DOMAIN-NAME with your ngrok domain&lt;/LI&gt;
&lt;LI&gt;Example replacement:&lt;/LI&gt;
&lt;/OL&gt;
&lt;LI-CODE lang="yaml"&gt;   # Before
   request_url: https://YOUR-DOMAIN-NAME/slack/events
   
   # After  
   request_url: https://abc123.ngrok-free.app/slack/events&lt;/LI-CODE&gt;
&lt;OL start="4"&gt;
&lt;LI&gt;Click&amp;nbsp;&lt;STRONG&gt;Save Changes&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;Go to &lt;STRONG&gt;Install App&lt;/STRONG&gt;&amp;nbsp;and install it to your workspace&lt;/LI&gt;
&lt;LI&gt;Copy the &lt;STRONG&gt;Bot User OAuth Token&lt;/STRONG&gt; (xoxb-...)&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4&gt;5.5 Test local integration&lt;/H4&gt;
&lt;OL&gt;
&lt;LI&gt;Invite the bot to channels:&amp;nbsp;&amp;nbsp;&lt;/LI&gt;
&lt;/OL&gt;
&lt;LI-CODE lang="bash"&gt;/invite azure-support&lt;/LI-CODE&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;Test the slash command:&lt;/LI&gt;
&lt;/OL&gt;
&lt;LI-CODE lang="bash"&gt; /azure-support&lt;/LI-CODE&gt;
&lt;OL start="3"&gt;
&lt;LI&gt;You should be able to see this screen:&lt;BR /&gt;&lt;BR /&gt;&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;4. Monitor logs:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;   # Check your Python terminal for incoming requests

   INFO:azure_support:Opened modal for support request&lt;/LI-CODE&gt;
&lt;H3&gt;6. Azure infrastructure setup&lt;/H3&gt;
&lt;H4&gt;6.1 Define resource names&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Set consistent naming convention
RG="rg-slack-support-prod"
LOCATION="eastus"
ACR_NAME="acrsupport$RANDOM"  # Globally unique
ENV_NAME="aca-slack-env"
APP_NAME="slack-support-app"
KV_NAME="kv-slack-$RANDOM"
UAMI_NAME="id-slack-support"
LAW_NAME="law-slack-support"

# Verify names are available
echo "ACR Name: $ACR_NAME"
echo "Key Vault: $KV_NAME"&lt;/LI-CODE&gt;
&lt;H4&gt;6.2 Create resource group&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;az group create \
  --name $RG \
  --location $LOCATION \
  --tags environment=production project=slack-support&lt;/LI-CODE&gt;
&lt;H3&gt;7. Container registry with security&lt;/H3&gt;
&lt;H4&gt;7.1 Create Azure container registry&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;az acr create \
  --name $ACR_NAME \
  --resource-group $RG \
  --sku Standard \
  --admin-enabled false  # Security: No admin credentials&lt;/LI-CODE&gt;
&lt;H4&gt;7.2 &amp;nbsp;Build and push image&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Login to ACR
az acr login --name $ACR_NAME

# Build and push 
IMAGE_NAME="$ACR_NAME.azurecr.io/azure-support-slack-bot:latest"

docker build -t $IMAGE_NAME .
docker push $IMAGE_NAME

# Verify image
az acr repository show \
  --name $ACR_NAME \
  --repository azure-support-slack-bot&lt;/LI-CODE&gt;
&lt;H3&gt;8. Managed Identity and RBAC Setup&lt;/H3&gt;
&lt;H4&gt;8.1 Create User-Assigned Managed Identity&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;az identity create \
  --name $UAMI_NAME \
  --resource-group $RG \
  --location $LOCATION

# Get identity details
UAMI_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query id -o tsv)
UAMI_PRINCIPAL_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query principalId -o tsv)
UAMI_CLIENT_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query clientId -o tsv)&lt;/LI-CODE&gt;
&lt;H4&gt;8.2 Grant ACR pull permissions&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;ACR_ID=$(az acr show --name $ACR_NAME --resource-group $RG --query id -o tsv)

az role assignment create \
  --assignee $UAMI_PRINCIPAL_ID \
  --role "AcrPull" \
  --scope $ACR_ID

# Wait for role propagation
echo "Waiting 60 seconds for role assignment propagation..."
sleep 60&lt;/LI-CODE&gt;
&lt;H4&gt;8.3 Grant Azure support API permissions&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Get subscription ID (in case it's not set from earlier)
subscriptionId=$(az account show --query id -o tsv)

# Support Request Contributor for the managed identity
az role assignment create \
  --assignee $UAMI_PRINCIPAL_ID \
  --role "Support Request Contributor" \
  --scope "/subscriptions/$subscriptionId"

# Reader role for listing Azure resources
az role assignment create \
  --assignee $UAMI_PRINCIPAL_ID \
  --role "Reader" \
  --scope "/subscriptions/$subscriptionId"

# Verify the role assignments were created successfully
echo "Azure Support API permissions granted to managed identity"
az role assignment list \
  --assignee $UAMI_PRINCIPAL_ID \
  --query "[].{Role:roleDefinitionName,Scope:scope}" -o table&lt;/LI-CODE&gt;
&lt;H3&gt;9. Azure Key Vault Setup&lt;/H3&gt;
&lt;H4&gt;9.1 Create Key Vault with RBAC&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;az keyvault create \
  --name $KV_NAME \
  --resource-group $RG \
  --location $LOCATION \
  --enable-rbac-authorization true \
  --retention-days 7 &lt;/LI-CODE&gt;
&lt;H4&gt;9.2 Grant Key Vault permissions&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Get current user and Key Vault scope
USER_PRINCIPAL_ID=$(az ad signed-in-user show --query id -o tsv)
KV_SCOPE=$(az keyvault show --name $KV_NAME --resource-group $RG --query id -o tsv)

# Grant admin access to current user
az role assignment create \
  --assignee $USER_PRINCIPAL_ID \
  --role "Key Vault Administrator" \
  --scope $KV_SCOPE

# Grant read access to managed identity
az role assignment create \
  --assignee $UAMI_PRINCIPAL_ID \
  --role "Key Vault Secrets User" \
  --scope $KV_SCOPE

# Wait for propagation
sleep 30&lt;/LI-CODE&gt;
&lt;H4&gt;9.3 Store secrets&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Store Slack secrets (replace with your actual values)
echo "Enter your Slack Bot Token (xoxb-...):"
read -s SLACK_BOT_TOKEN

echo "Enter your Slack Signing Secret:"
read -s SLACK_SIGNING_SECRET

az keyvault secret set \
  --vault-name $KV_NAME \
  --name "slack-bot-token" \
  --value "$SLACK_BOT_TOKEN"

az keyvault secret set \
  --vault-name $KV_NAME \
  --name "slack-signing-secret" \
  --value "$SLACK_SIGNING_SECRET"

# Verify secrets are stored
az keyvault secret list --vault-name $KV_NAME --query "[].name" -o table&lt;/LI-CODE&gt;
&lt;H3&gt;10. Container Apps environment&lt;/H3&gt;
&lt;H4&gt;10.1 Create Log Analytics Workspace&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;az monitor log-analytics workspace create \
  --workspace-name $LAW_NAME \
  --resource-group $RG \
  --location $LOCATION

# Get workspace details
LAW_CUSTOMER_ID=$(az monitor log-analytics workspace show \
  --workspace-name $LAW_NAME \
  --resource-group $RG \
  --query customerId -o tsv)

LAW_SHARED_KEY=$(az monitor log-analytics workspace get-shared-keys \
  --workspace-name $LAW_NAME \
  --resource-group $RG \
  --query primarySharedKey -o tsv)&lt;/LI-CODE&gt;
&lt;H4&gt;10.2 Create Container Apps environment&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;az containerapp env create \
  --name $ENV_NAME \
  --resource-group $RG \
  --location $LOCATION \
  --logs-workspace-id $LAW_CUSTOMER_ID \
  --logs-workspace-key $LAW_SHARED_KEY&lt;/LI-CODE&gt;
&lt;H3&gt;11. Deploy Container App with security&lt;/H3&gt;
&lt;H4&gt;11.1 Create Container App&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;az containerapp create \
  --name $APP_NAME \
  --resource-group $RG \
  --environment $ENV_NAME \
  --image $IMAGE_NAME \
  --target-port 5000 \
  --ingress external \
  --registry-server "$ACR_NAME.azurecr.io" \
  --user-assigned $UAMI_ID \
  --min-replicas 1 \
  --max-replicas 10 \
  --cpu 0.5 \
  --memory 1Gi&lt;/LI-CODE&gt;
&lt;H4&gt;11.2 Configure Key Vault secret references&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Create secret references to Key Vault
az containerapp secret set \
  --name $APP_NAME \
  --resource-group $RG \
  --secrets \
  "slack-bot-token=keyvaultref:https://$KV_NAME.vault.azure.net/secrets/slack-bot-token,identityref:$UAMI_ID" \
  "slack-signing-secret=keyvaultref:https://$KV_NAME.vault.azure.net/secrets/slack-signing-secret,identityref:$UAMI_ID"&lt;/LI-CODE&gt;
&lt;H4&gt;11.3 Configure environment variables&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;az containerapp update \
  --name $APP_NAME \
  --resource-group $RG \
  --set-env-vars \
  "SLACK_BOT_TOKEN=secretref:slack-bot-token" \
  "SLACK_SIGNING_SECRET=secretref:slack-signing-secret" \
  "AZURE_CLIENT_ID=$UAMI_CLIENT_ID" \
  "PORT=5000"&lt;/LI-CODE&gt;
&lt;H4&gt;11.4 Configure scaling rules&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;az containerapp update \
  --name $APP_NAME \
  --resource-group $RG \
  --scale-rule-name "http-rule" \
  --scale-rule-type "http" \
  --scale-rule-http-concurrency 50 \
  --min-replicas 0 \
  --max-replicas 10&lt;/LI-CODE&gt;
&lt;H3&gt;12. Production configuration&lt;/H3&gt;
&lt;H4&gt;12.1 Get application URL&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;APP_FQDN=$(az containerapp show \
  --name $APP_NAME \
  --resource-group $RG \
  --query properties.configuration.ingress.fqdn -o tsv)

APP_URL="https://$APP_FQDN"
echo "Production URL: $APP_URL/slack/events"&lt;/LI-CODE&gt;
&lt;H4&gt;12.2 Update Slack app manifest for production&lt;/H4&gt;
&lt;P&gt;&lt;STRONG&gt;Critical: Replace ngrok URLs with production URLs:&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;In your Slack app settings, go to App Manifest&lt;/LI&gt;
&lt;LI&gt;Replace &lt;STRONG&gt;all &lt;/STRONG&gt;ngrok URLs with your Azure Container Apps URL:&lt;BR /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;settings:
     event_subscriptions:
       request_url: https://your-app-fqdn.region.azurecontainerapps.io/slack/events
     interactivity:
       is_enabled: true
       request_url: https://your-app-fqdn.region.azurecontainerapps.io/slack/events
       message_menu_options_url: https://your-app-fqdn.region.azurecontainerapps.io/slack/events&lt;/LI-CODE&gt;&lt;/LI&gt;
&lt;LI&gt;Click&amp;nbsp;Save Changes&lt;/LI&gt;
&lt;LI&gt;Reinstall the app&amp;nbsp;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;STRONG&gt;Critical: URL Verification Step&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;After updating your Slack App Manifest with the production URL, Slack will attempt to verify the new endpoint. This verification process is&amp;nbsp;&lt;STRONG&gt;mandatory &lt;/STRONG&gt;and must succeed before your bot will work in production.&lt;/P&gt;
&lt;P&gt;What happens during verification:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Slack sends a POST request to your new URL (https://your-app-fqdn.region.azurecontainerapps.io/slack/events)&lt;/LI&gt;
&lt;LI&gt;The request contains a challenge parameter that your Flask app must echo back&lt;/LI&gt;
&lt;LI&gt;If verification fails, Slack will reject the manifest changes&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Common verification failures:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Container App not running: Ensure your Azure Container App is deployed and healthy&lt;/LI&gt;
&lt;LI&gt;Wrong URL format: Must end with /slack/events exactly&lt;/LI&gt;
&lt;LI&gt;HTTPS required: Slack only accepts HTTPS endpoints (Container Apps provides this automatically)&lt;/LI&gt;
&lt;LI&gt;Timeout issues: Container App must respond within Slack's timeout window&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4&gt;12.3 Bot installation and invitation&lt;/H4&gt;
&lt;P&gt;Required Post-Deployment Steps:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Slack App Manifest updated with production URL&lt;/LI&gt;
&lt;LI&gt;Reinstall the bot in your Slack workspace&lt;/LI&gt;
&lt;LI&gt;Invite the bot to channels: /invite @ azure-support&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Test with: /azure-support command&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;13. Testing and validation&lt;/H3&gt;
&lt;H4&gt;13.1 Health and connectivity checks&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Test basic connectivity (note: this will return 404 since the app has no root endpoint handler)
curl -f "$APP_URL/" || echo "Expected 404 - app only handles /slack/events endpoint"

# Check container app status
az containerapp show \
  --name $APP_NAME \
  --resource-group $RG \
  --query properties.provisioningState

# Check logs
az containerapp logs show \
  --name $APP_NAME \
  --resource-group $RG \
  --follow&lt;/LI-CODE&gt;
&lt;H4&gt;13.2 Functional testing&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;Test Slack integration:&lt;BR /&gt;&lt;BR /&gt;&lt;LI-CODE lang="bash"&gt;/azure-support&lt;/LI-CODE&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Complete workflow:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P&gt;Fill out the support ticket modal completely (details below)&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Submit the form&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Verify ticket appears in Azure Portal → Help + Support&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;13.3 Opening the support request&lt;/H4&gt;
&lt;P&gt;When you open the support request form, you’ll see a few fields that need your attention:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG style="color: rgb(30, 30, 30);"&gt;Subject: &lt;/STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;Think of this as your headline. Keep it short and clear&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Problem Details: &lt;/STRONG&gt;Here’s your chance to explain what’s going wrong. Be specific! The more details, the better.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure Subscription, Service, Problem Type, and Resource: &lt;/STRONG&gt;Select the right options from the dropdown menus. This helps the support team route your ticket to the right experts.&lt;/LI&gt;
&lt;/UL&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;
&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You’ll notice options for advanced diagnostic info. If you’re not sure, just say “Yes” (it’s recommended). Set the severity, if it’s a minor issue, pick “Minimal impact.” And choose how you’d like to be contacted (email is usually easiest). Make sure your name and email are correct. If you want someone else to get updates, add their email too.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;Once you’ve filled everything out, click &lt;STRONG&gt;Submit&lt;/STRONG&gt;. You’ll see a confirmation message, your ticket is on its way!&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you chose a Slack channel, you’ll get a message like this:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You’ll also get a link to view your ticket in the Azure portal, along an e-mail with all the details you provided.&lt;/P&gt;
&lt;H3&gt;14. Production observability&lt;/H3&gt;
&lt;H4&gt;14.1 Application Insights Integration&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Create Application Insights
APPINSIGHTS_NAME="ai-slack-support"

az monitor app-insights component create \
  --app $APPINSIGHTS_NAME \
  --location $LOCATION \
  --resource-group $RG \
  --workspace $LAW_NAME

# Get instrumentation key
APPINSIGHTS_KEY=$(az monitor app-insights component show \
  --app $APPINSIGHTS_NAME \
  --resource-group $RG \
  --query instrumentationKey -o tsv)

# Add to container app
az containerapp update \
  --name $APP_NAME \
  --resource-group $RG \
  --set-env-vars \
  "APPLICATIONINSIGHTS_INSTRUMENTATION_KEY=$APPINSIGHTS_KEY"&lt;/LI-CODE&gt;
&lt;H4&gt;14.2 Monitoring and alerts&lt;/H4&gt;
&lt;LI-CODE lang="bash"&gt;# Create alert for container app failures
az monitor metrics alert create \
  --name "SlackBot-ContainerFailures" \
  --resource-group $RG \
  --scopes "/subscriptions/$subscriptionId/resourceGroups/$RG/providers/Microsoft.App/containerApps/$APP_NAME" \
  --condition "avg Requests &amp;lt; 1" \
  --description "Slack bot container app is not receiving requests" \
  --window-size 5m \
  --evaluation-frequency 1m

# Create alert for Key Vault access failures  
az monitor metrics alert create \
  --name "SlackBot-KeyVaultAccess" \
  --resource-group $RG \
  --scopes "/subscriptions/$subscriptionId/resourceGroups/$RG/providers/Microsoft.KeyVault/vaults/$KV_NAME" \
  --condition "total ServiceApiHit &amp;lt; 1" \
  --description "Slack bot unable to access Key Vault secrets" \
  --target-resource-type "Microsoft.KeyVault/vaults" \
  --target-resource-region $LOCATION \
  --window-size 5m \
  --evaluation-frequency 1m&lt;/LI-CODE&gt;
&lt;H3&gt;15. Security Hardening Checklist&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Authentication &amp;amp; Authorization&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;User-Assigned Managed Identity for all Azure resources&lt;/LI&gt;
&lt;LI&gt;RBAC-based access (no admin credentials)&lt;/LI&gt;
&lt;LI&gt;Key Vault for all secrets with proper role assignments&lt;/LI&gt;
&lt;LI&gt;Azure Support API permissions (Support Request Contributor + Reader)&lt;/LI&gt;
&lt;LI&gt;Least-privilege permissions&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Network Security &amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;HTTPS-only ingress (Container Apps provides TLS termination)&lt;/LI&gt;
&lt;LI&gt;No public admin endpoints&lt;/LI&gt;
&lt;LI&gt;Container registry private access via managed identity&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Operational Security&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Comprehensive logging with Log Analytics&lt;/LI&gt;
&lt;LI&gt;Health monitoring and alerting&lt;/LI&gt;
&lt;LI&gt;Automated vulnerability scanning (ACR)&lt;/LI&gt;
&lt;LI&gt;Secret rotation capability via Key Vault&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Application Security&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;No secrets in code or environment variables&lt;/LI&gt;
&lt;LI&gt;Slack request signature verification&lt;/LI&gt;
&lt;LI&gt;Input validation and sanitization (built into Slack Bolt framework)&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;16. Complete deployment script&lt;/H3&gt;
&lt;P&gt;Before running the one-command deployment script, ensure you've completed sections 3 and 4 above, then verify:&lt;BR /&gt;&lt;BR /&gt;1. You're in the repository root directory&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;pwd  # Should end with: azure-support-slack-bot
ls   # Should show: Dockerfile, requirements.txt, app.py&lt;/LI-CODE&gt;
&lt;P&gt;2. Docker is ready for building&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;docker ps  # Should not show permission errors&lt;/LI-CODE&gt;
&lt;P&gt;3. You have your Slack tokens ready&lt;/P&gt;
&lt;P&gt;Now, go from zero to production Slack bot in one command. Save this as deploy-slack-bot.sh for one-command deployment:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;#!/bin/bash
set -euo pipefail

# Check parameters
if [ $# -ne 3 ]; then
    echo "Usage: $0 &amp;lt;subscription-id&amp;gt; &amp;lt;slack-bot-token&amp;gt; &amp;lt;slack-signing-secret&amp;gt;"
    exit 1
fi

SUBSCRIPTION_ID="$1"
SLACK_BOT_TOKEN="$2"
SLACK_SIGNING_SECRET="$3"

# Configuration - modern naming conventions
RG="rg-slack-support-prod"
LOCATION="eastus"
ACR_NAME="acrsupport$RANDOM"
ENV_NAME="aca-slack-env"
APP_NAME="slack-support-app"
KV_NAME="kv-slack-$RANDOM"
UAMI_NAME="id-slack-support"
LAW_NAME="law-slack-support"

echo "🚀 Deploying secure Azure Support Slack Bot..."

# Set subscription context
az account set --subscription "$SUBSCRIPTION_ID"

# Create resource group
az group create --name $RG --location $LOCATION --tags environment=production project=slack-support

# Create ACR with security defaults
az acr create --name $ACR_NAME --resource-group $RG --sku Standard --admin-enabled false
az acr login --name $ACR_NAME

# Build and push image
IMAGE_NAME="$ACR_NAME.azurecr.io/azure-support-slack-bot:latest"
docker build -t $IMAGE_NAME .
docker push $IMAGE_NAME

# Create managed identity - zero-trust foundation
az identity create --name $UAMI_NAME --resource-group $RG --location $LOCATION
UAMI_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query id -o tsv)
UAMI_PRINCIPAL_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query principalId -o tsv)
UAMI_CLIENT_ID=$(az identity show --name $UAMI_NAME --resource-group $RG --query clientId -o tsv)

# Grant ACR pull permissions
ACR_ID=$(az acr show --name $ACR_NAME --resource-group $RG --query id -o tsv)
az role assignment create --assignee $UAMI_PRINCIPAL_ID --role "AcrPull" --scope $ACR_ID

# Grant Azure Support API permissions - least privilege
subscriptionId="$SUBSCRIPTION_ID"
az role assignment create --assignee $UAMI_PRINCIPAL_ID --role "Support Request Contributor" --scope "/subscriptions/$subscriptionId"
az role assignment create --assignee $UAMI_PRINCIPAL_ID --role "Reader" --scope "/subscriptions/$subscriptionId"

echo "Azure Support API permissions granted to managed identity"

# Create Key Vault with RBAC (no access policies)
az keyvault create --name $KV_NAME --resource-group $RG --location $LOCATION --enable-rbac-authorization true --retention-days 7
KV_SCOPE=$(az keyvault show --name $KV_NAME --resource-group $RG --query id -o tsv)

# Grant Key Vault permissions
USER_PRINCIPAL_ID=$(az ad signed-in-user show --query id -o tsv)
az role assignment create --assignee $USER_PRINCIPAL_ID --role "Key Vault Administrator" --scope $KV_SCOPE
az role assignment create --assignee $UAMI_PRINCIPAL_ID --role "Key Vault Secrets User" --scope $KV_SCOPE

# Wait for RBAC propagation
sleep 60

# Store secrets securely
az keyvault secret set --vault-name $KV_NAME --name "slack-bot-token" --value "$SLACK_BOT_TOKEN"
az keyvault secret set --vault-name $KV_NAME --name "slack-signing-secret" --value "$SLACK_SIGNING_SECRET"

# Create observability foundation
az monitor log-analytics workspace create --workspace-name $LAW_NAME --resource-group $RG --location $LOCATION
LAW_CUSTOMER_ID=$(az monitor log-analytics workspace show --workspace-name $LAW_NAME --resource-group $RG --query customerId -o tsv)
LAW_SHARED_KEY=$(az monitor log-analytics workspace get-shared-keys --workspace-name $LAW_NAME --resource-group $RG --query primarySharedKey -o tsv)

# Create Container Apps environment
az containerapp env create --name $ENV_NAME --resource-group $RG --location $LOCATION --logs-workspace-id $LAW_CUSTOMER_ID --logs-workspace-key $LAW_SHARED_KEY

# Deploy Container App with scale-to-zero
az containerapp create \
  --name $APP_NAME \
  --resource-group $RG \
  --environment $ENV_NAME \
  --image $IMAGE_NAME \
  --target-port 5000 \
  --ingress external \
  --registry-server "$ACR_NAME.azurecr.io" \
  --user-assigned $UAMI_ID \
  --min-replicas 0 \
  --max-replicas 10 \
  --cpu 0.5 \
  --memory 1Gi

# Configure Key Vault secret references
az containerapp secret set --name $APP_NAME --resource-group $RG --secrets \
  "slack-bot-token=keyvaultref:https://$KV_NAME.vault.azure.net/secrets/slack-bot-token,identityref:$UAMI_ID" \
  "slack-signing-secret=keyvaultref:https://$KV_NAME.vault.azure.net/secrets/slack-signing-secret,identityref:$UAMI_ID"

# Configure environment variables
az containerapp update --name $APP_NAME --resource-group $RG --set-env-vars \
  "SLACK_BOT_TOKEN=secretref:slack-bot-token" \
  "SLACK_SIGNING_SECRET=secretref:slack-signing-secret" \
  "AZURE_CLIENT_ID=$UAMI_CLIENT_ID" \
  "PORT=5000"

# Configure HTTP-based autoscaling
az containerapp update --name $APP_NAME --resource-group $RG \
  --scale-rule-name "http-requests" \
  --scale-rule-type "http" \
  --scale-rule-http-concurrency 50 \
  --min-replicas 0 \
  --max-replicas 10

# Get deployment results
APP_FQDN=$(az containerapp show --name $APP_NAME --resource-group $RG --query properties.configuration.ingress.fqdn -o tsv)

echo ""
echo "🎉 Deployment Complete!"
echo ""
echo "Slack Webhook URL: https://$APP_FQDN/slack/events"
echo " Resource Group: $RG"
echo " Key Vault: $KV_NAME"
echo " ACR: $ACR_NAME"
echo ""
echo " Next Steps:"
echo "1. Update your Slack App Manifest with: https://$APP_FQDN/slack/events"
echo "2. Reinstall the Slack app in your workspace"
echo "3. Invite the bot to channels: /invite -support"
echo "4. Test with: /azure-support"
echo ""
echo "Monitor: az containerapp logs show -n $APP_NAME -g $RG --follow"
echo "Debug: az containerapp show -n $APP_NAME -g $RG --query properties.provisioningState"
&lt;/LI-CODE&gt;
&lt;P&gt;Usage:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;chmod +x deploy-slack-bot.sh
./deploy-slack-bot.sh "your-subscription-id" "xoxb-your-bot-token" "your-signing-secret"&lt;/LI-CODE&gt;
&lt;H4&gt;Cost Expectations&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;Scale-to-zero architecture&amp;nbsp;= minimal compute costs&lt;/LI&gt;
&lt;LI&gt;Base charges: Key Vault ($0.03/day), Log Analytics ($2.30/GB ingested)&lt;/LI&gt;
&lt;LI&gt;Container Apps: Only charges when processing requests (true serverless)&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;Deployment Notes&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;Script creates globally unique resource names using&amp;nbsp;$RANDOM&lt;/LI&gt;
&lt;LI&gt;Takes ~8-12 minutes due to RBAC propagation delays&lt;/LI&gt;
&lt;LI&gt;After deployment,&amp;nbsp;update your Slack App Manifest&amp;nbsp;with the production URL&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;Post-Deployment Steps&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;Update Slack App Manifest&amp;nbsp;with your Azure Container Apps URL&lt;/LI&gt;
&lt;LI&gt;Reinstall the Slack app&amp;nbsp;(required for URL changes)&lt;/LI&gt;
&lt;LI&gt;Test with /azure-support&amp;nbsp;or the global shortcut&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;17. Cleanup&lt;/H3&gt;
&lt;P&gt;When you're ready to remove all resources:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# Delete resource group (removes all resources)
az group delete --name $RG --yes --no-wait

# Purge Key Vault (if purge protection was enabled)
az keyvault purge --name $KV_NAME --location $LOCATION&lt;/LI-CODE&gt;
&lt;H3&gt;18. You’re live, what’s next?&lt;/H3&gt;
&lt;P data-start="315" data-end="487"&gt;You’ve just deployed a &lt;STRONG data-start="338" data-end="386"&gt;production-grade Slack bot for Azure Support&lt;/STRONG&gt; using a modern, secure-by-default architecture — no manual secrets, no patchy scripts, no guesswork.&lt;/P&gt;
&lt;P data-start="489" data-end="622"&gt;What you now have is more than a bot — it’s a &lt;STRONG data-start="535" data-end="612"&gt;template for how digital native teams should approach platform automation&lt;/STRONG&gt; on Azure:&lt;/P&gt;
&lt;UL data-start="624" data-end="1016"&gt;
&lt;LI data-start="624" data-end="690"&gt;&lt;STRONG data-start="629" data-end="654"&gt;Zero-trust foundation&lt;/STRONG&gt; with managed identities + Key Vault&lt;/LI&gt;
&lt;LI data-start="691" data-end="747"&gt;&lt;STRONG data-start="696" data-end="719"&gt;Dev-first workflows&lt;/STRONG&gt; for local testing and CI/CD&lt;/LI&gt;
&lt;LI data-start="748" data-end="807"&gt;&lt;STRONG data-start="753" data-end="783"&gt;Scale-to-zero architecture&lt;/STRONG&gt; on Azure Container Apps&lt;/LI&gt;
&lt;LI data-start="808" data-end="875"&gt;&lt;STRONG data-start="813" data-end="839"&gt;Built-in observability&lt;/STRONG&gt; with Log Analytics and App Insights&lt;/LI&gt;
&lt;LI data-start="876" data-end="966"&gt;&lt;STRONG data-start="882" data-end="908"&gt;RBAC-controlled access&lt;/STRONG&gt; to support APIs — no over-permissioned service principals&lt;/LI&gt;
&lt;LI data-start="967" data-end="1016"&gt;&lt;STRONG data-start="972" data-end="997"&gt;End-to-end automation&lt;/STRONG&gt; via GitHub Actions&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1018" data-end="1169"&gt;This isn't just a bot — it's a pattern. A way to wire your internal tools to your platform securely, scalably, and with full auditability from day one.&lt;/P&gt;
&lt;P data-start="1018" data-end="1169"&gt;This guide was made for fast-moving teams who prefer CLI over click-ops and automation over tribal knowledge. If you're building platforms, bots, or tools to empower your engineering org, this is a foundation you can trust.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Sep 2025 17:28:18 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-support-slack-bot-on-azure-container-apps-production-ready/ba-p/4436423</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-09-04T17:28:18Z</dc:date>
    </item>
    <item>
      <title>A practical guide to Azure VM SKU eligibility and zonal support monitoring</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/a-practical-guide-to-azure-vm-sku-eligibility-and-zonal-support/ba-p/4415773</link>
      <description>&lt;img /&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="793" data-end="840"&gt;&lt;STRONG data-start="793" data-end="840"&gt; Important clarification about “capacity”&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="844" data-end="929"&gt;This guide does not provide real-time, deployable capacity signals for Azure VM SKUs. The solution is based on the Azure ResourceSkus API, which exposes SKU metadata, regional availability, zonal support, and subscription-level restrictions. It can tell you whether a SKU is eligible for your subscription in a region and which zones are supported.&lt;/P&gt;
&lt;P data-start="1203" data-end="1397"&gt;It does not guarantee that capacity is available at deployment time. Azure capacity is dynamic, and allocation failures can still occur even when a SKU appears available and quota is sufficient. This solution is best used to proactively detect SKU restrictions, understand zonal exposure, and build guardrails and alternatives before deployments. For guaranteed capacity, Azure Capacity Reservations or pre-deployment validation are required.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="92" data-end="276"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-start="1664" data-end="1929"&gt;Look, Azure allocation failures can really derail your day. Most of the time, you only find out there’s a problem when a deployment fails. no clear early signal, no easy way to validate whether a SKU is even usable in a given region or zone for your subscription.&lt;/P&gt;
&lt;P data-start="1936" data-end="2220"&gt;After seeing this happen repeatedly with customers, I built a simple monitor that helps you proactively validate &lt;STRONG data-start="2049" data-end="2101"&gt;SKU eligibility, restrictions, and zonal support&lt;/STRONG&gt;, so you can catch “this SKU won’t work here” scenarios early and design alternatives before you hit deployment time.&lt;/P&gt;
&lt;P data-start="2227" data-end="2304"&gt;Thought I’d share it here. hopefully it saves you some of the same headaches.&lt;/P&gt;
&lt;H4&gt;What this thing does&lt;/H4&gt;
&lt;P&gt;This solution isn't fancy, but it works. Here's what it'll do for you:&lt;/P&gt;
&lt;OL&gt;
&lt;LI data-start="281" data-end="385"&gt;Checks whether specific VM SKUs are &lt;STRONG data-start="319" data-end="343"&gt;eligible and exposed&lt;/STRONG&gt; in a given region for your subscription&lt;/LI&gt;
&lt;LI data-start="386" data-end="529"&gt;Shows exactly &lt;STRONG data-start="402" data-end="429"&gt;why a SKU can’t be used&lt;/STRONG&gt; when there’s a restriction (for example, not available for the subscription or in specific zones)&lt;/LI&gt;
&lt;LI data-start="530" data-end="610"&gt;Shows which &lt;STRONG data-start="544" data-end="580"&gt;availability zones are supported&lt;/STRONG&gt; for each SKU in that region&lt;/LI&gt;
&lt;LI data-start="611" data-end="687"&gt;Suggests &lt;STRONG data-start="622" data-end="641"&gt;similar VM SKUs&lt;/STRONG&gt; you could consider when a SKU is restricted&lt;/LI&gt;
&lt;LI data-start="688" data-end="798"&gt;Logs all results to &lt;STRONG data-start="710" data-end="733"&gt;Azure Log Analytics&lt;/STRONG&gt; so you can track SKU exposure and restriction trends over time&lt;/LI&gt;
&lt;LI data-start="799" data-end="862"&gt;Runs directly from your terminal, no complex setup required&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4 data-start="2817" data-end="2850"&gt;What this solution does not do&lt;/H4&gt;
&lt;P data-start="2854" data-end="2938"&gt;This solution does not provide a real-time view of free or remaining Azure capacity. There is currently no public API that exposes live, deploy-time capacity per SKU, per zone, per region. As a result, even if a SKU appears eligible and zonally supported, deployments may still fail due to transient or regional capacity constraints.&lt;/P&gt;
&lt;P data-start="3196" data-end="3250"&gt;If you need allocation certainty, you should consider:&lt;/P&gt;
&lt;UL data-start="3255" data-end="3413"&gt;
&lt;LI data-start="3255" data-end="3286"&gt;Azure Capacity Reservations&lt;/LI&gt;
&lt;LI data-start="3289" data-end="3349"&gt;Running validation deployments as a point-in-time signal&lt;/LI&gt;
&lt;LI data-start="3352" data-end="3413"&gt;Designing for flexibility across SKUs, zones, and regions&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;How it's put together&lt;/H4&gt;
&lt;P&gt;It's pretty simple really - just two main Python scripts:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;The Monitoring Script&lt;/STRONG&gt;: Checks VM SKU eligibility, restrictions, and zonal support using Azure’s ResourceSkus API&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Log Analytics Setup&lt;/STRONG&gt;: Stores your data for later analysis (optional, but super useful)&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Here's a quick diagram:&lt;/P&gt;
&lt;img /&gt;
&lt;H4&gt;Before you start&lt;/H4&gt;
&lt;P&gt;You'll need a few things:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;1. Azure CLI&lt;/STRONG&gt; installed and working on your machine&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# If you haven't logged in yet
az login&lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;2. Azure permissions&lt;/STRONG&gt;&amp;nbsp;if you're doing the Log Analytics part:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# Get your username first
USER_PRINCIPAL=$(az ad signed-in-user show --query userPrincipalName -o tsv)
echo "Looks like you're logged in as: $USER_PRINCIPAL"

# Create a resource group - you can change the name if you want
az group create --name vm-sku-monitor-rg --location eastus2

# Give yourself the right permissions
az role assignment create \
  --assignee "$USER_PRINCIPAL" \
  --role "Monitoring Metrics Publisher" \
  --scope "/subscriptions/$(az account show --query id -o tsv)/resourcegroups/vm-sku-monitor-rg"

# Double-check it worked
az role assignment list \
  --assignee "$USER_PRINCIPAL" \
  --role "Monitoring Metrics Publisher" \
  --scope "/subscriptions/$(az account show --query id -o tsv)/resourcegroups/vm-sku-monitor-rg"&lt;/LI-CODE&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Azure can be kinda slow with permissions sometimes. If you get weird 403 errors later, maybe grab a coffee and try again in 10-15 mins.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;3. Python environment setup&lt;/STRONG&gt;:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# Set up a virtual environment - don't skip this step!
# I learned this the hard way when I borked my system Python...
python3 -m venv venv

# Activate it
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install what we need
pip install azure-identity azure-mgmt-compute azure-mgmt-subscription azure-monitor-ingestion rich&lt;/LI-CODE&gt;
&lt;H4&gt;Let's build this thing&lt;/H4&gt;
&lt;P&gt;&lt;STRONG&gt;1. The VM Capacity Checking Script&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The star of the show is the monitoring script itself. This script does all the heavy lifting - checking VM availability, showing you what's happening, and logging the data for later.&amp;nbsp; I'll call it: &lt;A class="lia-external-url" href="https://gist.github.com/ricmmartins/7f8fd1c3408464e5ea652301017c701c" target="_blank" rel="noopener"&gt;monitor_vm_sku_capacity.py&lt;/A&gt;:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;The script uses compute_client.resource_skus.list() to evaluate SKU metadata, regional exposure, supported zones, and restriction codes. This API does not surface live allocation capacity.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;BR /&gt;2. Log Analytics Setup Script&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Now for the script that sets up all the Log Analytics stuff. This part is optional, but really helpful if you want to track capacity trends over time: &lt;A class="lia-external-url" href="https://gist.github.com/ricmmartins/76b0e2e96f288a9b2635233570f5d4d7" target="_blank" rel="noopener"&gt;setup_log_analytics.py&lt;/A&gt;&lt;/P&gt;
&lt;H4&gt;Setting default region and VM SKU&lt;/H4&gt;
&lt;P&gt;You've got a few options to set your preferred region and VM SKU:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;1. Edit script defaults&lt;/STRONG&gt;: Open monitor_vm_sku_capacity.py and look for:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;parser.add_argument('--region', type=str, default='eastus2',  # Change this!
                    help='Azure region to check (default: eastus2)')
parser.add_argument('--sku', type=str, default='Standard_D16ds_v5',  # And this!
                    help='VM SKU to check (default: Standard_D16ds_v5)')&lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;2. Specify on command line&lt;/STRONG&gt;:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;python monitor_vm_sku_capacity.py --region westus2 --sku Standard_D8ds_v5&lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;3. Edit config file&lt;/STRONG&gt;: After running the setup script, it creates a config.json with these values:&lt;/P&gt;
&lt;LI-CODE lang="json"&gt;{
  "region": "eastus2",
  "target_sku": "Standard_D16ds_v5",
  "check_zones": true,
  ...
}&lt;/LI-CODE&gt;
&lt;H4&gt;Finding Available Regions and SKUs&lt;/H4&gt;
&lt;P&gt;If you're wondering which regions and SKUs to monitor, here's how to get that info:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Using Azure CLI&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# List all regions
az account list-locations --query "[].name" -o tsv

# List all VM SKUs in a region 
az vm list-skus --location eastus2 --resource-type virtualMachines --query "[].name" -o tsv  

# Get detailed info about a specific SKU
az vm list-skus --location eastus2 --size Standard_D16ds_v5 -o table&lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;Using Azure Portal&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Just go to the VM creation page in the portal and click "See all sizes" - you'll get a nice visual list of all available options. I sometimes just take a screenshot of this for reference.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;Using this tool&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;So here's how you use this thing. I tried to make it as simple as possible:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;1. Set up Log Analytics first&lt;/STRONG&gt; (optional but recommended):&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;python setup_log_analytics.py&lt;/LI-CODE&gt;
&lt;P&gt;This builds all the Log Analytics stuff and spits out a config file you can use in the next step. The default options should work fine for most people, but you can customize if needed.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;2. Run the monitoring script&lt;/STRONG&gt;:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;python monitor_vm_sku_capacity.py --config config.json&lt;/LI-CODE&gt;
&lt;P&gt;If you don't want to mess with Log Analytics, you can just run it directly:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;python monitor_vm_sku_capacity.py --region eastus2 --sku Standard_D16ds_v5&lt;/LI-CODE&gt;
&lt;P&gt;The output will look something like this (way prettier if you have the rich package installed):&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;================================================================================
AZURE VM SKU CAPACITY MONITOR - 2024-05-20 14:32:45
================================================================================

Status:       AVAILABLE
SKU:          Standard_D16ds_v5
Region:       eastus2
Subscription: My Azure Subscription (12345678-1234-1234-1234-123456789012)

Available Zones:
  - 1
  - 2
  - 3

VM SKU Specifications:
  vCPUs: 16
  MemoryGB: 64
  MaxDataDiskCount: 32
  PremiumIO: True
  AcceleratedNetworkingEnabled: True&lt;/LI-CODE&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG data-start="3654" data-end="3667"&gt;AVAILABLE&lt;/STRONG&gt;&amp;nbsp;means no subscription-level restriction was detected and the SKU is exposed in this region. It does not guarantee deploy-time capacity.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Or if the VM is unavailable:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;================================================================================
AZURE VM SKU CAPACITY MONITOR - 2024-05-20 14:32:45
================================================================================

Status:       NOT AVAILABLE
SKU:          Standard_D16ds_v5
Region:       eastus2
Subscription: My Azure Subscription (12345678-1234-1234-1234-123456789012)
Details:      SKU Standard_D16ds_v5 is not available in region eastus2

Available Zones:
  None

Restrictions:
  Type:           Zone
  Reason:         NotAvailableForSubscription
  Affected Values: eastus2

VM SKU Specifications:
  vCPUs: 16
  MemoryGB: 64
  MaxDataDiskCount: 32
  PremiumIO: True
  AcceleratedNetworkingEnabled: True

Alternative SKUs:
  - Standard_D16as_v5 (vCPUs: 16, Memory: 64 GB, Family: standardDasv5Family, Similarity: 100%)
  - Standard_D16s_v5 (vCPUs: 16, Memory: 64 GB, Family: standardDsv5Family, Similarity: 100%)
  - Standard_D16s_v4 (vCPUs: 16, Memory: 64 GB, Family: standardDsv4Family, Similarity: 100%)
  - Standard_F16s_v2 (vCPUs: 16, Memory: 32 GB, Family: standardFSv2Family, Similarity: 80%)
  - Standard_E16s_v5 (vCPUs: 16, Memory: 128 GB, Family: standardEsv5Family, Similarity: 80%)&lt;/LI-CODE&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG data-start="3808" data-end="3823"&gt;NOT AVAILABLE&amp;nbsp;&lt;/STRONG&gt;means the SKU is restricted for this subscription in this region or zone based on the ResourceSkus restriction signals.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H4&gt;&lt;BR /&gt;Setting up scheduled checks&lt;/H4&gt;
&lt;P&gt;I don't like missing things, so I set mine up to run every hour using cron:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# Open crontab editor
crontab -e

# Add this line to run it every hour
0 * * * * cd /path/to/scripts &amp;amp;&amp;amp; source venv/bin/activate &amp;amp;&amp;amp; python monitor_vm_sku_capacity.py --config config.json &amp;gt;&amp;gt; vm_sku_monitor.log 2&amp;gt;&amp;amp;1&lt;/LI-CODE&gt;
&lt;H4&gt;Checking your data in Log Analytics&lt;/H4&gt;
&lt;P&gt;If you set up Log Analytics, you can run all sorts of cool queries:&lt;/P&gt;
&lt;LI-CODE lang="kusto"&gt;// Basic query - see everything
VMSKUCapacity_CL
| order by TimeGenerated desc

// Find when capacity changed
VMSKUCapacity_CL
| where sku_name == "Standard_D16ds_v5" and region == "eastus2"
| project TimeGenerated, is_available
| order by TimeGenerated desc


// Simple dashboard
VMSKUCapacity_CL
| summarize LastStatus=arg_max(TimeGenerated, is_available), 
            LastChecked=max(TimeGenerated) 
  by sku_name, region
| extend Status = iff(LastStatus == true, "Available", "Not Available")
| project sku_name, region, Status, LastChecked&lt;/LI-CODE&gt;
&lt;P&gt;You can &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-monitor/alerts/tutorial-log-alert" target="_blank" rel="noopener"&gt;set up alerts too&lt;/A&gt;. That way Azure tells YOU when capacity changes, instead of you finding out during a failed deployment!&lt;/P&gt;
&lt;H4&gt;Troubleshooting&lt;/H4&gt;
&lt;P&gt;Some common problems I've run into:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;"Could not automatically detect subscription ID"&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;Make sure you're logged in with&amp;nbsp;az login&lt;/LI&gt;
&lt;LI&gt;Or just provide it explicitly with&amp;nbsp;--subscription-id&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Log Analytics permission errors&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;Make sure you ran the permission commands from the prerequisites section&lt;/LI&gt;
&lt;LI&gt;Azure's permissions can be weirdly slow - wait 10-15 minutes and try again&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Python environment issues&lt;/STRONG&gt;:
&lt;UL&gt;
&lt;LI&gt;Always use a virtual environment! I learned this one the hard way when I messed up my system Python&lt;/LI&gt;
&lt;LI&gt;Make sure all the packages are installed with pip install azure-identity azure-mgmt-compute azure-mgmt-subscription azure-monitor-ingestion rich&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4&gt;Next Steps&lt;/H4&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-monitor/visualize/tutorial-logs-dashboards" target="_blank" rel="noopener"&gt;Create a dashboard &lt;/A&gt;&amp;nbsp;to visualize VM SKU availability over time&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-monitor/alerts/tutorial-log-alert" target="_blank" rel="noopener"&gt;Set up alerts&amp;nbsp;&lt;/A&gt; to notify you when specific SKUs become available&lt;/LI&gt;
&lt;LI&gt;Integrate with your CI/CD pipeline to automatically select available SKUs&lt;/LI&gt;
&lt;LI&gt;For a serverless, fully managed option, create an Azure Function version of the monitoring script&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4&gt;Advanced: Bulk-Deploy Feasibility Check&lt;/H4&gt;
&lt;P data-start="946" data-end="1035"&gt;Want to validate up front whether a SKU is eligible in a region and whether your subscription quota would allow N VMs?&lt;BR data-start="1021" data-end="1024" /&gt;We combine:&lt;/P&gt;
&lt;OL data-start="1037" data-end="1187"&gt;
&lt;LI data-start="1037" data-end="1106"&gt;&lt;STRONG data-start="1040" data-end="1058"&gt;Hardware-level&lt;/STRONG&gt;: Resource SKUs API (is the SKU unrestricted?)&lt;/LI&gt;
&lt;LI data-start="1107" data-end="1187"&gt;&lt;STRONG data-start="1110" data-end="1132"&gt;Subscription-level&lt;/STRONG&gt;: Usage API (enough free vCPU cores for &lt;EM data-start="1172" data-end="1175"&gt;N&lt;/EM&gt; instances?)&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;STRONG&gt;Prerequisites already covered above:&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az login
USER_PRINCIPAL=$(az ad signed-in-user show --query userPrincipalName -o tsv)

az group create --name vm-sku-monitor-rg --location eastus2

az role assignment create \
  --assignee "$USER_PRINCIPAL" \
  --role "Monitoring Metrics Publisher" \
  --scope "/subscriptions/$(az account show --query id -o tsv)/resourcegroups/vm-sku-monitor-rg"

python3 -m venv venv &amp;amp;&amp;amp; source venv/bin/activate

pip install azure-identity azure-mgmt-compute azure-mgmt-subscription rich&lt;/LI-CODE&gt;
&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;nbsp;File: monitor_vm_sku_capacity_bulk.py&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;#!/usr/bin/env python
"""
Azure VM SKU Capacity &amp;amp; Quota Monitor (with Zone support)

Checks:
  1) Whether your target SKU is available in a region or zone
  2) Whether your subscription has enough free vCPU quota to deploy N VMs
Optionally logs results into Azure Log Analytics.
"""

import argparse
import datetime
import json
import logging
import subprocess
from typing import List, Tuple, Dict, Any

from azure.identity import DefaultAzureCredential
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.subscription import SubscriptionClient

# Rich for prettier tables
try:
    from rich.console import Console
    from rich.table import Table
    from rich import box
    RICH_AVAILABLE = True
except ImportError:
    RICH_AVAILABLE = False

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger("vm_sku_capacity_monitor")


def parse_arguments():
    p = argparse.ArgumentParser(
        description="Azure VM SKU Capacity &amp;amp; Quota Monitor (with zone support)"
    )
    p.add_argument("--region",        type=str,   default="eastus2",
                   help="Azure region to check")
    p.add_argument("--sku",           type=str,   default="Standard_D16ds_v5",
                   help="VM SKU to check")
    p.add_argument("--zone",          type=str,   default=None,
                   help="(Optional) Availability zone to check (e.g. '1')")
    p.add_argument("--count",         type=int,   default=1,
                   help="Number of VMs you plan to deploy")
    p.add_argument("--log-analytics", action="store_true",
                   help="Enable logging to Azure Log Analytics")
    p.add_argument("--endpoint",      type=str,
                   help="Data Collection Endpoint URI")
    p.add_argument("--rule-id",       type=str,
                   help="Data Collection Rule ID")
    p.add_argument("--stream-name",   type=str, default="Custom-VMSKUCapacity_CL",
                   help="Log Analytics stream name")
    p.add_argument("--debug",         action="store_true",
                   help="Enable debug logging")
    p.add_argument("--config",        type=str,
                   help="Path to JSON config file")
    p.add_argument("--subscription-id", type=str,
                   help="Azure Subscription ID")
    return p.parse_args()


def load_configuration(args) -&amp;gt; Dict[str, Any]:
    cfg = {
        "region": args.region,
        "zone": args.zone,
        "target_sku": args.sku,
        "desired_count": args.count,
        "subscription_id": args.subscription_id,
        "log_analytics": {
            "enabled": args.log_analytics,
            "endpoint": args.endpoint,
            "rule_id": args.rule_id,
            "stream_name": args.stream_name
        }
    }
    if args.config:
        try:
            with open(args.config) as f:
                j = json.load(f)
                # merge known keys
                for k in ("region","zone","target_sku","desired_count","subscription_id"):
                    if k in j: cfg[k] = j[k]
                cfg["log_analytics"].update(j.get("log_analytics", {}))
                logger.info(f"Loaded configuration from {args.config}")
        except Exception as e:
            logger.error(f"Failed loading config {args.config}: {e}")
    # CLI args override file
    if args.region:     cfg["region"] = args.region
    if args.zone:       cfg["zone"] = args.zone
    if args.sku:        cfg["target_sku"] = args.sku
    if args.count:      cfg["desired_count"] = args.count
    if args.subscription_id:
        cfg["subscription_id"] = args.subscription_id
    return cfg


def get_subscription_id(explicit: str) -&amp;gt; str:
    if explicit:
        return explicit
    # Try Azure CLI
    try:
        out = subprocess.run(
            "az account show --query id -o tsv",
            shell=True, check=True,
            stdout=subprocess.PIPE, text=True
        ).stdout.strip()
        if out:
            return out
    except:
        pass
    # Fallback: Azure SDK
    cred = DefaultAzureCredential()
    subs = list(SubscriptionClient(cred).subscriptions.list())
    return subs[0].subscription_id if subs else None


def check_sku_availability(
    compute: ComputeManagementClient,
    region: str, sku: str, zone: str = None
) -&amp;gt; Tuple[bool, str, List[str], Dict[str, Any]]:
    """
    Returns:
      is_available (bool),
      reason (str or None),
      supported_zones (list of str),
      capabilities (dict of name→value)
    """
    skus = list(compute.resource_skus.list())
    entry = next(
        (s for s in skus
         if s.name.lower() == sku.lower()
         and region.lower() in [loc.lower() for loc in s.locations]),
        None
    )
    if not entry:
        return False, "NotFound", [], {}

    # Find all zones where this SKU is sold in that region
    supported_zones = []
    for loc_info in entry.location_info or []:
        if loc_info.location.lower() == region.lower():
            supported_zones = loc_info.zones or []
            break

    # Determine restrictions
    if zone:
        # 1) If SKU doesn’t support the requested zone
        if zone not in supported_zones:
            return False, "UnsupportedZone", supported_zones, {}
        # 2) Check zone-level restrictionInfo.zones
        restricted = [
            r for r in entry.restrictions
            if r.restriction_info.zones and zone in r.restriction_info.zones
        ]
    else:
        # Region-level check
        restricted = [
            r for r in entry.restrictions
            if region.lower() in [l.lower() for l in r.restriction_info.locations]
        ]

    is_avail = len(restricted) == 0
    reason   = restricted[0].reason_code if restricted else None

    # Pull out SKU capabilities (vCPUs, MemoryGB, etc.)
    caps = {c.name: c.value for c in entry.capabilities or []}

    return is_avail, reason, supported_zones, caps


def check_quota(
    compute: ComputeManagementClient,
    region: str, vcpus_needed: int, count: int
) -&amp;gt; Tuple[int,int,bool]:
    usage = list(compute.usage.list(location=region))
    core = next((u for u in usage if u.name.value.lower()=="cores"), None)
    free = (core.limit - core.current_value) if core else 0
    required = vcpus_needed * count
    return free, required, free &amp;gt;= required


def display(rdata: Dict[str, Any]):
    if RICH_AVAILABLE:
        c = Console()
        c.print(f"\n[bold underline]SKU Capacity &amp;amp; Quota (Zone) Check "
                f"({datetime.datetime.now():%Y-%m-%d %H:%M:%S})[/]\n")

        # Availability table
        t1 = Table(box=box.SIMPLE)
        t1.add_column("SKU"); t1.add_column("Region"); t1.add_column("Zone")
        t1.add_column("Available"); t1.add_column("Reason")
        t1.add_row(
            rdata["target_sku"], rdata["region"],
            rdata["zone"] or "-",
            "✅" if rdata["is_available"] else "❌",
            rdata["reason"] or "-"
        )
        c.print(t1)

        # Supported zones
        t0 = Table(box=box.SIMPLE)
        t0.add_column("Supported Zones")
        t0.add_row(", ".join(rdata["supported_zones"]) or "None")
        c.print(t0)

        # Quota table
        t2 = Table(box=box.SIMPLE)
        t2.add_column("Desired VMs", justify="right")
        t2.add_column("vCPUs/VM",   justify="right")
        t2.add_column("Free Cores", justify="right")
        t2.add_column("Needs Cores",justify="right")
        t2.add_column("Quota OK?",  justify="center")
        t2.add_row(
            str(rdata["desired_count"]),
            str(rdata["vcpus"]),
            str(rdata["free_cores"]),
            str(rdata["required_cores"]),
            "✅" if rdata["quota_ok"] else "❌"
        )
        c.print(t2)

    else:
        print(f"\nSKU {rdata['target_sku']} in {rdata['region']} "
              f"zone {rdata['zone'] or '-'}: "
              f"Available={rdata['is_available']} (Reason={rdata['reason']})")
        print("Supported zones:", ", ".join(rdata["supported_zones"]) or "None")
        print(f"Quota: need {rdata['required_cores']} cores, "
              f"have {rdata['free_cores']} → OK={rdata['quota_ok']}")


def main():
    args = parse_arguments()
    if args.debug:
        logger.setLevel(logging.DEBUG)

    cfg = load_configuration(args)
    cfg["subscription_id"] = get_subscription_id(cfg.get("subscription_id"))
    logger.info(f"Checking SKU {cfg['target_sku']} x{cfg['desired_count']} "
                f"in {cfg['region']} zone {cfg['zone']}")

    cred = DefaultAzureCredential()
    compute = ComputeManagementClient(cred, cfg["subscription_id"])

    # 1) SKU + zone availability
    is_avail, reason, zones, caps = check_sku_availability(
        compute, cfg["region"], cfg["target_sku"], cfg["zone"]
    )
    vcpus = int(caps.get("vCPUs", 0))

    # 2) Subscription quota check
    free, required, ok = check_quota(
        compute, cfg["region"], vcpus, cfg["desired_count"]
    )

    result = {
        "target_sku":      cfg["target_sku"],
        "region":          cfg["region"],
        "zone":            cfg["zone"],
        "supported_zones": zones,
        "desired_count":   cfg["desired_count"],
        "is_available":    is_avail,
        "reason":          reason,
        "vcpus":           vcpus,
        "free_cores":      free,
        "required_cores":  required,
        "quota_ok":        ok
    }

    display(result)

    # (Optional) send to Log Analytics…
    # [omitted for brevity]


if __name__ == "__main__":
    main()
&lt;/LI-CODE&gt;
&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Run the bulk-deploy checker (region-level check)&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;python monitor_vm_sku_capacity_bulk.py \
  --region centralus \
  --sku Standard_B2s_v2 \
  --count 10 &lt;/LI-CODE&gt;
&lt;P&gt;(Optionally add the parameter&amp;nbsp; &lt;STRONG&gt;--log-analytics --endpoint &amp;lt;DCE-URI&amp;gt; --rule-id &amp;lt;DCR-ID&amp;gt;&lt;/STRONG&gt; to send it to Log Analytics)&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Example output&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;SKU Capacity &amp;amp; Quota (Zone) Check (2025-06-20 12:49:58)


  SKU               Region      Zone   Available   Reason
 ─────────────────────────────────────────────────────────
  Standard_B2s_v2   centralus   -      ✅          -


  Supported Zones
 ─────────────────
  1, 3, 2


  Desired VMs   vCPUs/VM   Free Cores   Needs Cores   Quota OK?
 ───────────────────────────────────────────────────────────────
           10          2          100            20      ✅&lt;/LI-CODE&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Availability in this output reflects SKU eligibility, not real-time capacity.&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Run the bulk-deploy checker (zone-level heck)&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;python monitor_vm_sku_capacity_bulk.py \
  --region centralus \
  --zone 2 \
  --sku Standard_B2s_v2 \
  --count 10 &lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;Example output&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;SKU Capacity &amp;amp; Quota (Zone) Check (2025-06-20 12:42:22)


  SKU               Region      Zone   Available   Reason
 ─────────────────────────────────────────────────────────
  Standard_B2s_v2   centralus   2      ✅          -


  Supported Zones
 ─────────────────
  1, 3, 2


  Desired VMs   vCPUs/VM   Free Cores   Needs Cores   Quota OK?
 ───────────────────────────────────────────────────────────────
           10          2          100            20      ✅&lt;/LI-CODE&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Availability in this output reflects SKU eligibility and zonal exposure, not real-time capacity.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H4&gt;Final Thoughts&lt;/H4&gt;
&lt;P data-start="3103" data-end="3316"&gt;This solution has proven to be a valuable asset for Azure infrastructure planning. It helps teams proactively identify SKU restrictions, understand zonal exposure, and spot changes in SKU eligibility over time.&lt;/P&gt;
&lt;P data-start="3323" data-end="3533"&gt;Used correctly, it reduces surprise deployment failures by surfacing &lt;STRONG data-start="3392" data-end="3421"&gt;where SKUs cannot be used&lt;/STRONG&gt; early, enabling better design decisions around regions, zones, and alternatives before production deployments&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;Happy monitoring!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Jan 2026 21:43:33 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/a-practical-guide-to-azure-vm-sku-eligibility-and-zonal-support/ba-p/4415773</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2026-01-20T21:43:33Z</dc:date>
    </item>
    <item>
      <title>The Digital Native's Checklist for Azure: Stuff I wish every startup knew</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/the-digital-native-s-checklist-for-azure-stuff-i-wish-every/ba-p/4406669</link>
      <description>&lt;P data-start="223" data-end="510"&gt;I’ve had the chance to work with a bunch of digital native customers — you know, those fast-moving, API-first, cloud-from-day-zero teams building the next big thing. And while no two startups are ever quite the same, I’ve noticed a pattern: the same Azure gotchas pop up again and again.&lt;/P&gt;
&lt;P data-start="512" data-end="721"&gt;So I thought, why not write down a quick checklist? Not a 100-page whitepaper. Just the stuff that actually helps — especially if you’re trying to go from MVP chaos to something a little more production-grade.&lt;/P&gt;
&lt;P data-start="723" data-end="968"&gt;This isn’t just based on my own experience (though there’s been plenty of that). I’ve pulled together insights from some awesome blog posts and official docs to consolidate the essentials into one simple checklist. Let’s jump in!&lt;/P&gt;
&lt;H3 data-start="1063" data-end="1112"&gt;Identity &amp;amp; Access: First thing to get right&lt;/H3&gt;
&lt;P data-start="1114" data-end="1210"&gt;Start here. Trust me, cleaning up Entra ID and access controls &lt;EM data-start="1177" data-end="1184"&gt;after&lt;/EM&gt; you scale is a nightmare.&lt;/P&gt;
&lt;UL data-start="1212" data-end="1862"&gt;
&lt;LI data-start="1212" data-end="1394"&gt;Use&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/security/fundamentals/identity-management-best-practices" target="_blank" rel="noopener" data-start="1221" data-end="1362"&gt;Microsoft Entra ID&lt;/A&gt; as your single source of truth.&lt;/LI&gt;
&lt;LI data-start="1395" data-end="1566"&gt;Ditch the “Owner” role everywhere. Implement&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/role-based-access-control/best-practices" target="_blank" rel="noopener" data-start="1445" data-end="1565"&gt;RBAC properly&lt;/A&gt;.&lt;/LI&gt;
&lt;LI data-start="1567" data-end="1792"&gt;Use&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/managed-identity-best-practice-recommendations" target="_blank" rel="noopener" data-start="1576" data-end="1751"&gt;Managed Identities&lt;/A&gt; instead of storing secrets in your code. It’s cleaner, safer, and modern.&lt;/LI&gt;
&lt;LI data-start="1793" data-end="1862"&gt;PIM (Privileged Identity Management) is your friend. Turn it on.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1864" data-end="2267"&gt;Extra reading:&lt;BR data-start="1882" data-end="1885" /&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/demystifying-microsoft-entra-id-tenants-and-azure-subscriptions/4155261" target="_blank" rel="noopener" data-start="1888" data-end="2072" data-lia-auto-title="Demystifying Entra Tenants and Subscriptions" data-lia-auto-title-active="0"&gt;Demystifying Entra Tenants and Subscriptions&lt;/A&gt;&lt;BR data-start="2072" data-end="2075" /&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/from-zero-to-hero-with-identity-and-access-control-in-azure-kubernetes-service/4386350" target="_blank" rel="noopener" data-start="2078" data-end="2267" data-lia-auto-title="From Zero to Hero: Identity in AKS" data-lia-auto-title-active="0"&gt;From Zero to Hero: Identity in AKS&lt;/A&gt;&lt;/P&gt;
&lt;H3 data-start="2274" data-end="2338"&gt;Networking &amp;amp; Security: You can't secure what you can’t see&lt;/H3&gt;
&lt;P data-start="2340" data-end="2430"&gt;Yes, even if you're “just prototyping.” Flat networks and open ports will haunt you later.&lt;/P&gt;
&lt;UL data-start="2432" data-end="2848"&gt;
&lt;LI data-start="2432" data-end="2602"&gt;Set up your&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/security/fundamentals/network-best-practices" target="_blank" rel="noopener" data-start="2450" data-end="2581"&gt;VNets, subnets, NSGs&lt;/A&gt; with actual thought.&lt;/LI&gt;
&lt;LI data-start="2432" data-end="2602"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-vnet-plan-design-arm" target="_blank" rel="noopener"&gt;Plan out VNet architecture&lt;/A&gt; — even if you think “we’re just testing stuff.”&lt;/LI&gt;
&lt;LI data-start="2432" data-end="2602"&gt;Turn on Defender for Cloud. &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/defender-for-cloud/defender-for-cloud-introduction#improve-your-security-posture" target="_blank" rel="noopener"&gt;The free plan gives you a lot already.&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="2603" data-end="2668"&gt;Use Azure Firewall and DDoS protection where it makes sense.&lt;/LI&gt;
&lt;LI data-start="2669" data-end="2848"&gt;Lock down public IPs, use &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-overview" target="_blank" rel="noopener"&gt;private endpoints &lt;/A&gt;when you can.&lt;/LI&gt;
&lt;LI data-start="2669" data-end="2848"&gt;Set up Key Vault + Managed Identity — even for “just a demo.”&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2850" data-end="3024"&gt;Bonus:&lt;/P&gt;
&lt;P data-start="2850" data-end="3024"&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/building-a-secure-and-scalable-foundation-for-your-startup-on-azure/4146456" target="_blank" rel="noopener" data-start="4287" data-end="4470" data-lia-auto-title="Building a Secure &amp;amp; Scalable Foundation" data-lia-auto-title-active="0"&gt;Building a Secure &amp;amp; Scalable Foundation&lt;/A&gt;&lt;/P&gt;
&lt;P data-start="2850" data-end="3024"&gt;&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/aks-networking-made-easy-your-comprehensive-guide/4398603" target="_blank" rel="noopener" data-start="2857" data-end="3003" data-lia-auto-title="AKS Networking Guide" data-lia-auto-title-active="0"&gt;AKS Networking Guide&lt;/A&gt; — bookmark this one.&lt;/P&gt;
&lt;H3 data-start="3031" data-end="3103"&gt;Resource Management: Don’t be that team with 243 unnamed resources&lt;/H3&gt;
&lt;P data-start="3105" data-end="3216"&gt;I once worked with a customer who had 15 “rg-dev-test-temp” resource groups. No one knew who owned them. Chaos.&lt;/P&gt;
&lt;UL data-start="3218" data-end="3636"&gt;
&lt;LI data-start="3218" data-end="3438"&gt;Follow a&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-setup-guide/organize-resources" target="_blank" rel="noopener" data-start="3232" data-end="3396"&gt;resource organization strategy&lt;/A&gt;. Management groups. Subscriptions. Do it.&lt;/LI&gt;
&lt;LI data-start="3439" data-end="3636"&gt;Use&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/tag-resources" target="_blank" rel="noopener" data-start="3449" data-end="3567"&gt;tags&lt;/A&gt; everywhere. Tag by owner, environment, cost center — whatever helps. No exceptions.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-start="3643" data-end="3712"&gt;Cost &amp;amp; FinOps: Avoid billing surprises (and awkward CFO convos)&lt;/H3&gt;
&lt;P data-start="3714" data-end="3800"&gt;You &lt;EM data-start="3718" data-end="3724"&gt;will&lt;/EM&gt; get burned if you don’t track costs. It’s not “extra work” — it’s survival.&lt;/P&gt;
&lt;UL data-start="3802" data-end="4386"&gt;
&lt;LI data-start="3802" data-end="3966"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/overview-cost-management" target="_blank" rel="noopener" data-start="3807" data-end="3949"&gt;Azure Cost Management&lt;/A&gt; is free. Use it.&lt;/LI&gt;
&lt;LI data-start="3967" data-end="4127"&gt;Set&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/cost-mgt-alerts-monitor-usage-spending" target="_blank" rel="noopener" data-start="3975" data-end="4126"&gt;budgets + alerts&lt;/A&gt;. Even if it’s just $10 over, that’s your early warning system.&lt;/LI&gt;
&lt;LI data-start="4128" data-end="4296"&gt;Use&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/how-azure-advisor-can-help-you-to-optimize-cloud-resources/4372082" target="_blank" rel="noopener" data-start="4137" data-end="4285" data-lia-auto-title="Azure Advisor" data-lia-auto-title-active="0"&gt;Azure Advisor&lt;/A&gt; regularly. &amp;nbsp;It's free. It’s there. It’s helpful. Just do it.&lt;/LI&gt;
&lt;LI data-start="4128" data-end="4296"&gt;Check out those “hidden” optimizations — Reservations, Spot, Savings Plans.&lt;/LI&gt;
&lt;LI data-start="4297" data-end="4386"&gt;Learn FinOps basics from &lt;A class="lia-external-url" href="https://microsoft.github.io/finops-toolkit/" target="_blank" rel="noopener"&gt;this toolkit&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="4388" data-end="4560"&gt;Also:&lt;BR data-start="4393" data-end="4396" /&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/slash-your-azure-bill-top-tips-for-startups/4141839" target="_blank" rel="noopener" data-start="4399" data-end="4560" data-lia-auto-title="Slash Your Azure Bill – Tips for Startups" data-lia-auto-title-active="0"&gt;Slash Your Azure Bill – Tips for Startups&lt;/A&gt;&lt;/P&gt;
&lt;H3 data-start="4567" data-end="4628"&gt;Monitoring &amp;amp; Observability: MELT is not just a buzzword&lt;/H3&gt;
&lt;P data-start="4630" data-end="4691"&gt;You need to know what’s happening — before your customers do.&lt;/P&gt;
&lt;UL data-start="4693" data-end="5591"&gt;
&lt;LI data-start="4693" data-end="5004"&gt;Enable&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-monitor/fundamentals/overview" target="_blank" rel="noopener" data-start="4705" data-end="4820"&gt;Azure Monitor&lt;/A&gt; and &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/the-importance-of-setting-up-service-and-resource-health-monitoring-in-azure/4372478" target="_blank" rel="noopener" data-start="4825" data-end="5003" data-lia-auto-title="Service + Resource Health" data-lia-auto-title-active="0"&gt;Service + Resource Health&lt;/A&gt;.&lt;/LI&gt;
&lt;LI data-start="5005" data-end="5215"&gt;Use&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-workbooks-advanced-customization-and-data-visualization-in-azure/4369588" target="_blank" rel="noopener" data-start="5014" data-end="5170" data-lia-auto-title="Workbooks" data-lia-auto-title-active="0"&gt;Workbooks&lt;/A&gt; to make dashboards that are actually useful.&lt;/LI&gt;
&lt;LI data-start="5216" data-end="5370"&gt;Set up&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/advanced-alerting-strategies-for-azure-monitoring/4268698" target="_blank" rel="noopener" data-start="5228" data-end="5369" data-lia-auto-title="advanced alerts" data-lia-auto-title-active="0"&gt;advanced alerts&lt;/A&gt;.&lt;/LI&gt;
&lt;LI data-start="5371" data-end="5591"&gt;MELT = Metrics, Events, Logs, Traces. Here’s a good read:&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-monitor--melt-a-comprehensive-approach-to-cloud-observability/4251166" target="_blank" rel="noopener" data-start="5434" data-end="5591" data-lia-auto-title="MELT in Azure" data-lia-auto-title-active="0"&gt;MELT in Azure&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-start="5598" data-end="5675"&gt;Infrastructure as Code: No, clicking around in the portal isn’t “agile”&lt;/H3&gt;
&lt;UL data-start="5677" data-end="6306"&gt;
&lt;LI data-start="5677" data-end="5903"&gt;Use&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/overview" target="_blank" rel="noopener" data-start="5687" data-end="5796"&gt;Bicep&lt;/A&gt;, ARM, or &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/developer/terraform/" target="_blank" rel="noopener" data-start="5806" data-end="5902"&gt;Terraform&lt;/A&gt; — not the portal. (Unless you're debugging.)&lt;/LI&gt;
&lt;LI data-start="5904" data-end="6110"&gt;Plug it into CI/CD.&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/devsecops/playbook/articles/infrastructure/best-practices-infrastructure-pipelines" target="_blank" rel="noopener" data-start="5929" data-end="6087"&gt;Infra pipelines&lt;/A&gt; are a thing. Use them.&lt;/LI&gt;
&lt;LI data-start="6111" data-end="6306"&gt;Add&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/from-zero-to-hero-with-azure-landing-zones/4229195" target="_blank" rel="noopener" data-start="6120" data-end="6258" data-lia-auto-title="Azure Landing Zones" data-lia-auto-title-active="0"&gt;Azure Landing Zones&lt;/A&gt; for structure, governance, and scale-readiness — even if you’re small. They scale &lt;EM data-start="3879" data-end="3885"&gt;with&lt;/EM&gt; you.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-start="6313" data-end="6394"&gt;AKS &amp;amp; App Architecture: Because most of y’all are running Kubernetes anyway&lt;/H3&gt;
&lt;UL data-start="6396" data-end="7509"&gt;
&lt;LI data-start="6396" data-end="6573"&gt;Start here: &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-kubernetes-service-%E2%80%93-a-friendly-guide-for-startups/4374796" target="_blank" rel="noopener" data-start="6410" data-end="6573" data-lia-auto-title="AKS Guide for Startups" data-lia-auto-title-active="0"&gt;AKS Guide for Startups&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="6574" data-end="7175"&gt;Learn about &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/from-zero-to-hero-mastering-storage-in-azure-kubernetes-service-aks/4397734" target="_blank" rel="noopener" data-start="6588" data-end="6739" data-lia-auto-title="storage" data-lia-auto-title-active="0"&gt;storage&lt;/A&gt;, &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/embracing-aks-built-in-upgrade-features-and-exploring-custom-solutions/4398230" target="_blank" rel="noopener" data-start="6741" data-end="6896" data-lia-auto-title="upgrades" data-lia-auto-title-active="0"&gt;upgrades&lt;/A&gt;, &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/understanding-identity-concepts-in-aks/4256435" target="_blank" rel="noopener" data-start="6898" data-end="7021" data-lia-auto-title="identity" data-lia-auto-title-active="0"&gt;identity&lt;/A&gt;, and &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/aks-standard-vs-aks-automatic-a-comprehensive-comparison/4264516" target="_blank" rel="noopener" data-start="7027" data-end="7174" data-lia-auto-title="cluster models" data-lia-auto-title-active="0"&gt;cluster models&lt;/A&gt;.&lt;/LI&gt;
&lt;LI data-start="7176" data-end="7335"&gt;Add monitoring with &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview" target="_blank" rel="noopener"&gt;Azure Monitor features for Kubernetes&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="7336" data-end="7509"&gt;And please, for the love of uptime, use the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/best-practices" target="_blank" rel="noopener"&gt;best practices for AKS&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-start="7516" data-end="7574"&gt;Azure OpenAI (AOAI): Because GenAI is everywhere now&lt;/H3&gt;
&lt;UL data-start="7576" data-end="8019"&gt;
&lt;LI data-start="7576" data-end="7775"&gt;Start with this gem: &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/startupsatmicrosoftblog/azure-openai-best-practices-a-quick-reference-guide-to-optimize-your-deployments/4403546" target="_blank" rel="noopener" data-start="7599" data-end="7775" data-lia-auto-title="AOAI Best Practices" data-lia-auto-title-active="0"&gt;AOAI Best Practices&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="7776" data-end="7943"&gt;Follow &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/on-your-data-best-practices" target="_blank" rel="noopener" data-start="7785" data-end="7913"&gt;this doc&lt;/A&gt; if you’re using your own data&lt;/LI&gt;
&lt;LI data-start="7776" data-end="7943"&gt;Familiarize yourself with &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/legal/cognitive-services/openai/data-privacy" target="_blank" rel="noopener"&gt;how Azure OpenAI processes and stores data&lt;/A&gt;.&lt;/LI&gt;
&lt;LI data-start="7944" data-end="8019"&gt;Watch out for data residency, concurrency, and cost — especially at scale&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-start="8026" data-end="8080"&gt;Bonus: AWS background? Here's your Rosetta Stone&lt;/H3&gt;
&lt;UL data-start="8082" data-end="8137"&gt;
&lt;LI data-start="8082" data-end="8137"&gt;👉 &lt;A class="lia-external-url" href="https://aka.ms/Azure4AWSPros" target="_blank" rel="noopener" data-start="8087" data-end="8137"&gt;Azure for AWS Pros&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-start="8144" data-end="8160"&gt;Final thought&lt;/H3&gt;
&lt;P data-start="8162" data-end="8378"&gt;This isn’t about checking every box on day one. It’s about having a clear, shared view of what “mature” looks like on Azure — for founders, devs, ops, finance, and even the intern shipping ARM templates on day three.&lt;/P&gt;
&lt;P data-start="8380" data-end="8487"&gt;Save this list. Bookmark it. Share it with your team. Better yet, build your own version and make it yours.&lt;/P&gt;
&lt;P data-start="8489" data-end="8562"&gt;Got a checklist you use or a tip you love? I’d seriously love to hear it.&lt;/P&gt;
&lt;P data-start="8564" data-end="8600"&gt;Let’s build smart, not just fast.&lt;/P&gt;</description>
      <pubDate>Tue, 22 Apr 2025 14:50:05 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/the-digital-native-s-checklist-for-azure-stuff-i-wish-every/ba-p/4406669</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-04-22T14:50:05Z</dc:date>
    </item>
    <item>
      <title>Azure OpenAI best practices: A quick-reference guide to optimize your deployments</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-openai-best-practices-a-quick-reference-guide-to-optimize/ba-p/4403546</link>
      <description>&lt;P&gt;&lt;EM&gt;Contributors: &lt;A href="https://www.linkedin.com/in/ahmed-chowdhury-b8a35112/" target="_blank" rel="noopener"&gt;Ahmed Chowdhury&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;As organizations increasingly integrate Azure OpenAI into their applications, it's essential to be aware of the comprehensive best practices that Microsoft has published. However, these valuable resources are often dispersed across various documentation pages, making it challenging to access them efficiently.​&lt;/P&gt;
&lt;P&gt;This quick-reference guide consolidates the key best practices for deploying and managing Azure OpenAI workloads. By bringing together architectural considerations, security measures, governance strategies, networking configurations, and more, this guide aims to provide a centralized resource to help you optimize your Azure OpenAI deployments effectively.&lt;/P&gt;
&lt;H3&gt;Architectural considerations&lt;/H3&gt;
&lt;P&gt;A robust architecture is the foundation of any successful Azure OpenAI deployment. Azure's Well-Architected Framework provides guidance to design and implement solutions that are reliable, secure, and efficient.​&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key recommendations:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Design for scalability:&lt;/STRONG&gt; Utilize Azure's scalable services to handle varying loads, ensuring consistent performance during peak times.​&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Optimize cost:&lt;/STRONG&gt; Monitor and manage resources to avoid unnecessary expenditures. Implement auto-scaling and choose appropriate pricing tiers based on workload demands.​&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Example:&lt;/EM&gt; An e-commerce platform using Azure OpenAI for personalized recommendations can leverage auto-scaling to handle increased traffic during sales events, ensuring users receive timely suggestions without over-provisioning resources.​&lt;/P&gt;
&lt;P&gt;For detailed architectural guidance, refer to the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/well-architected/service-guides/azure-openai" target="_blank" rel="noopener"&gt;Architecture Best Practices for Azure OpenAI Service&lt;/A&gt;.&lt;/P&gt;
&lt;H3&gt;Security best practices&lt;/H3&gt;
&lt;P&gt;Protecting sensitive data and ensuring compliance are paramount when deploying AI solutions. Azure provides a comprehensive security baseline tailored for Azure OpenAI services.​&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key recommendations:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Data encryption:&lt;/STRONG&gt; Implement encryption for data at rest and in transit to safeguard against unauthorized access.​&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Access controls:&lt;/STRONG&gt; Utilize Azure's Role-Based Access Control (RBAC) to restrict access to AI resources, ensuring only authorized personnel can interact with sensitive data.​&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Example:&lt;/EM&gt; A healthcare provider deploying Azure OpenAI for patient diagnostics should encrypt patient data and restrict access based on roles, ensuring compliance with regulations like HIPAA.​&lt;/P&gt;
&lt;P&gt;For comprehensive security guidelines, consult the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/security/benchmark/azure/baselines/azure-openai-security-baseline" target="_blank" rel="noopener"&gt;Azure Security Baseline for Azure OpenAI&lt;/A&gt;.&lt;/P&gt;
&lt;H3&gt;Governance strategies&lt;/H3&gt;
&lt;P&gt;Effective governance ensures that AI deployments align with organizational policies and regulatory requirements. Azure's governance recommendations provide a framework for managing AI resources.​&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key recommendations:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Resource tagging:&lt;/STRONG&gt; Implement consistent tagging for AI resources to facilitate tracking, management, and cost allocation.​&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Policy enforcement:&lt;/STRONG&gt; Use Azure Policy to enforce organizational standards and assess compliance across AI resources.​&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Example:&lt;/EM&gt; A company can use resource tagging to allocate AI resource costs to specific departments, ensuring transparency and accountability.​&lt;/P&gt;
&lt;P&gt;For detailed governance strategies, refer to the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/ai/platform/governance" target="_blank" rel="noopener"&gt;Governance Recommendations for AI Workloads on Azure&lt;/A&gt;.&lt;/P&gt;
&lt;H2&gt;Networking considerations&lt;/H2&gt;
&lt;P&gt;Efficient and secure networking is crucial for AI workloads, especially when dealing with large datasets and real-time processing. Azure offers networking recommendations tailored for AI services.​&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key recommendations:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Virtual networks (VNet):&lt;/STRONG&gt; Isolate AI resources within VNets to enhance security and control traffic flow.​&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Private endpoints:&lt;/STRONG&gt; Use private endpoints to connect securely to AI services, reducing exposure to the public internet.​&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="200" data-end="230"&gt;&lt;STRONG&gt;VNet Connectivity Patterns:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="232" data-end="328"&gt;When you need AI resources in two VNets to talk to each other, there are two primary approaches:&lt;/P&gt;
&lt;P&gt;&lt;U&gt;1. Gateway‑to‑Gateway VPN&lt;/U&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL data-start="557" data-end="923"&gt;
&lt;LI data-start="557" data-end="649"&gt;Encryption: Built‑in IPsec/IKE tunnel, ensuring all traffic is encrypted in transit.&lt;/LI&gt;
&lt;LI data-start="652" data-end="798"&gt;Transit Support: Enables hub‑and‑spoke or multi‑region topologies without mesh peerings —just connect each spoke to a central transit VNet.&lt;/LI&gt;
&lt;LI data-start="801" data-end="923"&gt;When to use: Regulated workloads, cross‑region connectivity, or any scenario demanding IPsec in-flight encryption.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;U&gt;2. VNet Peering&lt;/U&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-start="948" data-end="1018"&gt;Performance: Lowest latency over Microsoft’s backbone network.&lt;/LI&gt;
&lt;LI data-start="1021" data-end="1111"&gt;Cost: No gateway data‑processing charges (peering is metered only on data egress).&lt;/LI&gt;
&lt;LI data-start="1114" data-end="1250"&gt;When to use: VNets in the same region/tenant, where encryption‑tunnel overhead isn’t required and you want simplicity and speed.&lt;/LI&gt;
&lt;/UL&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="1254" data-end="1265"&gt;&lt;STRONG data-start="1254" data-end="1263"&gt;Note:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="1268" data-end="1627"&gt;
&lt;LI data-start="1268" data-end="1452"&gt;Peering is non‑transitive by default: A↔B and B↔C peerings don’t auto-connect A to C. To achieve transit, you either need gateway transit settings on your peering or use a VPN hub.&lt;/LI&gt;
&lt;LI data-start="1455" data-end="1627"&gt;If you require both low latency &lt;STRONG data-start="1489" data-end="1496"&gt;and&lt;/STRONG&gt; encrypted traffic, you can combine peering (data path) with Azure Route Server + NVA‑based IPsec—or stick with VPN for simplicity.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3&gt;Quota management and optimization&lt;/H3&gt;
&lt;P&gt;Azure imposes quotas to manage resource usage effectively. Understanding and optimizing these quotas ensures uninterrupted AI operations.​&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key recommendations:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Monitor usage:&lt;/STRONG&gt; Regularly monitor token usage and request rates to stay within allocated quotas.​&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Request increases proactively:&lt;/STRONG&gt; If approaching quota limits, request increases in advance to avoid service disruptions.​&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Example:&lt;/EM&gt; A chatbot service experiencing increased user interactions should monitor token usage and anticipate quota adjustments to maintain seamless user experiences.​&lt;/P&gt;
&lt;P&gt;For detailed quota management, refer to:​&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits" target="_blank" rel="noopener"&gt;Azure OpenAI Service Quotas and Limits&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota" target="_blank" rel="noopener"&gt;Manage Azure OpenAI Service Quota&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Example:&lt;/EM&gt; A financial institution processing real-time transactions with AI can use VNets and private endpoints to ensure data remains within a secure network boundary, mitigating risks of data breaches.​&lt;/P&gt;
&lt;P&gt;For comprehensive networking guidelines, consult the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/ai/platform/networking" target="_blank" rel="noopener"&gt;Networking Recommendations for AI Workloads on Azure&lt;/A&gt;.&lt;/P&gt;
&lt;H3&gt;Provisioned throughput units (PTUs)&lt;/H3&gt;
&lt;P&gt;For workloads requiring consistent and predictable performance, Azure offers Provisioned Throughput Units (PTUs).​&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key recommendations:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Assess workload needs: Determine if PTUs align with your workload's performance requirements and cost considerations.​&lt;/LI&gt;
&lt;LI&gt;Plan for scalability: Allocate PTUs based on anticipated growth, ensuring the AI system can handle increased demand.​&lt;/LI&gt;
&lt;LI&gt;Monitor utilization: Regularly monitor PTU utilization to ensure optimal performance and cost-effectiveness.​&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Example:&lt;/EM&gt; A streaming service using Azure OpenAI for content recommendations can deploy PTUs to guarantee consistent performance during peak viewing times.​&lt;/P&gt;
&lt;P&gt;For detailed information on PTUs, refer to the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput" target="_blank" rel="noopener"&gt;Provisioned Throughput Units (PTUs) in Azure OpenAI Service&lt;/A&gt;.&lt;/P&gt;
&lt;H3&gt;Monitoring and logging&lt;/H3&gt;
&lt;P&gt;Comprehensive monitoring and logging are vital for maintaining the health and performance of AI systems. Azure provides tools to monitor AI services effectively.​&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key recommendations:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Enable diagnostic logs:&lt;/STRONG&gt; Capture detailed logs for troubleshooting and performance analysis.​&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Set up alerts:&lt;/STRONG&gt; Configure alerts for anomalies or performance degradation to enable proactive responses.​&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Utilize Azure monitor:&lt;/STRONG&gt; Use Azure Monitor to collect, analyze, and act on telemetry data from your Azure OpenAI resources.​&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Example:&lt;/EM&gt; An online retailer using Azure OpenAI for customer support chatbots can set up alerts to detect unusual spikes in response times, allowing for immediate investigation and resolution.​&lt;/P&gt;
&lt;P&gt;For comprehensive monitoring guidelines, consult the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/monitor-openai" target="_blank" rel="noopener"&gt;Monitor Azure OpenAI Service&lt;/A&gt; documentation.&lt;/P&gt;
&lt;H3&gt;Multi-region gateway deployment strategy for Azure OpenAI&lt;/H3&gt;
&lt;P data-start="461" data-end="882"&gt;To enhance &lt;STRONG data-start="472" data-end="487"&gt;reliability&lt;/STRONG&gt;, &lt;STRONG data-start="489" data-end="500"&gt;latency&lt;/STRONG&gt;, and &lt;STRONG data-start="506" data-end="520"&gt;resilience&lt;/STRONG&gt; for geographically distributed Azure OpenAI users, a multi-region API gateway architecture is strongly recommended. This has become a key focus for engineering teams and field specialists, and for good reason: regional outages, high traffic scenarios, or backend limitations can impact availability. A well-architected gateway setup helps mitigate these issues.&lt;/P&gt;
&lt;P data-start="884" data-end="904"&gt;&lt;U&gt;Why This Matters&lt;/U&gt;&lt;/P&gt;
&lt;UL data-start="905" data-end="1144"&gt;
&lt;LI data-start="905" data-end="995"&gt;You can route requests intelligently across multiple Azure OpenAI deployments or models.&lt;/LI&gt;
&lt;LI data-start="996" data-end="1062"&gt;You minimize latency by serving traffic from the closest region.&lt;/LI&gt;
&lt;LI data-start="1063" data-end="1144"&gt;You reduce single points of failure and improve your disaster recovery posture.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;U&gt;Implementation Patterns&lt;/U&gt;&lt;/P&gt;
&lt;P data-start="1146" data-end="1214"&gt;There are &lt;STRONG data-start="1156" data-end="1177"&gt;two main patterns&lt;/STRONG&gt; for implementing this in production:&lt;/P&gt;
&lt;P data-start="1221" data-end="1324"&gt;&lt;STRONG&gt;Option 1: Azure API Management Premium – Multi-region deployment (recommended for enterprise scale)&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="1326" data-end="1551"&gt;This option leverages &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-deploy-multi-region" target="_blank" rel="noopener" data-start="1348" data-end="1513"&gt;Azure API Management's built-in multi-region deployment capability&lt;/A&gt;, available with the &lt;STRONG data-start="1534" data-end="1550"&gt;Premium tier&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P data-start="1553" data-end="1566"&gt;&lt;STRONG data-start="1553" data-end="1566"&gt;Benefits:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="1567" data-end="1802"&gt;
&lt;LI data-start="1567" data-end="1632"&gt;Replicates the gateway component across multiple Azure regions.&lt;/LI&gt;
&lt;LI data-start="1633" data-end="1716"&gt;Traffic is automatically routed to the nearest regional gateway based on latency.&lt;/LI&gt;
&lt;LI data-start="1717" data-end="1802"&gt;Ensures localized access points and high availability in case of regional failures.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1804" data-end="1823"&gt;&lt;STRONG data-start="1804" data-end="1823"&gt;Considerations:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="1824" data-end="1932"&gt;
&lt;LI data-start="1824" data-end="1862"&gt;Requires Premium tier (higher cost).&lt;/LI&gt;
&lt;LI data-start="1863" data-end="1932"&gt;Management plane and developer portal remain in the primary region.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1939" data-end="2028"&gt;&lt;STRONG&gt;Option 2: Standard tier APIM with external load balancer (cost-effective alternative)&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="2030" data-end="2241"&gt;If Premium tier is not feasible, you can deploy &lt;STRONG data-start="2078" data-end="2105"&gt;separate APIM instances&lt;/STRONG&gt; (Standard tier or higher) in each region and use a global load balancer like Azure Front Door or Traffic Manager to distribute traffic.&lt;/P&gt;
&lt;P data-start="2243" data-end="2253"&gt;&lt;STRONG data-start="2243" data-end="2253"&gt;Steps:&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL data-start="2254" data-end="2482"&gt;
&lt;LI data-start="2254" data-end="2323"&gt;Deploy multiple APIM instances independently in different regions.&lt;/LI&gt;
&lt;LI data-start="2324" data-end="2418"&gt;Use Azure Front Door or Traffic Manager to route traffic based on geo-proximity or latency.&lt;/LI&gt;
&lt;LI data-start="2419" data-end="2482"&gt;Maintain consistent configuration across all APIM instances.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P data-start="2484" data-end="2499"&gt;&lt;STRONG data-start="2484" data-end="2499"&gt;Trade-offs:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="2500" data-end="2622"&gt;
&lt;LI data-start="2500" data-end="2566"&gt;No built-in multi-region replication; manual config sync needed.&lt;/LI&gt;
&lt;LI data-start="2567" data-end="2622"&gt;More flexible cost-wise and supports gradual scaling.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2629" data-end="2682"&gt;&lt;STRONG&gt;Additional strategies to strengthen resilience&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-start="1848" data-end="1970"&gt;&lt;STRONG data-start="1850" data-end="1883"&gt;Multi-backend gateway pattern&lt;/STRONG&gt;: Configure your APIM to route requests to different OpenAI deployments/models based on performance, availability, or workload type.​&lt;/LI&gt;
&lt;LI data-start="1972" data-end="2092"&gt;&lt;STRONG data-start="1974" data-end="2005"&gt;Public backbone consumption&lt;/STRONG&gt;: Use gateways that connect via the Microsoft Public Backbone to improve performance and reduce exposure to public internet routing.​&lt;/LI&gt;
&lt;LI data-start="2094" data-end="2233"&gt;&lt;STRONG data-start="2096" data-end="2146"&gt;Business continuity &amp;amp; disaster recovery (BCDR)&lt;/STRONG&gt;: Integrate failover rules, caching, and retry policies to ensure seamless experiences during disruptions.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Example: &lt;/EM&gt;A multinational company deploying Azure OpenAI for internal employee support creates deployments in East US, West Europe, and Southeast Asia. They set up regional APIM gateways using the Premium tier and route traffic intelligently through Azure Front Door. If the East US region is unavailable, users are routed to West Europe automatically — with minimal latency impact — ensuring uptime and productivity.&lt;/P&gt;
&lt;P data-start="3627" data-end="3641"&gt;&lt;STRONG data-start="3627" data-end="3641"&gt;Resources:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="3642" data-end="4146"&gt;
&lt;LI data-start="3642" data-end="3796"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-deploy-multi-region" target="_blank" rel="noopener" data-start="3644" data-end="3794"&gt;Deploy Azure API Management across multiple regions&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="3797" data-end="3973"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/azure-openai-gateway-multi-backend" target="_blank" rel="noopener" data-start="3799" data-end="3971"&gt;Use a gateway in front of multiple Azure OpenAI deployments or models&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-start="3974" data-end="4146"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/business-continuity-disaster-recovery" target="_blank" rel="noopener" data-start="3976" data-end="4146"&gt;Designing for consumption through the Microsoft public backbone&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Bonus: Download the full Azure OpenAI review checklist&lt;/H3&gt;
&lt;P&gt;If you're looking for a structured way to assess your Azure OpenAI implementation, the Azure Review Checklists now provides a comprehensive &lt;STRONG&gt;checklist&lt;/STRONG&gt; with &lt;STRONG&gt;180+ best practice items&lt;/STRONG&gt; covering AI Landing Zone for every critical area: Governance, Operations, networking, Identity, Cost Management, and Business Continuity &amp;amp; Disaster Recovery (BCDR):&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Download the official &lt;A class="lia-external-url" href="https://github.com/Azure/review-checklists/blob/main/spreadsheet/review_checklist.xlsm" target="_blank" rel="noopener"&gt;Review Checklist Excel Workbook&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Select &lt;STRONG&gt;AI Landing Zone&amp;nbsp;&lt;/STRONG&gt;and click to &lt;STRONG&gt;Import latest checklist&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;Load the AI Landing Zone checklist and explore categorized recommendations with direct reference links to Microsoft documentation.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;This checklist serves as a powerful tool to validate architecture decisions, uncover gaps, and guide implementation discussions across technical and governance domains.&lt;/P&gt;
&lt;H3&gt;Conclusion&lt;/H3&gt;
&lt;P&gt;By adhering to these best practices, organizations can effectively manage and secure their Azure OpenAI workloads, ensuring they are reliable, efficient, and aligned with industry standards.&lt;/P&gt;</description>
      <pubDate>Fri, 25 Apr 2025 18:16:33 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/azure-openai-best-practices-a-quick-reference-guide-to-optimize/ba-p/4403546</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-04-25T18:16:33Z</dc:date>
    </item>
    <item>
      <title>AKS networking made easy: Your comprehensive guide</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/aks-networking-made-easy-your-comprehensive-guide/ba-p/4398603</link>
      <description>&lt;P&gt;Azure Kubernetes Service (AKS) is not just about deploying containerized applications—it’s also about architecting robust, secure, and efficient network connectivity for your clusters. In this blog post, we’ll explore the intricacies of AKS networking, clarify the different models and options available, and discuss best practices through real-world scenarios. Whether you’re just starting out or looking to fine-tune an existing deployment, this guide will help you master AKS networking.&lt;/P&gt;
&lt;H3&gt;1. AKS network topologies and connectivity&lt;/H3&gt;
&lt;P&gt;Understanding the network topology is the foundation of effective AKS networking. The&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/app-platform/aks/network-topology-and-connectivity" target="_blank" rel="noopener"&gt;Cloud Adoption Framework’s AKS network topology and connectivity guide&lt;/A&gt; provides a structured look at how AKS clusters integrate into an organization’s network fabric.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Key concepts:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Cluster connectivity: How pods, services, and external resources communicate.&lt;/LI&gt;
&lt;LI&gt;Topology options: From simple flat networks to more segmented designs.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Real-world scenario: Imagine a multi-tier application where frontend pods need to securely talk to backend services and databases. A clear network topology ensures that the traffic flow respects both performance and security requirements.&lt;/P&gt;
&lt;img&gt;This diagram illustrates a simplified view of how traffic flows from external users through an ingress controller to both frontend and backend pods.&lt;/img&gt;
&lt;H3&gt;2. Comparing AKS network models&lt;/H3&gt;
&lt;P&gt;One of the most important decisions when deploying AKS is choosing between the different networking models.&lt;/P&gt;
&lt;P&gt;Kubenet was one of the original networking drivers in Kubernetes, and it still “just works” out of the box in most on‑prem or DIY clusters. But as we’ve moved toward managed, cloud‑hosted Kubernetes, vendor‑built CNIs have become the norm—solving Kubenet’s limitations around IP‑address management, scalability and lack of overlay networking.&lt;/P&gt;
&lt;P&gt;That’s why AKS now offers a full spectrum of Azure‑native CNIs—Standard (Node Subnet), Overlay, dynamic IP allocation and even Cilium‑powered variants—each built to fill those gaps. Standard mode injects pod IPs straight into your VNet, Overlay preserves your address space, dynamic IP mode auto‑manages huge clusters, and Cilium brings eBPF‑driven performance and observability.&lt;/P&gt;
&lt;P&gt;The&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/concepts-network#compare-network-models" target="_blank" rel="noopener"&gt;AKS concepts on network models&lt;/A&gt; outline the primary options:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Kubenet vs. Azure CNI (Standard)&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Kubenet:&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;img&gt;Kubenet in action: Pods receive overlay network IPs, use NAT for external communication, and preserve VNET addresses.&lt;/img&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI style="list-style-type: none;"&gt;
&lt;UL&gt;
&lt;LI&gt;Simplicity and flexibility: Pods receive an IP from an overlay network.&lt;/LI&gt;
&lt;LI&gt;Use case: Historically, kubenet was favored for smaller clusters or scenarios where conserving IP addresses was important.&lt;/LI&gt;
&lt;LI&gt;Important notice: On 31 March 2028, &lt;STRONG&gt;kubenet networking for Azure Kubernetes Service (AKS) will be retired&lt;/STRONG&gt;.&amp;nbsp; To avoid service disruptions, you will need to upgrade your workloads running on kubenet to Azure Container Networking Interface (CNI) overlay before that date. More details can be found in the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/use-network-policies" target="_blank" rel="noopener"&gt;official Microsoft documentation&lt;/A&gt;.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;Azure CNI (Standard Mode):&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img&gt;CNI Standard: Pods obtain IPs directly from the VNET, ensuring seamless integration but requiring careful IP planning&lt;/img&gt;
&lt;UL&gt;
&lt;UL&gt;
&lt;LI&gt;Full integration: Pods get IP addresses directly from the virtual network (VNET), providing seamless integration with other Azure resources.&lt;/LI&gt;
&lt;LI&gt;Scalability and integration: Ideal for large clusters and scenarios that require tight integration with Azure networking.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Azure CNI Standard vs. Azure CNI Overlay&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;When choosing a networking approach for your AKS cluster, it's important to understand the trade-offs between the two main Azure CNI variants. Azure CNI Standard assigns pod IPs directly from your Azure VNET, offering tight integration with your network infrastructure. In contrast, Azure CNI Overlay decouples pod IP assignment from the VNET through encapsulation (e.g., VXLAN), which can be advantageous for large-scale deployments with limited IP space. Below is an overview of the differences between these two approaches:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Azure CNI Standard:&lt;/LI&gt;
&lt;/UL&gt;
&lt;img&gt;CNI Standard: Pods directly receive IPs from the VNET, allowing seamless integration but requiring careful IP planning&lt;/img&gt;
&lt;UL&gt;
&lt;UL&gt;
&lt;LI&gt;Direct IP assignment: Each pod is assigned a unique IP address from your Azure VNET.&lt;/LI&gt;
&lt;LI&gt;Full VNET integration: Enables use of VNET-level controls (like NSGs) and ensures pods are routable within your VNET.&lt;/LI&gt;
&lt;LI&gt;IP consumption: Requires careful IP planning, as each pod consumes a VNET IP.&lt;/LI&gt;
&lt;LI&gt;Learn more: &lt;A class="lia-external-url" href="http://%20https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni" target="_blank" rel="noopener"&gt;Azure CNI networking&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;Azure CNI Overlay:&lt;/LI&gt;
&lt;/UL&gt;
&lt;img&gt;CNI Overlay: Pods receive IPs from an overlay network, decoupled from VNET IP space for efficient IP usage but with slight encapsulation overhead.&lt;/img&gt;
&lt;UL&gt;
&lt;UL&gt;
&lt;LI&gt;Overlay network: Pods receive IP addresses from an overlay network using encapsulation (such as VXLAN).&lt;/LI&gt;
&lt;LI&gt;Efficient IP utilization: Decouples pod IP assignment from the VNET's IP range, which is beneficial for large-scale deployments with limited VNET address space.&lt;/LI&gt;
&lt;LI&gt;Performance consideration: There is a slight overhead due to encapsulation/decapsulation processes.&lt;/LI&gt;
&lt;LI&gt;Learn more: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay" target="_blank" rel="noopener"&gt;Azure CNI overlay&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Additional Azure CNI variants&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="192" data-end="287"&gt;Beyond the standard modes, Microsoft offers other variants to address different workload needs:&lt;/P&gt;
&lt;UL data-start="289" data-end="1658"&gt;
&lt;LI data-start="289" data-end="964"&gt;Azure CNI with dynamic IP allocation: Allocates pod IP addresses dynamically, reducing the need for pre-allocation and easing IP management in highly dynamic environments.&lt;BR /&gt;
&lt;UL data-start="337" data-end="964"&gt;
&lt;LI data-start="494" data-end="671"&gt;Benefits:&amp;nbsp;
&lt;UL data-start="516" data-end="671"&gt;
&lt;LI data-start="516" data-end="601"&gt;Reduces IP waste when pods are ephemeral and can be scaled up or down frequently.&lt;/LI&gt;
&lt;LI data-start="606" data-end="671"&gt;Simplifies IP address management by allocating IPs on-demand.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI data-start="674" data-end="818"&gt;When to use: Ideal for environments with rapid scaling or high pod churn, where managing a static pool of IPs can be cumbersome.&lt;/LI&gt;
&lt;LI data-start="821" data-end="964"&gt;Learn more: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni-dynamic-ip-allocation" target="_blank" rel="noopener" data-start="839" data-end="964"&gt;Azure CNI with dynamic IP allocation&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI data-start="966" data-end="1658"&gt;Azure CNI Powered by Cilium: Leverages Cilium and eBPF to provide advanced networking capabilities, enhanced security policies, and improved observability.&lt;BR /&gt;
&lt;UL data-start="1005" data-end="1658"&gt;
&lt;LI data-start="1155" data-end="1389"&gt;Benefits:
&lt;UL data-start="1177" data-end="1389"&gt;
&lt;LI data-start="1177" data-end="1270"&gt;Provides granular security and networking policies with high performance, thanks to eBPF.&lt;/LI&gt;
&lt;LI data-start="1275" data-end="1389"&gt;Enables advanced features like transparent encryption, load balancing, and deep visibility into network flows.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI data-start="1392" data-end="1535"&gt;When to use: Suitable for organizations looking for cutting-edge network security, observability, and performance improvements.&lt;/LI&gt;
&lt;LI data-start="1538" data-end="1658"&gt;Learn more: &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/azure-cni-powered-by-cilium" target="_blank" rel="noopener" data-start="1556" data-end="1658"&gt;Azure CNI Powered by Cilium&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Example lab: Deploying an AKS cluster with Azure CNI (Standard)&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;1. Plan IP addressing: Use the&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni?tabs=configure-networking-portal#plan-ip-addressing-for-your-cluster" target="_blank" rel="noopener"&gt;Azure CNI configuration guide&lt;/A&gt; to determine your IP range.&lt;/P&gt;
&lt;P&gt;2. Create the AKS cluster:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# Variables
resourceGroup="MyResourceGroup"
location="centralus"
vnetName="MyVnet"
vnetAddressPrefix="10.0.0.0/16"
subnetName="MySubnet"
subnetAddressPrefix="10.0.1.0/24"
aksName="MyCNIAKSCluster"
serviceCidr="10.200.0.0/16"
dnsServiceIp="10.200.0.10"

# Create resource group
az group create --name "$resourceGroup" --location "$location"

# Create virtual network
az network vnet create \
  --resource-group "$resourceGroup" \
  --name "$vnetName" \
  --address-prefix "$vnetAddressPrefix"

# Create subnet within the VNET
az network vnet subnet create \
  --resource-group "$resourceGroup" \
  --vnet-name "$vnetName" \
  --name "$subnetName" \
  --address-prefix "$subnetAddressPrefix"

# Retrieve current subscription ID and build the subnet ID dynamically
subId=$(az account show --query id -o tsv)
subnetId="/subscriptions/${subId}/resourceGroups/${resourceGroup}/providers/Microsoft.Network/virtualNetworks/${vnetName}/subnets/${subnetName}"

# Create the AKS cluster using the dynamic subnet ID
az aks create \
  --resource-group "$resourceGroup" \
  --name "$aksName" \
  --location "$location" \
  --network-plugin azure \
  --vnet-subnet-id "$subnetId" \
  --service-cidr "$serviceCidr" \
  --dns-service-ip "$dnsServiceIp" \
  --enable-managed-identity
&lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;AKS CNI Standard mode – Key networking parameters&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;When deploying an AKS cluster using Azure CNI in standard mode, it’s important to understand the key parameters that control network configuration:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;--service-cidr:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Purpose: This parameter defines the CIDR block from which Kubernetes service IPs are allocated.&lt;/LI&gt;
&lt;LI&gt;Usage: The service CIDR must be a range that does not conflict with your virtual network (VNET) or pod IP ranges.&lt;/LI&gt;
&lt;LI&gt;Example: If you specify --service-cidr 10.200.0.0/16, all cluster services (such as those created via kubectl expose) will receive IPs from this range. It’s critical to plan this CIDR carefully to ensure there are no overlaps with any other network segments in your environment.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI&gt;--dns-service-ip:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Purpose: This parameter designates the IP address within the service CIDR that is used for the cluster’s DNS service (typically CoreDNS).&lt;/LI&gt;
&lt;LI&gt;Usage: This IP must fall within the range defined by the service CIDR and must not be in use by any other service.&lt;/LI&gt;
&lt;LI&gt;Example: For a service CIDR of 10.200.0.0/16, you might set --dns-service-ip 10.200.0.10. This reserved IP is then used by the DNS service to resolve names for services and pods within the cluster.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/UL&gt;
&lt;P&gt;Why these settings are critical:&lt;BR /&gt;Using separate CIDR blocks for the VNET, pods, and services ensures there is no overlap, which is essential for proper routing and network isolation. While Azure CNI (standard mode) assigns pod IPs directly from the VNET, the service CIDR is distinct and is only used for service IP allocation. This separation allows you to have more control over your network design and helps prevent conflicts with external networks.&lt;/P&gt;
&lt;P&gt;3.&amp;nbsp;&lt;SPAN style="font-family: var(--lia-blog-font-family); background-color: var(--lia-rte-bg-color); color: var(--lia-bs-body-color); font-size: var(--lia-bs-font-size-base); font-style: var(--lia-font-style-base);"&gt;Validate networking:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Check that nodes and pods are receiving IPs from the specified VNET. This command displays each node's internal IP address, helping you verify that nodes are attached to the correct network.&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get nodes -o wide&lt;/LI-CODE&gt;
&lt;P&gt;Look for the INTERNAL-IP column to confirm that each node's IP falls within the expected VNET address space.&lt;/P&gt;
&lt;P&gt;To check that pods are receiving IPs correctly, list pods across all namespaces:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get pods --all-namespaces -o wide&lt;/LI-CODE&gt;
&lt;P&gt;The IP column should show addresses allocated from the VNET's defined range (for Azure CNI).&lt;/P&gt;
&lt;P&gt;For additional details on a specific node’s networking, including labels and annotations related to IP assignment, you can describe the node:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl describe node &amp;lt;node-name&amp;gt;&lt;/LI-CODE&gt;
&lt;P&gt;Replace &amp;lt;node-name&amp;gt; with one of the node names from the previous command. This output can help confirm that the node is correctly integrated with the VNET.&lt;/P&gt;
&lt;P&gt;These commands together help validate that both the node and pod IP assignments are in line with your planned IP ranges, ensuring that your network planning and model selection are correctly implemented.&lt;/P&gt;
&lt;H3&gt;3. Private clusters and DNS configurations&lt;/H3&gt;
&lt;P&gt;&lt;SPAN style="font-style: var(--lia-blog-font-style); font-weight: var(--lia-blog-font-weight); font-family: var(--lia-blog-font-family); background-color: var(--lia-rte-bg-color); color: var(--lia-bs-body-color); font-size: var(--lia-bs-font-size-base);"&gt;For organizations with strict security requirements, AKS offers the ability to create private clusters. Private clusters ensure that the API server is not exposed to the public internet, enhancing security.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Key topics:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Private cluster deployment: Detailed in the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=default-basic-networking%2Cazure-portal" target="_blank" rel="noopener"&gt;private clusters documentation&lt;/A&gt; and the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=default-basic-networking%2Cazure-portal#no-private-dns-zone-prerequisites" target="_blank" rel="noopener"&gt;DNS prerequisites&lt;/A&gt;.&lt;/LI&gt;
&lt;LI&gt;Private DNS: The &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/dns/private-dns-overview" target="_blank" rel="noopener"&gt;Azure Private DNS overview&lt;/A&gt; explains how to leverage private DNS zones, and the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=default-basic-networking%2Cazure-portal#configure-private-dns-zone" target="_blank" rel="noopener"&gt;configuration guide&lt;/A&gt; provides a step-by-step approach to integrate it with your private clusters.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Example lab: Creating a private AKS cluster with CNI (Standard)&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;1. Deploy a private AKS cluster:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az aks create \
  --resource-group MyResourceGroup \
  --name MyPrivateAKSCluster \
  --enable-private-cluster \
  --network-plugin azure&lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;Why isn't the VNET or service CIDR specified?&lt;BR /&gt;&lt;/U&gt;&lt;BR /&gt;In this example, advanced networking parameters like the virtual network, subnet, service CIDR, and DNS service IP are not explicitly defined. This is because:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Default networking configuration: When these parameters are omitted, AKS automatically provisions a default virtual network and assigns IP ranges for the cluster. With --network-plugin azure, the cluster is created using Azure CNI. This managed configuration is sufficient for many scenarios, reducing complexity during initial deployments.&lt;/LI&gt;
&lt;LI&gt;Focus on enabling privacy: The primary goal in this scenario is to enable the private connectivity feature. By focusing on --enable-private-cluster, the example emphasizes that the API server will be accessible only within the internal network. Customizing networking settings (like specifying a particular VNET or IP ranges) is optional and can be added if you have specific integration or policy requirements.&lt;/LI&gt;
&lt;LI&gt;Flexibility and customization: If your deployment requires integration with an existing virtual network or adherence to particular IP address planning, you can extend the command to include those parameters, similar to the public cluster examples. The minimal command is provided as a baseline for simplicity and ease of deployment.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;2. &lt;SPAN style="font-family: var(--lia-blog-font-family); background-color: var(--lia-rte-bg-color); color: var(--lia-bs-body-color); font-size: var(--lia-bs-font-size-base); font-style: var(--lia-font-style-base);"&gt;Configure a private DNS zone:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;The &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/dns/private-dns-overview" target="_blank" rel="noopener"&gt;Azure Private DNS overview&lt;/A&gt; explains how to leverage private DNS zones for name resolution within your virtual network. For private clusters, configuring a private DNS zone ensures that your cluster’s API server and internal endpoints are accessible using friendly domain names. The &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=default-basic-networking%2Cazure-portal#configure-private-dns-zone" target="_blank" rel="noopener"&gt;configuration guide&lt;/A&gt; provides step-by-step instructions for this setup.&lt;/P&gt;
&lt;P&gt;Real-world example:&amp;nbsp;&lt;SPAN style="font-style: var(--lia-blog-font-style); font-weight: var(--lia-blog-font-weight); font-family: var(--lia-blog-font-family); background-color: var(--lia-rte-bg-color); color: var(--lia-bs-body-color); font-size: var(--lia-bs-font-size-base);"&gt;Consider a financial services company that must comply with strict data residency and security guidelines. Deploying AKS as a private cluster—with a dedicated private DNS zone—ensures that all control-plane communications and sensitive endpoints remain isolated within the company’s secure virtual network. If advanced network customization is needed, additional parameters (like a pre-created VNET, custom service CIDR, etc.) can be integrated into the deployment command.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;4. Ingress, application routing, and traffic management&lt;/H3&gt;
&lt;P&gt;Managing incoming traffic is critical for any production-grade application. AKS offers several options for routing traffic:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Application Gateway for Containers&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Azure’s latest Ingress offering, &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/application-gateway/for-containers/overview" target="_blank" rel="noopener"&gt;Application Gateway for Containers&lt;/A&gt;, is the successor to Application Gateway Ingress Controller, bringing numerous performance, resiliency, and layer 7 load balancing capabilities.&amp;nbsp; In addition, it adopts Kubernetes’s latest Gateway API to enable administrators and developers to easily define their load balancing intent.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-teams="true"&gt;&lt;STRONG&gt;Application Gateway Ingress Controller (AGIC)&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN data-teams="true"&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/application-gateway/ingress-controller-overview" target="_blank" rel="noopener"&gt;Azure Application Gateway Ingress Controller&lt;/A&gt; can provide advanced load balancing, SSL termination, and web application firewall capabilities.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Application routing&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;HTTP Application Routing: Historically, HTTP Application Routing was a popular option for simplifying DNS management for your applications. Note: Microsoft has announced that HTTP Application Routing will be retired on 03 March 2025. It is recommended that you migrate to the Application Routing add-on by that date to ensure continued support and enhanced functionality. For further details on migration, refer to the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/app-routing-migration" target="_blank" rel="noopener"&gt;App routing migration guide&lt;/A&gt;.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Traffic management overview&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;In addition to ingress and application routing, effective traffic management involves strategies that optimize how traffic is handled within your environment. While a deep dive into these advanced topics is beyond the scope of this article, here is a brief overview:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Traffic splitting &amp;amp; canary deployments: Techniques that enable gradual rollout of new application versions by directing a portion of the traffic to new deployments while the majority remains on the current version. This reduces risk during updates and allows for real-time testing under live conditions.&lt;/LI&gt;
&lt;LI&gt;A/B testing &amp;amp; blue/green deployments: Strategies that allow you to serve different versions of your application to different user groups. This can help in testing features or UI changes before a full rollout, ensuring smoother transitions and minimizing disruption.&lt;/LI&gt;
&lt;LI&gt;Geo-based routing: Directing user requests to the nearest available service endpoint based on geographic location. This not only improves response times but also enhances the overall user experience by reducing latency.&lt;/LI&gt;
&lt;LI&gt;Service mesh integration: Tools like &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/istio-deploy-addon" target="_blank" rel="noopener"&gt;Istio can be deployed alongside AKS&lt;/A&gt; to provide fine-grained control over traffic routing, observability, and secure communication between services. These tools add another layer of management for scenarios that require dynamic traffic policies and granular control.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Note&lt;EM&gt;:&lt;/EM&gt; For a comprehensive exploration of these advanced traffic management strategies, a dedicated article would be ideal. This overview aims to provide context on how these techniques integrate with basic ingress and application routing to form a complete traffic management strategy.&lt;/P&gt;
&lt;P data-start="126" data-end="189"&gt;&lt;STRONG&gt;Ingress resource example with Gateway API and how to use it&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="126" data-end="189"&gt;With the new capabilities of Application Gateway for Containers, you can now leverage the Gateway API for more advanced ingress scenarios—such as hosting multiple sites and aligning with Kubernetes’ evolving standards. Unlike the traditional ingress resource, the Gateway API provides a more flexible and standardized way to manage external traffic.&lt;/P&gt;
&lt;P data-start="542" data-end="583"&gt;Step 1: Prepare Your Backend Service&lt;/P&gt;
&lt;P data-start="585" data-end="710"&gt;Ensure you have a backend service deployed (for example, a service named my-service that listens on port 80). For instance:&lt;BR /&gt;&lt;EM&gt;Service Configuration (my-service.yaml)&lt;/EM&gt;&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;apiVersion: v1
kind: Service
metadata:
  name: my-service
  namespace: default
spec:
  selector:
    app: myapp
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80&lt;/LI-CODE&gt;
&lt;P data-start="585" data-end="710"&gt;Deploy the service using:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl apply -f my-service.yaml&lt;/LI-CODE&gt;
&lt;P data-start="969" data-end="1016"&gt;Step 2: Create a Gateway API Configuration&lt;/P&gt;
&lt;P data-start="1018" data-end="1223"&gt;Below is an example of how to configure the Gateway API to work with AGC. This example demonstrates creating a Gateway and an associated HTTPRoute to host traffic for the hostname example.yourdomain.com.&lt;/P&gt;
&lt;P data-start="585" data-end="710"&gt;&lt;EM&gt;Gateway Configuration (gateway.yaml):&lt;/EM&gt;&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: my-gateway
  namespace: default
spec:
  gatewayClassName: azure-agc
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      allowedRoutes:
        namespaces:
          from: All&lt;/LI-CODE&gt;
&lt;P data-start="585" data-end="710"&gt;&lt;EM&gt;HTTPRoute Configuration (httproute.yaml):&lt;/EM&gt;&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: my-httproute
  namespace: default
spec:
  parentRefs:
    - name: my-gateway
  hostnames:
    - "example.yourdomain.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /
      backendRefs:
        - name: my-service
          port: 80
&lt;/LI-CODE&gt;
&lt;P data-start="1933" data-end="1978"&gt;Step 3: Deploy the Gateway and HTTPRoute&lt;/P&gt;
&lt;P data-start="1980" data-end="2005"&gt;Apply the configurations:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl apply -f gateway.yaml kubectl apply -f httproute.yaml&lt;/LI-CODE&gt;
&lt;P data-start="2082" data-end="2118"&gt;Step 4: Validate the Deployment&lt;/P&gt;
&lt;OL data-start="2120" data-end="2692"&gt;
&lt;LI data-start="2120" data-end="2252"&gt;DNS resolution: Ensure that example.yourdomain.com points to the public IP of your Application Gateway for Containers.&lt;/LI&gt;
&lt;LI data-start="2254" data-end="2433"&gt;Testing connectivity: From an external client, send an HTTPS request to https://example.yourdomain.com and verify that the traffic is routed to your backend service.&lt;/LI&gt;
&lt;LI data-start="2435" data-end="2692"&gt;Monitoring and troubleshooting: Use the following commands to inspect the status and events of your Gateway and HTTPRoute:&lt;/LI&gt;
&lt;/OL&gt;
&lt;LI-CODE lang="bash"&gt;kubectl describe gateway my-gateway -n default
kubectl describe httproute my-httproute -n default
&lt;/LI-CODE&gt;
&lt;P data-start="2699" data-end="2742"&gt;Key advantages of Gateway API with AGC:&lt;/P&gt;
&lt;UL data-start="2744" data-end="3305"&gt;
&lt;LI data-start="2744" data-end="2915"&gt;Advanced routing capabilities: The Gateway API allows you to define multiple routes, enabling scenarios like multiple site hosting, path-based routing, and more.&lt;/LI&gt;
&lt;LI data-start="2917" data-end="3101"&gt;Future-proof alignment: With Ingress API development in a freeze state, adopting the Gateway API aligns your deployments with the evolving direction of Kubernetes networking.&lt;/LI&gt;
&lt;LI data-start="3103" data-end="3305"&gt;Unified management: By using AGC with Gateway API, you benefit from the advanced features of Application Gateway for Containers, including robust load balancing and enhanced security features.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3307" data-end="3469"&gt;This updated approach not only modernizes your ingress setup but also provides a more scalable and flexible way to manage external traffic into your AKS clusters.&lt;/P&gt;
&lt;P data-start="3471" data-end="3712"&gt;For additional details and the latest examples, see the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/application-gateway/for-containers/how-to-multiple-site-hosting-gateway-api?tabs=alb-managed" target="_blank" rel="noopener"&gt;multi-site hosting with Application Gateway for Containers&lt;/A&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="3471" data-end="3712"&gt;Another great content about AGC written by&amp;nbsp;&lt;A class="lia-external-url" href="https://www.linkedin.com/in/erjosito/" target="_blank" rel="noopener"&gt;Jose Moreno&lt;/A&gt; is available here: &lt;A class="lia-external-url" href="https://blog.cloudtrooper.net/2025/04/02/application-gateway-for-containers-a-not-so-gentle-intro-4/" target="_blank" rel="noopener"&gt;Application Gateway for Containers: a not-so-gentle intro&lt;/A&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="3471" data-end="3712"&gt;&lt;U style="font-style: var(--lia-blog-font-style); font-weight: var(--lia-blog-font-weight); font-family: var(--lia-blog-font-family); background-color: var(--lia-rte-bg-color); color: var(--lia-bs-body-color); font-size: var(--lia-bs-font-size-base);"&gt;Diagram: AGC traffic flow&lt;/U&gt;&lt;/P&gt;
&lt;img&gt;An example architecture illustrating how Application Gateway for Containers (AGC) uses the Gateway API to route HTTPS traffic from a client to different services within an AKS cluster.&lt;BR /&gt;&lt;/img&gt;
&lt;P&gt;&lt;SPAN style="font-style: var(--lia-blog-font-style); font-weight: var(--lia-blog-font-weight); font-family: var(--lia-blog-font-family); background-color: var(--lia-rte-bg-color); color: var(--lia-bs-body-color); font-size: var(--lia-bs-font-size-base);"&gt;Scenario: A global e-commerce platform leverages Application Gateway for Containers (AGC) integrated with the Gateway API to route traffic based on hostnames, paths, or other advanced routing rules. This approach allows each microservice (e.g., checkout, product catalog, user management) to be served through its own route configuration, simplifying scaling and updates. As the platform grows, the Gateway API’s extensible model ensures a future-proof solution—one that supports multiple site hosting and advanced traffic management without relying on the older Ingress API.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;5. Virtual networks, service endpoints, and private link&lt;/H3&gt;
&lt;P&gt;Integrating your AKS clusters with Azure Virtual Networks (VNETs) is crucial for secure communication with other Azure services.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Service endpoints and private link:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Service endpoints: The &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-service-endpoints-overview" target="_blank" rel="noopener"&gt;Virtual network service endpoints overview&lt;/A&gt; explains how endpoints extend VNET private address space to Azure services.&lt;/LI&gt;
&lt;LI&gt;Private link: For even tighter integration, &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/private-link/private-link-overview" target="_blank" rel="noopener"&gt;private link&lt;/A&gt; allows you to access Azure PaaS services over a private endpoint in your VNET.&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;U&gt;Diagram: VNET Integration:&lt;/U&gt;&lt;/P&gt;
&lt;img&gt;An AKS cluster integrates with an Azure Virtual Network, enabling secure access to Azure SQL Database, Storage Accounts, and other PaaS services&lt;/img&gt;
&lt;P&gt;Example Use Case: A healthcare application that needs to access an Azure SQL Database can use service endpoints or Private Link. This ensures that traffic between the AKS cluster and the database does not traverse the public internet, thereby meeting regulatory compliance and security requirements.&lt;/P&gt;
&lt;H3&gt;6. Planning IP addressing with Azure CNI&lt;/H3&gt;
&lt;P&gt;A critical aspect of designing your AKS network is planning the IP address space. The &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni?tabs=configure-networking-portal#plan-ip-addressing-for-your-cluster" target="_blank" rel="noopener"&gt;Azure CNI configuration guide&lt;/A&gt; helps you to:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Determine IP range requirements: For nodes and pods.&lt;/LI&gt;
&lt;LI&gt;Avoid address overlap: With existing VNETs or on-premises networks.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;7. Egress traffic management and security controls&lt;/H3&gt;
&lt;P&gt;Outbound traffic from your AKS clusters must be managed to ensure security and compliance. There are several approaches:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Egress options:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;UDR and Azure Firewall: The &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/egress-outboundtype#deploy-a-cluster-with-outbound-type-of-udr-and-azure-firewall" target="_blank" rel="noopener"&gt;Deploy a cluster with outbound type of UDR and Azure Firewall&lt;/A&gt; documentation details how to route egress traffic through user-defined routes (UDRs) and Azure Firewall for enhanced control.&lt;/LI&gt;
&lt;LI&gt;Egress outbound types: Additional details in the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/egress-outboundtype" target="_blank" rel="noopener"&gt;egress outbound type guide&lt;/A&gt; illustrate various configurations.&lt;/LI&gt;
&lt;LI&gt;Limiting egress traffic: The &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/limit-egress-traffic?tabs=aks-with-system-assigned-identities" target="_blank" rel="noopener"&gt;limit egress traffic&lt;/A&gt; document offers strategies for restricting outbound access to only trusted destinations.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Security layers:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Network security groups (NSGs): &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-network/network-security-group-how-it-works" target="_blank" rel="noopener"&gt;NSGs in virtual networks&lt;/A&gt; provide an extra layer of security by filtering traffic at the subnet or NIC level.&lt;/LI&gt;
&lt;LI&gt;Network policies: For pod-level security, the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/use-network-policies" target="_blank" rel="noopener"&gt;use network policies&lt;/A&gt; guide explains how to restrict communication between pods.&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Network policies are a key tool for enforcing security at the pod level. They allow you to restrict both ingress and egress traffic between pods. In this example, we focus on an ingress policy that permits only pods with the label app: frontend to communicate with pods labeled app: backend.&lt;/P&gt;
&lt;P&gt;Practical scenario: In a scenario where a cluster hosts a mix of public-facing and internal services, configuring UDRs with Azure Firewall and applying NSGs and network policies ensures that public endpoints are hardened while internal communications remain efficient and secure.&lt;/P&gt;
&lt;H3&gt;8. Advanced networking: CNI overlay and operator best practices&lt;/H3&gt;
&lt;P&gt;For those looking to push the envelope in AKS networking, advanced configurations can offer improved performance and flexibility. One such configuration is using Azure CNI Overlay, which helps in scenarios where you need to conserve VNET IP addresses for large-scale deployments.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;What is Azure CNI overlay?&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Overlay network: Instead of assigning each pod an IP directly from your VNET (as in standard Azure CNI), pods receive IP addresses from an overlay network. This overlay is built using encapsulation methods (such as VXLAN), allowing you to decouple pod IP assignment from your VNET’s IP range.&lt;/LI&gt;
&lt;LI&gt;Efficient IP utilization: This approach is particularly beneficial in environments with limited VNET address space or when deploying clusters with high pod density.&lt;/LI&gt;
&lt;LI&gt;Trade-off: While the overlay approach introduces slight encapsulation overhead, it greatly enhances scalability.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Advanced concepts:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;CNI overlay: The &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl" target="_blank" rel="noopener"&gt;Azure CNI overlay documentation&lt;/A&gt; covers how to leverage overlay networks when direct VNET integration is not feasible.&lt;/LI&gt;
&lt;LI&gt;Operator best practices: Following the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-network" target="_blank" rel="noopener"&gt;Operator best practices for networking&lt;/A&gt; can help you maintain optimal performance and security in production environments.&lt;/LI&gt;
&lt;LI&gt;CNI overview: The &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/concepts-network-cni-overview" target="_blank" rel="noopener"&gt;AKS concepts on CNI&lt;/A&gt; provide a thorough understanding of how container networking works within Azure.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Example lab: Implementing CNI overlay&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;1. Deploy a cluster with CNI overlay:&lt;/U&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;# Variables
resourceGroup="MyResourceGroup"
location="centralus"
aksName="MyOverlayAKSCluster"
podCidr="192.168.0.0/16"
nodeCount=3

# Create resource group (if it doesn't already exist)
az group create --name "$resourceGroup" --location "$location"

# Create AKS cluster with CNI Overlay
az aks create \
  --resource-group "$resourceGroup" \
  --name "$aksName" \
  --location "$location" \
  --network-plugin azure \
  --network-plugin-mode overlay \
  --pod-cidr "$podCidr" \
  --enable-addons monitoring \
  --node-count "$nodeCount"
&lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;Note on VNET and pod CIDR with CNI overlay&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;When deploying an AKS cluster using &lt;STRONG&gt;Azure CNI Overlay&lt;/STRONG&gt;, it's important to understand how networking is handled:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Overlay pod CIDR: The pod CIDR you specify (e.g., 192.168.0.0/16) is used exclusively for assigning IP addresses to pods. This overlay CIDR is completely independent of the address space used by the underlying virtual network (VNET).&lt;/LI&gt;
&lt;LI&gt;Default VNET provisioning: In overlay mode, you do not have the option to provide a custom VNET or configure its address range. Instead, if you do not explicitly specify a VNET (which you actually cannot in overlay mode), AKS automatically provisions a default VNET in a system-managed resource group. This VNET supports the cluster's control plane and node infrastructure, and its IP range is independent of the overlay pod CIDR.&lt;/LI&gt;
&lt;LI&gt;Decoupled pod networking: Because pod IP addresses are allocated from the overlay CIDR rather than the VNET, even if the system-managed VNET uses a different range (e.g., 10.0.0.0/16), there is no conflict with pod IPs from the overlay CIDR (e.g., 192.168.0.0/16). This decoupling simplifies IP management and allows for greater scalability, especially in environments where VNET IP space is limited.&lt;/LI&gt;
&lt;LI&gt;When to Use Azure CNI (Standard): If you require explicit control over your VNET—such as defining specific address ranges, subnets, or other custom network configurations—you should use Azure CNI (Standard) mode. With Standard mode, you can create and supply your own custom VNET during cluster creation.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;In summary, &lt;STRONG&gt;Azure CNI Overlay&lt;/STRONG&gt; is designed to abstract the underlying VNET management, automatically provisioning a default VNET without allowing custom configurations, while still providing efficient and scalable pod networking via a decoupled overlay pod CIDR.&lt;/P&gt;
&lt;P&gt;&lt;U&gt;2. Test connectivity:&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;Deploy a sample application and verify pod-to-pod connectivity using overlay networking tools and commands.&lt;/P&gt;
&lt;P&gt;Step 1: Deploy two test pods:&lt;/P&gt;
&lt;P&gt;Create two pods (named test-pod-1 and test-pod-2) using the BusyBox image, which provides basic networking utilities:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl run test-pod-1 --image=busybox --restart=Never -- /bin/sh -c "sleep 3600"
kubectl run test-pod-2 --image=busybox --restart=Never -- /bin/sh -c "sleep 3600"
&lt;/LI-CODE&gt;
&lt;P&gt;Step 2: Verify pods are running&lt;/P&gt;
&lt;P&gt;Check that both pods are in the running state:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get pods&lt;/LI-CODE&gt;
&lt;P&gt;You should see output similar to:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;NAME        READY   STATUS   RESTARTS   AGE
test-pod-1  1/1     Running  0          1m
test-pod-2  1/1     Running  0          1m&lt;/LI-CODE&gt;
&lt;P&gt;Step 3: Retrieve the IP address of one pod&lt;/P&gt;
&lt;P&gt;Get the IP address of test-pod-2:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;POD2_IP=$(kubectl get pod test-pod-2 -o jsonpath='{.status.podIP}')
echo "test-pod-2 IP: $POD2_IP"
&lt;/LI-CODE&gt;
&lt;P&gt;Step 4: Test connectivity from the other pod&lt;/P&gt;
&lt;P&gt;Exec into test-pod-1 and ping test-pod-2 using the retrieved IP address:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl exec test-pod-1 -- ping -c 4 $POD2_IP&lt;/LI-CODE&gt;
&lt;P&gt;You should see output confirming that test-pod-1 can reach test-pod-2, such as:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;PING 192.168.0.5 (192.168.2.117): 56 data bytes
64 bytes from 192.168.2.117: seq=0 ttl=64 time=0.123 ms
64 bytes from 192.168.2.117: seq=1 ttl=64 time=0.098 ms
...
--- 192.168.2.117 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss&lt;/LI-CODE&gt;
&lt;P&gt;Optional: Clean up&lt;/P&gt;
&lt;P&gt;After testing, remove the test pods:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl delete pod test-pod-1 test-pod-2&lt;/LI-CODE&gt;
&lt;H3&gt;9. Managing resource groups and FAQs&lt;/H3&gt;
&lt;P&gt;Understanding how AKS organizes its resources is critical for efficient management. When you deploy an AKS cluster, two resource groups are created by design: one for the cluster's user-managed resources and a secondary, system-managed resource group that contains supporting components. Here’s what you need to know:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Primary vs. secondary resource group: The primary resource group hosts the cluster’s core components, while the secondary resource group holds system-managed resources like load balancers, managed identities, and network components. It’s important to avoid manual modifications in the secondary group since it is maintained by AKS.&lt;/LI&gt;
&lt;LI&gt;Lifecycle management best practices: To safeguard your resources:&lt;/LI&gt;
&lt;UL&gt;
&lt;LI&gt;Apply resource locks or policies to prevent accidental deletion or modification.&lt;/LI&gt;
&lt;LI&gt;Use consistent naming conventions and tagging across both resource groups. This aids in tracking, cost management, and operational monitoring.&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI&gt;Role-based access control (RBAC): Implement RBAC not only within your AKS cluster but also across both resource groups. Proper RBAC configuration ensures that access is granted based on roles and responsibilities, enhancing overall security and operational efficiency.&lt;/LI&gt;
&lt;LI&gt;Monitoring and auditing: Regular monitoring using Azure Monitor or other auditing tools is essential. Keeping a close watch on both resource groups can help detect unauthorized changes or unexpected costs early on, ensuring the operational health and security of your deployment.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;By integrating these practices into your management strategy, you can efficiently control the lifecycle, security, and performance of your AKS resources, leading to a more stable and secure production environment.&lt;/P&gt;
&lt;P&gt;For further details, refer to the &lt;A href="https://learn.microsoft.com/en-us/azure/aks/faq#why-are-two-resource-groups-created-with-aks" target="_blank" rel="noopener"&gt;AKS FAQ on resource groups&lt;/A&gt;.&lt;/P&gt;
&lt;H3&gt;Conclusion&lt;/H3&gt;
&lt;P&gt;AKS networking is multifaceted, covering everything from basic connectivity and IP planning to advanced security and routing scenarios. By understanding:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Network topologies and models (&lt;A href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/app-platform/aks/network-topology-and-connectivity" target="_blank" rel="noopener"&gt;Topology &amp;amp; Connectivity&lt;/A&gt;, &lt;A href="https://learn.microsoft.com/en-us/azure/aks/concepts-network#compare-network-models" target="_blank" rel="noopener"&gt;Compare Network Models&lt;/A&gt;),&lt;/LI&gt;
&lt;LI&gt;Private clusters and DNS configurations (&lt;A href="https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=default-basic-networking%2Cazure-portal" target="_blank" rel="noopener"&gt;Private Clusters&lt;/A&gt;, &lt;A href="https://learn.microsoft.com/en-us/azure/dns/private-dns-overview" target="_blank" rel="noopener"&gt;Private DNS Overview&lt;/A&gt;),&lt;/LI&gt;
&lt;LI&gt;Ingress and routing strategies (&lt;A href="https://learn.microsoft.com/en-us/azure/application-gateway/ingress-controller-overview" target="_blank" rel="noopener"&gt;Ingress Controller Overview&lt;/A&gt;, HTTP Application Routing Note: retirement on 03 March 2025 with migration to Application Routing add-on, &lt;A href="https://learn.microsoft.com/en-us/azure/aks/app-routing-migration" target="_blank" rel="noopener"&gt;App Routing Migration&lt;/A&gt;),&lt;/LI&gt;
&lt;LI&gt;Integration with virtual networks and security controls (&lt;A href="https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-service-endpoints-overview" target="_blank" rel="noopener"&gt;Service Endpoints&lt;/A&gt;, &lt;A href="https://learn.microsoft.com/en-us/azure/private-link/private-link-overview" target="_blank" rel="noopener"&gt;Private Link&lt;/A&gt;),&lt;/LI&gt;
&lt;LI&gt;Advanced topics like CNI overlay and operator best practices (&lt;A href="https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl" target="_blank" rel="noopener"&gt;CNI Overlay&lt;/A&gt;, &lt;A href="https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-network" target="_blank" rel="noopener"&gt;Operator Best Practices&lt;/A&gt;),&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;you can design and operate AKS clusters that are both high-performing and secure. Real-world scenarios, like segregating public-facing and internal services or ensuring regulatory compliance via private networking, illustrate how these concepts are applied in production environments.&lt;/P&gt;
&lt;H3&gt;Next Steps: Hands-On Labs&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Lab 1:&lt;/STRONG&gt; &lt;EM&gt;Deploy an AKS Cluster with Azure CNI Standard and validate IP addressing&lt;/EM&gt;&lt;BR /&gt;Follow the steps in Section 2 to create your cluster and verify pod IPs.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Lab 2:&lt;/STRONG&gt; &lt;EM&gt;Implement a Private Cluster and Configure Private DNS&lt;/EM&gt;&lt;BR /&gt;Use Section 3’s instructions to deploy a private cluster and set up a private DNS zone.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Lab 3: &lt;/STRONG&gt;&lt;EM&gt;Deploy an AKS Cluster with Azure CNI Overlay&lt;BR /&gt;&lt;/EM&gt;Follow the steps in step 8 to create your cluster and test pods connectivity&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;These labs will give you hands-on experience with the core aspects of AKS networking, solidifying your understanding of both the concepts and their practical applications.&lt;/P&gt;
&lt;P&gt;&lt;U&gt;References&lt;/U&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;A href="https://www.youtube.com/watch?v=mAGqnX2WW1M" target="_blank" rel="noopener"&gt;Networking Best Practices - AKS&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/app-platform/aks/network-topology-and-connectivity" target="_blank" rel="noopener"&gt;AKS Network Topology and Connectivity&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/concepts-network#compare-network-models" target="_blank" rel="noopener"&gt;Compare Network Models&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=default-basic-networking%2Cazure-portal" target="_blank" rel="noopener"&gt;Private Clusters&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/dns/private-dns-overview" target="_blank" rel="noopener"&gt;Azure Private DNS Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/faq#why-are-two-resource-groups-created-with-aks" target="_blank" rel="noopener"&gt;AKS FAQ – Resource Groups&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=default-basic-networking%2Cazure-portal#no-private-dns-zone-prerequisites" target="_blank" rel="noopener"&gt;No Private DNS Zone Prerequisites&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/application-gateway/ingress-controller-overview" target="_blank" rel="noopener"&gt;Application Gateway Ingress Controller Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/previous-versions/azure/aks/http-application-routing" target="_blank" rel="noopener"&gt;HTTP Application Routing&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/app-routing-migration" target="_blank" rel="noopener"&gt;App Routing Migration&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-service-endpoints-overview" target="_blank" rel="noopener"&gt;Virtual Network Service Endpoints Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/private-link/private-link-overview" target="_blank" rel="noopener"&gt;Azure Private Link Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=default-basic-networking%2Cazure-portal#configure-private-dns-zone" target="_blank" rel="noopener"&gt;Configure Private DNS Zone&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni?tabs=configure-networking-portal#plan-ip-addressing-for-your-cluster" target="_blank" rel="noopener"&gt;Plan IP Addressing for Your Cluster&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/egress-outboundtype#deploy-a-cluster-with-outbound-type-of-udr-and-azure-firewall" target="_blank" rel="noopener"&gt;Deploy a Cluster with Outbound Type of UDR and Azure Firewall&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/concepts-network-cni-overview" target="_blank" rel="noopener"&gt;AKS CNI Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/egress-outboundtype" target="_blank" rel="noopener"&gt;Egress Outbound Type&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/limit-egress-traffic?tabs=aks-with-system-assigned-identities" target="_blank" rel="noopener"&gt;Limit Egress Traffic&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/virtual-network/network-security-group-how-it-works" target="_blank" rel="noopener"&gt;Network Security Groups Overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/use-network-policies" target="_blank" rel="noopener"&gt;Use Network Policies&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl" target="_blank" rel="noopener"&gt;Azure CNI Overlay&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-network" target="_blank" rel="noopener"&gt;Operator Best Practices – Networking&lt;/A&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Thu, 17 Apr 2025 14:55:28 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/aks-networking-made-easy-your-comprehensive-guide/ba-p/4398603</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-04-17T14:55:28Z</dc:date>
    </item>
    <item>
      <title>Embracing AKS built-in upgrade features and exploring custom solutions</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/embracing-aks-built-in-upgrade-features-and-exploring-custom/ba-p/4398230</link>
      <description>&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Upgrading your AKS clusters in production is made simple with Microsoft’s robust, automated upgrade and update features. The official AKS upgrade process seamlessly handles surge nodes, optimizes Pod Disruption Budgets (PDBs), manages node updates, and performs comprehensive compatibility checks—all ensuring a smooth, low-downtime experience with minimal manual intervention.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="auto"&gt;Official AKS upgrade and update features&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Microsoft has built a rich set of features into AKS to simplify the upgrade process:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Automated cluster upgrades:&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;AKS provides an &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-cluster" target="_blank" rel="noopener"&gt;automated upgrade process&lt;/A&gt; via the az aks upgrade command. This process manages surge nodes for availability, applies necessary health checks, and ensures minimal disruption during the upgrade.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Scheduled and auto-upgrades: &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;With&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/planned-maintenance" target="_blank" rel="noopener"&gt; scheduled upgrades, you can define maintenance windows for cluster updates&lt;/A&gt;. The auto-upgrade feature (when enabled) automatically updates clusters, ensuring they remain under support without manual intervention.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Node image upgrades:&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/node-image-upgrade" target="_blank" rel="noopener"&gt;AKS Node Image Upgrade&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; process automatically updates the underlying node images, reducing the risk of security vulnerabilities and compatibility issues.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="4" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Fleet orchestration for multi-cluster management:&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;For organizations managing multiple clusters, &lt;/SPAN&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/kubernetes-fleet/update-orchestration" target="_blank" rel="noopener"&gt;Kubernetes Fleet Update Orchestration&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; provides a centralized way to coordinate upgrades and updates across your entire fleet.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;These features are robust and continuously evolving, ensuring your production clusters are maintained with industry’s best practices.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="auto"&gt;Why consider a custom upgrade approach?&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;For most users, leveraging the builtin AKS upgrade capabilities is the best way to maintain and update clusters.&amp;nbsp; However, some users desire complete control over every step of the process. If you have unique requirements—for instance, if you prefer to manually trigger upgrades ondemand rather than using scheduled upgrades, or if you need to integrate custom health checks and rollback logic—the custom CLI script presented in this post may be of interest.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;Disclaimer: &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;This experimental proof-of-concept custom CLI solution is provided as-is and is not an official Microsoft solution. It hasn’t been tested on every supported configuration and is not production ready. Use it at your own risk and discretion.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;By exploring this custom approach, you may gain additional control over the upgrade process. Nevertheless, we strongly encourage most users to leverage the robust, builtin features provided by AKS.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="auto"&gt;The custom CLI script for AKS upgrades&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;For users interested in a more granular approach, this custom CLI script automates many aspects of the upgrade process. The script:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Displays available information:&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;It lists your resource groups and AKS clusters (with resource group, cluster name, and location) so you can easily obtain the required parameters.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Dynamic credential download:&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;The script automatically downloads your cluster credentials based on the Resource Group and Cluster Name you provide.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Retrieves the current version and allowed upgrade paths:&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;It displays your current Kubernetes version and uses an interactive menu to show available upgrade targets, clearly marking allowed options.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="4" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Performs pre-upgrade health checks:&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;The script checks node readiness, PDBs, failed pods, and even includes a placeholder for surge capacity.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="5" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Ensures compatibility checks:&lt;/SPAN&gt; &lt;SPAN data-contrast="auto"&gt;It reminds you to verify that your workloads are compatible with the new version before proceeding.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="6" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Initiates the upgrade process: &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Once you confirm, the script triggers the upgrade using the az aks upgrade command.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="7" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Validates post-upgrade health:&amp;nbsp;&lt;/SPAN&gt;After upgrading, the script verifies application health and provides a simulated rollback option if issues are detected.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;AKS upgrade script:&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;#!/bin/bash
###############################################################################
# Enhanced AKS Upgrade Script with Health Validation &amp;amp; Rollback
#
# Prerequisites are handled within this script:
#   1. List your subscriptions to get the subscription ID:
#      az account list --output table
#   2. Set your subscription:
#      az account set --subscription 00000000-0000-0000-0000-000000000000
#   3. List your resource groups:
#      az group list --output table
#   4. List your AKS clusters (ResourceGroup, Cluster Name, Location):
#      az aks list --output table
#   5. The script downloads your cluster credentials dynamically.
#
# This script retrieves the current Kubernetes version for an AKS cluster,
# shows allowed upgrade paths and highlights allowed options, performs 
# pre-upgrade health and compatibility checks, initiates the upgrade process, 
# validates application health post-upgrade, and offers a simulated rollback if issues 
# are detected.
#
# NOTE: AKS does not officially support downgrades. The rollback here simulates
#       a recovery by re-upgrading to the previous version.
###############################################################################

#---------------------------
# Display Available Resource Groups and AKS Clusters
#---------------------------
echo "------ Available Resource Groups ------"
az group list --output table
echo ""
echo "------ Available AKS Clusters (ResourceGroup, Cluster Name, Location) ------"
az aks list --query "[].{ResourceGroup: resourceGroup, ClusterName: name, Location: location}" -o table
echo ""
echo "Please note the Resource Group, Cluster Name, and Location for your AKS cluster."
echo "----------------------------------------------------------------------"
echo ""

#---------------------------
# Helper Functions
#---------------------------

# Pre-upgrade health checks (nodes, PDBs, failed pods, surge capacity)
perform_pre_upgrade_checks() {
    echo ""
    echo "--------------------------------------------"
    echo "Performing Pre-Upgrade Health Checks"
    echo "--------------------------------------------"
    
    echo "1. Checking node status..."
    kubectl get nodes

    echo ""
    echo "2. Checking for any NotReady nodes..."
    NOTREADY=$(kubectl get nodes | grep NotReady)
    if [ ! -z "$NOTREADY" ]; then
        echo "WARNING: Some nodes are not ready. Please investigate before upgrading."
    else
        echo "All nodes are Ready."
    fi

    echo ""
    echo "3. Checking Pod Disruption Budgets (PDBs)..."
    kubectl get pdb --all-namespaces

    echo ""
    echo "4. Checking for pods in a failed state (e.g., CrashLoopBackOff, Error)..."
    kubectl get pods --all-namespaces | grep -E 'CrashLoopBackOff|Error' || echo "No pods in error state found."

    echo ""
    echo "5. Checking for surge nodes / additional capacity (placeholder)..."
    echo "   (Ensure your node pool autoscaler or surge capacity is configured properly.)"
    
    echo ""
    echo "Pre-upgrade health checks completed. Please review the output above."
    read -p "Do you want to continue with the upgrade? (y/N): " CHECK_CONFIRM
    if [[ ! "$CHECK_CONFIRM" =~ ^[Yy]$ ]]; then
        echo "Upgrade cancelled based on pre-upgrade health checks."
        exit 1
    fi
}

# Compatibility check reminder
perform_compatibility_checks() {
    echo ""
    echo "--------------------------------------------"
    echo "Performing Compatibility Checks"
    echo "--------------------------------------------"
    echo "NOTE: Ensure that all critical workloads, custom resources, and third-party"
    echo "      integrations are compatible with the new Kubernetes version."
    echo "      Review release notes and documentation for any breaking changes."
    echo ""
    read -p "Have you verified workload compatibility? (y/N): " COMP_CONFIRM
    if [[ ! "$COMP_CONFIRM" =~ ^[Yy]$ ]]; then
        echo "Upgrade cancelled. Please verify compatibility and try again."
        exit 1
    fi
}

# Post-upgrade health checks (applications, deployments, pods)
perform_post_upgrade_checks() {
    echo ""
    echo "--------------------------------------------"
    echo "Performing Post-Upgrade Health Checks"
    echo "--------------------------------------------"
    echo "Checking deployments status..."
    kubectl get deployments --all-namespaces

    echo ""
    echo "Checking pods status..."
    kubectl get pods --all-namespaces

    echo ""
    echo "Please review the output for any errors or issues with your applications."
    read -p "Do all applications appear healthy? (y/N): " POST_CHECK_CONFIRM
    if [[ ! "$POST_CHECK_CONFIRM" =~ ^[Yy]$ ]]; then
        return 1
    fi
    return 0
}

# Attempt rollback to previous version (simulation)
attempt_rollback() {
    echo ""
    echo "--------------------------------------------"
    echo "Attempting Rollback"
    echo "--------------------------------------------"
    read -p "Rollback to the previous version ($CURRENT_VERSION) ? (y/N): " ROLLBACK_CONFIRM
    if [[ "$ROLLBACK_CONFIRM" =~ ^[Yy]$ ]]; then
        echo "Initiating rollback to version $CURRENT_VERSION..."
        az aks upgrade --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" --kubernetes-version "$CURRENT_VERSION" --yes
        if [ $? -eq 0 ]; then
            echo "Rollback executed successfully."
        else
            echo "Rollback failed. Please check the error messages above and consider manual recovery."
            exit 1
        fi
    else
        echo "Rollback aborted. Please perform manual recovery if necessary."
        exit 1
    fi
}

#---------------------------
# Main Script
#---------------------------

# Prompt for input parameters
read -p "Enter the Resource Group: " RESOURCE_GROUP
read -p "Enter the AKS Cluster Name: " CLUSTER_NAME
read -p "Enter the AKS Region (e.g., eastus): " LOCATION

# Download cluster credentials dynamically
echo ""
echo "Downloading cluster credentials for '$CLUSTER_NAME' in resource group '$RESOURCE_GROUP'..."
az aks get-credentials --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" --overwrite-existing

###############################################################################
# Step 1: Retrieve and Display the Current Kubernetes Version
###############################################################################
echo ""
echo "Fetching the current Kubernetes version for cluster '$CLUSTER_NAME' in '$RESOURCE_GROUP'..."
CURRENT_VERSION=$(az aks show --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" --query "kubernetesVersion" -o tsv)
if [ -z "$CURRENT_VERSION" ]; then
    echo "ERROR: Failed to retrieve the current Kubernetes version. Please check your cluster details."
    exit 1
fi
echo "Current Kubernetes version: $CURRENT_VERSION"
echo ""

###############################################################################
# Step 2: Retrieve Allowed Upgrade Paths for the Cluster
###############################################################################
echo "Retrieving allowed upgrade paths for your cluster..."
UPGRADES_JSON=$(az aks get-upgrades --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" -o json)
ALLOWED_UPGRADES=$(echo "$UPGRADES_JSON" | jq -r 'if .controlPlaneProfile.upgradeProfile.upgrades then (.controlPlaneProfile.upgradeProfile.upgrades | map(.kubernetesVersion) | join(" ")) else "" end')

# Fallback: if no allowed upgrades are determined and CURRENT_VERSION starts with "1.30"
if [ -z "$ALLOWED_UPGRADES" ]; then
    if [[ "$CURRENT_VERSION" =~ ^1\.30 ]]; then
        echo "WARNING: No allowed upgrade paths could be determined automatically."
        echo "Typically, if your cluster is running a version like 1.30.x (e.g., 1.30.10),"
        echo "you can only upgrade directly to a 1.31.x version."
        echo "For example, allowed upgrade targets might include: 1.31.6, 1.31.5, 1.31.4, 1.31.3, 1.31.2, or 1.31.1."
        ALLOWED_UPGRADES="1.31.6 1.31.5 1.31.4 1.31.3 1.31.2 1.31.1"
        ALLOWED_MAJOR_MINOR="1.31"
    else
        echo "WARNING: No allowed upgrade paths could be determined automatically. Proceed with caution."
        ALLOWED_MAJOR_MINOR=""
    fi
else
    # Extract unique major.minor values from allowed upgrades
    ALLOWED_MAJOR_MINOR=$(for ver in $ALLOWED_UPGRADES; do
        echo "$ver" | awk -F. '{print $1"."$2}';
    done | sort -u | tr '\n' ' ')
fi

if [ -n "$ALLOWED_MAJOR_MINOR" ]; then
    echo ""
    echo "Based on your current version ($CURRENT_VERSION), you can upgrade directly to versions with major.minor:"
    for mm in $ALLOWED_MAJOR_MINOR; do
        echo "  - $mm"
    done
    echo "Only versions matching these allowed major.minor values will be marked as [ALLOWED] below."
    echo "For more details, please see https://aka.ms/aks-supported-k8s-ver"
    echo ""
fi

###############################################################################
# Step 3: Fetch Available Kubernetes Versions
###############################################################################
echo "Fetching available Kubernetes versions in '$LOCATION'..."
VERSIONS_JSON=$(az aks get-versions --location "$LOCATION" -o json)
if [ $? -ne 0 ]; then
    echo "ERROR: Failed to fetch available versions. Please check your Azure CLI configuration."
    exit 1
fi

# Extract list of versions and preview flag from the "values" array
mapfile -t OPTIONS &amp;lt; &amp;lt;(echo "$VERSIONS_JSON" | jq -r '(.values // [])[] | "\(.version) \(.isPreview)"')

if [ ${#OPTIONS[@]} -eq 0 ]; then
    echo "ERROR: No versions found for location '$LOCATION'."
    echo "This might be due to the aks-preview extension altering the output."
    echo "If you don't need preview features, try removing the extension with: az extension remove --name aks-preview"
    exit 1
fi

###############################################################################
# Step 4: Build the Interactive Menu with Highlighted Allowed Options
###############################################################################
declare -a VERSION_LIST
declare -a LABELS

for entry in "${OPTIONS[@]}"; do
    # Extract version and preview flag
    VERSION=$(echo "$entry" | awk '{print $1}')
    IS_PREVIEW=$(echo "$entry" | awk '{print $2}')
    LABEL="$VERSION"
    if [ "$IS_PREVIEW" == "true" ]; then
        LABEL="$LABEL (Preview)"
    else
        LABEL="$LABEL (Stable)"
    fi
    
    # Highlight if allowed (by comparing major.minor)
    if [ -n "$ALLOWED_MAJOR_MINOR" ]; then
        AVAILABLE_MM=$(echo "$VERSION" | awk -F. '{print $1"."$2}')
        for allowed in $ALLOWED_MAJOR_MINOR; do
            if [ "$AVAILABLE_MM" == "$allowed" ]; then
                LABEL="$LABEL [ALLOWED]"
                break
            fi
        done
    fi

    VERSION_LIST+=("$VERSION")
    LABELS+=("$LABEL")
done

echo ""
echo "Select the Kubernetes version to upgrade to:"
PS3="Enter your choice (or type 'q' to quit): "
select opt in "${LABELS[@]}"; do
    if [[ "$REPLY" == "q" ]]; then
        echo "Exiting..."
        exit 0
    fi
    if [ -z "$opt" ]; then
        echo "Invalid selection. Please try again."
    else
        TARGET_VERSION=${VERSION_LIST[$((REPLY-1))]}
        echo "You selected: $opt"
        break
    fi
done

###############################################################################
# Step 5: Validate the Selected Target Version
###############################################################################
if [ -n "$ALLOWED_MAJOR_MINOR" ]; then
    AVAILABLE_MM=$(echo "$TARGET_VERSION" | awk -F. '{print $1"."$2}')
    ALLOWED_MATCH=0
    for allowed in $ALLOWED_MAJOR_MINOR; do
        if [ "$AVAILABLE_MM" == "$allowed" ]; then
            ALLOWED_MATCH=1
            break
        fi
    done
    if [ $ALLOWED_MATCH -ne 1 ]; then
        echo ""
        echo "WARNING: Upgrading from $CURRENT_VERSION to $TARGET_VERSION is not allowed based on your cluster's upgrade policy."
        echo "Allowed upgrades from your current version are only for versions with major.minor:"
        for mm in $ALLOWED_MAJOR_MINOR; do
            echo "  - $mm"
        done
        echo "Please select one of these allowed versions."
        exit 1
    fi
fi

###############################################################################
# Step 6: Pre-Upgrade Health &amp;amp; Compatibility Checks
###############################################################################
perform_pre_upgrade_checks
perform_compatibility_checks

###############################################################################
# Step 7: Confirm and Execute the Upgrade
###############################################################################
echo ""
read -p "Proceed with upgrading '$CLUSTER_NAME' from version $CURRENT_VERSION to $TARGET_VERSION? (y/N): " CONFIRM
if [[ ! "$CONFIRM" =~ ^[Yy]$ ]]; then
    echo "Upgrade cancelled."
    exit 0
fi

echo ""
echo "Initiating upgrade to version $TARGET_VERSION..."
az aks upgrade --resource-group "$RESOURCE_GROUP" --name "$CLUSTER_NAME" --kubernetes-version "$TARGET_VERSION" --yes

if [ $? -eq 0 ]; then
    echo "Upgrade command executed successfully."
else
    echo "ERROR: Upgrade command failed. Please check the error messages above."
    exit 1
fi

###############################################################################
# Step 8: Post-Upgrade Health Checks &amp;amp; Rollback Option
# Rollback Mechanism Note: The rollback feature in this script is designed to simulate a recovery process by re-upgrading the cluster back to its previous version. 
# Please note that AKS does not officially support downgrades, so this rollback is not a true downgrade in the traditional sense. It is a best-effort approach that relies on having a known, working previous version and should only be used as a last resort. 
# Ensure that you have proper backups and recovery strategies in place before relying on this functionality.
###############################################################################
if ! perform_post_upgrade_checks; then
    echo ""
    echo "One or more post-upgrade health checks have failed."
    attempt_rollback
else
    echo ""
    echo "Post-upgrade health checks passed. Your applications appear healthy."
fi

echo ""
echo "Upgrade complete. Please continue monitoring your cluster and applications for any issues."&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559685&amp;quot;:720,&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;You can download the full script here:&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://gist.github.com/ricmmartins/c3f58b2f45fad4e143df3e6c10920cc6" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Custom CLI Script for AKS Upgrades&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt; &lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="auto"&gt;Additional considerations&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Rollback mechanism note: &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;The rollback feature in this script is designed to simulate a recovery process by re-upgrading the cluster to its previous version. Please note that AKS does not officially support downgrades, so this rollback is not a true downgrade in the traditional sense. It is a best-effort approach that relies on having a known, working previous version and should only be used as a last resort. Ensure you have proper backups and a comprehensive recovery strategy in place before relying on this functionality.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="auto"&gt;Final thoughts&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Microsoft’s official AKS upgrade features are powerful and designed to simplify the process—from automated cluster and node image upgrades to orchestrated updates across multiple clusters using Fleet. For most users, these built-in capabilities offer the most reliable and supported approach.&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;That said, if you’re a user with unique requirements, exploring a custom solution can provide granular control over every step of the process. This custom CLI script is provided as a proof-of-concept to inspire those who wish to tailor the upgrade process to their specific needs—but always remember, &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;use it at your own risk&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;. It is experimental, not production-ready, and is not fully supported by Microsoft.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Mar 2025 12:54:30 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/embracing-aks-built-in-upgrade-features-and-exploring-custom/ba-p/4398230</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-03-28T12:54:30Z</dc:date>
    </item>
    <item>
      <title>From zero to hero: Mastering storage in Azure Kubernetes Service (AKS)</title>
      <link>https://techcommunity.microsoft.com/t5/startups-at-microsoft/from-zero-to-hero-mastering-storage-in-azure-kubernetes-service/ba-p/4397734</link>
      <description>&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;In today’s cloud-native era, managing persistent storage is a critical component of any production-grade Kubernetes deployment. Azure Kubernetes Service (AKS) provides a robust and scalable platform for orchestrating containerized applications. However, the challenge lies in selecting and configuring the right storage solution for stateful applications such as databases and content management systems.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This guide is designed to take you from a storage novice to an expert in AKS storage management. You’ll learn essential storage concepts, evaluate various storage classes, and implement best practices tailored for AKS. To make these concepts tangible, we’ll walk you through hands-on labs where we deploy a sample WordPress and MySQL application in a dedicated namespace. In the process, you’ll see how to provision dynamic persistent volumes, benchmark performance using fio, and set up Velero for comprehensive backup and disaster recovery. Whether you’re looking to optimize performance or ensure data resilience, this guide has you covered.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-ccp-props="{}"&gt;1. Understanding storage in AKS&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;1.1 Why storage matters in Kubernetes&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Kubernetes pods are ephemeral and may be recreated at any time. For stateful applications—such as databases or content management systems—data persistence is essential. AKS supports several storage backends, letting you choose the best option for your workload. Whether you need high-performance storage for databases or shared file systems for applications, understanding these options ensures your data remains durable and accessible.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;1.2 Storage options overview&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-style-solid" border="1" style="width: 100%; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Storage Type&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Persistence&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Performance&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Use Case&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Ephemeral OS Disk&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;No&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;High&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Stateless apps, caching&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Azure Managed Disks&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Yes&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;High&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Databases, stateful applications&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Azure Files&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Yes&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Medium&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Shared file systems, SMB/NFS storage&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Azure NetApp Files&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Yes&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Very High&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Enterprise-grade workloads&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Azure Blob Storage&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Yes&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Variable&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Data lakes, backup, archiving&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Azure Container Storage&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Yes&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Variable; optimized for container use&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Container-native block storage for databases/streaming/caching/messaging/other generic stateful applications&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;P&gt;&lt;BR /&gt;&lt;EM&gt;Consider both performance and cost. Premium SSD-backed disks (managed-csi-premium) offer high IOPS and low latency for databases, whereas Azure Files can be more cost-effective for shared access needs.&lt;/EM&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="auto"&gt;2. Choosing the right storage class for AKS&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P aria-level="3"&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;What is a &lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;StorageClass&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt; and its relation to storage options?&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:281,&amp;quot;335559739&amp;quot;:281}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;A &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;StorageClass&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; is a Kubernetes resource that defines a "class" or policy for dynamically provisioning persistent storage. It specifies parameters—such as the type of disk (e.g., SSD, HDD), performance characteristics, replication policies, and other configuration options—used by the underlying storage provider.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;How it relates to storage options:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="10" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559683&amp;quot;:0,&amp;quot;335559684&amp;quot;:-2,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Storage options&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; refer to the various types of storage available in your environment, such as Azure Managed Disks, Azure Files, Azure Blob Storage, etc. Each storage option has distinct characteristics in terms of performance, cost, and use case.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="10" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559683&amp;quot;:0,&amp;quot;335559684&amp;quot;:-2,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;hybridMultilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;A &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;StorageClass&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; acts as a bridge between your Kubernetes applications and these underlying storage options. When you create a PersistentVolumeClaim (PVC), you reference a StorageClass, which instructs Kubernetes on how to dynamically provision a PersistentVolume (PV) based on the specified parameters and the chosen storage option.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;In essence, while storage options represent the physical or managed storage resources available, a StorageClass defines the rules for how that storage is allocated and managed within Kubernetes. This abstraction allows developers to simply request storage with a particular performance or cost profile, without needing to know the details of the underlying storage infrastructure.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;AKS offers several predefined storage classes tailored for different needs:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;managed-csi:&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; Standard managed disk (SSD) for general-purpose workloads.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;managed-csi-premium:&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; Premium SSD-backed disks for high-performance requirements.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;azurefile-csi:&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; Azure Files with SMB/NFS support for shared file systems.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;azurefile-csi-premium:&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; Premium Azure Files (via CSI) for workloads that require higher throughput and performance.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233118&amp;quot;:false,&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;azureblob-nfs-premium:&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; Premium Azure Blob Storage using NFS for scenarios that require file-system access.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:0,&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;azureblob-fuse-premium:&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; Premium Azure Blob Storage mounted using Blobfuse for specialized workloads.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;When selecting a storage class, consider your application’s IOPS, latency, and capacity needs. For instance, a database may require the low latency and high IOPS offered by premium SSDs (managed-csi-premium), while a file server or shared resource might be better served by Azure Files. Understanding these trade-offs is key to optimizing both performance and cost.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;SPAN data-contrast="auto"&gt;3. Hands-on labs&lt;/SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;In this section, we deploy MySQL and WordPress within a dedicated namespace called &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;wordpress&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;. This exercise demonstrates the practical usage of one AKS storage option (Azure Managed Disks) while teaching key storage concepts such as dynamic PV binding, PVC provisioning via StatefulSet, and automated database initialization.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Overview diagram&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Below is a diagram illustrating how MySQL and WordPress interact in AKS:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;img /&gt;&lt;BR /&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Lab 1: Setting ip the AKS cluster and deploying MySQL&lt;/SPAN&gt; &lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 1: Create an AKS cluster&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;First, create your resource group and AKS cluster:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az group create --name MyResourceGroup --location eastus 
az aks create                       \ 
  --resource-group MyResourceGroup  \ 
  --name MyAKSCluster               \ 
  --node-count 3                    \ 
  --enable-addons monitoring        \ 
  --generate-ssh-keys &lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;After the cluster is created, retrieve its credentials:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az aks get-credentials --resource-group MyResourceGroup --name MyAKSCluster &lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 2: Create the "wordpress" namespace&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Now that your cluster is ready, create a dedicated namespace for your WordPress and MySQL deployments:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl create namespace wordpress &lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Verify the namespace:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get namespaces &lt;/LI-CODE&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;(Optional step: Manual PVC creation for MySQL)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Note:&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt; When using a StatefulSet with a volumeClaimTemplate (as shown in Step 5), Kubernetes automatically creates a PVC for each pod (e.g., mysql-storage-mysql-0). This is the recommended approach for stateful applications. Manual PVC creation for MySQL is unnecessary and may lead to redundant resources.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;SPAN data-contrast="auto"&gt;This step is shown here only for demonstration and can be skipped.&lt;/SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;cat &amp;lt;&amp;lt;EOF | kubectl apply -f - 
apiVersion: v1 
kind: PersistentVolumeClaim 
metadata: 
  name: mysql-pvc 
  namespace: wordpress 
spec: 
  accessModes: 
    - ReadWriteOnce 
  resources: 
    requests: 
      storage: 10Gi 
  storageClassName: managed-csi-premium 
EOF  &lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Verify the PVC (if created):&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get pvc mysql-pvc -n wordpress &lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 3: Create a headless service for MySQL&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;A headless service ensures proper DNS resolution for MySQL pods:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;cat &amp;lt;&amp;lt;EOF | kubectl apply -f - 
apiVersion: v1 
kind: Service 
metadata: 
  name: mysql 
  namespace: wordpress 
spec: 
  clusterIP: None 
  selector: 
    app: mysql 
  ports: 
    - port: 3306 
      targetPort: 3306 
EOF &lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;SPAN data-contrast="auto"&gt;Verify the service:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get svc -n wordpress mysql &lt;/LI-CODE&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 4: Deploy MySQL with persistent storage (Including subPath &amp;amp; auto DB initialization)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Dynamically provisioned PVs (e.g., using ext4) include a default lost+found directory, which could cause MySQL to consider the data directory non-empty. We use the subPath option to mount a specific subdirectory (e.g., mysql-data) and avoid this issue. Additionally, we automate database creation so you don't have to log in manually.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Automate database initialization:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;cat &amp;lt;&amp;lt;EOF | kubectl apply -f - 
apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: mysql-initdb 
  namespace: wordpress 
data: 
  init-wordpress.sql: | 
    CREATE DATABASE IF NOT EXISTS wordpress; 
EOF &lt;/LI-CODE&gt;
&lt;P&gt;Verify the ConfigMap:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get configmaps -n wordpress &lt;/LI-CODE&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 5: Deploy MySQL using a StatefulSet with a volumeClaimTemplate&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This approach automatically creates a unique PVC for the MySQL pod (e.g., mysql-storage-mysql-0), ensuring consistency and eliminating the need for a separate manual PVC.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;cat &amp;lt;&amp;lt;EOF | kubectl apply -f - 
apiVersion: apps/v1 
kind: StatefulSet 
metadata: 
  name: mysql 
  namespace: wordpress 
spec: 
  serviceName: mysql 
  replicas: 1 
  selector: 
    matchLabels: 
      app: mysql 
  template: 
    metadata: 
      labels: 
        app: mysql 
    spec: 
      containers: 
      - name: mysql 
        image: mysql:5.7 
        env: 
        - name: MYSQL_ROOT_PASSWORD 
          value: "password123" 
        volumeMounts: 
        - name: mysql-storage 
          mountPath: /var/lib/mysql 
          subPath: mysql-data  # Mount only this subdirectory to avoid 'lost+found' 
        - name: initdb 
          mountPath: /docker-entrypoint-initdb.d 
      volumes: 
      - name: initdb 
        configMap: 
          name: mysql-initdb 
  volumeClaimTemplates: 
  - metadata: 
      name: mysql-storage 
    spec: 
      accessModes: [ "ReadWriteOnce" ] 
      resources: 
        requests: 
          storage: 10Gi 
      storageClassName: managed-csi-premium 
EOF&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;SPAN data-contrast="auto"&gt;Verify the StatefulSet:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get statefulset mysql -n wordpress &lt;/LI-CODE&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Verify the created pods:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get pods -n wordpress -l app=mysql &lt;/LI-CODE&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Verify the PVCs created by the volumeClaimTemplate:&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get pvc -n wordpress &lt;/LI-CODE&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;And check its logs to confirm successful initialization (including automatic database creation):&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl logs -n wordpress mysql-0 &lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;BR /&gt;Lab 2: Deploying WordPress and connecting to MySQL&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;U&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;SPAN data-contrast="auto"&gt;Step 1: Create a separate PVC for WordPress files&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/U&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;SPAN data-contrast="auto"&gt;We use a separate PVC for WordPress files to keep application data distinct from database storage:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;cat &amp;lt;&amp;lt;EOF | kubectl apply -f - 
apiVersion: v1 
kind: PersistentVolumeClaim 
metadata: 
  name: wordpress-pvc 
  namespace: wordpress 
spec: 
  accessModes: 
    - ReadWriteOnce 
  resources: 
    requests: 
      storage: 10Gi 
  storageClassName: azurefile-csi 
EOF&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Verify the PVC:&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get pvc wordpress-pvc -n wordpress&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 2: Create secrets for WordPress credentials&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Store the required credentials securely:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;cat &amp;lt;&amp;lt;EOF | kubectl apply -f - 
apiVersion: v1 
kind: Secret 
metadata: 
  name: wordpress-secrets 
  namespace: wordpress 
type: Opaque 
data: 
  WORDPRESS_DB_USER: cm9vdA==         	# "root" in base64 
  WORDPRESS_DB_PASSWORD: cGFzc3dvcmQxMjM=	# "password123" in base64 
EOF&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Verify the secrets:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 3: Deploy WordPress&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Deploy WordPress using a Deployment that connects to MySQL:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;cat &amp;lt;&amp;lt;EOF | kubectl apply -f - 
apiVersion: apps/v1 
kind: Deployment 
metadata: 
  name: wordpress 
  namespace: wordpress 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      app: wordpress 
  template: 
    metadata: 
      labels: 
        app: wordpress 
    spec: 
      containers: 
      - name: wordpress 
        image: wordpress:latest 
        env: 
        - name: WORDPRESS_DB_HOST 
          value: mysql 
        - name: WORDPRESS_DB_USER 
          valueFrom: 
            secretKeyRef: 
              name: wordpress-secrets 
              key: WORDPRESS_DB_USER 
        - name: WORDPRESS_DB_PASSWORD 
          valueFrom: 
            secretKeyRef: 
              name: wordpress-secrets 
              key: WORDPRESS_DB_PASSWORD 
        volumeMounts: 
        - name: wp-storage 
          mountPath: /var/www/html 
      volumes: 
      - name: wp-storage 
        persistentVolumeClaim: 
          claimName: wordpress-pvc 
EOF &lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Verify the WordPress deployment:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get deployments wordpress -n wordpress &lt;/LI-CODE&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Verify the WordPress pods:&lt;/SPAN&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get pods -n wordpress &lt;/LI-CODE&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 4: Expose WordPress externally&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Create a LoadBalancer service to expose WordPress externally:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;cat &amp;lt;&amp;lt;EOF | kubectl apply -f - 
apiVersion: v1 
kind: Service 
metadata: 
  name: wordpress-service 
  namespace: wordpress 
spec: 
  selector: 
    app: wordpress 
  ports: 
    - protocol: TCP 
      port: 80 
      targetPort: 80 
  type: LoadBalancer 
EOF 
&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Verify the service:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get svc wordpress-service -n wordpress &lt;/LI-CODE&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;To extract just the external IP, run:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get svc wordpress-service -n wordpress \ 
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}{"\n"}' &lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Open the external IP in your browser to complete the WordPress setup. With our automated initialization, the wordpress database is created automatically in MySQL, so no manual intervention is required.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;H3&gt;&lt;SPAN data-contrast="auto"&gt;4. Performance benchmarking for storage classes&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;SPAN data-contrast="auto"&gt;To measure IOPS and throughput, deploy a temporary pod with fio:&lt;/SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl run storage-test -n wordpress --rm -it --image=debian – bash &lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Inside the pod, run:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;apt-get update &amp;amp;&amp;amp; apt-get install -y fio 
fio --name=randwrite --ioengine=libaio --rw=randwrite --bs=4k --size=1G --numjobs=4 --runtime=60 --group_reporting 
&lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Example fio output explanation&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Interpretation:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;1. IOPS (Input/Output Operations Per Second)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;IOPS = 8913&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;On average, this setup handled about 8,900 random 4K write operations per second. This is a solid indicator of how many small, random writes the storage can handle in parallel.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;2. Bandwidth (Throughput)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-leveltext="o" data-font="Courier New" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:1440,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Courier New&amp;quot;,&amp;quot;469769242&amp;quot;:[9675],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;o&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="2"&gt;&lt;SPAN data-contrast="auto"&gt;BW = 34.8 MiB/s (≈36.5 MB/s)&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;Over the 60-second test, the system wrote roughly 34.8 MiB of data per second. This corresponds well to the IOPS figure when considering each write is 4K in size.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;3. Latency&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-leveltext="o" data-font="Courier New" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:1440,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Courier New&amp;quot;,&amp;quot;469769242&amp;quot;:[9675],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;o&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="2"&gt;&lt;SPAN data-contrast="auto"&gt;slat (submission latency): ~446 µs on average.&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;This is the time it takes fio to submit the I/O request to the kernel or I/O subsystem.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI data-leveltext="o" data-font="Courier New" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:1440,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Courier New&amp;quot;,&amp;quot;469769242&amp;quot;:[9675],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;o&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="2"&gt;&lt;SPAN data-contrast="auto"&gt;clat (completion latency): ~1.59 µs on average (but with some large outliers).&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;This is the time from when the request is submitted to when it’s completed by the storage.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI data-leveltext="o" data-font="Courier New" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:1440,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Courier New&amp;quot;,&amp;quot;469769242&amp;quot;:[9675],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;o&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="2"&gt;&lt;SPAN data-contrast="auto"&gt;lat (total latency): ~448 µs on average.&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;The overall latency (submission + completion) is under half a millisecond on average, which is decent for a cloud-based block storage scenario. However, there are a few high spikes, as seen by the maximum lat of ~123 ms.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;4. Percentiles&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-leveltext="o" data-font="Courier New" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:1440,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Courier New&amp;quot;,&amp;quot;469769242&amp;quot;:[9675],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;o&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="2"&gt;&lt;SPAN data-contrast="auto"&gt;The 99th percentile latencies are important for understanding worst-case performance:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;99.00th = 9024 ns (~9 µs)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;99.90th = 26496 ns (~26 µs)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;99.95th = 64256 ns (~64 µs)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;99.99th = 692224 ns (~692 µs)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI data-leveltext="o" data-font="Courier New" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:1440,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Courier New&amp;quot;,&amp;quot;469769242&amp;quot;:[9675],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;o&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="2"&gt;&lt;SPAN data-contrast="auto"&gt;Most I/Os complete quickly, but we do see occasional outliers (in the hundreds of microseconds to over a millisecond). This can happen in bursty or cloud storage environments.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;5. CPU Usage&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-leveltext="o" data-font="Courier New" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:1440,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Courier New&amp;quot;,&amp;quot;469769242&amp;quot;:[9675],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;o&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="2"&gt;&lt;SPAN data-contrast="auto"&gt;usr=0.48%, sys=1.43%&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;The CPU overhead is relatively low, indicating that the storage performance (rather than CPU resources) is the primary bottleneck.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;6. IO Depth&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-leveltext="o" data-font="Courier New" data-listid="8" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:1440,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Courier New&amp;quot;,&amp;quot;469769242&amp;quot;:[9675],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;o&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="2"&gt;&lt;SPAN data-contrast="auto"&gt;iodepth=1 for each job (4 jobs total).&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;This means fio is issuing only one I/O request at a time per job. The results might differ if you increase the iodepth to allow more in-flight requests.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;7. Summary&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Overall IOPS: ~8,900&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Throughput: ~34.8 MiB/s&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Average Latency: ~448 µs, with occasional spikes&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;CPU usage is low, suggesting storage is the limiting factor rather than compute.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;These metrics suggest that the AKS storage (backed by Azure Managed Disks in this scenario) can handle ~8,900 random 4K writes per second at an average latency of under half a millisecond—a respectable performance for many stateful applications. These values provide a benchmark to compare against your workload's expected performance.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;For instance, if you plan to run a high-transaction database, you might expect higher IOPS and lower latencies; in that case, you might consider using an even higher-performance storage class (e.g., Azure Ultra Disk), increase replication, or tune application-level caching and concurrency. Conversely, if your workload is less I/O intensive, these results confirm that your current storage configuration is sufficient.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;SPAN data-contrast="auto"&gt;5. Backup &amp;amp; Disaster Recovery&lt;/SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Reliable backup and disaster recovery are critical for production systems. Here, we cover two methods: using Velero and backing up MySQL with database dumps.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;5.1 Installing Velero for cluster backups&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 1: Install the Velero CLI&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Download the Velero CLI from the &lt;/SPAN&gt;&lt;A href="https://github.com/vmware-tanzu/velero/releases" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Velero&lt;/SPAN&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt; releases page&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;. For Linux:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;wget https://github.com/vmware-tanzu/velero/releases/download/v1.9.3/velero-v1.9.3-linux-amd64.tar.gz 
tar -xzvf velero-v1.9.3-linux-amd64.tar.gz 
sudo mv velero-v1.9.3-linux-amd64/velero /usr/local/bin/velero 
velero version&lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 2: Retrieve your Subscription ID&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Run:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az account show --query id --output tsv &lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This command returns your subscription ID&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 3: Create a service principal for Velero&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Replace &amp;lt;your-subscription-id&amp;gt; with your subscription ID:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az ad sp create-for-rbac --name VeleroSP --role Contributor --scopes /subscriptions/&amp;lt;your-subscription-id&amp;gt; &lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;SPAN data-contrast="auto"&gt;Sample output:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;{ 
  "appId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", 
  "displayName": "VeleroSP", 
  "password": "your-generated-password", 
  "tenant": "yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy" 
} &lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Verify the service principal:&lt;/SPAN&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az ad sp list --display-name VeleroSP &lt;/LI-CODE&gt;&lt;img /&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 4: Create the Velero credentials file&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Create a file named credentials-velero with the following content (replace placeholders with actual values):&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;AZURE_SUBSCRIPTION_ID=&amp;lt;your-subscription-id&amp;gt; 
AZURE_TENANT_ID=&amp;lt;your-tenant-id&amp;gt; 
AZURE_CLIENT_ID=&amp;lt;your-appId&amp;gt; 
AZURE_CLIENT_SECRET=&amp;lt;your-password&amp;gt;&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 5: Install Velero&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Run the following command (adjust the plugin version if necessary):&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;velero install                                              \ 
  --provider azure                                          \ 
  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.1 \ 
  --bucket my-backup-bucket                                 \ 
  --secret-file ./credentials-velero                        \ 
  --use-volume-snapshots=true                               \ 
  --backup-location-config resourceGroup=myResourceGroup,storageAccount=myStorageAccount &lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The output will show the creation of CRDs, namespace, service account, and deployment. When finished, it will state:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status. &lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;5.2 Backing up MySQL using database dumps&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 1: Get the MySQL pod name&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Retrieve the MySQL pod name with:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl get pods -n wordpress -l app=mysql -o jsonpath='{.items[0].metadata.name}' &lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 2: Create a MySQL dump&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Replace &amp;lt;mysql-pod-name&amp;gt; with the actual name (e.g., mysql-0):&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl exec -it &amp;lt;mysql-pod-name&amp;gt; -n wordpress -- mysqldump -u root -p wordpress &amp;gt; wordpress-backup.sql &lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;&lt;SPAN data-contrast="auto"&gt;Step 3: Restore MySQL from the dump&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/U&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl cp wordpress-backup.sql &amp;lt;mysql-pod-name&amp;gt;:/tmp/wordpress-backup.sql -n wordpress 
kubectl exec -it &amp;lt;mysql-pod-name&amp;gt; -n wordpress -- mysql -u root -p wordpress &amp;lt; /tmp/wordpress-backup.sql 
&lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-ccp-props="{&amp;quot;335559739&amp;quot;:0}"&gt;&amp;nbsp;6. &lt;/SPAN&gt;&lt;SPAN style="font-family: var(--lia-blog-font-family); background-color: var(--lia-rte-bg-color); color: var(--lia-bs-body-color); font-size: var(--lia-bs-font-size-base); font-style: var(--lia-font-style-base);"&gt;Troubleshooting &amp;amp; Best Practices&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="6" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="1" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Stuck PVCs:&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;If a PVC remains stuck during deletion, inspect for finalizers and remove them:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI-CODE lang="bash"&gt;kubectl patch pvc mysql-pvc -n wordpress -p '{"metadata":{"finalizers":[]}}' &lt;/LI-CODE&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="6" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="2" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;MySQL initialization errors:&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;The error regarding a non-empty data directory is usually due to the default lost+found folder on ext4 file systems. The subPath fix ensures MySQL only sees an empty subdirectory.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="6" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="3" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;PVC pending state:&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;When a PVC is created, it may initially show as Pending while the dynamic storage provisioner creates and binds an underlying PV. This typically takes only a few moments.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="6" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="4" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;DNS resolution &amp;amp; network connectivity:&lt;/SPAN&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN data-contrast="auto"&gt;Verify that WordPress resolves the MySQL service correctly by running:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI-CODE lang="bash"&gt;kubectl run debug --rm -it --image=busybox -n wordpress -- nslookup mysql &lt;/LI-CODE&gt;
&lt;UL&gt;
&lt;LI data-leveltext="" data-font="Symbol" data-listid="6" data-list-defn-props="{&amp;quot;335552541&amp;quot;:1,&amp;quot;335559685&amp;quot;:720,&amp;quot;335559991&amp;quot;:360,&amp;quot;469769226&amp;quot;:&amp;quot;Symbol&amp;quot;,&amp;quot;469769242&amp;quot;:[8226],&amp;quot;469777803&amp;quot;:&amp;quot;left&amp;quot;,&amp;quot;469777804&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;469777815&amp;quot;:&amp;quot;multilevel&amp;quot;}" aria-setsize="-1" data-aria-posinset="5" data-aria-level="1"&gt;&lt;SPAN data-contrast="auto"&gt;Credentials &amp;amp; Environment Variables:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;Ensure that the environment variables in your WordPress deployment match those required by MySQL.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN style="font-family: var(--lia-blog-font-family); background-color: var(--lia-rte-bg-color); color: var(--lia-bs-body-color); font-size: var(--lia-bs-font-size-base); font-style: var(--lia-font-style-base);"&gt;7. Final Thoughts and Next Steps&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This blog post focused on mastering storage in AKS by exploring various storage options and best practices. The deployment of WordPress and MySQL served as an exercise to demonstrate how to implement one AKS storage option—Azure Managed Disks—while teaching key storage concepts such as PVC provisioning via a StatefulSet (using volumeClaimTemplate), dynamic PV binding, and automated database initialization using subPath. We also covered performance benchmarking using fio and provided a complete step-by-step guide to installing Velero for backup and disaster recovery.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;References about AKS Storage:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/aks/concepts-storage" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Storage options for applications in AKS&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/architecture/aws-professional/eks-to-aks/storage#aks-storage-options" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;AKS storage options&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://techcommunity.microsoft.com/blog/coreinfrastructureandsecurityblog/field-tips-for-aks-storage-provisioning/3761105" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Field tips for AKS storage provisioning&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/samples/azure-samples/aks-ephemeral-os-disk/aks-ephemeral-os-disk/" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Everything you want to know about ephemeral OS disks and AKS&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Next Steps:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Experiment with advanced scaling and monitoring for both MySQL and WordPress.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Implement additional security measures for production deployments.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Explore other AKS storage options (like Azure Files or Blob Storage) and benchmark their performance based on your specific workload requirements.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Happy deploying, and enjoy mastering storage in AKS!&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/DIV&gt;</description>
      <pubDate>Wed, 16 Apr 2025 15:48:09 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/startups-at-microsoft/from-zero-to-hero-mastering-storage-in-azure-kubernetes-service/ba-p/4397734</guid>
      <dc:creator>rmmartins</dc:creator>
      <dc:date>2025-04-16T15:48:09Z</dc:date>
    </item>
  </channel>
</rss>

