infrastructure
185 TopicsProactive Reliability Series — Article 1: Fault Types in Azure
Welcome to the Proactive Reliability Series — a collection of articles dedicated to raising awareness about the importance of designing, implementing, and operating reliable solutions in Azure. Each article will focus on a specific area of reliability engineering: from identifying critical flows and setting reliability targets, to designing for redundancy, testing strategies, and disaster recovery. This series draws its foundation from the Reliability pillar of the Azure Well-Architected Framework, Microsoft's authoritative guidance for building workloads that are resilient to malfunction and capable of returning to a fully functioning state after a failure occurs. In the cloud, failures are not a matter of if but when. Whether it is a regional outage, an availability zone going dark, a misconfigured resource, or a downstream service experiencing degradation — your workload will eventually face adverse conditions. The difference between a minor blip and a major incident often comes down to how deliberately you have planned for failure. In this first article, we start with one of the most foundational practices: Fault Mode Analysis (FMA) — and the question that underpins it: what kinds of faults can actually happen in Azure? Disclaimer: The views expressed in this article are my own and do not represent the views or positions of Microsoft. This article is written in a personal capacity and has not been reviewed, endorsed, or approved by Microsoft. Why Fault Mode Analysis Matters Fault Mode Analysis is the practice of systematically identifying potential points of failure within your workload and its associated flows, and then planning mitigation actions accordingly. A key tenet of FMA is that in any distributed system, failures can occur regardless of how many layers of resiliency are applied. More complex environments are simply exposed to more types of failures. Given this reality, FMA allows you to design your workload to withstand most types of failures and recover gracefully within defined recovery objectives. If you skip FMA altogether, or perform an incomplete analysis, your workload is at risk of unpredicted behavior and potential outages caused by suboptimal design. But to perform FMA effectively, you first need to understand what kinds of faults can actually occur in Azure infrastructure — and that is where most teams hit a gap. Sample "Azure Fault Type" Taxonomy Azure infrastructure is complex and distributed, and while Microsoft invests heavily in reliability, faults can and do occur. These faults can range from large-scale global service outages to localized issues affecting a single VM. The following is a sample taxonomy of common Azure infrastructure fault types, categorized by their characteristics, likelihood, and mitigation strategies. The taxonomy is organized from a customer impact perspective — focusing on how fault types affect customer workloads and what mitigation options are available — rather than from an internal Azure engineering perspective. Some of these "faults" may not even be caused by an actual failure in Azure infrastructure. They can be caused by a lack of understanding of Azure service designed behaviors (e.g., underestimating the impact of Azure planned maintenance) or by Azure platform design decisions (e.g., capacity constraints). However, from a customer perspective, they all represent potential failure modes that need to be considered and mitigated when designing for reliability. The following table presents infrastructure fault types from a customer impact perspective: Disclaimer: This is an unofficial taxonomy sample of Azure infrastructure fault types. It is not an official Microsoft publication and is not officially supported, endorsed, or maintained by Microsoft. The fault type definitions, likelihood assessments, and mitigation recommendations are based on publicly available Azure documentation and general cloud architecture best practices, but may not reflect the most current Azure platform behavior. Always refer to official Azure documentation and Azure Service Health for authoritative guidance. The "Likelihood" values below are relative planning heuristics intended to help prioritize resilience investments. They are not statistical probabilities, do not represent Azure SLA commitments, and are not derived from official Azure reliability data. Fault Type Blast Radius Likelihood Mitigation Redundancy Level Requirements Service Fault (Global) Worldwide or Multiple Regions Very Low High Service Fault (Region) Single service in region Medium Region Redundancy Region Fault Single region Very Low Region Redundancy Partial Region Fault Multiple services in a single Region Low Region Redundancy Availability Zone Fault Single AZ within region Low Availability Zone Redundancy Single Resource Fault Single VM/instance High Resource Redundancy Platform Maintenance Fault Variable (resource to region) High Resource Redundancy, Maintenance Schedules Region Capacity Constraint Fault Single region Low Region Redundancy, Capacity Reservations Network POP Location Fault Network hardware Colocation site Low Site Redundancy In future articles we will examine each of these fault types in detail. For this first article, let's take a closer look at one that is often underestimated: the Partial Region Fault. Deep Dive: "Partial Region Fault" A Partial Region Fault is a fault affecting multiple Azure services within a single region simultaneously, typically due to shared regional infrastructure dependencies, regional network issues, or regional platform incidents. Sometimes, the number of affected services may be significant enough to resemble a full region outage — but the key distinction is that it is not a complete loss of the region. Some services may continue to operate normally, while others experience degradation or unavailability. Unlike Natural Disaster caused Region outage, in the documented cases referenced later in this article, such "Partial Region Faults" have historically been resolved within hours. Attribute Description Blast Radius Multiple services within a single region Likelihood Low Typical Duration Minutes to hours Fault Tolerance Options Multi-region architecture; cross-region failover Fault Tolerance Cost High Impact Severe Typical Cause Regional networking infrastructure failure affecting multiple services, regional storage subsystem degradation impacting dependent services, regional control plane issues affecting service management These faults are rare, but they can happen — and when they do, they can have a severe impact on customer solutions that are not architected for multi-region resilience. What makes Partial Region Faults particularly dangerous is that they fall into a blind spot in most teams' resilience planning. When organizations think about regional failures, they tend to think in binary terms: either a region is up or it is down. Disaster recovery runbooks are written around the idea of a full region outage — triggered by a natural disaster or a catastrophic infrastructure event — where the response is to fail over everything to a secondary region. But a Partial Region Fault is not a full region outage. It is something more insidious. A subset of services in the region degrades or becomes unavailable while others continue to function normally. Your VMs might still be running, but the networking layer that connects them is broken. Your compute is fine, but Azure Resource Manager — the control plane through which you manage everything — is unreachable. This partial nature creates several problems that teams rarely plan for: Failover logic may not trigger. Most automated failover mechanisms are designed to detect a complete loss of connectivity to a region. When only some services are affected, health probes may still pass, traffic managers may still route requests to the degraded region, and your failover automation may sit idle — while your users are already experiencing errors. Recovery is more complex. With a full region outage, the playbook is straightforward: fail over to the secondary region. With a partial fault, you may need to selectively fail over some services while others remain in the primary region — a scenario that few teams have tested and most architectures do not support gracefully. The real-world examples below illustrate this clearly. In each case, a shared infrastructure dependency — regional networking, Managed Identities, or Azure Resource Manager — experienced an issue that cascaded into a multi-service fault lasting hours. None of these were full region outages, yet the scope and duration of affected services was significant in each case: Switzerland North — Network Connectivity Impact (BT6W-FX0) A platform issue resulted in an impact to customers in Switzerland North who may have experienced service availability issues for resources hosted in the region. Attribute Value Date September 26–27, 2025 Region Switzerland North Time Window 23:54 UTC on 26 Sep – 21:59 UTC on 27 Sep 2025 Total Duration ~22 hours Services Impacted Multiple (network-dependent services in the region) According to the official Post Incident Review (PIR) published by Microsoft on Azure Status History, a platform issue caused network connectivity degradation affecting multiple network-dependent services across the Switzerland North region, with impact lasting approximately 22 hours. The full root cause analysis, timeline, and remediation steps are documented in the linked PIR below. 🔗 View PIR on Azure Status History East US and West US — Managed Identities and Dependent Services (_M5B-9RZ) A platform issue with the Managed Identities for Azure resources service impacted customers trying to create, update, or delete Azure resources, or acquire Managed Identity tokens in East US and West US regions. Attribute Value Date February 3, 2026 Regions East US, West US Time Window 00:10 UTC – 06:05 UTC on 03 February 2026 Total Duration ~6 hours Services Impacted Managed Identities + dependent services (resource create/update/delete, token acquisition) 🔗 View PIR on Azure Status History Azure Government — Azure Resource Manager Failures (ML7_-DWG) Customers using any Azure Government region experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI. Attribute Value Date December 8, 2025 Regions Azure Government (all regions) Time Window 11:04 EST (16:04 UTC) – 14:13 EST (19:13 UTC) Total Duration ~3 hours Services Impacted 20+ services (ARM and all ARM-dependent services) 🔗 View PIR on Azure Status History Wrapping Up Designing resilient Azure solutions requires understanding the full spectrum of potential infrastructure faults. The Partial Region Fault is just one of many fault types you should account for during your Failure Mode Analysis — but it is a powerful reminder that even within a single region, shared infrastructure dependencies can amplify a single failure into a multi-service outage. Use this taxonomy as a starting point for FMA when designing your Azure architecture. The area is continuously evolving as the Azure platform and industry evolve — watch the space and revisit your fault type analysis periodically. In the next article, we will continue exploring additional fault types from the taxonomy. Stay tuned. Authors & Reviewers Authored by Zoran Jovanovic, Cloud Solutions Architect at Microsoft. Peer Review by Catalina Alupoaie, Cloud Solutions Architect at Microsoft. Peer Review by Stefan Johner, Cloud Solutions Architect at Microsoft. References Azure Well-Architected Framework — Reliability Pillar Failure Mode Analysis Shared Responsibility for Reliability Azure Availability Zones Business Continuity and Disaster Recovery Transient Fault Handling Azure Service Level Agreements Azure Reliability Guidance by Service Azure Status History71Views0likes0CommentsAWS to Azure Migration — From the Cloud Economics & FinOps Lens
“ROI fails when FinOps joins late.” That single pattern explains why many cloud migrations deliver technical success but financial disappointment. Workloads move. SLAs hold. Teams celebrate go‑live. Then the CFO asks: Where are the savings we modeled? In most cases, FinOps was engaged after architecture decisions were locked, licenses were double‑paid, and governance debt had already accumulated. This article frames AWS‑to‑Azure migration through a FinOps lens—not to chase immediate modernization, but to deliver defensible, incremental cost savings during and after migration, without increasing risk. Azure migration guidance consistently emphasizes a structured, phased approach—discover, migrate like‑for‑like, stabilize, then optimize. From a FinOps perspective, this sequencing is not conservative—it is economically rational: Like‑for‑like preserves performance baselines and business KPIs Cost comparisons remain apples‑to‑apples Optimization levers can be applied surgically, not blindly The real value emerges in the first 90 days after migration, when cost signals stabilize and commitment‑based savings become safe to apply. {TLDR: Cloud migrations miss ROI when FinOps joins late. AWS → Azure migrations deliver real savings when FinOps leads early, migrations stay like‑for‑like, and optimization is applied after costs stabilize. Azure enables this through four levers: AI‑assisted planning (Copilot + Azure Migrate), cheaper non‑prod with Dev/Test pricing, license reuse via Azure Hybrid Benefit, and low‑risk long‑term savings with Reservations—across compute and storage. Result: lower migration risk, controlled spend, and sustainable savings post‑move.} This Article talks about top 4 FinOps Levers in AWS → Azure Migration 1. Azure Copilot Migration Agent + Azure Migrate. Azure Copilot Migration Agent (currently in public preview) is a planning‑focused, AI‑assisted experience built on Azure Migrate. It analyzes inventory, readiness, landing zone requirements, and ROI before execution. You can interact with the Agent using natural language prompts to explore inventory, migration readiness, strategies, ROI considerations, and landing zone requirements. From a FinOps perspective, this directly translates into faster decision cycles and lower planning overhead. By simplifying and compressing activities that traditionally required weeks of manual analysis or external managed services support, organizations can reduce the cost of migration planning, accelerate business case creation, and bring cost and ROI discussions forward—before environments are deployed and financial commitments are made. 2. Azure Dev/Test pricing: Azure Dev/Test pricing provides discounted rates for non‑production workloads for eligible subscriptions, significantly reducing dev and test environment costs (Azure Dev/Test pricing). You can save up to 57 percent for a typical web app dev/test environment running SQL Database and App Service. Unlike other Cloud Providers, this directly reduces environment sprawl costs, which often exceed production waste post‑migration. It also enables wave‑based migration by lowering the cost of parallel environments, allowing teams to migrate deliberately rather than under financial pressure. 3. Azure Hybrid Benefit: Azure Hybrid Benefit allows organizations to reuse existing Windows Server, SQL Server, and supported Linux subscriptions (RHEL and SLES) on Azure, reducing both migration and steady‑state run costs. It enables license portability across Azure services, helping organizations avoid repurchasing software licenses they already own and redirect savings toward innovation and modernization. During migration, Azure Hybrid Benefit is especially impactful because it addresses migration overlap costs. The 180‑day migration allowance for Windows Server and SQL Server allows workloads to run on‑premises and in Azure simultaneously, supporting parallel validation, phased cutovers, and rollback readiness without double‑paying for licenses. For Linux, Azure Hybrid Benefit enables RHEL and SLES workloads to move to Azure without redeployment, ensuring continuity and avoiding downtime. From a FinOps perspective, this reduces one of the most underestimated migration cost drivers, delivering up to 76% savings versus pay‑as‑you‑go pricing for Linux and up to 29% versus leading cloud providers for SQL Server, while keeping migration timelines driven by readiness—not cost pressure. 4. Azure Reservations: Azure Reservations enable organizations to reduce costs by committing to one‑year or three‑year plans for eligible Azure services, receiving a billing discount that is automatically applied to matching resources. Reservations provide discounts of up to 72% compared to pay‑as‑you‑go pricing, do not affect the runtime state of workloads, and can be paid upfront or monthly with no difference in total cost. Importantly, Azure Reservations apply not only to compute and database, but also to storage services like Azure Blob storage, Azure Data Lake Gen2 Storage and Azure Files (for storage capacity) which often represent a significant portion of enterprise cloud spend. In the context of migration, Azure Reservations matter because they allow FinOps teams to optimize baseline costs across both compute and data layers once workloads stabilize. Unlike AWS, where commitment‑based discounts are largely compute‑centric and storage services such as Amazon S3 do not offer reservation‑style pricing, Azure enables long‑term cost optimization for persistent storage footprints that continue to grow post‑migration. Additionally, Azure Reservations offer greater flexibility—customers can modify, exchange, or cancel reservations through a self‑service program, subject to defined limits. This is particularly valuable during wave‑based migrations, where workload shapes evolve over time. From a FinOps perspective, Azure Reservations allow organizations to commit to predictable savings with broader scope and lower risk, covering both infrastructure and data‑heavy workloads common in migration scenarios. Successful migrations are no longer measured by workloads moved, but by cost control maintained and value unlocked. Azure’s FinOps‑aligned migration capabilities allow organizations to reduce risk first, optimize deliberately, and ensure that savings are sustained long after the last workload migrates.Resiliency Patterns for Azure Front Door: Field Lessons
Abstract Azure Front Door (AFD) sits at the edge of Microsoft’s global cloud, delivering secure, performant, and highly available applications to users worldwide. As adoption has grown—especially for mission‑critical workloads—the need for resilient application architectures that can tolerate rare but impactful platform incidents has become essential. This article summarizes key lessons from Azure Front Door incidents in October 2025, outlines how Microsoft is hardening the platform, and—most importantly—describes proven architectural patterns customers can adopt today to maintain business continuity when global load‑balancing services are unavailable. Who this is for This article is intended for: Cloud and solution architects designing mission‑critical internet‑facing workloads Platform and SRE teams responsible for high availability and disaster recovery Security architects evaluating WAF placement and failover trade‑offs Customers running revenue‑impacting workloads on Azure Front Door Introduction Azure Front Door (AFD) operates at massive global scale, serving secure, low‑latency traffic for Microsoft first‑party services and thousands of customer applications. Internally, Microsoft is investing heavily in tenant isolation, independent infrastructure resiliency, and active‑active service architectures to reduce blast radius and speed recovery. However, no global distributed system can completely eliminate risk. Customers hosting mission‑critical workloads on AFD should therefore design for the assumption that global routing services can become temporarily unavailable—and provide alternative routing paths as part of their architecture. Resiliency options for mission‑critical workloads The following patterns are in active use by customers today. Each represents a different trade‑off between cost, complexity, operational maturity, and availability. 1. No CDN with Application Gateway Figure 1: Azure Front Door primary routing with DNS failover When to use: Workloads without CDN caching requirements that prioritize predictable failover. Architecture summary Azure Traffic Manager (ATM) runs in Always Serve mode to provide DNS‑level failover. Web Application Firewall (WAF) is implemented regionally using Azure Application Gateway. App Gateway can be private, provided the AFD premium is used, and is the default path. DNS failover available when AFD is not reachable. When Failover is triggered, one of the steps will be to switch to AppGW IP to Public (ATM can route to public endpoints only) Switch back to AFD route, once AFD resumes service. Pros DNS‑based failover away from the global load balancer Consistent WAF enforcement at the regional layer Application Gateways can remain private during normal operations Cons Additional cost and reduced composite SLA from extra components Application Gateway must be made public during failover Active‑passive pattern requires regular testing to maintain confidence 2. Multi‑CDN for mission‑critical applications Figure 2: Multi‑CDN architecture using Azure Front Door and Akamai with DNS‑based traffic steering When to use: Mission critical Applications with strict availability requirements and heavy CDN usage. Architecture summary Dual CDN setup (for example, Azure Front Door + Akamai) Azure Traffic Manager in Always Serve mode Traffic split (for example, 90/10) to keep both CDN caches warm During failover, 100% of traffic is shifted to the secondary CDN Ensure Origin servers can handle the load of extra hits (Cache misses) Pros Highest resilience against CDN‑specific or control‑plane outages Maintains cache readiness on both providers Cons Expensive and operationally complex Requires origin capacity planning for cache‑miss surges Not suitable if applications rely on CDN‑specific advanced features 3. Multi‑layered CDN (Sequential CDN architecture) Figure 3: Sequential CDN architecture with Akamai as caching layer in front of Azure Front Door When to use: Rare, niche scenarios where a layered CDN approach is acceptable. Not a common approach, Akamai can be a single entry point of failure. However if the AFD isn't available, you can update Akamai properties to directly route to Origin servers. Architecture summary Akamai used as the front caching layer Azure Front Door used as the L7 gateway and WAF During failover, Akamai routes traffic directly to origin services Pros Direct fallback path to origins if AFD becomes unavailable Single caching layer in normal operation Cons Fronting CDN remains a single point of failure Not generally recommended due to complexity Requires a well‑tested operational playbook 4. No CDN – Traffic Manager redirect to origin (with Application Gateway) Figure 4: DNS‑based failover directly to origin via Application Gateway when Azure Front Door is unavailable When to use: Applications that require L7 routing but no CDN caching. Architecture summary Azure Front Door provides L7 routing and WAF Azure Traffic Manager enables DNS failover During an AFD outage, Traffic Manager routes directly to Application Gateway‑protected origins Pros Alternative ingress path to origin services Consistent regional WAF enforcement Cons Additional infrastructure cost Operational dependency on Traffic Manager configuration accuracy 5. No CDN – Traffic Manager redirect to origin (no Application Gateway) Figure 5: Direct DNS failover to origin services without Application Gateway When to use: Cost‑sensitive scenarios with clearly accepted security trade‑offs. Architecture summary WAF implemented directly in Azure Front Door Traffic Manager provides DNS failover During an outage, traffic routes directly to origins Pros Simplest architecture No Application Gateway in the primary path Cons Risk of unscreened traffic during failover Failover operations can be complex if WAF consistency is required Frequently asked questions Is Azure Traffic Manager a single point of failure? No. Traffic Manager operates as a globally distributed service. For extreme resilience requirements, customers can combine Traffic Manager with a backup FQDN hosted in a separate DNS provider. Should every workload implement these patterns? No. These patterns are intended for mission‑critical workloads where downtime has material business impact. Non critical applications do not require multi‑CDN or alternate routing paths. What does Microsoft use internally? Microsoft uses a combination of active‑active regions, multi‑layered CDN patterns, and controlled fail‑away mechanisms, selected based on service criticality and performance requirements. What happened in October 2025 (summary) Two separate Azure Front Door incidents in October 2025 highlighted the importance of architectural resiliency: A control‑plane defect caused erroneous metadata propagation, impacting approximately 26% of global edge sites A later compatibility issue across control‑plane versions resulted in DNS resolution failures Both incidents were mitigated through automated restarts, manual intervention, and controlled failovers. These events accelerated platform‑level hardening investments. How Azure Front Door is being hardened Microsoft has already completed or initiated major improvements, including: Synchronous configuration processing before rollout Control‑plane and data‑plane isolation Reduced configuration propagation times Active‑active fail‑away for major first‑party services Microcell segmentation to reduce blast radius These changes reinforce a core principle: no single tenant configuration should ever impact others, and recovery must be fast and predictable. Key takeaways Global platforms can experience rare outages—architect for them Mission‑critical workloads should include alternate routing paths Multi‑CDN and DNS‑based failover patterns remain the most robust Resiliency is a business decision, not just a technical one References Azure Front Door: Implementing lessons learned following October outages | Microsoft Community Hub Azure Front Door Resiliency Deep Dive and Architecting for Mission Critical - John Savill's deep dive into Azure Front Door resilience and options for mission critical applications Global Routing Redundancy for Mission-Critical Web Applications - Azure Architecture Center | Microsoft Learn Architecture Best Practices for Azure Front Door - Microsoft Azure Well-Architected Framework | Microsoft Learn629Views3likes0CommentsHow Azure NetApp Files Object REST API powers Azure and ISV Data and AI services – on YOUR data
This article introduces the Azure NetApp Files Object REST API, a transformative solution for enterprises seeking seamless, real-time integration between their data and Azure's advanced analytics and AI services. By enabling direct, secure access to enterprise data—without costly transfers or duplication—the Object REST API accelerates innovation, streamlines workflows, and enhances operational efficiency. With S3-compatible object storage support, it empowers organizations to make faster, data-driven decisions while maintaining compliance and data security. Discover how this new capability unlocks business potential and drives a new era of productivity in the cloud.1.2KViews0likes0CommentsAzure Local LENS workbook—deep insights at scale, in minutes
Azure Local at scale needs fleet-level visibility As Azure Local deployments grow from a handful of instances to hundreds (or even thousands), the operational questions change. You’re no longer troubleshooting a single environment—you’re looking for patterns across your entire fleet: Which sites are trending with a specific health issue? Where are workload deployments increasing over time, do we have enough capacity available? Which clusters are outliers compared to the rest? Today we’re sharing Azure Local LENS: a free, community-driven Azure Workbook designed to help you gain deep insights across a large Azure Local fleet—quickly and consistently—so you can move from reactive troubleshooting to proactive operations. Get the workbook and step-by-step instructions to deploy it here: https://aka.ms/AzureLocalLENS Who is it for? This workbook is especially useful if you manage or support: Large Azure Local fleets distributed across many sites (retail, manufacturing, branch offices, healthcare, etc.). Central operations teams that need standardized health/update views. Architects who want to aggregate data to gain insights in cluster and workload deployment trends over time. What is Azure Local LENS? Azure Local - Lifecycle, Events & Notification Status (or LENS) workbook brings together the signals you need to understand your Azure Local estate through a fleet lens. Instead of jumping between individual resources, you can use a consistent set of views to compare instances, spot outliers, and drill into the focus areas that need attention. Fleet-first design: Start with an estate-wide view, then drill down to a specific site/cluster using the seven tabs in the workbook. Operational consistency: Standard dashboards help teams align on “what good looks like” across environments, update trends, health check results and more. Actionable insights: Identify hotspots and trends early so you can prioritize remediation and plan health remediation, updates and workload capacity with confidence. What insights does it provide? Azure Local LENS is built to help you answer the questions that matter at scale, such as: Fleet scale overview and connection status: How many Azure Local instances do you have, and what are their connection, health and update status? Workload deployment trends: Where have you deployed Azure Local VMs and AKS Arc clusters, how many do you have in total, are they connected and in a healthy state? Top issues to prioritize: What are the common signals across your estate that deserve operational focus, such as update health checks, extension failures or Azure Resource Bridge connectivity issues? Updates: What is your overall update compliance status for Solution and SBE updates? What is the average, standard deviation or 95 th percentile update duration run times for your fleet? Drilldown workflow: After spotting an outlier, what does the instance-level view show, so you can act or link directly to Azure portal for more actions and support? Get started in minutes If you are managing Azure Local instances, give Azure Local LENS a try and see how quickly a fleet-wide view can help with day-to-day management, helping to surface trends & actionable insights. The workbook is an open-source, community-driven project, which can be accessed using a public GitHub repository, which includes full step-by-step instructions for setup at https://aka.ms/AzureLocalLENS. Most teams can deploy the workbook and start exploring insights in a matter of minutes. (depending on your environment). An example of the “Azure Local Instances” tab: How teams are using fleet dashboards like LENS Weekly fleet review: Use a standard set of views to review top outliers and trend shifts, then assign follow-ups. Update planning: Identify clusters with system health check failures, and prioritize resolving the issues based on frequency of the issue category. Update progress: Review clusters update status (InProgress, Failed, Success) and take action based on trends and insights from real-time data. Baseline validation: Spot clusters that consistently differ from the norm—can be a sign of configuration or environmental difference, such as network access, policies, operational procedures or other factors. Feedback and what’s next This workbook is a community driven, open source project intended to be practical and easy to adopt. The project is not a Microsoft‑supported offering. If you encounter any issues, have feedback, or a new feature request, please raise an Issue on the GitHub repository, so we can track discussions, prioritize improvements, and keep updates transparent for everyone. Author Bio Neil Bird is a Principal Program Manager in the Azure Edge & Platform Engineering team at Microsoft. His background is in Azure and hybrid / sovereign cloud infrastructure, specialising in operational excellence and automation. He is passionate about helping customers deploy and manage cloud solutions successfully using Azure and Azure Edge technologies.1.6KViews8likes4CommentsReference Architecture for Highly Available Multi-Region Azure Kubernetes Service (AKS)
Introduction Cloud-native applications often support critical business functions and are expected to stay available even when parts of the platform fail. Azure Kubernetes Service (AKS) already provides strong availability features within a single region, such as availability zones and a managed control plane. However, a regional outage is still a scenario that architects must plan for when running important workloads. This article walks through a reference architecture for running AKS across multiple Azure regions. The focus is on availability and resilience, using practical patterns that help applications continue to operate during regional failures. It covers common design choices such as traffic routing, data replication, and operational setup, and explains the trade-offs that come with each approach. This content is intended for cloud architects, platform engineers, and Site Reliability Engineers (SREs who design and operate Kubernetes platforms on Azure and need to make informed decisions about multi-region deployments. Resilience Requirements and Design Principles Before designing a multi-region Kubernetes platform, it is essential to define resilience objectives aligned with business requirements: Recovery Time Objective (RTO): Maximum acceptable downtime during a regional failure. Recovery Point Objective (RPO): Maximum acceptable data loss. Service-Level Objectives (SLOs): Availability targets for applications and platform services. The architecture described in this article aligns with the Azure Well-Architected Framework Reliability pillar, emphasizing fault isolation, redundancy, and automated recovery. Multi-Region AKS Architecture Overview The reference architecture uses two independent AKS clusters deployed in separate Azure regions, such as West Europe and North Europe. Each region is treated as a separate deployment stamp, with its own networking, compute, and data resources. This regional isolation helps reduce blast radius and allows each environment to be operated and scaled independently. Traffic is routed at a global level using Azure Front Door together with DNS. This setup provides a single public entry point for clients and enables traffic steering based on health checks, latency, or routing rules. If one region becomes unavailable, traffic can be automatically redirected to the healthy region. Each region exposes applications through a regional ingress layer, such as Azure Application Gateway for Containers or an NGINX Ingress Controller. This keeps traffic management close to the workload and allows regional-specific configuration when needed. Data services are deployed with geo-replication enabled to support multi-region access and recovery scenarios. Centralized monitoring and security tooling provides visibility across regions and helps operators detect, troubleshoot, and respond to failures consistently. The main building blocks of the architecture are: Azure Front Door as the global entry point Azure DNS for name resolution An AKS cluster deployed in each region A regional ingress layer (Application Gateway for Containers or NGINX Ingress) Geo-replicated data services Centralized monitoring and security services Deployment Patterns for Multi-Region AKS There is no single “best” way to run AKS across multiple regions. The right deployment pattern depends on availability requirements, recovery objectives, operational maturity, and cost constraints. This section describes three common patterns used in multi-region AKS architectures and highlights the trade-offs associated with each one. Active/Active Deployment Model In an active/active deployment model, AKS clusters in multiple regions serve production traffic at the same time. Global traffic routing distributes requests across regions based on health checks, latency, or weighted rules. If one region becomes unavailable, traffic is automatically shifted to the remaining healthy region. This model provides the highest level of availability and the lowest recovery time, but it requires careful handling of data consistency, state management, and operational coordination across regions. Capability Pros Cons Availability Very high availability with no single active region Requires all regions to be production-ready at all times Failover behavior Near-zero downtime when a region fails More complex to test and validate failover scenarios Data consistency Supports read/write traffic in multiple regions Requires strong data replication and conflict handling Operational complexity Enables full regional redundancy Higher operational overhead and coordination Cost Maximizes resource utilization Highest cost due to duplicated active resources Active/Passive Deployment Model In an active/passive deployment model, one region serves all production traffic, while a second region remains on standby. The passive region is kept in sync but does not receive user traffic until a failover occurs. When the primary region becomes unavailable, traffic is redirected to the secondary region. This model reduces operational complexity compared to active/active and is often easier to operate, but it comes with longer recovery times and underutilized resources. Capability Pros Cons Availability Protects against regional outages Downtime during failover is likely Failover behavior Simpler failover logic Higher RTO compared to active/active Data consistency Easier to manage single write region Requires careful promotion of the passive region Operational complexity Easier to operate and test Manual or semi-automated failover processes Cost Lower cost than active/active Standby resources are mostly idle Deployment Stamps and Isolation Deployment stamps are a design approach rather than a traffic pattern. Each region is deployed as a fully isolated unit, or stamp, with its own AKS cluster, networking, and supporting services. Stamps can be used with both active/active and active/passive models. The goal of deployment stamps is to limit blast radius, enable independent lifecycle management, and reduce the risk of cross-region dependencies. Capability Pros Cons Availability Limits impact of regional or platform failures Requires duplication of platform components Failover behavior Enables clean and predictable failover Failover logic must be implemented at higher layers Data consistency Encourages clear data ownership boundaries Data replication can be more complex Operational complexity Simplifies troubleshooting and isolation More environments to manage Cost Supports targeted scaling per region Increased cost due to duplicated infrastructure Global Traffic Routing and Failover In a multi-region setup, global traffic routing is responsible for sending users to the right region and keeping the application reachable when a region becomes unavailable. In this architecture, Azure Front Door acts as the global entry point for all incoming traffic. Azure Front Door provides a single public endpoint that uses Anycast routing to direct users to the closest available region. TLS termination and Web Application Firewall (WAF) capabilities are handled at the edge, reducing latency and protecting regional ingress components from unwanted traffic. Front Door also performs health checks against regional endpoints and automatically stops sending traffic to a region that is unhealthy. DNS plays a supporting role in this design. Azure DNS or Traffic Manager can be used to define geo-based or priority-based routing policies and to control how traffic is initially directed to Front Door. Health probes continuously monitor regional endpoints, and routing decisions are updated when failures are detected. When a regional outage occurs, unhealthy endpoints are removed from rotation. Traffic is then routed to the remaining healthy region without requiring application changes or manual intervention. This allows the platform to recover quickly from regional failures and minimizes impact to users. Choosing Between Azure Traffic Manager and Azure DNS Both Azure Traffic Manager and Azure DNS can be used for global traffic routing, but they solve slightly different problems. The choice depends mainly on how fast you need to react to failures and how much control you want over traffic behavior. Capability Azure Traffic Manager Azure DNS Routing mechanism DNS-based with built-in health probes DNS-based only Health checks Native endpoint health probing No native health checks Failover speed (RTO) Low RTO (typically seconds to < 1 minute) Higher RTO (depends on DNS TTL, often minutes) Traffic steering options Priority, weighted, performance, geographic Basic DNS records Control during outages Automatic endpoint removal Relies on DNS cache expiration Operational complexity Slightly higher Very low Typical use cases Mission-critical workloads Simpler or cost-sensitive scenarios Data and State Management Across Regions Kubernetes platforms are usually designed to be stateless, which makes scaling and recovery much easier. In practice, most enterprise applications still depend on stateful services such as databases, caches, and file storage. When running across multiple regions, handling this state correctly becomes one of the hardest parts of the architecture. The general approach is to keep application components stateless inside the AKS clusters and rely on Azure managed services for data persistence and replication. These services handle most of the complexity involved in synchronizing data across regions and provide well-defined recovery behaviors during failures. Common patterns include using Azure SQL Database with active geo-replication or failover groups for relational workloads. This allows a secondary region to take over when the primary region becomes unavailable, with controlled failover and predictable recovery behavior. For globally distributed applications, Azure Cosmos DB provides built-in multi-region replication with configurable consistency levels. This makes it easier to support active/active scenarios, but it also requires careful thought around how the application handles concurrent writes and potential conflicts. Caching layers such as Azure Cache for Redis can be geo-replicated to reduce latency and improve availability. These caches should be treated as disposable and rebuilt when needed, rather than relied on as a source of truth. For object and file storage, Azure Blob Storage and Azure Files support geo-redundant options such as GRS and RA-GRS. These options provide data durability across regions and allow read access from secondary regions, which is often sufficient for backup, content distribution, and disaster recovery scenarios. When designing data replication across regions, architects should be clear about trade-offs. Strong consistency across regions usually increases latency and limits scalability, while eventual consistency improves availability but may expose temporary data mismatches. Replication lag, failover behavior, and conflict resolution should be understood and tested before going to production. Security and Governance Considerations In a multi-region setup, security and governance should look the same in every region. The goal is to avoid special cases and reduce the risk of configuration drift as the platform grows. Consistency is more important than introducing region-specific controls. Identity and access management is typically centralized using Azure Entra ID. Access to AKS clusters is controlled through a combination of Azure RBAC and Kubernetes RBAC, allowing teams to manage permissions in a way that aligns with existing Azure roles while still supporting Kubernetes-native access patterns. Network security is enforced through segmentation. A hub-and-spoke topology is commonly used, with shared services such as firewalls, DNS, and connectivity hosted in a central hub and application workloads deployed in regional spokes. This approach helps control traffic flows, limits blast radius, and simplifies auditing. Policy and threat protection are applied at the platform level. Azure Policy for Kubernetes is used to enforce baseline configurations, such as allowed images, pod security settings, and resource limits. Microsoft Defender for Containers provides visibility into runtime threats and misconfigurations across all clusters. Landing zones play a key role in this design. By integrating AKS clusters into a standardized landing zone setup, governance controls such as policies, role assignments, logging, and network rules are applied consistently across subscriptions and regions. This makes the platform easier to operate and reduces the risk of gaps as new regions are added. AKS Observability and Resilience Testing Running AKS across multiple regions only works if you can clearly see what is happening across the entire platform. Observability should be centralized so operators don’t need to switch between regions or tools when troubleshooting issues. Azure Monitor and Log Analytics are typically used as the main aggregation point for logs and metrics from all clusters. This makes it easier to correlate signals across regions and quickly understand whether an issue is local to one cluster or affecting the platform as a whole. Distributed tracing adds another important layer of visibility. By using OpenTelemetry, requests can be traced end to end as they move through services and across regions. This is especially useful in active/active setups, where traffic may shift between regions based on health or latency. Synthetic probes and health checks should be treated as first-class signals. These checks continuously test application endpoints from outside the platform and help validate that routing, failover, and recovery mechanisms behave as expected. Observability alone is not enough. Resilience assumptions must be tested regularly. Chaos engineering and planned failover exercises help teams understand how the system behaves under failure conditions and whether operational runbooks are realistic. These tests should be performed in a controlled way and repeated over time, especially after platform changes. The goal is not to eliminate failures, but to make failures predictable, visible, and recoverable. Conclusion and Next Steps Building a highly available, multi-region AKS platform is mostly about making clear decisions and understanding their impact. Traffic routing, data replication, security, and operations all play a role, and there are always trade-offs between availability, complexity, and cost. The reference architecture described in this article provides a solid starting point for running AKS across regions on Azure. It focuses on proven patterns that work well in real environments and scale as requirements grow. The most important takeaway is that multi-region is not a single feature you turn on. It is a set of design choices that must work together and be tested regularly. Deployment Models Area Active/Active Active/Passive Deployment Stamps Availability Highest High Depends on routing model Failover time Very low Medium Depends on implementation Operational complexity High Medium Medium to high Cost Highest Lower Medium Typical use case Mission-critical workloads Business-critical workloads Large or regulated platforms Traffic Routing and Failover Aspect Azure Front Door + Traffic Manager Azure DNS Health-based routing Yes No Failover speed (RTO) Seconds to < 1 minute Minutes (TTL-based) Traffic steering Advanced Basic Recommended for Production and critical workloads Simple or non-critical workloads Data and State management Data Type Recommended Approach Notes Relational data Azure SQL with geo-replication Clear primary/secondary roles Globally distributed data Cosmos DB multi-region Consistency must be chosen carefully Caching Azure Cache for Redis Treat as disposable Object and file storage Blob / Files with GRS or RA-GRS Good for DR and read scenarios Security and Governance Area Recommendation Identity Centralize with Azure Entra ID Access control Combine Azure RBAC and Kubernetes RBAC Network security Hub-and-spoke topology Policy enforcement Azure Policy for Kubernetes Threat protection Defender for Containers Governance Use landing zones for consistency Observability and Testing Practice Why It Matters Centralized monitoring Faster troubleshooting Metrics, logs, traces Full visibility across regions Synthetic probes Early failure detection Failover testing Validate assumptions Chaos engineering Build confidence in recovery Recommended Next Steps If you want to move from design to implementation, the following steps usually work well: Start with a proof of concept using two regions and a simple workload Define RTO and RPO targets and validate them with tests Create operational runbooks for failover and recovery Automate deployments and configuration using CI/CD and GitOps Regularly test failover and recovery, not just once For deeper guidance, the Azure Well-Architected Framework and the Azure Architecture Center provide additional patterns, checklists, and reference implementations that build on the concepts discussed here.2.5KViews10likes6CommentsUnlocking Advanced Data Analytics & AI with Azure NetApp Files object REST API
Azure NetApp Files object REST API enables object access to enterprise file data stored on Azure NetApp Files, without copying, moving, or restructuring that data. This capability allows analytics and AI platforms that expect object storage to work directly against existing NFS based datasets, while preserving Azure NetApp Files’ performance, security, and governance characteristics.588Views0likes0CommentsBuilding a Secure and Compliant Azure AI Landing Zone: Policy Framework & Best Practices
As organizations accelerate their AI adoption on Microsoft Azure, governance, compliance, and security become critical pillars for success. Deploying AI workloads without a structured compliance framework can expose enterprises to data privacy issues, misconfigurations, and regulatory risks. To address this challenge, the Azure AI Landing Zone provides a scalable and secure foundation — bringing together Azure Policy, Blueprints, and Infrastructure-as-Code (IaC) to ensure every resource aligns with organizational and regulatory standards. The Azure Policy & Compliance Framework acts as the governance backbone of this landing zone. It enforces consistency across environments by applying policy definitions, initiatives, and assignments that monitor and remediate non-compliant resources automatically. This blog will guide you through: 🧭 The architecture and layers of an AI Landing Zone 🧩 How Azure Policy as Code enables automated governance ⚙️ Steps to implement and deploy policies using IaC pipelines 📈 Visualizing compliance flows for AI-specific resources What is Azure AI Landing Zone (AI ALZ)? AI ALZ is a foundational architecture that integrates core Azure services (ML, OpenAI, Cognitive Services) with best practices in identity, networking, governance, and operations. To ensure consistency, security, and responsibility, a robust policy framework is essential. Policy & Compliance in AI ALZ Azure Policy helps enforce standards across subscriptions and resource groups. You define policies (single rules), group them into initiatives (policy sets), and assign them with certain scopes & exemptions. Compliance reporting helps surface noncompliant resources for mitigation. In AI workloads, some unique considerations: Sensitive data (PII, models) Model accountability, logging, audit trails Cost & performance from heavy compute usage Preview features and frequent updates Scope This framework covers: Azure Machine Learning (AML) Azure API Management Azure AI Foundry Azure App Service Azure Cognitive Services Azure OpenAI Azure Storage Accounts Azure Databases (SQL, Cosmos DB, MySQL, PostgreSQL) Azure Key Vault Azure Kubernetes Service Core Policy Categories 1. Networking & Access Control Restrict resource deployment to approved regions (e.g., Europe only). Enforce private link and private endpoint usage for all critical resources. Disable public network access for workspaces, storage, search, and key vaults. 2. Identity & Authentication Require user-assigned managed identities for resource access. Disable local authentication; enforce Microsoft Entra ID (Azure AD) authentication. 3. Data Protection Enforce encryption at rest with customer-managed keys (CMK). Restrict public access to storage accounts and databases. 4. Monitoring & Logging Deploy diagnostic settings to Log Analytics for all key resources. Ensure activity/resource logs are enabled and retained for at least one year. 5. Resource-Specific Guardrails Apply built-in and custom policy initiatives for OpenAI, Kubernetes, App Services, Databases, etc. A detailed list of all policies is bundled and attached at the end of this blog. Be sure to check it out for a ready-to-use Excel file—perfect for customer workshops—which includes policy type (Standalone/Initiative), origin (Built-in/Custom), and more. Implementation: Policy-as-Code using EPAC To turn policies from Excel/JSON into operational governance, Enterprise Policy as Code (EPAC) is a powerful tool. EPAC transforms policy artifacts into a desired state repository and handles deployment, lifecycle, versioning, and CI/CD automation. What is EPAC & Why Use It? EPAC is a set of PowerShell scripts / modules to deploy policy definitions, initiatives, assignments, role assignments, exemptions. Enterprise Policy As Code (EPAC) It supports CI/CD integration (GitHub Actions, Azure DevOps) so policy changes can be treated like code. It handles ordering, dependency resolution, and enforcement of a “desired state” — any policy resources not in your repo may be pruned (depending on configuration). It integrates with Azure Landing Zones (including governance baseline) out of the box. References & Further Reading EPAC GitHub Repository Advanced Azure Policy management - Microsoft Learn [Advanced A...Framework] How to deploy Azure policies the DevOps way [How to dep...- Rabobank]2KViews1like2CommentsCross-Region Zero Trust: Connecting Power Platform to Azure PaaS across different regions
In the modern enterprise cloud landscape, data rarely sits in one place. You might face a scenario where your Power Platform environment (Dynamics 365, Power Apps, or Power Automate) is hosted in Region A for centralized management, while your sensitive SQL Databases or Storage Accounts must reside in Region B due to data sovereignty, latency requirements, or legacy infrastructure. Connecting these two worlds usually involves traversing the public internet - a major "red flag" for security teams. The Missing Link in Cloud Security When we talk about enterprise security, "Public Access: Disabled" is the holy grail. But for Power Platform architects, this setting is often followed by a headache. The challenge is simple but daunting: How can a Power Platform Environment (e.g., in Region A) communicate with an Azure PaaS service (e.g., Storage or SQL in Region B) when that resource is completely locked down behind a Private Endpoint? Existing documentation usually covers single-region setups with no firewalls. This post details a "Zero Trust" architecture that bridges this gap. This is a walk through for setting up a Cross-Region Private Link that routes traffic from the Power Platform in Region A, through a secure Azure Hub, and down the Azure Global Backbone to a Private Endpoint in Region B, without a single packet ever touching the public internet. 1. Understanding the Foundation: VNet Support Before we build, we must understand what moves: Power Platform VNet integration is an "Outbound" technology. It allows the platform to connect to data sources secured within an Azure Virtual Network and "inject" its traffic into your Virtual Network, without needing to install or manage an on-premises data gateway. According to Microsoft's official documentation, this integration supports a wide range of services: Dataverse: Plugins and Virtual Tables. Power Automate: Cloud Flows using standard connectors. Power Apps: Canvas Apps calling private APIs. This means once the "tunnel" is built, your entire Power Platform ecosystem can reach your private Azure universe. Virtual Network support overview – Power Platform | Microsoft Learn 2. The Architecture: A Cross-Region Global Bridge Based on the Hub-and-Spoke topology, this architecture relies on four key components working in unison: Source (Region A): The Power Platform environment utilizes VNet Injection. This injects the platform's outbound traffic into a dedicated, delegated subnet within your Region A Spoke VNet. The Hub: A central VNet containing an Azure Firewall. This acts as the regional traffic cop and DNS Proxy, inspecting traffic and resolving private names before allowing packets to traverse the global backbone. The Bridge (Global Backbone): We utilize Global VNet Peering to connect Region A to the Region B Spoke. This keeps traffic on Microsoft's private fiber backbone. Destination (Region B): The Azure PaaS service (e.g. Storage Account) is locked down with Public Access Disabled. It is only accessible via a Private Endpoint. The Architecture: Visualizing the Flow As illustrated in the diagram below, this solution separates the responsibilities into two distinct layers: the Network Admin (Azure Infrastructure) and the Power Platform Admin (Enterprise Policy). 3. The High Availability Constraint: Regional Pairs A common pitfall of these deployments is configuring only a single region. Power Platform environments are inherently redundant. In a geography like Europe, your environment is actually hosted across a Regional Pair (e.g., West Europe and North Europe). Why? If one Azure region in the pair experiences an outage, your Power Platform environment will failover to the second region. If your VNet Policy isn't already there, your private connectivity will break. To maintain High Availability (HA) for your private tunnel, your Azure footprint must mirror this: Two VNets: You must create a Virtual Network in each region of the pair. Two Delegated Subnets: Each VNet requires a subnet delegated specifically to Microsoft.PowerPlatform/enterprisePolicies. Two Network Policies: You must create an Enterprise Policy in each region and link both to your environment to ensure traffic flows even during a regional failover. Ensure your Azure subscription is registered for the Microsoft.PowerPlatform resource provider by running the SetupSubscriptionForPowerPlatform.ps1 script. 4. Solving the DNS Riddle with Azure Firewall In a Hub-and-Spoke model, peering the VNets is only half the battle. If your Power Platform environment in Region A asks for mystorage.blob.core.windows.net, it will receive a public IP by default, and your connection will be blocked. To fix this, we utilize the Azure Firewall as a DNS Proxy: Link the Private DNS Zone: Ensure your Private DNS Zones (e.g., privatelink.blob.core.windows.net) are linked to the Hub VNet. Enable DNS Proxy: Turn on the DNS Proxy feature on your Azure Firewall. Configure Custom DNS: Set the DNS servers of your Spoke VNets (Region A) to the Firewall’s Internal IP. Now, the DNS query flows through the Firewall, which "sees" the Private DNS Zone and returns the Private IP to the Power Platform. 5. Secretless Security with User-Assigned Managed Identity Private networking secures the path, but identity secures the access. Instead of managing fragile Client Secrets, we use User-Assigned Managed Identity (UAMI). Phase A: The Azure Setup Create the Identity: Generate a User-Assigned Managed Identity in your Azure subscription. Assign RBAC Roles: Grant this identity specific permissions on your destination resource. For example, assign the Storage Blob Data Contributor role to allow the identity to manage files in your private storage account. Phase B: The Power Platform Integration To make the environment recognize this identity, you must register it as an Application User: Navigate to the Power Platform Admin Center. Go to Environments > [Your Environment] > Settings > Users + permissions > Application users. Add a new app and select the Managed Identity you created in Azure. 6. Creating Enterprise Policy using PowerShell Scripts One of the most important things to realize is that Enterprise Policies cannot be created manually in the Azure Portal UI. They must be deployed via PowerShell or CLI. While Microsoft provides a comprehensive official GitHub repository with all the necessary templates, it is designed to be highly modular and granular. This means that to achieve a High Availability (HA) setup, an admin usually needs to execute deployments for each region separately and then perform the linking step. To simplify this workflow, I have developed a Simplified Scripts Repository on my GitHub. These scripts use the official Microsoft templates as their foundation but add an orchestration layer specifically for the Regional Pair requirement: Regional Pair Automation: Instead of running separate deployments, my script handles the dual-VNet injection in a single flow. It automates the creation of policies in both regions and links them to your environment in one execution. Focused Scenarios: I’ve distilled the most essential scripts for Network Injection and Encryption (CMK), making it easier for admins to get up and running without navigating the entire modular library. The Goal: To provide a "Fast-Track" experience that follows Microsoft's best practices while reducing the manual steps required to achieve a resilient, multi-region architecture. Owning the Keys with Encryption Policies (CMK) While Microsoft encrypts Dataverse data by default, many enterprise compliance standards require Customer-Managed Keys (CMK). This ensures that you, not Microsoft, control the encryption keys for your environments. - Manage your customer-managed encryption key - Power Platform | Microsoft Learn Key Requirements: Key Vault Configuration: Your Key Vault must have Purge Protection and Soft Delete enabled to prevent accidental data loss. The Identity Bridge: The Encryption Policy uses the User-Assigned Managed Identity (created in Step 5) to authenticate against the Key Vault. Permissions: You must grant the Managed Identity the Key Vault Crypto Service Encryption User role so it can wrap and unwrap the encryption keys. 7. The Final Handshake: Linking Policies to Your Environment Creating the Enterprise Policy in Azure is only the first half of the process. You must now "inform" your Power Platform environment that it should use these policies for its outbound traffic and identity. Linking the Policies to Your Environment: For VNet Injection: In the Admin Center, go to Security > Data and privacy > Azure Virtual Network Policies. Select your environment and link it to the Network Injection policies you created. For Encryption (CMK): Go to Security > Data and privacy > Customer-managed encryption Key. Add the Select the Encryption Enterprise Policy -Edit Policy - Add Environment. Crucial Step: You must first grant the Power Platform service "Get", "List", "Wrap" and "Unwrap" permissions on your specific key within Azure Key Vault before the environment can successfully validate the policy. Verification: The "Smoking Gun" in Log Analytics After successfully reaching a Resource from one of the power platform services you can check if the connection was private. How do you prove its private? Use KQL in Azure Log Analytics to verify the Network Security Perimeter (NSP) ID. The Proof: When you see a GUID in the NetworkPerimeter field, it is cryptographic evidence that the resource accepted the request only because it arrived via your authorized private bridge. In Azure Portal - Navigate to your Resource for example KeyVault - Logs - Use the following KQL: AzureDiagnostics | where ResourceProvider == "MICROSOFT.KEYVAULT" | where OperationName == "KeyGet" or OperationName == "KeyUnwrap" | where ResultType == "Success" | project TimeGenerated, OperationName, VaultName = Resource, ResultType, CallerIP = CallerIPAddress, EnterprisePolicy = identity_claim_xms_mirid_s, NetworkPerimeter = identity_claim_xms_az_nwperimid_s | sort by TimeGenerated desc Result: By implementing the Network, and Encryption Enterprise policy you transition the Power Platform from a public SaaS tool into a fully governed, private extension of your Azure infrastructure. You no longer have to choose between the agility of low-code and the security of a private cloud. To summarize the transformation from public endpoints to a complete Zero Trust architecture across regions, here is the end-to-end workflow: PHASE 1: Azure Infrastructure Foundation Create Network Fabric (HA): Deploy VNets and Delegated Subnets in both regional pairs. Deploy the Hub: Set up the Central Hub VNet with Azure Firewall. Connect Globally: Establish Global VNet Peering between all Spokes and the Hub. Solve DNS: Enable DNS Proxy on the Firewall and link Private DNS Zones to the Hub VNet. ↓ PHASE 2: Identity & Security Prep Create Identity: Generate a User-Assigned Managed Identity (UAMI). Grant Access (RBAC): Give the UAMI permissions on the target PaaS resource (e.g., Storage Contributor). Prepare CMK: Configure Key Vault access policies for the UAMI (Wrap/Unwrap permissions). ↓ PHASE 3: Deploy Enterprise Policies (PowerShell/IaC) Deploy Network Policies: Create "Network Injection" policies in Azure for both regions. Deploy Encryption Policy: Create the "CMK" policy linking to your Key Vault and Identity. ↓ PHASE 4: Power Platform Final Link (Admin Center) Link Network: Associate the Environment with the two Network Policies. Link Encryption: Activate the Customer-Managed Key on the environment. Register User: Add the Managed Identity as an "Application User" in the environment. ↓ PHASE 5: Verification Run Workload: Trigger a Flow or Plugin. Audit Logs: Use KQL in Log Analytics to confirm the presence of the NetworkPerimeter ID.1KViews2likes2CommentsArchitecting an Azure AI Hub-and-Spoke Landing Zone for Multi-Tenant Enterprises
A large enterprise customer adopting AI at scale typically needs three non‑negotiables in its AI foundation: End‑to‑end tenant isolation across network, identity, compute, and data Secure, governed traffic flow from users to AI services Transparent chargeback/showback for shared AI and platform services At the same time, the platform must enable rapid onboarding of new tenants or applications and scale cleanly from proof‑of‑concept to production. This article proposes an Azure Landing Zone–aligned architecture using a Hub‑and‑Spoke model, where: The AI Hub centralizes shared services and governance AI Spokes host tenant‑dedicated AI resources Application logic and AI agents run on AKS The result is a secure, scalable, and operationally efficient enterprise AI foundation. 1. Architecture goals & design principles Goals Host application logic and AI agents on Azure Kubernetes Service (AKS) as custom deployments instead of using agents under Azure AI Foundry Enforce strong tenant isolation across all layers Support cross chargeback and cost attribution Adopt a Hub‑and‑Spoke model with clear separation of shared vs. tenant‑specific services Design principles (Azure Landing Zone aligned) Azure Landing Zone (ALZ) guidance emphasizes: Separation of platform and workload subscriptions Management groups and policy inheritance Centralized connectivity using hub‑and‑spoke networking Policy‑driven governance and automation For infrastructure as code, ALZ‑aligned deployments typically use Bicep or Terraform, increasingly leveraging Azure Verified Modules (AVM) for consistency and long‑term maintainability. 2. Subscription & management group model A practical enterprise layout looks like this: Tenant Root Management Group o Platform Management Group Connectivity subscription (Hub VNet, Firewall, DNS, ExpressRoute/VPN) Management subscription (Log Analytics, Monitor) Security subscription (Defender for Cloud, Sentinel if required) o AI Hub Management Group AI Hub subscription (shared AI and governance services) o AI Spokes Management Group One subscription per tenant, business unit, or regulated boundary This structure supports enterprise‑scale governance while allowing teams to operate independently within well‑defined guardrails. 3. Logical architecture — AI Hub vs. AI Spoke AI Hub (central/shared services) The AI Hub acts as the governed control plane for AI consumption: Ingress & edge security: Azure Application Gateway with WAF (or Front Door for global scenarios) Central egress control: Azure Firewall with forced tunneling API governance: Azure API Management (private/internal mode) Shared AI services: Azure OpenAI (shared deployments where appropriate), safety controls Monitoring & observability: Azure Monitor, Log Analytics, centralized dashboards Governance: Azure Policy, RBAC, naming and tagging standards All tenant traffic enters through the hub, ensuring consistent enforcement of security, identity, and usage policies. AI Spoke (tenant‑dedicated services) Each AI Spoke provides a tenant‑isolated data and execution plane: Tenant‑dedicated storage accounts and databases Vector stores and retrieval systems (Azure AI Search with isolated indexes or services) AKS runtime for tenant‑specific AI agents and backend services Tenant‑scoped keys, secrets, and identities 4. Logical architecture diagram (Hub vs. Spoke) 5. Network architecture — Hub and Spoke 6. Tenant onboarding & isolation strategy Tenant onboarding flow Tenant onboarding is automated using a landing zone vending model: Request new tenant or application Provision a spoke subscription and baseline policies Deploy spoke VNet and peer to hub Configure private DNS and firewall routes Deploy AKS tenancy and data services Register identities and API subscriptions Enable monitoring and cost attribution This approach enables consistent, repeatable onboarding with minimal manual effort. Isolation by design Network: Dedicated VNets, private endpoints, no public AI endpoints Identity: Microsoft Entra ID with tenant‑aware claims and conditional access Compute: AKS isolation using namespaces, node pools, or dedicated clusters Data: Per‑tenant storage, databases, and vector indexes 7. Identity & access management (Microsoft Entra ID) Key IAM practices include: Central Microsoft Entra ID tenant for authentication and authorization Application and workload identities using managed identities Tenant context enforced at API Management and propagated downstream Conditional Access and least‑privilege RBAC This ensures zero‑trust access while supporting both internal and partner scenarios. 8. Secure traffic flow (end‑to‑end) User accesses application via Application Gateway + WAF Traffic inspected and routed through Azure Firewall API Management validates identity, quotas, and tenant context AKS workloads invoke AI services over Private Link Responses return through the same governed path This pattern provides full auditability, threat protection, and policy enforcement. 9. AKS multitenancy options Model When to use Characteristics Namespace per tenant Default Cost‑efficient, logical isolation Dedicated node pools Medium isolation Reduced noisy‑neighbor risk Dedicated AKS cluster High compliance Maximum isolation, higher cost Enterprises typically adopt a tiered approach, choosing the isolation level per tenant based on regulatory and risk requirements. 10. Cost management & chargeback model Tagging strategy (mandatory) tenantId costCenter application environment owner Enforced via Azure Policy across all subscriptions. Chargeback approach Dedicated spoke resources: Direct attribution via subscription and tags Shared hub resources: Allocated using usage telemetry o API calls and token usage from API Management o CPU/memory usage from AKS namespaces Cost data is exported to Azure Cost Management and visualized using Power BI to support showback and chargeback. 11. Security controls checklist Private endpoints for AI services, storage, and search No public network access for sensitive services Azure Firewall for centralized egress and inspection WAF for OWASP protection Azure Policy for governance and compliance 12. Deployment & automation Foundation: Azure Landing Zone accelerators (Bicep or Terraform) Workloads: Modular IaC for hub and spokes AKS apps: GitOps (Flux or Argo CD) Observability: Policy‑driven diagnostics and centralized logging 13. Final thoughts This Azure AI Landing Zone design provides a repeatable, secure, and enterprise‑ready foundation for any large customer adopting AI at scale. By combining: Hub‑and‑Spoke networking AKS‑based AI agents Strong tenant isolation FinOps‑ready chargeback Azure Landing Zone best practices organizations can confidently move AI workloads from experimentation to production—without sacrificing security, governance, or cost transparency. Disclaimer: While the above article discusses hosting custom agents on AKS alongside customer-developed application logic, the following sections focus on a baseline deployment model with no customizations. This approach uses Azure AI Foundry, where models and agents are fully managed by Azure, with centrally governed LLMs(AI Hub) hosted in Azure AI Foundry and agents deployed in a spoke environment. 🚀 Get Started: Building a Secure & Scalable Azure AI Platform To help you accelerate your Azure AI journey, Microsoft and the community provide several reference architectures, solution accelerators, and best-practice guides. Together, these form a strong foundation for designing secure, governed, and cost-efficient GenAI and AI workloads at scale. Below is a recommended starting path. 1️⃣ AI Landing Zone (Foundation) Purpose: Establish a secure, enterprise-ready foundation for AI workloads. The AI Landing Zone extends the standard Azure Landing Zone with AI-specific considerations such as: Network isolation and hub-spoke design Identity and access control for AI services Secure connectivity to data sources Alignment with enterprise governance and compliance 🔗 AI Landing Zone (GitHub): https://github.com/Azure/AI-Landing-Zones?tab=readme-ov-file 👉 Start here if you want a standardized baseline before onboarding any AI workloads. 2️⃣ AI Hub Gateway – Solution Accelerator Purpose: Centralize and control access to AI services across multiple teams or customers. The AI Hub Gateway Solution Accelerator helps you: Expose AI capabilities (models, agents, APIs) via a centralized gateway Apply consistent security, routing, and traffic controls Support both Chat UI and API-based consumption Enable multi-team or multi-tenant AI usage patterns 🔗 AI Hub Gateway Solution Accelerator: https://github.com/mohamedsaif/ai-hub-gateway-landing-zone?tab=readme-ov-file 👉 Ideal when you want a shared AI platform with controlled access and visibility. 3️⃣ Citadel Governance Hub (Advanced Governance) Purpose: Enforce strong governance, compliance, and guardrails for AI usage. The Citadel Governance Hub builds on top of the AI Hub Gateway and focuses on: Policy enforcement for AI usage Centralized governance controls Secure onboarding of teams and workloads Alignment with enterprise risk and compliance requirements 🔗 Citadel Governance Hub (README): https://github.com/Azure-Samples/ai-hub-gateway-solution-accelerator/blob/citadel-v1/README.md 👉 Recommended for regulated environments or large enterprises with strict governance needs. 4️⃣ AKS Cost Analysis (Operational Excellence) Purpose: Understand and optimize the cost of running AI workloads on AKS. AI platforms often rely on AKS for agents, inference services, and gateways. This guide explains: How AKS costs are calculated How to analyze node, pod, and workload costs Techniques to optimize cluster spend 🔗 AKS Cost Analysis: https://learn.microsoft.com/en-us/azure/aks/cost-analysis 👉 Use this early to avoid unexpected cost overruns as AI usage scales. 5️⃣ AKS Multi-Tenancy & Cluster Isolation Purpose: Safely run workloads for multiple teams or customers on AKS. This guidance covers: Namespace vs cluster isolation strategies Security and blast-radius considerations When to use shared clusters vs dedicated clusters Best practices for multi-tenant AKS platforms 🔗 AKS Multi-Tenancy & Cluster Isolation: https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-cluster-isolation 👉 Critical reading if your AI platform supports multiple teams, business units, or customers. 🧭 Suggested Learning Path If you’re new, follow this order: AI Landing Zone → build the foundation AI Hub Gateway → centralize AI access Citadel Governance Hub → enforce guardrails AKS Cost Analysis → control spend AKS Multi-Tenancy → scale securely2.4KViews1like0Comments