well-architected
56 TopicsExpressRoute Gateway Microsoft initiated migration
Important: Microsoft initiated Gateway migrations are temporarily paused. You will be notified when migrations resume. Objective The backend migration process is an automated upgrade performed by Microsoft to ensure your ExpressRoute gateways use the Standard IP SKU. This migration enhances gateway reliability and availability while maintaining service continuity. You receive notifications about scheduled maintenance windows and have options to control the migration timeline. For guidance on upgrading Basic SKU public IP addresses for other networking services, see Upgrading Basic to Standard SKU. Important: As of September 30, 2025, Basic SKU public IPs are retired. For more information, see the official announcement. You can initiate the ExpressRoute gateway migration yourself at a time that best suits your business needs, before the Microsoft team performs the migration on your behalf. This gives you control over the migration timing. Please use the ExpressRoute Gateway Migration Tool to migrate your gateway Public IP to Standard SKU. This tool provides a guided workflow in the Azure portal and PowerShell, enabling a smooth migration with minimal service disruption. Backend migration overview The backend migration is scheduled during your preferred maintenance window. During this time, the Microsoft team performs the migration with minimal disruption. You don’t need to take any actions. The process includes the following steps: Deploy new gateway: Azure provisions a second virtual network gateway in the same GatewaySubnet alongside your existing gateway. Microsoft automatically assigns a new Standard SKU public IP address to this gateway. Transfer configuration: The process copies all existing configurations (connections, settings, routes) from the old gateway. Both gateways run in parallel during the transition to minimize downtime. You may experience brief connectivity interruptions may occur. Clean up resources: After migration completes successfully and passes validation, Azure removes the old gateway and its associated connections. The new gateway includes a tag CreatedBy: GatewayMigrationByService to indicate it was created through the automated backend migration Important: To ensure a smooth backend migration, avoid making non-critical changes to your gateway resources or connected circuits during the migration process. If modifications are absolutely required, you can choose (after the Migrate stage complete) to either commit or abort the migration and make your changes. Backend process details This section provides an overview of the Azure portal experience during backend migration for an existing ExpressRoute gateway. It explains what to expect at each stage and what you see in the Azure portal as the migration progresses. To reduce risk and ensure service continuity, the process performs validation checks before and after every phase. The backend migration follows four key stages: Validate: Checks that your gateway and connected resources meet all migration requirements for the Basic to Standard public IP migration. Prepare: Deploys the new gateway with Standard IP SKU alongside your existing gateway. Migrate: Cuts over traffic from the old gateway to the new gateway with a Standard public IP. Commit or abort: Finalizes the public IP SKU migration by removing the old gateway or reverts to the old gateway if needed. These stages mirror the Gateway migration tool process, ensuring consistency across both migration approaches. The Azure resource group RGA serves as a logical container that displays all associated resources as the process updates, creates, or removes them. Before the migration begins, RGA contains the following resources: This image uses an example ExpressRoute gateway named ERGW-A with two connections (Conn-A and LAconn) in the resource group RGA. Portal walkthrough Before the backend migration starts, a banner appears in the Overview blade of the ExpressRoute gateway. It notifies you that the gateway uses the deprecated Basic IP SKU and will undergo backend migration between March 7, 2026, and April 30, 2026: Validate stage Once you start the migration, the banner in your gateway’s Overview page updates to indicate that migration is currently in progress. In this initial stage, all resources are checked to ensure they are in a Passed state. If any prerequisites aren't met, validation fails and the Azure team doesn't proceed with the migration to avoid traffic disruptions. No resources are created or modified in this stage. After the validation phase completes successfully, a notification appears indicating that validation passed and the migration can proceed to the Prepare stage. Prepare stage In this stage, the backend process provisions a new virtual network gateway in the same region and SKU type as the existing gateway. Azure automatically assigns a new public IP address and re-establishes all connections. This preparation step typically takes up to 45 minutes. To indicate that the new gateway is created by migration, the backend mechanism appends _migrate to the original gateway name. During this phase, the existing gateway is locked to prevent configuration changes, but you retain the option to abort the migration, which deletes the newly created gateway and its connections. After the Prepare stage starts, a notification appears showing that new resources are being deployed to the resource group: Deployment status In the resource group RGA, under Settings → Deployments, you can view the status of all newly deployed resources as part of the backend migration process. In the resource group RGA under the Activity Log blade, you can see events related to the Prepare stage. These events are initiated by GatewayRP, which indicates they are part of the backend process: Deployment verification After the Prepare stage completes, you can verify the deployment details in the resource group RGA under Settings > Deployments. This section lists all components created as part of the backend migration workflow. The new gateway ERGW-A_migrate is deployed successfully along with its corresponding connections: Conn-A_migrate and LAconn_migrate. Gateway tag The newly created gateway ERGW-A_migrate includes the tag CreatedBy: GatewayMigrationByService, which indicates it was provisioned by the backend migration process. Migrate stage After the Prepare stage finishes, the backend process starts the Migrate stage. During this stage, the process switches traffic from the existing gateway ERGW-A to the new gateway ERGW-A_migrate. Gateway ERGW-A_migrate: Old gateway (ERGW-A) handles traffic: After the backend team initiates the traffic migration, the process switches traffic from the old gateway to the new gateway. This step can take up to 15 minutes and might cause brief connectivity interruptions. New gateway (ERGW-A_migrate) handles traffic: Commit stage After migration, the Azure team monitors connectivity for 15 days to ensure everything is functioning as expected. The banner automatically updates to indicate completion of migration: During this validation period, you can’t modify resources associated with both the old and new gateways. To resume normal CRUD operations without waiting 15 days, you have two options: Commit: Finalize the migration and unlock resources. Abort: Revert to the old gateway, which deletes the new gateway and its connections. To initiate Commit before the 15-day window ends, type yes and select Commit in the portal. When the commit is initiated from the backend, you will see “Committing migration. The operation may take some time to complete.” The old gateway and its connections are deleted. The event shows as initiated by GatewayRP in the activity logs. After old connections are deleted, the old gateway gets deleted. Finally, the resource group RGA contains only resources only related to the migrated gateway ERGW-A_migrate: The ExpressRoute Gateway migration from Basic to Standard Public IP SKU is now complete. Frequently asked questions How long will Microsoft team wait before committing to the new gateway? The Microsoft team waits around 15 days after migration to allow you time to validate connectivity and ensure all requirements are met. You can commit at any time during this 15-day period. What is the traffic impact during migration? Is there packet loss or routing disruption? Traffic is rerouted seamlessly during migration. Under normal conditions, no packet loss or routing disruption is expected. Brief connectivity interruptions (typically less than 1 minute) might occur during the traffic cutover phase. Can we make any changes to ExpressRoute Gateway deployment during the migration? Avoid making non-critical changes to the deployment (gateway resources, connected circuits, etc.). If modifications are absolutely required, you have the option (after the Migrate stage) to either commit or abort the migration.1.7KViews0likes0CommentsAzure Front Door: Resiliency Series – Part 2: Faster recovery (RTO)
In Part 1 of this blog series, we outlined our four‑pillar strategy for resiliency in Azure Front Door: configuration resiliency, data plane resiliency, tenant isolation, and accelerated Recovery Time Objective (RTO). Together, these pillars help Azure Front Door remain continuously available and resilient at global scale. Part 1 focused on the first two pillars: configuration and data plane resiliency. Our goal is to make configuration propagation safer, so incompatible changes never escape pre‑production environments. We discussed how incompatible configurations are blocked early, and how data plane resiliency ensures the system continues serving traffic from a last‑known‑good (LKG) configuration even if a bad change manages to propagate. We also introduced ‘Food Taster’, a dedicated sacrificial process running in each edge server’s data plane, that pretests every configuration change in isolation, before it ever reaches the live data plane. In this post, we turn to the recovery pillar. We describe how we have made key enhancements to the Azure Front Door recovery path so the system can return to full operation in a predictable and bounded timeframe. For a global service like Azure Front Door, serving hundreds of thousands of tenants across 210+ edge sites worldwide, we set an explicit target: to be able to recover any edge site – or all edge sites – within approximately 10 minutes, even in worst‑case scenarios. In typical data plane crash scenarios, we expect recovery in under a second. Repair status The first blog post in this series mentioned the two Azure Front Door incidents from October 2025 – learn more by watching our Azure Incident Retrospective session recordings for the October 9 th incident and/or the October 29 th incident. Before diving into our platform investments for improving our Recovery Time Objectives (RTO), we wanted to provide a quick update on the overall repair items from these incidents. We are pleased to report that the work on configuration propagation and data plane resiliency is now complete and fully deployed across the platform (in the table below, “Completed” means broadly deployed in production). With this, we have reduced configuration propagation latency from ~45 minutes to ~20 minutes. We anticipate reducing this even further – to ~15 minutes by the end of April 2026, while ensuring that platform stability remains our top priority. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond ‘EUAP or canary regions’ Control plane and data plane defect fixes Forced synchronous configuration processing Additional stages with extended bake time Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. Completed 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes April 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 Why recovery at edge scale is deceptively hard To understand why recovery took as long as it did, it helps to first understand how the Azure Front Door data plane processes configuration. Azure Front Door operates in 210+ edge sites with multiple servers per site. The data plane of each edge server hosts multiple processes. A master process orchestrates the lifecycle of multiple worker processes, that serve customer traffic. A separate configuration translator process runs alongside the data plane processes, and is responsible for converting customer configuration bundles from the control plane into optimized binary FlatBuffer files. This translation step, covering hundreds of thousands of tenants, represents hours of cumulative computation. A per edge server cache is kept locally at each server level – to enable a fast recovery of the data plane, if needed. Once the configuration translator process produces these FlatBuffer files, each worker processes them independently and memory-maps them for zero-copy access. Configuration updates flow through a two-phase commit: new FlatBuffers are first loaded into a staging area and validated, then atomically swapped into production maps. In-flight requests continue using the old configuration, until the last request referencing them completes. The data process recovery is designed to be resilient to different failure modes. A failure or crash at the worker process level has a typical recovery time of less than one second. Since each server has multiple such worker processes which serve customer traffic, this type of crash has no impact on the data plane. In the case of a master process crash, the system automatically tries to recover using the local cache. When the local cache is reused, the system is able to recover quickly – in approximately 60 minutes – since most of the configurations in the cache were already loaded into the data plane before the crash. However, in certain cases if the cache becomes unavailable or must be invalidated because of corruption, the recovery time increases significantly. During the October 29 th incident, a data plane crash triggered a complete recovery sequence that took approximately 4.5 hours. This was not because restarting a process is slow, it is because a defect in the recovery process invalidated the local cache, which meant that “restart” meant rebuilding everything from scratch. The configuration translator process then had to re-fetch and re-translate every one of the hundreds of thousands of customer configurations, before workers could memory-map them and begin serving traffic. This experience has crystallized three fundamental learnings related to our recovery path: Expensive rework: A subset of crashes discarded all previously translated FlatBuffer artifacts, forcing the configuration translator process to repeat hours of conversion work that had already been validated and stored. High restart costs: Every worker on every node had to wait for the configuration translator process to complete the full translation, before it could memory-map any configuration and begin serving requests. Unbounded recovery time: Recovery time grew linearly with total tenant footprint rather than with active traffic, creating a ‘scale penalty’ as more tenants onboarded to the system. Separately and together, the insight was clear: recovery must stop being proportional to the total configuration size. Persisting ‘validated configurations’ across restarts One of the key recovery improvements was strengthening how validated customer configurations are cached and reused across failures, rather than rebuilding configuration states from scratch during recovery. Azure Front Door already cached customer configurations on host‑mounted storage prior to the October incident. The platform enhancements post outage focused on making the local configuration cache resilient to crashes, partial failures, and bad tenant inputs. Our goal was to ensure that recovery behavior is dominated by serving traffic safely, not by reconstructing configuration state. This led us to two explicit design goals… Design goals No category of crash should invalidate the configuration cache: Configuration cache invalidation must never be the default response to failures. Whether the failure is a worker crash, master crash, data plane restart, or coordinated recovery action, previously validated customer configurations should remain usable—unless there is a proven reason to discard it. Bad tenant configuration must not poison the entire cache: A single faulty or incompatible tenant configuration should result in targeted eviction of that tenant’s configuration only—not wholesale cache invalidation across all tenants. Platform enhancements Previously, customer configurations persisted to host‑mounted storage, but certain failure paths treated the cache as unsafe and invalidated it entirely. In those cases, recovery implicitly meant reloading and reprocessing configuration for hundreds of thousands of tenants before traffic could resume, even though the vast majority of cached data was still valid. We changed the recovery model to avoid invalidating customer configurations, with strict scoping around when and how cached entries are discarded: Cached configurations are no longer invalidated based on crash type. Failures are assumed to be orthogonal to configuration correctness unless explicitly proven otherwise. Cache eviction is granular and tenant‑scoped. If a cached configuration fails validation or load checks, only that tenant’s configuration is discarded and reloaded. All other tenant configurations remain available. This ensures that recovery does not regress into a fleet‑wide rebuild due to localized or unrelated faults. Safety and correctness Durability is paired with strong correctness controls, to prevent unsafe configurations from being served: Per‑tenant validation on load: Each cached tenant configuration is validated during the ‘load and verification’ phase, before being promoted for traffic serving. Therefore, failures are contained to that tenant. Targeted re‑translation: When validation fails, only the affected tenant’s configuration is reloaded or reprocessed. Therefore, the cache for other tenants is left untouched. Operational escape hatch: Operators retain the ability to explicitly instruct a clean rebuild of the configuration cache (with proper authorization), preserving control without compromising the default fast‑recovery path. Resulting behavior With these changes, recovery behavior now aligns with real‑world traffic patterns - configuration defects impact tenants locally and predictably, rather than globally. The system now prefers isolated tenant impact, and continued service using last-known-good over aggressive invalidation, both of which are critical for predictable recovery at the scale of Azure Front Door. Making recovery scale with active traffic, not total tenants Reusing configuration cache solves the problem of rebuilding configuration in its entirety, but even with a warm cache, the original startup path had a second bottleneck: eagerly loading a large volume of tenant configurations into memory before serving any traffic. At our scale, memory-mapping, parsing hundreds of thousands of FlatBuffers, constructing internal lookup maps, adding Transport Layer Security (TLS) certificates and configuration blocks for each tenant, collectively added almost an hour to startup time. This was the case even when a majority of those tenants had no active traffic at that moment. We addressed this by fundamentally changing when configuration is loaded into workers. Rather than eagerly loading most of the tenants at startup across all edge locations, Azure Front Door now uses an Machine Learning (ML)-optimized lazy loading model. In the new architecture, instead of loading a large number of tenant configurations, we only load a small subset of tenants that are known to be historically active in a given site, we call this the “warm tenants” list. The warm tenants list per edge site is created through a sophisticated traffic analysis pipeline that leverages ML. However, loading the warm tenants is not good enough, because when a request arrives and we don’t have the configuration in memory, we need to know two things. Firstly, is this a request from a real Azure Front Door tenant – and, if it is, where can I find the configuration? To answer these questions, each worker maintains a hostmap that tracks the state of each tenant’s configuration. This hostmap is constructed during startup, as we process each tenant configuration – if the tenant is in the warm list, we will process and load their configuration fully; if not, then we will just add an entry into the hostmap where all their domain names are mapped to the configuration path location. When a request arrives for one of these tenants, the worker loads and validates that tenant’s configuration on demand, and immediately begins serving traffic. This allows a node to start serving its busiest tenants within a few minutes of startup, while additional tenants are loaded incrementally only when traffic actually arrives—allowing the system to progressively absorb cold tenants as demand increases. The effect on recovery is transformative. Instead of recovery time scaling with the total number of tenants configured on a server, it scales with the number of tenants actively receiving traffic. In practice, even at our busiest edge sites, the active tenant set is a small fraction of the total. Just as importantly, this modified form of lazy loading provides a natural failure isolation boundary. Most Edge sites won’t ever load a faulty configuration of an inactive tenant. When a request for an inactive tenant with an incompatible configuration arrives, impact is contained to a single worker. The configuration load architecture now prefers serving as many customers as quickly as possible, rather than waiting until everything is ready before serving anyone. The above changes are slated to complete in April 2026 and will bring our RTO from the current ~1 hour to under 10 minutes – for complete recovery from a worst case scenario. Continuous validation through Game Days A critical element of our recovery confidence comes from GameDay fault-injection testing. We don’t simply design recovery mechanisms and assume they work—we break the system deliberately and observe how it responds. Since late 2025, we have conducted recurring GameDay drills that simulate the exact failure scenarios we are defending against: Food Taster crash scenarios: Injecting deliberately faulty tenant configurations, to verify that they are caught and isolated with zero impact on live traffic. In our January 2026 GameDay, the Food Taster process crashed as expected, the system halted the update within approximately 5 seconds, and no customer traffic was affected. Master process crash scenarios: Triggering master process crashes across test environments to verify that workers continue serving traffic, that the Local Config Shield engages within 10 seconds, and that the coordinated recovery tool restores full operation within the expected timeframe. Multi-region failure drills: Simulating simultaneous failures across multiple regions to validate that global Config Shield mechanisms engage correctly, and that recovery procedures scale without requiring manual per-region intervention. Fallback test drills for critical Azure services running behind Azure Front Door: In our February 2026 GameDay, we simulated the complete unavailability of Azure Front Door, and successfully validated failover for critical Azure services with no impact to traffic. These drills have both surfaced corner cases and built operational confidence. They have transformed recovery from a theoretical plan into tested, repeatable muscle memory. As we noted in an internal communication to our team: “Game day testing is a deliberate shift from assuming resilience to actively proving it—turning reliability into an observed and repeatable outcome.” Closing Part 1 of this series emphasized preventing unsafe configurations from reaching the data plane, and data plane resiliency in case an incompatible configuration reaches production. This post has shown that prevention alone is not enough—when failures do occur, recovery must be fast, predictable, and bounded. By ensuring that the FlatBuffer cache is never invalidated, by loading only active tenants, and by building safe coordinated recovery tooling, we have transformed failure handling from a fleet-wide crisis into a controlled operation. These recovery investments work in concert with the prevention mechanisms described in Part 1. Together, they ensure that the path from incident detection to full service restoration is measured in minutes, with customer traffic protected at every step. In the next post of this series, we will cover the third pillar of our resiliency strategy: tenant isolation—how micro-cellular architecture and ingress-layered sharding can reduce the blast radius of any failure to a small subset, ensuring that one customer’s configuration or traffic anomaly never becomes everyone’s problem. We deeply value our customers’ trust in Azure Front Door. We are committed to transparently sharing our progress on these resiliency investments, and to exceed expectations for safety, reliability, and operational readiness.1.8KViews3likes0CommentsAnnouncing Azure DNS security policy with Threat Intelligence feed general availability
Azure DNS security policy with Threat Intelligence feed allows early detection and prevention of security incidents on customer Virtual Networks where known malicious domains sourced by Microsoft’s Security Response Center (MSRC) can be blocked from name resolution. Azure DNS security policy with Threat Intelligence feed is being announced to all customers and will have regional availability in all public regions.2.4KViews3likes0CommentsImprove your resiliency posture with new capabilities and intelligent assistance
At Microsoft Ignite 2025, Azure introduces intelligent automation and expanded capabilities to keep your business running—no matter what. From zonal protection and disaster recovery to ransomware defense, discover how the new AI innovations in Azure Copilot helps you move from reactive recovery to proactive resilience.Optimize Your Cloud Environment Using Agentic AI
In today’s cloud-first world, optimization is no longer a luxury—it’s a strategic imperative. As IT professionals and developers navigate increasingly complex environments, the need to reduce costs, improve sustainability, and accelerate decision-making has never been more urgent. At Ignite 2025, Microsoft is introducing a new wave of agentic capabilities within Azure Copilot—one of the key capabilities includes the optimization agent, designed to help you identify, validate, and act on opportunities to streamline cloud operations. For FinOps teams, this agent becomes especially powerful, enabling cost governance, carbon insights, and actionable recommendations to maximize financial efficiency at scale. From Complexity to Clarity For users familiar with Azure’s cost and performance tools, the new operations center experience in the Azure Portal provides a unified agentic experience to monitor spend and carbon emissions side by side, surface the most critical optimization opportunities, and seamlessly trigger actions by invoking the Optimization agent—bringing governance, efficiency, and sustainability into one streamlined experience. What’s New in Optimization The optimization agent in Azure Copilot empowers teams to: Identify top actions prioritized by impact, cost savings, and ease of implementation. Evaluate cost and carbon impacts side-by-side, helping you make informed decisions that align with financial and sustainability goals. Validate recommendations with supporting evidence, current / projected utilization trends, and alternative SKU choices. Accelerate implementation with step-by-step guidance and agentic workflows that reduce toil and increase confidence. These capabilities are designed to scale FinOps impact, enabling collaboration across engineering, finance, procurement, and sustainability teams—all within a unified experience. A Day in the Life: FinOps in Action Let’s step into the shoes of a FinOps practitioner at a large enterprise navigating the complexities of cost management. It’s Monday morning. Over the weekend, a set of development VMs were left running, quietly accumulating costs. The optimization agent—a capability within Azure Copilot—surfaces a top action: resize or shut down the idle resources. With a few clicks, the practitioner reviews the supporting evidence, including usage trends, cost impact, and carbon footprint. The agent offers visibility over alternative SKUs and guides the practitioner through a step-by-step implementation—all within the same interface. But it doesn’t stop there. For teams that prefer automation or scripting, the agent also generates Azure CLI and PowerShell scripts tailored to the recommended action. This gives practitioners flexibility: they can execute changes directly in the portal or integrate scripts into their existing workflows for repeatability and scale. The experience is seamless—every recommendation is actionable, verifiable, and aligned with enterprise policy. By midweek, the practitioner has implemented multiple optimizations without leaving the console or writing custom code. Each action is logged for audit visibility, ensuring compliance and transparency across the organization. What used to take hours of manual investigation and coordination now happens in minutes, freeing the team to focus on strategic initiatives rather than firefighting cost overruns. Why It Matters These aren’t just features—they’re answers to the pain points customers have been voicing for years. Cost visibility and predictability: Azure Copilot centralizes insights across subscriptions, helping teams avoid surprise bills and understand where every dollar goes. Resource inefficiencies: The optimization agent proactively identifies underutilized resources and guide teams to act before costs escalate. Scalability and complexity: Azure Copilot’s unified experience simplifies operations for even the most complex setups. Azure Copilot isn’t just simplifying cloud operations—it’s transforming how teams collaborate, govern, and optimize. Get Started at Ignite At Ignite 2025, you’ll get hands-on with Azure Copilot’s optimization capabilities. Explore how intelligent assistance can help you: Reduce cloud costs Improve sustainability metrics Strengthen governance and compliance Drive better outcomes—faster Azure Copilot: turning cloud operations into intelligent collaboration. Sign up for the Agents in Azure Copilot Limited (Preview) and try the experience today.Azure Virtual Network Manager + Azure Virtual WAN
Azure continues to expand its networking capabilities, with Azure Virtual Network Manager and Azure Virtual WAN (vWAN) standing out as two of the most transformative services. When deployed together, they offer the best of both worlds: the operational simplicity of a managed hub architecture combined with the ability for spoke VNets to communicate directly, avoiding additional hub hops and minimizing latency Revisiting the classic hub-and-spoke pattern Element Traditional hub-and-spoke role Hub VNet Centralized network that hosts shared services including firewalls (e.g., Azure Firewall, NVAs), VPN/ExpressRoute gateways, DNS servers, domain controllers, and central route tables for traffic management. Acts as the connectivity and security anchor for all spoke networks. Spoke VNets Host individual application workloads and peer directly to the hub VNet. Traffic flows through the hub for north-south connectivity (to/from on-premises or internet) and cross-spoke communication (east-west traffic between spokes). Benefits • Single enforcement point for security policies and network controls • No duplication of shared services across environments • Simplified routing logic and traffic flow management • Clear network segmentation and isolation between workloads • Cost optimization through centralized resources However, this architecture comes with a trade-off: every spoke-to-spoke packet must route through the hub, introducing additional network hops, increased latency, and potential throughput constraints. How Virtual WAN modernizes that design Virtual WAN replaces a do-it-yourself hub VNet with a fully managed hub service: Managed hubs – Azure owns and operates the hub infrastructure. Automatic route propagation – routes learned once are usable everywhere. Integrated add-ons – Firewalls, VPN, and ExpressRoute ports are first-class citizens. By default, Virtual WAN enables any-to-any routing between spokes. Traffic transits the hub fabric automatically—no configuration required. Why direct spoke mesh? Certain patterns require single-hop connectivity Micro-service meshes that sit in different spokes and exchange chatty RPC calls. Database replication / backups where throughput counts, and hub bandwidth is precious. Dev / Test / Prod spokes that need to sync artifacts quickly yet stay isolated from hub services. Segmentation mandates where a workload must bypass hub inspection for compliance yet still talk to a partner VNet. Benefits Lower latency – the hub detour disappears. Better bandwidth – no hub congestion or firewall throughput cap. Higher resilience – spoke pairs can keep talking even if the hub is under maintenance. The peering explosion problem With pure VNet peering, the math escalates fast: For n spokes you need n × (n-1)/2 links. Ten spokes? 45 peerings. Add four more? Now 91. Each extra peering forces you to: Touch multiple route tables. Update NSG rules to cover the new paths. Repeat every time you add or retire a spoke. Troubleshoot an ever-growing spider web. Where Azure Virtual Network Manager Steps In? Azure Virtual Network Manager introduces Network Groups plus a Mesh connectivity policy: Azure Virtual Network Manager Concept What it gives you Network group A logical container that groups multiple VNets together, allowing you to apply configurations and policies to all members simultaneously Mesh connectivity Automated peering between all VNets in the group, ensuring every member can communicate directly with every other member without manual configuration Declarative config Intent-based approach where you define the desired network state, and Azure Virtual Network Manager handles the implementation and ongoing maintenance Dynamic updates Automatic topology management—when VNets are added to or removed from a group, Azure Virtual Network Manager reconfigures all necessary connections without manual intervention Operational complexity collapses from O(n²) to O(1)—you manage a group, not 100+ individual peerings. A complementary model: Azure Virtual Network Manager mesh inside vWAN Since Azure Virtual Network Manager works on any Azure VNet—including the VNets you already attach to a vWAN hub—you can apply mesh policies on top of your existing managed hub architecture: Spoke VNets join a vWAN hub for branch connectivity, centralized firewalling, or multi-region reach. The same spokes are added to an Azure Virtual Network Manager Network Group with a mesh policy. Azure Virtual Network Manager builds direct peering links between the spokes, while vWAN continues to advertise and learn routes. Result: All VNets still benefit from vWAN’s global routing and on-premises integration. Latency-critical east-west flows now travel the shortest path—one hop—as if the VNets were traditionally peered. Rather than choosing one over the other, organizations can leverage both vWAN and Azure Virtual Network Manager together, as the combination enhances the strengths of each service. Performance illustration Spoke-to-Spoke Communication with Virtual WAN without Azure Virtual Network Manager mesh: Spoke-to-Spoke Communication with Virtual WAN with Azure Virtual Network Manager mesh: Observability & protection NSG flow logs – granular packet logs on every peered VNet. Azure Virtual Network Manager admin rules – org-wide guardrails that trump local NSGs. Azure Monitor + SIEM – route flow logs to Log Analytics, Sentinel, or third-party SIEM for threat detection. Layered design – hub firewalls inspect north-south traffic; NSGs plus admin rules secure east-west flows. Putting it all together Virtual WAN offers fully managed global connectivity, simplifying the integration of branch offices and on-premises infrastructure into your Azure environment. Azure Virtual Network Manager mesh establishes direct communication paths between spoke VNets, making it ideal for workloads requiring high throughput or minimal latency in east-west traffic patterns. When combined, these services provide architects with granular control over traffic routing. Each flow can be directed through hub services when needed or routed directly between spokes for optimal performance—all without re-architecting your network or creating additional management complexity. By pairing Azure Virtual Network Manager’s group-based mesh with VWAN’s managed hubs, you get the best of both worlds: worldwide reach, centralized security, and single-hop performance where it counts.2.5KViews5likes0CommentsDeploying Third-Party Firewalls in Azure Landing Zones: Design, Configuration, and Best Practices
As enterprises adopt Microsoft Azure for large-scale workloads, securing network traffic becomes a critical part of the platform foundation. Azure’s Well-Architected Framework provides the blueprint for enterprise-scale Landing Zone design and deployments, and while Azure Firewall is a built-in PaaS option, some organizations prefer third-party firewall appliances for familiarity, feature depth, and vendor alignment. This blog explains the basic design patterns, key configurations, and best practices when deploying third-party firewalls (Palo Alto, Fortinet, Check Point, etc.) as part of an Azure Landing Zone. 1. Landing Zone Architecture and Firewall Role The Azure Landing Zone is Microsoft’s recommended enterprise-scale architecture for adopting cloud at scale. It provides a standardized, modular design that organizations can use to deploy and govern workloads consistently across subscriptions and regions. At its core, the Landing Zone follows a hub-and-spoke topology: Hub (Connectivity Subscription): Central place for shared services like DNS, private endpoints, VPN/ExpressRoute gateways, Azure Firewall (or third-party firewall appliances), Bastion, and monitoring agents. Provides consistent security controls and connectivity for all workloads. Firewalls are deployed here to act as the traffic inspection and enforcement point. Spokes (Workload Subscriptions): Application workloads (e.g., SAP, web apps, data platforms) are placed in spoke VNets. Additional specialized spokes may exist for Identity, Shared Services, Security, or Management. These are isolated for governance and compliance, but all connectivity back to other workloads or on-premises routes through the hub. Traffic Flows Through Firewalls North-South Traffic: Inbound connections from the Internet (e.g., customer access to applications). Outbound connections from Azure workloads to Internet services. Hybrid connectivity to on-premises datacenters or other clouds. Routed through the external firewall set for inspection and policy enforcement. East-West Traffic: Lateral traffic between spokes (e.g., Application VNet to Database VNet). Communication across environments like Dev → Test → Prod (if allowed). Routed through an internal firewall set to apply segmentation, zero-trust principles, and prevent lateral movement of threats. Why Firewalls Matter in the Landing Zone While Azure provides NSGs (Network Security Groups) and Route Tables for basic packet filtering and routing, these are not sufficient for advanced security scenarios. Firewalls add: Deep packet inspection (DPI) and application-level filtering. Intrusion Detection/Prevention (IDS/IPS) capabilities. Centralized policy management across multiple spokes. Segmentation of workloads to reduce blast radius of potential attacks. Consistent enforcement of enterprise security baselines across hybrid and multi-cloud. Organizations May Choose Depending on security needs, cost tolerance, and operational complexity, organizations typically adopt one of two models for third party firewalls: Two sets of firewalls One set dedicated for north-south traffic (external to Azure). Another set for east-west traffic (between VNets and spokes). Provides the highest security granularity, but comes with higher cost and management overhead. Single set of firewalls A consolidated deployment where the same firewall cluster handles both east-west and north-south traffic. Simpler and more cost-effective, but may introduce complexity in routing and policy segregation. This design choice is usually made during Landing Zone design, balancing security requirements, budget, and operational maturity. 2. Why Choose Third-Party Firewalls Over Azure Firewall? While Azure Firewall provides simplicity as a managed service, customers often choose third-party solutions due to: Advanced features – Deep packet inspection, IDS/IPS, SSL decryption, threat feeds. Vendor familiarity – Network teams trained on Palo Alto, Fortinet, or Check Point. Existing contracts – Enterprise license agreements and support channels. Hybrid alignment – Same vendor firewalls across on-premises and Azure. Azure Firewall is a fully managed PaaS service, ideal for customers who want a simple, cloud-native solution without worrying about underlying infrastructure. However, many enterprises continue to choose third-party firewall appliances (Palo Alto, Fortinet, Check Point, etc.) when implementing their Landing Zones. The decision usually depends on capabilities, familiarity, and enterprise strategy. Key Reasons to Choose Third-Party Firewalls Feature Depth and Advanced Security Third-party vendors offer advanced capabilities such as: Deep Packet Inspection (DPI) for application-aware filtering. Intrusion Detection and Prevention (IDS/IPS). SSL/TLS decryption and inspection. Advanced threat feeds, malware protection, sandboxing, and botnet detection. While Azure Firewall continues to evolve, these vendors have a longer track record in advanced threat protection. Operational Familiarity and Skills Network and security teams often have years of experience managing Palo Alto, Fortinet, or Check Point appliances on-premises. Adopting the same technology in Azure reduces the learning curve and ensures faster troubleshooting, smoother operations, and reuse of existing playbooks. Integration with Existing Security Ecosystem Many organizations already use vendor-specific management platforms (e.g., Panorama for Palo Alto, FortiManager for Fortinet, or SmartConsole for Check Point). Extending the same tools into Azure allows centralized management of policies across on-premises and cloud, ensuring consistent enforcement. Compliance and Regulatory Requirements Certain industries (finance, healthcare, government) require proven, certified firewall vendors for security compliance. Customers may already have third-party solutions validated by auditors and prefer extending those to Azure for consistency. Hybrid and Multi-Cloud Alignment Many enterprises run a hybrid model, with workloads split across on-premises, Azure, AWS, or GCP. Third-party firewalls provide a common security layer across environments, simplifying multi-cloud operations and governance. Customization and Flexibility Unlike Azure Firewall, which is a managed service with limited backend visibility, third-party firewalls give admins full control over operating systems, patching, advanced routing, and custom integrations. This flexibility can be essential when supporting complex or non-standard workloads. Licensing Leverage (BYOL) Enterprises with existing enterprise agreements or volume discounts can bring their own firewall licenses (BYOL) to Azure. This often reduces cost compared to pay-as-you-go Azure Firewall pricing. When Azure Firewall Might Still Be Enough Organizations with simple security needs (basic north-south inspection, FQDN filtering). Cloud-first teams that prefer managed services with minimal infrastructure overhead. Customers who want to avoid manual scaling and VM patching that comes with IaaS appliances. In practice, many large organizations use a hybrid approach: Azure Firewall for lightweight scenarios or specific environments, and third-party firewalls for enterprise workloads that require advanced inspection, vendor alignment, and compliance certifications. 3. Deployment Models in Azure Third-party firewalls in Azure are primarily IaaS-based appliances deployed as virtual machines (VMs). Leading vendors publish Azure Marketplace images and ARM/Bicep templates, enabling rapid, repeatable deployments across multiple environments. These firewalls allow organizations to enforce advanced network security policies, perform deep packet inspection, and integrate with Azure-native services such as Virtual Network (VNet) peering, Azure Monitor, and Azure Sentinel. Note: Some vendors now also release PaaS versions of their firewalls, offering managed firewall services with simplified operations. However, for the purposes of this blog, we will focus mainly on IaaS-based firewall deployments. Common Deployment Modes Active-Active Description: In this mode, multiple firewall VMs operate simultaneously, sharing the traffic load. An Azure Load Balancer distributes inbound and outbound traffic across all active firewall instances. Use Cases: Ideal for environments requiring high throughput, resilience, and near-zero downtime, such as enterprise data centers, multi-region deployments, or mission-critical applications. Considerations: Requires careful route and policy synchronization between firewall instances to ensure consistent traffic handling. Typically involves BGP or user-defined routes (UDRs) for optimal traffic steering. Scaling is easier: additional firewall VMs can be added behind the load balancer to handle traffic spikes. Active-Passive Description: One firewall VM handles all traffic (active), while the secondary VM (passive) stands by for failover. When the active node fails, Azure service principals manage IP reassignment and traffic rerouting. Use Cases: Suitable for environments where simpler management and lower operational complexity are preferred over continuous load balancing. Considerations: Failover may result in a brief downtime, typically measured in seconds to a few minutes. Synchronization between the active and passive nodes ensures firewall policies, sessions, and configurations are mirrored. Recommended for smaller deployments or those with predictable traffic patterns. Network Interfaces (NICs) Third-party firewall VMs often include multiple NICs, each dedicated to a specific type of traffic: Untrust/Public NIC: Connects to the Internet or external networks. Handles inbound/outbound public traffic and enforces perimeter security policies. Trust/Internal NIC: Connects to private VNets or subnets. Manages internal traffic between application tiers and enforces internal segmentation. Management NIC: Dedicated to firewall management traffic. Keeps administration separate from data plane traffic, improving security and reducing performance interference. HA NIC (Active-Passive setups): Facilitates synchronization between active and passive firewall nodes, ensuring session and configuration state is maintained across failovers. This design choice is usually made during Landing Zone design, balancing security requirements, budget, and operational maturity. : NICs of Palo Alto External Firewalls and FortiGate Internal Firewalls in two sets of firewall scenario 4. Key Configuration Considerations When deploying third-party firewalls in Azure, several design and configuration elements play a critical role in ensuring security, performance, and high availability. These considerations should be carefully aligned with organizational security policies, compliance requirements, and operational practices. Routing User-Defined Routes (UDRs): Define UDRs in spoke Virtual Networks to ensure all outbound traffic flows through the firewall, enforcing inspection and security policies before reaching the Internet or other Virtual Networks. Centralized routing helps standardize controls across multiple application Virtual Networks. Depending on the architecture traffic flow design, use appropriate Load Balancer IP as the Next Hop on UDRs of spoke Virtual Networks. Symmetric Routing: Ensure traffic follows symmetric paths (i.e., outbound and inbound flows pass through the same firewall instance). Avoid asymmetric routing, which can cause stateful firewalls to drop return traffic. Leverage BGP with Azure Route Server where supported, to simplify route propagation across hub-and-spoke topologies. : Azure UDR directing all traffic from a Spoke VNET to the Firewall IP Address Policies NAT Rules: Configure DNAT (Destination NAT) rules to publish applications securely to the Internet. Use SNAT (Source NAT) to mask private IPs when workloads access external resources. Security Rules: Define granular allow/deny rules for both north-south traffic (Internet to VNet) and east-west traffic (between Virtual Networks or subnets). Ensure least privilege by only allowing required ports, protocols, and destinations. Segmentation: Apply firewall policies to separate workloads, environments, and tenants (e.g., Production vs. Development). Enforce compliance by isolating workloads subject to regulatory standards (PCI-DSS, HIPAA, GDPR). Application-Aware Policies: Many vendors support Layer 7 inspection, enabling controls based on applications, users, and content (not just IP/port). Integrate with identity providers (Azure AD, LDAP, etc.) for user-based firewall rules. : Example Configuration of NAT Rules on a Palo Alto External Firewall Load Balancers Internal Load Balancer (ILB): Use ILBs for east-west traffic inspection between Virtual Networks or subnets. Ensures that traffic between applications always passes through the firewall, regardless of origin. External Load Balancer (ELB): Use ELBs for north-south traffic, handling Internet ingress and egress. Required in Active-Active firewall clusters to distribute traffic evenly across firewall nodes. Other Configurations: Configure health probes for firewall instances to ensure faulty nodes are automatically bypassed. Validate Floating IP configuration on Load Balancing Rules according to the respective vendor recommendations. Identity Integration Azure Service Principals: In Active-Passive deployments, configure service principals to enable automated IP reassignment during failover. This ensures continuous service availability without manual intervention. Role-Based Access Control (RBAC): Integrate firewall management with Azure RBAC to control who can deploy, manage, or modify firewall configurations. SIEM Integration: Stream logs to Azure Monitor, Sentinel, or third-party SIEMs for auditing, monitoring, and incident response. Licensing Pay-As-You-Go (PAYG): Licenses are bundled into the VM cost when deploying from the Azure Marketplace. Best for short-term projects, PoCs, or variable workloads. Bring Your Own License (BYOL): Enterprises can apply existing contracts and licenses with vendors to Azure deployments. Often more cost-effective for large-scale, long-term deployments. Hybrid Licensing Models: Some vendors support license mobility from on-premises to Azure, reducing duplication of costs. 5. Common Challenges Third-party firewalls in Azure provide strong security controls, but organizations often face practical challenges in day-to-day operations: Misconfiguration Incorrect UDRs, route tables, or NAT rules can cause dropped traffic or bypassed inspection. Asymmetric routing is a frequent issue in hub-and-spoke topologies, leading to session drops in stateful firewalls. Performance Bottlenecks Firewall throughput depends on the VM SKU (CPU, memory, NIC limits). Under-sizing causes latency and packet loss, while over-sizing adds unnecessary cost. Continuous monitoring and vendor sizing guides are essential. Failover Downtime Active-Passive models introduce brief service interruptions while IPs and routes are reassigned. Some sessions may be lost even with state sync, making Active-Active more attractive for mission-critical workloads. Backup & Recovery Azure Backup doesn’t support vendor firewall OS. Configurations must be exported and stored externally (e.g., storage accounts, repos, or vendor management tools). Without proper backups, recovery from failures or misconfigurations can be slow. Azure Platform Limits on Connections Azure imposes a per-VM cap of 250,000 active connections, regardless of what the firewall vendor appliance supports. This means even if an appliance is designed for millions of sessions, it will be constrained by Azure’s networking fabric. Hitting this cap can lead to unexplained traffic drops despite available CPU/memory. The workaround is to scale out horizontally (multiple firewall VMs behind a load balancer) and carefully monitor connection distribution. 6. Best Practices for Third-Party Firewall Deployments To maximize security, reliability, and performance of third-party firewalls in Azure, organizations should follow these best practices: Deploy in Availability Zones: Place firewall instances across different Availability Zones to ensure regional resilience and minimize downtime in case of zone-level failures. Prefer Active-Active for Critical Workloads: Where zero downtime is a requirement, use Active-Active clusters behind an Azure Load Balancer. Active-Passive can be simpler but introduces failover delays. Use Dedicated Subnets for Interfaces: Separate trust, untrust, HA, and management NICs into their own subnets. This enforces segmentation, simplifies route management, and reduces misconfiguration risk. Apply Least Privilege Policies: Always start with a deny-all baseline, then allow only necessary applications, ports, and protocols. Regularly review rules to avoid policy sprawl. Standardize Naming & Tagging: Adopt consistent naming conventions and resource tags for firewalls, subnets, route tables, and policies. This aids troubleshooting, automation, and compliance reporting. Validate End-to-End Traffic Flows: Test both north-south (Internet ↔ VNet) and east-west (VNet ↔ VNet/subnet) flows after deployment. Use tools like Azure Network Watcher and vendor traffic logs to confirm inspection. Plan for Scalability: Monitor throughput, CPU, memory, and session counts to anticipate when scale-out or higher VM SKUs are needed. Some vendors support autoscaling clusters for bursty workloads. Maintain Firmware & Threat Signatures: Regularly update the firewall’s software, patches, and threat intelligence feeds to ensure protection against emerging vulnerabilities and attacks. Automate updates where possible. Conclusion Third-party firewalls remain a core building block in many enterprise Azure Landing Zones. They provide the deep security controls and operational familiarity enterprises need, while Azure provides the scalable infrastructure to host them. By following the hub-and-spoke architecture, carefully planning deployment models, and enforcing best practices for routing, redundancy, monitoring, and backup, organizations can ensure a secure and reliable network foundation in Azure.2.8KViews5likes2CommentsEmpower Smarter AI Agent Investments
This curated series of modules is designed to equip technical and business decision-makers, including IT, developers, engineers, AI engineers, administrators, solution architects, business analysts, and technology managers, with the practical knowledge and guidance needed to make cost-conscious decisions at every stage of the AI agent journey. From identifying high-impact use cases and understanding cost drivers, to forecating ROI, adopting best practices, designing scalable and effective architectures, and optimizing ongoing investments, this learning path provides actionable guidance for building, deploying, and managing AI agents on Azure with confidence. Whether you’re just starting your AI journey or looking to scale enterprise adoption, these modules will help you align innovation with financial discipline, ensuring your AI agent initiatives deliver sustainable value and long-term success. Discover the full learning path here: aka.ms/Cost-Efficient-AI-Agents Explore the sections below for an overview of each module included in this learning path, highlighting the core concepts, practical strategies, and actionable insights designed to help you maximize the value of AI agent investments on Azure: Module 1: Identify and Prioritize High-Impact, Cost-Effective AI Agent Use Cases The journey begins with a strategic approach to selecting AI agent use cases that maximize business impact and cost efficiency. This module introduces a structured framework for researching proven use cases, collaborating across teams, and defining KPIs to evaluate feasibility and ROI. You’ll learn how to target “quick wins” while ensuring alignment with organizational goals and resource constraints. Explore this module Module 2: Understand the Key Cost Drivers of AI Agents Building on the foundation of use case selection, Module 2 dives into the core cost drivers of AI agent development and operations on Azure. It covers infrastructure, integration, data quality, team expertise, and ongoing operational expenses, offering actionable strategies to optimize spending at every stage. The module emphasizes right-sizing resources, efficient data preparation, and leveraging Microsoft tools to streamline development and ensure sustainable, scalable success. Explore this module Module 3: Forecast the Return on Investment (ROI) of AI agents With a clear understanding of costs, the next step is to quantify value. Module 3 empowers both business and technical leaders with practical frameworks for forecasting and communicating ROI, even without a finance background. Through step-by-step guides and real-world examples, you’ll learn to measure tangible and intangible outcomes, apply NPV calculations, and use sensitivity analysis to prioritize AI investments that align with broader organizational objectives. Explore this module Module 4: Implement Best Practices to Empower AI Agent Efficiency and Ensure Long-Term Success To drive efficiency and governance at scale, Module 4 introduces essential frameworks such as the AI Center of Excellence (CoE), FinOps, GenAI Ops, the Cloud Adoption Framework (CAF), and the Well-Architected Framework (WAF). These best practices help organizations accelerate adoption, optimize resources, and foster operational excellence, ensuring AI agents deliver measurable value, remain secure, and support sustainable enterprise growth. Explore this module Module 5: Maximize Cost Efficiency by Choosing the Right AI Agent Development Approach Selecting the right development approach is critical for balancing speed, customization, and cost. In Module 5, you’ll learn how to align business needs and technical skills with SaaS, PaaS, or IaaS options, empowering both business users and developers to efficiently build, deploy, and manage AI agents. The module also highlights how Microsoft Copilot Studio, Visual Studio, and Azure AI Foundry can help your organization achieve its goals. Explore this module Module 6: Architect Scalable and Cost-Efficient AI Agent Solutions on Azure As your AI initiatives grow, architectural choices become paramount. Module 6 explores how to leverage Azure Landing Zones and reference architectures for secure, well-governed, and cost-optimized deployments. It compares single-agent and multi-agent systems, highlights strategies for cost-aware model selection, and details best practices for governance, tagging, and pricing, ensuring your AI solutions remain flexible, resilient, and financially sustainable. Explore this module Module 7: Manage and Optimize AI Agent Investments on Azure The learn path concludes with a focus on operational excellence. Module 7 provides guidance on monitoring agent performance and spending using Azure AI Foundry Observability, Azure Monitor Application Insights, and Microsoft Cost Management. Learn how to track key metrics, set budgets, receive real-time alerts, and optimize resource allocation, empowering your organization to maximize ROI, stay within budget, and deliver ongoing business value. Explore this module Ready to accelerate your AI agent journey with financial confidence? Start exploring the new learning path and unlock proven strategies to maximize the cost efficiency of your AI agents on Azure, transforming innovation into measurable, sustainable business success. Get started todayCloud and AI Cost Efficiency: A Strategic Imperative for Long-Term Business Growth
In this blog, we’ll explore why cost efficiency is a top priority for organizations today, how Azure Essentials can help address this challenge, and provide an overview of Microsoft’s solutions, tools, programs, and resources designed to help organizations maximize the value of their cloud and AI investments.Using Application Gateway to secure access to the Azure OpenAI Service: Customer success story
Introduction A large enterprise customer set out to build a generative AI application using Azure OpenAI. While the app would be hosted on-premises, the customer wanted to leverage the latest large language models (LLMs) available through Azure OpenAI. However, they faced a critical challenge: how to securely access Azure OpenAI from an on-prem environment without private network connectivity or a full Azure landing zone. This blog post walks through how customers overcame these limitations using Application Gateway as a reverse proxy in front of their Azure Open AI along with other Azure services, to meet their security and governance requirements. Customer landscape and challenges The customer’s environment lacked: Private network connectivity (no Site-to-Site VPN or ExpressRoute). This was due to using a new Azure Government environment and not having a cloud operations team set up yet Common network topology such as Virtual WAN and Hub-Spoke network design A full Enterprise Scale Landing Zone (ESLZ) of common infrastructure Security components like private DNS zones, DNS resolvers, API Management, and firewalls This meant they couldn’t use private endpoints or other standard security controls typically available in mature Azure environments. Security was non-negotiable. Public access to Azure OpenAI was unacceptable. Customer needs to: Restrict access to specific IP CIDR ranges from on-prem user machines and data centers Limit ports communicating with Azure OpenAI Implement a reverse proxy with SSL termination and Web Application Firewall (WAF) Use a customer-provided SSL certificate to secure traffic Proposed solution To address these challenges, the customer designed a secure architecture using the following Azure components: Key Azure services Application Gateway – Layer 7 reverse proxy, SSL termination & Web Application Firewall (WAF) Public IP – Allows communication over public internet between customer’s IP addresses & Azure IP addresses Virtual Network – Allows control of network traffic in Azure Network Security Group (NSG) – Layer 4 network controls such as port numbers, service tags using five-tuple information (source, source port, destination, destination port, protocol) Azure OpenAI – Large Language Model (LLM) NSG configuration Inbound Rules: Allow traffic only from specific IP CIDR ranges and HTTP(S) ports Outbound Rules: Target AzureCloud.<region> with HTTP(S) ports (no service tag for Azure OpenAI yet) Application Gateway setup SSL Certificate: Issued by the customer’s on-prem Certificate Authority HTTPS Listener: Uses the on-prem certificate to terminate SSL Traffic flow: Decrypt incoming traffic Scan with WAF Re-encrypt using a well-known Azure CA Override backend hostname Custom health probe: Configured to detect a 404 response from Azure OpenAI (since no health check endpoint exists) Azure OpenAI configuration IP firewall restrictions: Only allow traffic from the Application Gateway subnet Outcome By combining Application Gateway, NSGs, and custom SSL configurations, the customer successfully secured their Azure OpenAI deployment—without needing a full ESLZ or private connectivity. This approach enabled them to move forward with their generative AI app while maintaining enterprise-grade security and governance.692Views1like0Comments