Azure VMware Solution Availability Design Considerations
Published Nov 28 2022 06:00 AM 365K Views
Microsoft

Azure VMware Solution Design Series

 

 

Overview

A global enterprise wants to migrate thousands of VMware vSphere virtual machines (VMs) to Microsoft Azure as part of their application modernization strategy. The first step is to exit their on-premises data centers and rapidly relocate their legacy application VMs to the Azure VMware Solution as a staging area for the first phase of their modernization strategy. What should the Azure VMware Solution look like?

 

Azure VMware Solution is a VMware validated first party Azure service from Microsoft that provides private clouds containing VMware vSphere clusters built from dedicated bare-metal Azure infrastructure. It enables customers to leverage their existing investments in VMware skills and tools, allowing them to focus on developing and running their VMware-based workloads on Azure.

 

In this post, I will introduce the typical customer workload availability requirements, describe the Azure VMware Solution architectural components, and describe the availability design considerations for Azure VMware Solution private clouds.

 

In the next section, I will introduce the typical availability requirements of a customer’s workload.

 

Customer Workload Requirements

A typical customer has multiple application tiers that have specific Service Level Agreement (SLA) requirements that need to be met. These SLAs are normally named by a tiering system such as Platinum, Gold, Silver, and Bronze or Mission-Critical, Business-Critical, Production, and Test/Dev. Each SLA will have different availability, recoverability, performance, manageability, and security requirements that need to be met.

 

For the availability design quality, customers will normally have an uptime percentage requirement with an availability zone (AZ) or region requirement that defines each SLA level. For example:

 

SLA Name

Uptime

AZ/Region

Gold

99.999% (5.26 min downtime/year)

Dual Regions

Silver

99.99% (52.6 min downtime/year)

Dual AZs

Bronze

99.9% (8.76 hrs downtime/year)

Single AZ

Table 1 – Typical Customer SLA requirements for Availability

 

A typical legacy business-critical application will have the following application architecture:

 

  • Load Balancer layer: Uses load balancers to distribute traffic across multiple web servers in the web layer to improve application availability.
  • Web layer: Uses web servers to process client requests made via the secure Hypertext Transfer Protocol (HTTPS). Receives traffic from the load balancer layer and forwards to the application layer.
  • Application layer: Uses application servers to run software that delivers a business application through a communication protocol. Receives traffic from the web layer and uses the database layer to access stored data.
  • Database layer: Uses a relational database management service (RDMS) cluster to store data and provide database services to the application layer.

 

Depending upon the availability requirements for the service, the application components could be many and spread across multiple sites and regions to meet the customer SLA.

 

rvandenbedem_0-1669229572305.png

Figure 1 – Typical Legacy Business-Critical Application Architecture

 

In the next section, I will introduce the architectural components of the Azure VMware Solution.

 

Architectural Components

The diagram below describes the architectural components of the Azure VMware Solution.

 

rvandenbedem_0-1709571400912.png

Figure 2 – Azure VMware Solution Architectural Components

 

Each Azure VMware Solution architectural component has the following function:

 

  • Azure Subscription: Used to provide controlled access, budget and quota management for the Azure VMware Solution.
  • Azure Region: Physical locations around the world where we group data centers into Availability Zones (AZs) and then group AZs into regions.
  • Azure Resource Group: Container used to place Azure services and resources into logical groups.
  • Azure VMware Solution Private Cloud: Uses VMware software, including vCenter Server, NSX software-defined networking, vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported.
  • Azure VMware Solution Resource Cluster: Uses VMware software, including vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources for customer workloads by scaling out the Azure VMware Solution private cloud. Azure NetApp Files, Azure Elastic SAN, and Pure Cloud Block Store are also supported.
  • VMware HCX: Provides mobility, migration, and network extension services.
  • VMware Site Recovery: Provides Disaster Recovery automation, and storage replication services with VMware vSphere Replication. Third party Disaster Recovery solutions Zerto DR and JetStream DR are also supported.
  • Dedicated Microsoft Enterprise Edge (D-MSEE): Router that provides connectivity between Azure cloud and the Azure VMware Solution private cloud instance.
  • Azure Virtual Network (VNet): Private network used to connect Azure services and resources together.
  • Azure Route Server: Enables network appliances to exchange dynamic route information with Azure networks.
  • Azure Virtual Network Gateway: Cross premises gateway for connecting Azure services and resources to other private networks using IPSec VPN, ExpressRoute, and VNet to VNet.
  • Azure ExpressRoute: Provides high-speed private connections between Azure data centers and on-premises or colocation infrastructure.
  • Azure Virtual WAN (vWAN): Aggregates networking, security, and routing functions together into a single unified Wide Area Network (WAN).

 

In the next section, I will describe the availability design considerations for the Azure VMware Solution.

 

Availability Design Considerations

The architectural design process takes the business problem to be solved and the business goals to be achieved and distills these into customer requirements, design constraints and assumptions. Design constraints can be characterized by the following three categories:

 

  • Laws of the Land – data and application sovereignty, governance, regulatory, compliance, etc.
  • Laws of Physics – data and machine gravity, network latency, etc.
  • Laws of Economics – owning versus renting, total cost of ownership (TCO), return on investment (ROI), capital expenditure, operational expenditure, earnings before interest, taxes, depreciation, and amortization (EBITDA), etc.

 

Each design consideration will be a trade-off between the availability, recoverability, performance, manageability, and security design qualities. The desired result is to deliver business value with the minimum of risk by working backwards from the customer problem.

 

Design Consideration 1 – Azure Region and AZs: Azure VMware Solution is available in 30 Azure Regions around the world (US Government has 2 additional Azure Regions). Select the relevant Azure Regions and AZs that meet your geographic requirements. These locations will typically be driven by your design constraints.

 

Design Consideration 2 – Deployment topology: Select the Azure VMware Solution topology that best matches the uptime and geographic requirements of your SLAs. For very large deployments, it may make sense to have separate private clouds dedicated to each SLA for cost efficiency.

 

The Azure VMware Solution supports a maximum of 12 clusters per private cloud. Each cluster supports a minimum of 3 hosts and a maximum of 16 hosts per cluster. Each private cloud supports a maximum of 96 hosts.

 

VMware vSphere HA provides protection against ESXi host failures and VMware vSphere DRS provides distributed resource management. VMware vSphere Fault Tolerance is not supported by the Azure VMware Solution. These features are preconfigured as part of the managed service and cannot be changed by the customer.

 

VMware vCenter Server, VMware HCX Manager, VMware SRM and VMware vSphere Replication Manager are individual appliances and are protected by vSphere HA.

 

VMware NSX Manager is a cluster of 3 unified appliances that have a VM-VM anti-affinity placement policy to spread them across the hosts of the cluster. The VMware NSX Edge cluster is a pair of appliances that also use a VM-VM anti-affinity placement policy.

 

Topology 1 – Standard: The Azure VMware Solution standard private cloud is deployed within a single AZ in an Azure Region, which delivers an infrastructure SLA of 99.9%.

 

rvandenbedem_1-1709571456640.png

Figure 3 – Azure VMware Solution Private Cloud Standard Topology

 

Topology 2 – Multi-AZ: Azure VMware Solution private clouds in separate AZs per Azure Region. VMware HCX is used to connect private clouds across AZs. Application clustering is required to provide the multi-AZ availability mechanism. The customer is responsible for ensuring their application clustering solution is within the limits of bandwidth and latency between private clouds. This topology will deliver an SLA of greater than 99.9%, however it will be dependent upon the application clustering solution used by the customer.

 

The Azure VMware Solution does not support AZ selection during provisioning. This is mitigated by having separate Azure Subscriptions with quota in each separate AZ. You can open a ticket with Microsoft to configure a Special Placement Policy to deploy your Azure VMware Solution private cloud to a particular AZ per subscription.

 

rvandenbedem_2-1709571493514.png

Figure 4 – Azure VMware Solution Private Cloud Multi-AZ Topology

 

Topology 3 – Stretched: The Azure VMware Solution stretched clusters private cloud is deployed across dual AZs in an Azure Region, which delivers a 99.99% infrastructure SLA. This also includes a third AZ for the Azure VMware Solution witness site. Stretched clusters support policy-based synchronous replication to deliver a recovery point objective (RPO) of zero. It is possible to use placement policies and storage policies to mix SLA levels within stretched clusters, by pinning lower SLA workloads to a particular AZ, which will experience downtime during an AZ failure.

 

This feature is GA and is currently only available in Australia East, West Europe, UK South and Germany West Central Azure Regions.

 

rvandenbedem_3-1709571549726.png

Figure 5 – Azure VMware Solution Private Cloud with Stretched Clusters Topology

 

Topology 4 – Multi-Region: Azure VMware Solution private clouds across Azure regions. VMware HCX is used to connect private clouds across Azure Regions. Application clustering is required to provide the multi-region availability mechanism. The customer is responsible for ensuring their application clustering solution is within the limits of bandwidth and latency between private clouds. This topology will deliver an SLA of greater than 99.9%, however it will be dependent upon the application clustering solution used by the customer.

 

An additional enhancement could be using Azure VMware Solution stretched clusters in one or both Azure Regions.

 

rvandenbedem_4-1709571593349.png

Figure 6 – Azure VMware Solution Private Cloud Multi-Region Topology

 

Design Decision 3 – Shared Services or Separate Services Model: The management and control plane cluster (Cluster-1) can be shared with customer workload VMs or be a dedicated cluster for management and control, including customer enterprise services, such as Active Directory, DNS, and DHCP. Additional resource clusters can be added to support customer workload demand. This also includes the option of using separate clusters for each customer SLA.

 

rvandenbedem_5-1709571649825.png

Figure 7 – Azure VMware Solution Shared Services Model

 

rvandenbedem_6-1709571700736.png

Figure 8 – Azure VMware Solution Separate Services Model

 

Design Consideration 4 – SKU type: Three SKU types can be selected for provisioning an Azure VMware Solution private cloud. The smaller AV36 SKU can be used to minimize the impact radius of a failed node. The larger AV36P and AV52 SKUs can be used to run more workloads with less nodes which increases the impact radius of a failed node.

 

The AV36 SKU is widely available in most Azure regions and the AV36P and AV52 SKUs are limited to certain Azure regions. Azure VMware Solution does not support mixing different SKU types within a private cloud (AV64 SKU is the exception). You can check Azure VMware Solution SKU availability by Azure Region here.

 

The AV64 SKU is currently only available for mixed SKU deployments in certain regions.

 

rvandenbedem_7-1709571750349.png

Figure 9 – AV64 Mixed SKU Topology

 

Design Consideration 5 – Placement Policies: Placement policies are used to increase the availability of a service by separating the VMs in an application availability layer across ESXi hosts. When an ESXi failure occurs, it would only impact one VM of a multi-part application layer, which would then restart on another ESXi host through vSphere HA. Placement policies support VM-VM and VM-Host affinity and anti-affinity rules. The vSphere Distributed Resource Scheduler (DRS) is responsible for migrating VMs to enforce the placement policies.

 

To increase the availability of an application cluster, a placement policy with VM-VM anti-affinity rules for each of the web, application and database service layers can be used. Alternatively, VM-Host affinity rules can be used to segment the web, application, and database components to dedicated groups of hosts.

 

The placement policies for stretched clusters can use VM-Host affinity rules to pin workloads to the preferred and secondary sites, if needed.

 

rvandenbedem_8-1709571807811.png

Figure 10 – Azure VMware Solution Placement Policies – VM-VM Anti-Affinity

 

rvandenbedem_9-1709571856753.png

Figure 11 – Azure VMware Solution Placement Policies – VM-Host Affinity

 

Design Consideration 6 – Storage Policies: Table 2 lists the pre-defined VM Storage Policies available for use with VMware vSAN. The appropriate redundant array of independent disks (RAID) and failures to tolerate (FTT) settings per policy need to be considered to match the customer workload SLAs. Each policy has a trade-off between availability, performance, capacity, and cost that needs to be considered.

 

The storage policies for stretched clusters include a designation for the dual site (synchronous replication), preferred site and secondary site policies that need to be considered.

 

To comply with the Azure VMware Solution SLA, you are responsible for using an FTT=2 storage policy when the cluster has 6 or more nodes in a standard cluster. You must also retain a minimum slack space of 25% for backend vSAN operations.

 

Deployment Type

Policy Name

RAID

Failures to Tolerate (FTT)

Site

Standard

RAID-1 FTT-1

1

1

N/A

Standard

RAID-1 FTT-2

1

2

N/A

Standard

RAID-1 FTT-3

1

3

N/A

Standard

RAID-5 FTT-1

5

1

N/A

Standard

RAID-6 FTT-2

6

2

N/A

Standard

VMware Horizon

1

1

N/A

Stretched

RAID-1 FTT-1 Dual Site

1

1

Site mirroring

Stretched

RAID-1 FTT-1 Preferred

1

1

Preferred

Stretched

RAID-1 FTT-1 Secondary

1

1

Secondary

Stretched

RAID-1 FTT-2 Dual Site

1

2

Site mirroring

Stretched

RAID-1 FTT-2 Preferred

1

2

Preferred

Stretched

RAID-1 FTT-2 Secondary

1

2

Secondary

Stretched

RAID-1 FTT-3 Dual Site

1

3

Site mirroring

Stretched

RAID-1 FTT-3 Preferred

1

3

Preferred

Stretched

RAID-1 FTT-3 Secondary

1

3

Secondary

Stretched

RAID-5 FTT-1 Dual Site

5

1

Site mirroring

Stretched

RAID-5 FTT-1 Preferred

5

1

Preferred

Stretched

RAID-5 FTT-1 Secondary

5

1

Secondary

Stretched

RAID-6 FTT-2 Dual Site

6

2

Site mirroring

Stretched

RAID-6 FTT-2 Preferred

6

2

Preferred

Stretched

RAID-6 FTT-2 Secondary

6

2

Secondary

Stretched

VMware Horizon

1

1

Site mirroring

Table 2 – VMware vSAN Storage Policies

 

Design Consideration 7 – Network Connectivity: Azure VMware Solution private clouds can be connected using IPSec VPN and Azure ExpressRoute circuits, including a variety of Azure Virtual Networking topologies such as Hub-Spoke and Azure Virtual WAN with Azure Firewall and third-party Network Virtualization Appliances.

 

Multiple Azure ExpressRoute circuits can be used to provide redundant connectivity. VMware HCX also supports redundant Network Extension appliances to provide high availability for Layer-2 network extensions.

 

For more information, refer to the Azure VMware Solution networking and interconnectivity concepts. The Azure VMware Solution Cloud Adoption Framework also has example network scenarios that can be considered. 

 

And, if you are interested in Azure ExpressRoute design:

 

 

In the following section, I will describe the next steps that would need to be made to progress this high-level design estimate towards a validated detailed design.

 

Next Steps

The Azure VMware Solution sizing estimate should be assessed using Azure Migrate. With large enterprise solutions for strategic and major customers, an Azure VMware Solution Solutions Architect from Azure, VMware, or a VMware Partner should be engaged to ensure the solution is correctly sized to deliver business value with the minimum of risk. This should also include an application dependency assessment to understand the mapping between application groups and identify areas of data gravity, application network traffic flows, and network latency dependencies.

 

Summary

In this post, we took a closer look at the typical availability requirements of a customer workload, the architectural building blocks, and the availability design considerations for the Azure VMware Solution. We also discussed the next steps to continue an Azure VMware Solution design.

 

If you are interested in the Azure VMware Solution, please use these resources to learn more about the service:

 

 

Author Bio

René van den Bedem is a Principal Technical Program Manager in the Azure VMware Solution product group at Microsoft. His background is in enterprise architecture with extensive experience across all facets of the enterprise, public cloud, and service provider spaces, including digital transformation and the business, enterprise, and technology architecture stacks. René works backwards from the problem to be solved and designs solutions that deliver business value with the minimum of risk. In addition to being the first quadruple VMware Certified Design Expert (VCDX), he is also a Dell Technologies Certified Master Enterprise Architect, a Nutanix Platform Expert (NPX), and a VMware vExpert.

Co-Authors
Version history
Last update:
‎Apr 05 2024 07:35 AM
Updated by: