A global enterprise wants to migrate thousands of VMware vSphere virtual machines (VMs) to Microsoft Azure as part of their application modernization strategy. The first step is to exit their on-premises data centers and rapidly relocate their legacy application VMs to the Azure VMware Solution as a staging area for the first phase of their modernization strategy. What should the Azure VMware Solution look like?
Azure VMware Solution is a VMware validated first party Azure service from Microsoft that provides private clouds containing VMware vSphere clusters built from dedicated bare-metal Azure infrastructure. It enables customers to leverage their existing investments in VMware skills and tools, allowing them to focus on developing and running their VMware-based workloads on Azure.
In this post, I will introduce the typical customer workload recoverability requirements, describe the Azure VMware Solution architectural components, and describe the recoverability design considerations for Azure VMware Solution private clouds.
In the next section, I will introduce the typical recoverability requirements of a customer’s workload.
A typical customer has multiple application tiers that have specific Service Level Agreement (SLA) requirements that need to be met. These SLAs are normally named by a tiering system such as Platinum, Gold, Silver, and Bronze or Mission-Critical, Business-Critical, Production, and Test/Dev. Each SLA will have different availability, recoverability, performance, manageability, and security requirements that need to be met.
For the recoverability design quality, customers will normally have an uptime percentage requirement with a recovery point objective (RPO), recovery time objective (RTO), work recovery time (WRT), maximum tolerable downtime (MTD) and a Disaster Recovery Site requirement that defines each SLA level. This is normally documented in the customer’s Business Continuity Plan (BCP). For example:
SLA Name |
Uptime |
RPO |
RTO |
WRT |
MTD |
DR Site |
Gold |
99.999% (5.26 min downtime/year) |
5 min |
3 min |
2 min |
5 min |
Yes |
Silver |
99.99% (52.6 min downtime/year) |
1 hour |
20 min |
10 min |
30 min |
Yes |
Bronze |
99.9% (8.76 hrs downtime/year) |
4 hours |
6 hours |
2 hours |
8 hours |
No |
Table 1 – Typical Customer SLA requirements for Recoverability
The recoverability concepts introduced in Table 1 have the following definitions:
Figure 1 – Recoverability Concepts
A typical legacy business-critical application will have the following application architecture:
Depending upon the recoverability requirements for each service, the disaster recovery protection mechanisms could be a mix of manual runbooks and disaster recovery automation solutions with replication and clustering mechanisms connected to many different regions to meet the customer SLAs.
Figure 2 – Typical Legacy Business-Critical Application Architecture
In the next section, I will introduce the architectural components of the Azure VMware Solution.
The diagram below describes the architectural components of the Azure VMware Solution.
Figure 3 – Azure VMware Solution Architectural Components
Each Azure VMware Solution architectural component has the following function:
In the next section, I will describe the recoverability design considerations for the Azure VMware Solution.
The architectural design process takes the business problem to be solved and the business goals to be achieved and distills these into customer requirements, design constraints and assumptions. Design constraints can be characterized by the following three categories:
Each design consideration will be a trade-off between the availability, recoverability, performance, manageability, and security design qualities. The desired result is to deliver business value with the minimum of risk by working backwards from the customer problem.
Design Consideration 1 – Azure Region: Azure VMware Solution is available in 30 Azure Regions around the world (US Government has 2 additional Azure Regions). Select the relevant Azure Regions that meet your geographic requirements. These locations will typically be driven by your design constraints and the required distance the Disaster Recovery Site needs to be from the Primary Site. The Primary Site can be located on-premises, in a co-location or in the public cloud.
Figure 4 – Azure VMware Solution Region for Disaster Recovery
Design Consideration 2 – Deployment topology: Select the Azure VMware Solution Disaster Recovery Pod topology that best matches the uptime and geographic requirements of your SLAs. For very large deployments, it may make sense to have separate Disaster Recovery Pods (private clouds) dedicated to each SLA for cost efficiency.
The management and control plane cluster (Cluster-1) can be shared with customer workload VMs or be a dedicated cluster for management and control, including customer enterprise services, such as Active Directory, DNS, & DHCP. Additional resource clusters can be added to support customer workload demand. This also includes the option of using separate clusters for each customer SLA.
The best practice for Disaster Recovery design is to follow a pod architecture where each protected site has a matching private cloud in the Disaster Recovery Azure Region. Complex mesh topologies should be avoided for operational simplicity.
The required workload Service Level Agreement values must be mapped to the appropriate Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) and use a naming convention that is easy to understand. For example, Gold, Silver and Bronze or Tier-1, Tier-2 and Tier-3. Each pod should be designated with an SLA capability for operational simplicity. On a smaller scale, the pod concept could be per cluster instead of per private cloud.
The Disaster Recovery pods are provisioned to support the necessary replicated storage capacity during steady state. When a disaster is declared, the necessary compute resources will be added to the private cloud. This can be configured automatically using this Auto-Scale function with Azure Automation Accounts and PowerShell Runbooks.
Figure 5 – Azure VMware Solution DR Shared Services
Figure 6 – Azure VMware Solution Dedicated DR Pods
Design Consideration 3 – Disaster Recovery Solution: The Azure VMware Solution supports the following first-party and third-party Disaster Recovery solutions. Depending upon your recoverability and cost efficiency requirements, the best solution can be selected from Table 2 below.
For cost efficiency, a best effort RPO and RTO can be met using backup replication of daily snapshots to the Disaster Recovery Site or using the Disaster Recovery replication feature of VMware HCX (Solution 4).
If these solutions are not viable, you can also consider application, database or message bus clustering as an option.
Solution |
RPO |
RTO |
DR Automation |
1. VMware Site Recovery |
5min – 24hr |
Minutes |
Yes, with Protection Groups & Recovery Plans |
2. Zerto DR |
Seconds |
Minutes |
Yes, with Virtual Protection Groups (VPGs) |
3. JetStream Software DR |
Seconds |
Minutes |
Yes, with Protection Domains, Runbooks & Runbook Groups |
4. VMware HCX |
5min – 24hr |
Hours |
No, manual process only |
Table 2 – Disaster Recovery Vendor Products
Note: Azure Site Recovery can be used to protect Azure VMware Solution but is not listed here since we are describing how to use Azure VMware Solution to protect on-premises VMware vSphere solutions.
Solution 1 – VMware Site Recovery supports Disaster Recovery automation with an RPO of 5 minutes to 24 hours with VMware SRM Virtual Appliance, VMware vSphere Replication and VMware vSAN. Currently, using VMware Site Recovery with Azure NetApp Files is not supported. When designing a solution with VMware Site Recovery, these Azure VMware Solution limits should be considered.
Figure 7 – Azure VMware Solution with VMware Site Recovery Manager
Solution 2 – Zerto Disaster Recovery supports Disaster Recovery automation with an RPO of seconds using continuous replication with the Zerto Virtual Manager (ZVM), Zerto Virtual Replication Appliance (ZVRA) and VMware vSAN. When designing a solution with Zerto Disaster Recovery, this Zerto Architecture Guide should be considered.
Figure 8 – Azure VMware Solution with Zerto Disaster Recovery
Solution 3 – JetStream Software Disaster Recovery supports Disaster Recovery automation with an RPO of seconds using continuous replication with the JetStream Manager Virtual Appliance (MSA), JetStream DR Virtual Appliance (DRVA) and VMware vSAN. When designing a solution with JetStream Software Disaster Recovery, these JetStream Software resources should be considered.
Figure 9 – Azure VMware Solution with JetStream Software Disaster Recovery
Solution 4 – VMware HCX Disaster Recovery supports manual Disaster Recovery with an RPO of 5 minutes to 24 hours with VMware HCX Manager, VMware vSphere Replication and VMware vSAN. When designing a solution with VMware HCX, these Azure VMware Solution limits should be considered.
Figure 10 – Azure VMware Solution with VMware HCX Disaster Recovery
Design Consideration 5 – SKU type: Three SKU types can be selected for provisioning an Azure VMware Solution private cloud. The smaller AV36 SKU can be used at the Disaster Recovery Site to build a pilot light cluster with the minimum storage resources for cost efficiency while the Primary Site can use the larger and more expensive AV36P and AV52 SKUs.
The AV36 SKU is widely available in most Azure regions and the AV36P and AV52 SKUs are limited to certain Azure regions. Azure VMware Solution does not support mixing different SKU types within a private cloud (AV64 SKU is the exception). You can check Azure VMware Solution SKU availability by Azure Region here.
The AV64 SKU is currently only available for mixed SKU deployments in certain regions.
Figure 11 – AV64 Mixed SKU Topology
Design Consideration 6 – Runbook Application Groups: After the application dependency assessment is complete, this data will be used to create the runbook application groups to ensure that the application SLAs are met during a disaster event. If the application dependency assessment is incomplete, the runbook application groups can be initially designed using the process knowledge from your application architecture team and IT operations. The idea is to ensure each application is captured in a runbook that allows the application to be recovered completely and consistently using the runbook architecture and order of operations.
Figure 12 – VMware Site Recovery Application Recovery Plans
Design Consideration 7– Storage Policies: Table 3 lists the pre-defined VM Storage Policies available for use with VMware vSAN. The appropriate redundant array of independent disks (RAID) and failures to tolerate (FTT) settings per policy need to be considered to match the customer workload SLAs. Each policy has a trade-off between availability, performance, capacity, and cost that needs to be considered.
To comply with the Azure VMware Solution SLA, you are responsible for using an FTT=2 storage policy when the cluster has 6 or more nodes in a standard cluster. You must also retain a minimum slack space of 25% for backend vSAN operations.
Deployment Type |
Policy Name |
RAID |
Failures to Tolerate (FTT) |
Site |
Standard |
RAID-1 FTT-1 |
1 |
1 |
N/A |
Standard |
RAID-1 FTT-2 |
1 |
2 |
N/A |
Standard |
RAID-1 FTT-3 |
1 |
3 |
N/A |
Standard |
RAID-5 FTT-1 |
5 |
1 |
N/A |
Standard |
RAID-6 FTT-2 |
6 |
2 |
N/A |
Standard |
VMware Horizon |
1 |
1 |
N/A |
Table 3 – VMware vSAN Storage Policies
Design Consideration 8 – Network Connectivity: Azure VMware Solution private clouds can be connected using IPSec VPN and Azure ExpressRoute circuits, including a variety of Azure Virtual Networking topologies such as Hub-Spoke and Virtual WAN with Azure Firewall and third-party Network Virtualization Appliances. For more information, refer to the Azure VMware Solution networking and interconnectivity concepts. The Azure VMware Solution Cloud Adoption Framework also has example network scenarios that can be considered.
Design Consideration 9 – Layer 2 Network Extension: VMware HCX can be used to provide Layer 2 network extension functionality to maintain the same IP address schema between sites.
Figure 13 – VMware HCX Layer 2 Network Extension with VMware Site Recovery
Design Consideration 10 – Anti-Patterns: Try to avoid using these anti-patterns in your recoverability design.
Anti-Pattern 1 – Stretched Clusters: Azure VMware Solution Stretched Clusters is the only option for meeting an RPO of 0 requirement. Remember that stretched clusters are considered an availability solution, not disaster recovery, because it is a single fault domain for the management and control plane running in dual Availability Zones (AZs). Azure VMware Solution stretched clusters (GA) currently does not support the VMware Site Recovery add-on.
Figure 14 – Azure VMware Solution Private Cloud with Stretched Clusters
Anti-Pattern 2 – Ransomware Protection: A Disaster Recovery Automation solution does not provide protection against a ransomware attack. Ransomware protection requires additional security functionality where an isolated and secure area is used to filter through a series of data restores to validate the point in time copy is free from ransomware. This process can take months and it is necessary to access data backups that may be months or years old. This is because the ransomware demand for money is merely the end of a long period of reconnaissance by an attacker and every system needs to be checked for active security vulnerabilities and spyware agents.
Disaster Recovery Automation assumes that ransomware is not present, and that data corruption has not replicated to the Disaster Recovery Site. That said, some Disaster Recovery Automation vendors now have a Ransomware Protection feature that can be leveraged as part of the solution.
In the following section, I will describe the next steps that would need to be made to progress this high-level design estimate towards a validated detailed design.
The Azure VMware Solution sizing estimate should be assessed using Azure Migrate. With large enterprise solutions for strategic and major customers, an Azure VMware Solution Solutions Architect from Azure, VMware, or a trusted VMware Partner should be engaged to ensure the solution is correctly sized to deliver business value with the minimum of risk. This should also include an application dependency assessment to understand the mapping between application groups and identify areas of data gravity, application network traffic flows, and network latency dependencies.
In this post, we took a closer look at the typical recoverability requirements of a customer workload, the architectural building blocks, and the recoverability design considerations for the Azure VMware Solution. We also discussed the next steps to continue an Azure VMware Solution design.
If you are interested in the Azure VMware Solution, please use these resources to learn more about the service:
René van den Bedem is a Principal Technical Program Manager in the Azure VMware Solution product group at Microsoft. His background is in enterprise architecture with extensive experience across all facets of the enterprise, public cloud, and service provider spaces, including digital transformation and the business, enterprise, and technology architecture stacks. René works backwards from the problem to be solved and designs solutions that deliver business value with the minimum of risk. In addition to being the first quadruple VMware Certified Design Expert (VCDX), he is also a Dell Technologies Certified Master Enterprise Architect, a Nutanix Platform Expert (NPX), and a VMware vExpert.
Link to PPTX Diagrams: azure-vmware-solution/azure-vmware-master-diagrams
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.