Blog Post

Azure Architecture Blog
7 MIN READ

Empowering Disaster Recovery for Azure VMs with Azure Site Recovery and Terraform

FreddyAyala's avatar
FreddyAyala
Icon for Microsoft rankMicrosoft
Aug 01, 2023

Azure provides comprehensive backup and disaster recovery solutions that are easy to use, secure, scalable, and budget-friendly. They can also be seamlessly integrated with existing on-premises data protection setups.

 

In this article, we'll explore a practical demonstration of implementing Virtual Machine Disaster Recovery using Azure Site Recovery VM Replication through Terraform. This approach ensures the continuity of your business apps and workloads during unexpected outages, replicating your virtual machines from one location to another in a straightforward manner.

 

Azure Site Recovery

 

Site Recovery helps ensure business continuity by keeping business apps and workloads running during outages. Site Recovery replicates workloads running on physical and virtual machines (VMs) from a primary site to a secondary location. When an outage occurs at your primary site, you fail over to secondary location, and access apps from there. After the primary location is running again, you can fail back to it.

 

Use Case and architecture

 

Our use case is to replicate a VM using Site Recovery, so in case we have a failure, we can failover to a secondary VM that will be available using the same FQDN and Public IP as the first one. The second VM should be replicated to a different Region.

 

The architecture of implementing Azure VM Disaster Recovery using Azure Site Recovery and Terraform involves several key components and steps:

 

  1. Virtual Machines (VMs): These are the primary workloads that need to be protected and replicated to a secondary location. The VMs run on Azure infrastructure and are the core components of the system.

  2. Terraform: Terraform is used as the infrastructure as code tool to provision and manage Azure resources. It provides a declarative approach to define the desired state of the infrastructure and enables easy replication and management of resources.

  3. Azure Site Recovery (ASR): ASR is a Microsoft service that orchestrates and automates the disaster recovery process for Azure VMs. It ensures business continuity by replicating VMs from a primary Azure region to a secondary region.

  4. Recovery Vault: A Recovery Vault is a central component in ASR that holds the configuration settings for replication, failover, and recovery. It acts as a control center for managing the disaster recovery process.

  5. Primary and Secondary Regions: The architecture involves at least two Azure regions: the primary region where the original VMs reside and the secondary region where the replicated VMs will be stored.

  6. Replication Policy: A replication policy defines the rules for replication, including how often data should be synchronized between the primary and secondary VMs.

  7. Network Infrastructure: Virtual Networks and Subnets are set up in both the primary and secondary regions to facilitate communication between VMs and ensure that the replication process works seamlessly.

  8. Managed Disks: The VM's OS and data disks are replicated to the secondary region's Storage Account. Managed Disks provide high availability and reliability for data storage.

  9. Failover Testing: Before a disaster occurs, it's essential to perform failover testing to ensure that the replication process works as expected and that applications can run successfully in the secondary region.

  10. Failover and Failback: In the event of a disaster, a failover operation is executed to switch the applications from the primary VMs to the replicated VMs in the secondary region. After the primary region is restored, a failback operation can be performed to return to the original setup.

In summary, the architecture leverages Terraform's infrastructure as code capabilities to set up the necessary Azure resources, including VMs, Storage Accounts, Virtual Networks, and Subnets. Azure Site Recovery orchestrates the replication, failover, and failback processes to ensure business continuity and disaster recovery for Azure VMs. The well-defined workflow and automation provided by these services enable a resilient and reliable disaster recovery solution for Azure Virtual Machines.

 

The advantage of using Site Recovery is that the second VM is not running, so we do not pay for the computing resources but only for the storage and traffic to the secondary region.

 

Implementation example

 

Before starting you should already have a VNET and Subnet deployed.

 

First of all we will deploy our virtual machine with a public IP:

 

 


data "azurerm_subnet" "snet-backend" {
  depends_on           = [var.subnets]
  name                 = var.vm.snet_name
  virtual_network_name = var.vm.vnet_name
  resource_group_name  = var.resource_group
}

resource "azurerm_public_ip" "pip-vm-app" {
  name                    = "pip-app"
  location                = var.location
  resource_group_name     = var.resource_group
  allocation_method       = "Static"
  idle_timeout_in_minutes = 30
  domain_name_label       = var.vm.fqdn

  tags = {
    environment = "test"
  }
}
resource "azurerm_network_interface" "main" {
  name                = var.vm.nic_name
  location            = var.location
  resource_group_name = var.resource_group

  ip_configuration {
    name                          = "testconfiguration1"
    subnet_id                     = data.azurerm_subnet.snet-backend.id
    private_ip_address_allocation = "Dynamic"
    public_ip_address_id          = azurerm_public_ip.pip-vm-app.id
  }
}

resource "azurerm_virtual_machine" "main" {
  name                  = var.vm.name
  location              = var.location
  resource_group_name   = var.resource_group
  network_interface_ids = [azurerm_network_interface.main.id]
  vm_size               = var.vm.size

  # Uncomment this line to delete the OS disk automatically when deleting the VM
  # delete_os_disk_on_termination = true

  # Uncomment this line to delete the data disks automatically when deleting the VM
  # delete_data_disks_on_termination = true

  storage_image_reference {
    publisher = var.vm.storage_image_reference.publisher
    offer     = var.vm.storage_image_reference.offer
    sku       = var.vm.storage_image_reference.sku
    version   = var.vm.storage_image_reference.version
  }

  storage_os_disk {
    name              = "disk-${var.vm.name}-os"
    caching           = var.vm.storage_os_disk.caching
    create_option     = var.vm.storage_os_disk.create_option
    managed_disk_type = var.vm.storage_os_disk.managed_disk_type
  }

  os_profile {
    computer_name  = var.vm.os_profile.computer_name
    admin_username = var.vm.os_profile.admin_username
    admin_password = var.vm.os_profile.admin_password
    #custom_data    = file(var.vm.os_profile.custom_data)
  }

  os_profile_linux_config {
    disable_password_authentication = false
  }

  boot_diagnostics {
    enabled     = true
    storage_uri = "https://${var.storage_account.name}.blob.core.windows.net"
  }
  tags = var.tags
}


resource "azurerm_managed_disk" "disk-data-app" {
  name                 = "disk-${azurerm_virtual_machine.main.name}-data"
  location             = var.location
  resource_group_name  = var.resource_group
  storage_account_type = "StandardSSD_LRS"
  create_option        = "Empty"
  disk_size_gb         = var.vm.storage_data_disk.disk_size_gb
}

resource "azurerm_virtual_machine_data_disk_attachment" "example" {
  managed_disk_id    = azurerm_managed_disk.disk-data-app.id
  virtual_machine_id = azurerm_virtual_machine.main.id
  lun                = var.vm.storage_data_disk.lun
  caching            = "ReadWrite"
}

 

 

Once the VM is deployed we will deploy a Recovery Vault to use the service Site Recovery

 

 


data "azurerm_resource_group" "secondary" {
  name =var.resource_group_secondary
}


data "azurerm_resource_group" "primary" {
  name =var.resource_group
}


resource "azurerm_recovery_services_vault" "vault" {
  name                = "rv-app-${var.region_secondary}-${var.environment}"
  location            = var.location_secondary
  resource_group_name = data.azurerm_resource_group.secondary.name
  sku                 = "Standard"
}

 

 

Then we will deploy a recovery fabric and a protection container

 

 


resource "azurerm_site_recovery_fabric" "primary" {
  name                = "primary-fabric"
  resource_group_name = data.azurerm_resource_group.secondary.name
  recovery_vault_name = azurerm_recovery_services_vault.vault.name
  location            = data.azurerm_resource_group.primary.location
}

resource "azurerm_site_recovery_fabric" "secondary" {
  name                = "secondary-fabric"
  resource_group_name = data.azurerm_resource_group.secondary.name
  recovery_vault_name = azurerm_recovery_services_vault.vault.name
  location            = var.location_secondary
}

resource "azurerm_site_recovery_protection_container" "primary" {
  name                 = "primary-protection-container"
  resource_group_name  = data.azurerm_resource_group.secondary.name
  recovery_vault_name  = azurerm_recovery_services_vault.vault.name
  recovery_fabric_name = azurerm_site_recovery_fabric.primary.name
}

resource "azurerm_site_recovery_protection_container" "secondary" {
  name                 = "secondary-protection-container"
  resource_group_name  = data.azurerm_resource_group.secondary.name
  recovery_vault_name  = azurerm_recovery_services_vault.vault.name
  recovery_fabric_name = azurerm_site_recovery_fabric.secondary.name
}

 

 

We will define a replication policy

 

 

resource "azurerm_site_recovery_replication_policy" "policy" {
  name                                                 = "policy"
  resource_group_name                                  = data.azurerm_resource_group.secondary.name
  recovery_vault_name                                  = azurerm_recovery_services_vault.vault.name
  recovery_point_retention_in_minutes                  = 24 * 60
  application_consistent_snapshot_frequency_in_minutes = 4 * 60
}

 

 

We will map the source container with the target

 

 

resource "azurerm_site_recovery_protection_container_mapping" "container-mapping" {
  name                                      = "container-mapping"
  resource_group_name                       = data.azurerm_resource_group.secondary.name
  recovery_vault_name                       = azurerm_recovery_services_vault.vault.name
  recovery_fabric_name                      = azurerm_site_recovery_fabric.primary.name
  recovery_source_protection_container_name = azurerm_site_recovery_protection_container.primary.name
  recovery_target_protection_container_id   = azurerm_site_recovery_protection_container.secondary.id
  recovery_replication_policy_id            = azurerm_site_recovery_replication_policy.policy.id
}

 

 

Now we will deploy a VNET and Subnet were we will replicate or main Virtual Machine, another one to test the failover and a staging storage account for data replication.

 

 


resource "random_string" "lower" {
  length  = 4
  upper   = false
  lower   = true
  number  = true
  special = false
}

resource "azurerm_storage_account" "primary" {
  name                     = "prireccache${random_string.lower.result}"
  location                 = var.location
  resource_group_name      = var.resource_group
  account_tier             = "Standard"
  account_replication_type = "LRS"
}


resource "azurerm_virtual_network" "secondary" {
  name                = "vnet-app"
  resource_group_name = data.azurerm_resource_group.secondary.name
  address_space       = var.vnet.address_space
  location            = var.location_secondary
}

resource "azurerm_subnet" "secondary" {
  name                 = "snet-backend"
  resource_group_name  = data.azurerm_resource_group.secondary.name
  virtual_network_name = azurerm_virtual_network.secondary.name
  address_prefix       = var.vnet.subnets[0].address_prefix
}


resource "azurerm_virtual_network" "test-failover" {
  name                = "vnet-app-test-failover"
  resource_group_name = data.azurerm_resource_group.secondary.name
  address_space       = [var.vnet.address_space_failover_test[0]]
  location            = var.location_secondary
}

resource "azurerm_subnet" "test-failover" {
  name                 = var.vnet.subnets_failover_test[0].name
  resource_group_name  = data.azurerm_resource_group.secondary.name
  virtual_network_name = azurerm_virtual_network.test-failover.name
  address_prefix       = var.vnet.subnets_failover_test[0].address_prefix
}

resource "azurerm_network_interface" "vm" {
  name                = "vm-nic"
  location            = var.location_secondary
  resource_group_name = data.azurerm_resource_group.secondary.name

  ip_configuration {
    name                          = "nic-vm-app-01"
    subnet_id                     = azurerm_subnet.secondary.id
    private_ip_address_allocation = "Dynamic"
  }
}

 

 

Finally we will deploy our replicated VM

 

 


resource "azurerm_site_recovery_replicated_vm" "vm-replication" {
  name                                      = "vm-replication"
  resource_group_name                       = data.azurerm_resource_group.secondary.name
  recovery_vault_name                       = azurerm_recovery_services_vault.vault.name
  source_recovery_fabric_name               = azurerm_site_recovery_fabric.primary.name
  source_vm_id                              = azurerm_virtual_machine.main.id
  recovery_replication_policy_id            = azurerm_site_recovery_replication_policy.policy.id
  source_recovery_protection_container_name = azurerm_site_recovery_protection_container.primary.name

  target_resource_group_id                = data.azurerm_resource_group.secondary.id
  target_recovery_fabric_id               = azurerm_site_recovery_fabric.secondary.id
  target_recovery_protection_container_id = azurerm_site_recovery_protection_container.secondary.id
   
  managed_disk {
    disk_id                    = azurerm_virtual_machine.main.storage_os_disk[0].managed_disk_id
    staging_storage_account_id = azurerm_storage_account.primary.id
    target_resource_group_id   = data.azurerm_resource_group.secondary.id
    target_disk_type           = var.vm.storage_os_disk.managed_disk_type
    target_replica_disk_type   = var.vm.storage_os_disk.managed_disk_type
  }

  managed_disk {
    disk_id                    = azurerm_managed_disk.disk-data-app.id
    staging_storage_account_id = azurerm_storage_account.primary.id
    target_resource_group_id   = data.azurerm_resource_group.secondary.id
    target_disk_type           = "StandardSSD_LRS"
    target_replica_disk_type   = "StandardSSD_LRS"
  }

  target_network_id                       = azurerm_virtual_network.secondary.id

  network_interface {
    source_network_interface_id = azurerm_network_interface.main.id
    target_static_ip = azurerm_public_ip.pip-vm-app.id
    
  }
}

 

 

Once we have deployed the resources, we can verify in the Azure Portal Recovery Vault and do a test failover:

 

And voila! Your VM is replicated to a second region and will keep the same FQDN and public IP in the case of a region outage when performing a manual failover, which can be automated using Azure Monitor and a Automation Account.

 

Conclusion

In conclusion, setting up Azure VM Disaster Recovery with Azure Site Recovery and Terraform helps businesses protect their important work and keep their operations running smoothly. By using Terraform's simple way to manage resources and Azure Site Recovery's automatic disaster recovery features, companies can easily replicate VMs to another location.

 

This approach ensures that even if something goes wrong or there's a problem, the applications and data are safe. When a disaster happens, the failover process lets businesses switch to the replicated VMs in a different region, minimizing downtime and data loss.

 

The combination of Azure Site Recovery and Terraform makes it cost-effective and easy to manage the disaster recovery setup. Businesses can focus on strategic tasks instead of spending time on manual recovery efforts.

 

With this solution, businesses can be well-prepared for unexpected disruptions, keeping their services up and running and protecting their important data. Implementing Azure VM Disaster Recovery with Azure Site Recovery and Terraform is a smart choice to ensure business continuity and peace of mind.

Updated Aug 01, 2023
Version 1.0

2 Comments

  • charanjitsingh's avatar
    charanjitsingh
    Copper Contributor

    If VM, Public IPs, Vnet, Subnets and storage account are already created, can we fetch the VM and storage account details via data block in terraform can you suggest the code for it.

  • rahuldhande's avatar
    rahuldhande
    Copper Contributor

    Hello,

    That's Informative. To test the above terraform scripts do we require to enable disaster recovery for the source machine in advance? 

     

    Thanks,

    Rahul