This article explains the steps carried out in doing a lift and shift migration of Kubernetes workloads running on virtual machines (from any location – Onpremise or Third-party cloud provider) to Azure public region. This migration was tested at a customer side, where they had specific requirements to migrate the Kubernetes workloads as is by retaining their IP addresses.
As services like docker, weaver etc are manually installed on the VM to set up Kubernetes cluster, there were different IP addresses used at different layers of the application (within the VM) and migrating them manually to “Azure public region” is a huge task which itself will take 6-8 months’ time (as per effort estimation).
We decided to leverage Azure Migrate service by doing a P2A (Physical to Azure), keeping in mind that this migration will be unique for two reasons!
Ofcourse, there was option to modernize the application by moving the workloads to AKS, but due to a strict migration timeline, we decided to migrate as-is and later focus on Modernization. This article, therefore is intended for doing an effective planning and migration of Kubernetes workloads running on VM’s as is to Azure.
Abstract:
Customer had ~700+ Virtual machines running Kubernetes workloads from two environments (Prod and Non-prod) on two separate networks. As IP retention being the key requirement, we cannot migrate the workloads in phases just like how a normal workload migration is planned.
Due to IP overlap issues, we cannot attach two networks of same CIDR range in a network. Hence, one entire network (prod or non-prod) running in on-premises must be migrated (failed over) completely at the same time, so that once the failover is performed, the network from on-premises could be terminated and the new network in Azure could function without any DNS conflicts.
We migrated each environment in just a day and ensured that all the servers had their IP retained and their Kubernetes workloads started working exactly the same manner as how it used to work from on-premises or source location. Planning of the migration (including replication time) etc. went for 3 weeks and we migrated each environment over a weekend that included functional testing as well.
Current Architecture
Steps followed in Migration and Planning:
The important aspect of this migration is the way it is planned and executed. Since the IP’s had to be retained, the critical factor is that on the day of the migration, all the 350 servers belonging to various application categories must complete the replication without any sync issues. Even if a few machines show up with sync issues, then the entire migration takes a hit. Hence, planning of the migration is very important for a successful migration. There must be proper co-ordination between the Migration engineers, Application Team, Database team and Network Engineers.
We followed a P2A migration approach by deploying Config Servers & Process servers in On-premises.
Planning:
Eg: Built 4 Migration projects in Azure Migrate (eg: Project 1, 2, 3, 4) in target subscription.
Eg: As part of building Kubernetes workloads on VM’s, you configure multiple IP address range that gets consumed by Docker services, Weave services, LB’s configured inside the server etc. As this is a lift and shift migration, the entire IP addresses configured as part of setting up this Kubernetes will also be migrated as it is. Azure Migrate during replication understands that there are multiple IP addresses configured inside the OS and hence asks for configuring the same in Azure migrate tool for the target side.
Other IPs configured as part of Kubernetes build is configured inside the VM and they belong to Docker services, Weave services, LB’s etc. This is the prime reason why customer wanted a lift and shift migration so that the IP addresses configured inside the OS will also be migrated as it is. i.e, if you simply type “ipconfig /all” on a Kubernetes deployed VM, inside the OS (on the source machine) you will see a primary IP address and a few other additional IP’s belonging to different range on a Kubernetes workload VM.
When you configure the target attributes, ensure you set the Primary NIC and provide the same IP address as that of source and make it static and also for remaining IP’s related to docker and weave, you set it to “Do not create” so that new NICS are not attached as secondary to the server. As mentioned above, these additional IP addresses are for various services deployed as part of Kubernetes and are configure in the OS and the same will be migrated as it is during lift and shift. Hence do not disturb these additional IP address settings while migrating Kubernetes VM workloads and leave it as “Do NOT create” so that new IP’s are not attached to it. Refer the screenshot below.
Eg: Priority 1, 2, 3…1 means high priority and 3 less priority.
As the Kubernetes workloads depend on a common Image server, this server is considered a high priority (Priority 1) and hence this was failed over in the end.
Eg: In Windows, the services will be in stopped state and Automatic will be disabled.
In Linux, we disabled Auto start (upon reboot).
Network Change Planning:
One of customers prime requirement is that the IP addresses of the servers (Kubernetes workload VM’s) must be the retained post migration. As the volume of servers is huge (~350 servers per environment) and all of them belong to a same network, this network shift had to be carefully planned.
Now, since multiple VNETS with same IP address range cannot be peered to the one (IP overlapping issues), we must first disable the existing connectivity from on-premises to the hub network in Azure so that the new VNET (target VNET) from Azure region can be peered to it. On the day of migration as soon as the sync was completed across all priority servers (1,2 and 3), it was decided to perform a failover of all the 350 servers and then terminate the connectivity of source networks to the Hub and initiate a new peering from the new VNET (in Azure) where the servers are failed over to in Azure region.
NOTE : We cannot switch the network before the entire set of servers are failed over, because the sync is a continuous and on-going process and if terminated, the configuration server becomes unreachable, and you may not be able to login to it again and perform failover. Hence, the peering should be disabled only after the entire set of servers were migrated or failed over to newer region.
NOTE : If you temporarily want to login to the migrated servers, you can set up Azure Bastion because connection via the VPN will not be allowed until the target VNET is peered with the Hub network. Customer uses a VPN to connect to the servers and by default only the IP address range of the source network was allowed. Even though the target VNET also contained the same IP address range, this target VNET is not peered to hub yet (after just migrating one set of servers) and hence direct connection via VPN to the new servers will not work unless the on-prem connection is disabled and new peering connection is established. We don’t have to add the IP address range in the VPN for RDP as the address range is not changing post migration.
Hence, for RDP or SSH login to all the VM’s were possible only after the entire set of servers were migrated (or failed over) and peering connection restored from new region. So, we used a combination of Azure bastion and customer VPN to connect to the VM’s post migration. Azure bastion was primarily used to check the health of VM’s immediately after failover and VPN was used after entire set of servers from all priority set was migrated.
On the Migration Day,
Post Failover
In a workload migration project, an effective planning always helps minimize the post migration issues. Having a general understanding of the network architecture of the target platform also helps. Happy Journey to Cloud !!!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.