Azure is Microsoft’s cloud computing offering which provides IaaS (infra as a service) virtual machines (VM), PaaS (platform as a service) containers and many other SaaS services (e.g., Azure Storge, Networking, etc.). Azure, being one of the largest cloud service providers, hosts millions of customer virtual machines (VMs) in our data centers. The operating system that runs on these hosts is a modified version of Windows called Cloud Host. I talked about this and the Azure Host OS architecture (incl. the root OS and the hypervisor) in an earlier blog post. In this blog post we will talk about how we update the operating system that runs on those hosts, in particular, a new advancement we made in updating the hypervisor called “Hypervisor Hot Restart (HHR)”.
Azure Host OS Updates Overview
Ensuring the security of the Azure hosts is critical to maintaining our customers’ trust as their applications run in a public cloud where customers have limited control over infrastructure updates. We ensure the security of the Azure host by patching and keeping it up to date with all the latest applicable security updates. These patches are typically rolled out every month with no disruption to the customer workloads. In addition, to security updates, we also update the Azure Host OS to provide new features and functionality to the customer VMs, e.g., new hardware generation support or new features such as Azure confidential computing.
Note: this blog focuses on internal Azure Host OS technical details and does not talk about Azure customer facing VM updates and control mechanisms. Those maintenance controls or scheduled events are documented for our customers on Azure’s website.
Different Azure Host OS update mechanisms
These are the most common types of Azure Host OS update technologies used in the Azure fleet.
Best – in milliseconds. Not visible to customer VMs
Typically used for monthly security updates (e.g., MSRC). More detailed blog on internals here.
Typically, 30 secs
Update the entire Azure Host OS. Paper in EuroSys 2021 for tech details.
With that introduction on Azure Host OS update technologies, we are going to do a deep dive into our latest and most advanced update technology: Hypervisor Hot Restart (HHR). HHR allows us to update and replace the hypervisor on a running system with sub second blackout time for customer VMs and importantly without dropping any packets. With Hypervisor Hot Restart, we can deploy new hypervisor features or fixes easily, providing enormous customer value. This is especially important in today's world, where security threats are becoming more prevalent and sophisticated.
Hypervisor Hot Restart in Action
This is a demonstration of how Hypervisor Hot Restart works in action. It showcases 4 VMs (Virtual Machines) that continue to run while the hypervisor is fully replaced under them. The network connection remains stable throughout the process and no packets are lost. Additionally, we showcase the speed of the restart process, with a maximum packet delay of 600 milliseconds. (Apologies for the low-quality GIF, the blogging platform has a low size limit. The original videos are attached to this blog post for offline viewing.)
How Does Hypervisor Hot Restart Work?
On an Azure node there will be one active hypervisor running the host operating system and the guest VMs. When we are ready to update the hypervisor, this active hypervisor creates a service partition where the new updated or latest hypervisor is initialized. The other partitions hosting the customer VMs continue to run normally, uninterrupted.
Once the new-hypervisor initialization is complete, it is ready, and the active hypervisor can now call into the new-hypervisor. Next, the active hypervisor creates a mirroring thread for each active partition, which replicates all state associated with the partition to the new-hypervisor. All partitions remain running while the mirroring threads reflect important state changes from the active hypervisor to the new-hypervisor. This mirrored state includes information such as memory ownership, partition lifecycle changes, device ownership, and so on.
All partitions are then temporarily suspended, and their state is saved into an internal hypervisor buffer to capture any state that has not already been mirrored. This phase is known as the "blackout" period, during which neither the host OS nor any guest VMs are running. Control of the physical machine is then passed to the new-hypervisor, which becomes the new active hypervisor. This time is well under a second as you see in the demo.
Finally, the active hypervisor restores the host OS and guest VMs, and their virtual processors resume execution. We can then reclaim memory that was used by the old hypervisor but is no longer needed by the new hypervisor. This allows us to perform repeated HHR operations without exhausting system resources.
To help visualize this process, we have created an animation that demonstrates Hypervisor Hot Restart. (Apologies for the low-quality GIF, the blogging platform has a low size limit. The original videos are attached to this blog post for offline viewing.)
The development of Hypervisor Hot Restart enables easy deployment of new hypervisor versions with new features and capabilities without VM downtime. For example, we used Hypervisor Hot Restart to mitigate Retbleed, a side-channel vulnerability that can compromise data security in virtualized environments. We deployed the latest Hypervisor with HyperClear to protect against Retbleed, marking our first utilization of Hypervisor Hot Restart in the Azure Fleet. During this deployment, we were able to deploy HyperClear across the Azure Fleet with sub-second blackout.
With that we conclude our look into the internals of Azure Host OS updates with the latest Hypervisor Hot Restart technology. Expect to see more of Azure Host and Windows internals in future blogs.
Meghna, Hari, Bruce (on behalf of the entire Core OS Team)