Azure Host OS Update with Hypervisor Hot Restart
Published Sep 06 2023 09:48 AM 6,989 Views
Microsoft

Azure is Microsoft’s cloud computing offering which provides IaaS (infra as a service) virtual machines (VM), PaaS (platform as a service) containers and many other SaaS services (e.g., Azure Storge, Networking, etc.). Azure, being one of the largest cloud service providers, hosts millions of customer virtual machines (VMs) in our data centers. The operating system that runs on these hosts is a modified version of Windows called Cloud Host. I talked about this and the Azure Host OS architecture (incl. the root OS and the hypervisor) in an earlier blog post. In this blog post we will talk about how we update the operating system that runs on those hosts, in particular, a new advancement we made in updating the hypervisor called “Hypervisor Hot Restart (HHR)”.

 

Azure Host OS Updates Overview

Ensuring the security of the Azure hosts is critical to maintaining our customers’ trust as their applications run in a public cloud where customers have limited control over infrastructure updates. We ensure the security of the Azure host by patching and keeping it up to date with all the latest applicable security updates. These patches are typically rolled out every month with no disruption to the customer workloads. In addition, to security updates, we also update the Azure Host OS to provide new features and functionality to the customer VMs, e.g., new hardware generation support or new features such as Azure confidential computing.

 

Note: this blog focuses on internal Azure Host OS technical details and does not talk about Azure customer facing VM updates and control mechanisms. Those maintenance controls or scheduled events are documented for our customers on Azure’s website.

 

Different Azure Host OS update mechanisms

These are the most common types of Azure Host OS update technologies used in the Azure fleet.

 

Update Tech

Performance

Purpose

Hot Patching

Best – in milliseconds. Not visible to customer VMs

Typically used for monthly security updates (e.g., MSRC). More detailed blog on internals here.  

VM PHU

Typically, 30 secs

Update the entire Azure Host OS. Paper in EuroSys 2021 for tech details.  

Live Migration [1]

Multiple seconds

Migrate the VM to a different node and potentially empty the node for Host OS updates or other needs.  

Hypervisor Hot Restart

Under a second

Update the entire Hypervisor. Useful when updating to a new version with the latest features.

 

[1] Future post on Live Migration internals

 

Introducing Hypervisor Hot Restart

With that introduction on Azure Host OS update technologies, we are going to do a deep dive into our latest and most advanced update technology: Hypervisor Hot Restart (HHR). HHR allows us to update and replace the hypervisor on a running system with sub second blackout time for customer VMs and importantly without dropping any packets. With Hypervisor Hot Restart, we can deploy new hypervisor features or fixes easily, providing enormous customer value. This is especially important in today's world, where security threats are becoming more prevalent and sophisticated.

 

Hypervisor Hot Restart in Action

This is a demonstration of how Hypervisor Hot Restart works in action. It showcases 4 VMs (Virtual Machines) that continue to run while the hypervisor is fully replaced under them. The network connection remains stable throughout the process and no packets are lost. Additionally, we showcase the speed of the restart process, with a maximum packet delay of 600 milliseconds. (Apologies for the low-quality GIF, the blogging platform has a low size limit. The original videos are attached to this blog post for offline viewing.) 

 

HHRDemo.gif

 

How Does Hypervisor Hot Restart Work?

On an Azure node there will be one active hypervisor running the host operating system and the guest VMs. When we are ready to update the hypervisor, this active hypervisor creates a service partition where the new updated or latest hypervisor is initialized. The other partitions hosting the customer VMs continue to run normally, uninterrupted.

 

Once the new-hypervisor initialization is complete, it is ready, and the active hypervisor can now call into the new-hypervisor. Next, the active hypervisor creates a mirroring thread for each active partition, which replicates all state associated with the partition to the new-hypervisor. All partitions remain running while the mirroring threads reflect important state changes from the active hypervisor to the new-hypervisor. This mirrored state includes information such as memory ownership, partition lifecycle changes, device ownership, and so on.

 

All partitions are then temporarily suspended, and their state is saved into an internal hypervisor buffer to capture any state that has not already been mirrored. This phase is known as the "blackout" period, during which neither the host OS nor any guest VMs are running. Control of the physical machine is then passed to the new-hypervisor, which becomes the new active hypervisor. This time is well under a second as you see in the demo.

 

Finally, the active hypervisor restores the host OS and guest VMs, and their virtual processors resume execution. We can then reclaim memory that was used by the old hypervisor but is no longer needed by the new hypervisor. This allows us to perform repeated HHR operations without exhausting system resources.

 

To help visualize this process, we have created an animation that demonstrates Hypervisor Hot Restart. (Apologies for the low-quality GIF, the blogging platform has a low size limit. The original videos are attached to this blog post for offline viewing.)

 

HHR.gif

 

The development of Hypervisor Hot Restart enables easy deployment of new hypervisor versions with new features and capabilities without VM downtime. For example, we used Hypervisor Hot Restart to mitigate Retbleed, a side-channel vulnerability that can compromise data security in virtualized environments. We deployed the latest Hypervisor with HyperClear to protect against Retbleed, marking our first utilization of Hypervisor Hot Restart in the Azure Fleet. During this deployment, we were able to deploy HyperClear across the Azure Fleet with sub-second blackout.

 

With that we conclude our look into the internals of Azure Host OS updates with the latest Hypervisor Hot Restart technology. Expect to see more of Azure Host and Windows internals in future blogs.  

 

Cheers,

Meghna, Hari, Bruce (on behalf of the entire Core OS Team)

3 Comments
Co-Authors
Version history
Last update:
‎Sep 06 2023 08:39 AM
Updated by: