Today I'm sharing details about Known Issue Rollback (KIR), a new capability that can quickly return an impacted device back to productive use if an issue arises during a Windows update. In this blog, my engineering partner, Vatsan Madhavan, and I will walk you through this new capability.
In today’s digital world, we you know you rely on trustworthy systems and services to stay ahead of potential security threats to your enterprise environments. At the same time, you strive to sustain the productivity of your work forces. We know you need both: security and productivity.
The Windows Servicing & Delivery team works to help keep Microsoft customers protected and productive by continuously building and delivering updates to the intelligent cloud and intelligent edge. Over the past year, we have been working on improvements that help you ensure uninterrupted productivity even when rolling out critical updates to prevent or address potential issues.
Known Issue Rollback is an important Windows servicing improvement to support non-security bug fixes, enabling us to quickly revert a single, targeted fix to a previously released behavior if a critical regression is discovered.
Built in direct response to your feedback, the pieces that make Known Issue Rollback work came together in a functionally complete system beginning in Windows 10, version 2004. Every month, we release monthly updates with many of quality changes “contained” using the Known Issue Rollback capability.
While Known Issue Rollback was originally designed for user-mode processes, we have made phased improvements over the last year to the OS kernel and the boot loader to support this capability in kernel mode. Some versions of Windows 10 prior to version 2004, for example versions 1909 and 1809, have partial support for Known Issue Rollback built into the OS and we leverage that support whenever possible when shipping updates for those versions.
When Windows developers code a non-security bug-fix, they keep the old code intact and add the fix.
The Known Issue Rollback infrastructure in the OS provides developers with a method that evaluates a policy to determine the execution path. This policy tells the OS whether a fix should remain enabled or not. If the policy states that the fix is enabled, then the new code runs; and if the policy says that the fix is disabled, then the OS falls back to the old code-path.
Today, fixes in our monthly updates are enabled by default -- i.e., the old code is disabled, and the new code is enabled. If a fix turns out to have a serious problem, Azure hosted services and Windows work in tandem to update this policy-setting on the device and disable the problematic fix. Enterprises will be able to exercise control over this policy.
Note: As mentioned earlier, we only use Known Issue Rollback with non-security fixes. Using this coding scheme retains the old code. In the context of security fixes, older code is typically more vulnerable or exploitable; this is why we don't use Known Issue Rollback with security fixes today.
When Microsoft decides to rollback a bug fix in an update because of a known issue, we make a configuration change in the cloud. Devices connected to Windows Update or Windows Update for Business are notified of this change and it takes effect with the next reboot.
Once this happens, the Know Issue Rollback infrastructure will start reporting that the fix - the new code that has a problem - is no longer enabled. From this point on, the OS will fall back to the previous code that had a bug albeit a much more benign issue than the new code that has a problem.
While these devices would still require a reboot, in most cases we have identified and published a rollback before most end user devices would have had the chance to even install the update containing the issue. In other words, most end users will never see the regression!
Devices that have opted into providing Microsoft with diagnostic data then send very scoped information about which code path is being exercised. This data helps us learn how well the rollback is succeeding in the ecosystem.
Enterprise devices are typically behind a Network Address Translation (NAT) and a firewall, which means they tend to be part of an Active Directory forest and are often managed using Group Policy. For a Know Issue Rollback, Microsoft publishes a specific Group Policy on the Download Center that can be used to configure and apply a rollback policy (rolling back the code in the latest cumulative update or LCU) within an enterprise. A link to the Group Policy is included in the Windows Update KB article and release notes as mitigations for a “Known Issue.”
In the KB article, we describe the issue and related information to help you and your IT administrators make informed choices. Our customer service teams are also aware of the Known Issue Rollback system and will be able to work with customers to identify problems with monthly updates and in turn coordinate a rollback if necessary.
Similar to the end-user scenario, devices that have opted into providing Microsoft with diagnostic data send specific information about which code-path is being exercised. This data helps us learn how well the rollback is succeeding in the ecosystem.
We have been using the Known Issue Rollback since late 2019 to contain non-security fixes. Today, about 80% of fixes shipped on Windows 10, version 2004 (and later in-market versions), ship using Known Issue Rollback containment. Like most mitigation solutions, the real value of the Known Issue Rollback does not become truly apparent until you need it, and it works! Here’s an example.
The April 2020 preview non-security release (KB4550945) for Windows 10, version 1903 had a regression in a fix for the Microsoft Store. For gamers who had purchased in-app content (like PC game add-ons) from the Microsoft Store prior to the April update, that in-app content would no longer appear licensed once the OS was updated with this release. The net result was that the games would not run.
Fortunately, the Microsoft Store team used Known Issue Rollback.
On April 27th, a pattern indicating that a regression might be emerging was noticed by our social media monitoring team. With just over 110,000 devices updated, around a dozen tweets/reddit etc. posts indicated the following:
Internally, we created a post-release incident report to research, identify, and confirm the root cause. By April 29th, 170,000 devices had installed the April update and we had pinpointed the code change in the release that was problematic.
At 7:40 PM PST on April 29th, we made the decision to initiate a Known Issue Rollback. By 10:00 PM PST, we had kicked off the Known Issue Rollback process through our Azure configuration service and began configuring Windows 10, version 1903, devices to remedy the Microsoft Store issue. By 6:00 PM PST on April 30th, 145 million Windows 10, version 1903 devices had been configured to roll back the Microsoft Store regression. By May 3rd, after only 72 hours, 236 million devices had been rolled back.
In this example, we were able to deploy a Known Issue Rollback mitigation within 24 hours of identifying the root cause of a problem. The result is that the overwhelming majority of Windows users never saw the regression. For them, the problem never existed; their machines quietly updated and there was no visible issue.
Known Issue Rollback configurations have a limited lifespan—a few months at most—because we expect to solve the underlying problem quickly and re-issue the fix. Once the underlying problem has been fixed, the Group Policy has outlived its usefulness. It becomes a benign setting and can be undeployed safely.
It’s worth noting that each Known Issue Rollback Group Policy is unique to a specific issue, i.e. a regression, and, as such, these policies are not cumulative in nature.
Today, Windows runs on over a billion devices spanning industries and countries around the globe. At Microsoft, we have been cultivating a service mindset in our journey to continuously improve our engineering and software update experience.
With Known Issue Rollback, your organization can remain secure and productive. We understand that the size of our ecosystem and the scale of our service requires a service maturity mindset. Known Issue Rollback is a great capability that can enable you to quickly recover from regressions without the risks associated with rolling back critical security protections.
Vatsan and I are happy to be on the team leading these innovations to keep you, our customers, secure while you also stay productive.
For more information on Known Issue Rollback, watch this video:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.