Known Issue Rollback: Helping you keep Windows devices protected and productive

Published 03-02-2021 08:00 AM 13.5K Views
Microsoft

Today I'm sharing details about Known Issue Rollback (KIR), a new capability that can quickly return an impacted device back to productive use if an issue arises during a Windows update. In this blog, my engineering partner, Vatsan Madhavan, and I will walk you through this new capability.

In today’s digital world, we you know you rely on trustworthy systems and services to stay ahead of potential security threats to your enterprise environments. At the same time, you strive to sustain the productivity of your work forces. We know you need both: security and productivity.

The Windows Servicing & Delivery team works to help keep Microsoft customers protected and productive by continuously building and delivering updates to the intelligent cloud and intelligent edge. Over the past year, we have been working on improvements that help you ensure uninterrupted productivity even when rolling out critical updates to prevent or address potential issues.

Introducing Known Issue Rollback

Known Issue Rollback is an important Windows servicing improvement to support non-security bug fixes, enabling us to quickly revert a single, targeted fix to a previously released behavior if a critical regression is discovered.

Built in direct response to your feedback, the pieces that make Known Issue Rollback work came together in a functionally complete system beginning in Windows 10, version 2004. Every month, we release monthly updates with many of quality changes “contained” using the Known Issue Rollback capability.

While Known Issue Rollback was originally designed for user-mode processes, we have made phased improvements over the last year to the OS kernel and the boot loader to support this capability in kernel mode. Some versions of Windows 10 prior to version 2004, for example versions 1909 and 1809, have partial support for Known Issue Rollback built into the OS and we leverage that support whenever possible when shipping updates for those versions.

At the code level

When Windows developers code a non-security bug-fix, they keep the old code intact and add the fix.

Eric_Vernon_1-1614669786001.png

The Known Issue Rollback infrastructure in the OS provides developers with a method that evaluates a policy to determine the execution path. This policy tells the OS whether a fix should remain enabled or not. If the policy states that the fix is enabled, then the new code runs; and if the policy says that the fix is disabled, then the OS falls back to the old code-path.

Today, fixes in our monthly updates are enabled by default -- i.e., the old code is disabled, and the new code is enabled. If a fix turns out to have a serious problem, Azure hosted services and Windows work in tandem to update this policy-setting on the device and disable the problematic fix. Enterprises will be able to exercise control over this policy.

Note: As mentioned earlier, we only use Known Issue Rollback with non-security fixes. Using this coding scheme retains the old code. In the context of security fixes, older code is typically more vulnerable or exploitable; this is why we don't use Known Issue Rollback with security fixes today.

How Known Issue Rollback works for the end user

When Microsoft decides to rollback a bug fix in an update because of a known issue, we make a configuration change in the cloud. Devices connected to Windows Update or Windows Update for Business are notified of this change and it takes effect with the next reboot.

end-user-microsoft-managed.png

Microsoft Privacy Statement on Windows Diagnostics

Once this happens, the Know Issue Rollback infrastructure will start reporting that the fix - the new code that has a problem - is no longer enabled. From this point on, the OS will fall back to the previous code that had a bug albeit a much more benign issue than the new code that has a problem.

While these devices would still require a reboot, in most cases we have identified and published a rollback before most end user devices would have had the chance to even install the update containing the issue. In other words, most end users will never see the regression!

Devices that have opted into providing Microsoft with diagnostic data then send very scoped information about which code path is being exercised. This data helps us learn how well the rollback is succeeding in the ecosystem.

Putting enterprises in control

Enterprise devices are typically behind a Network Address Translation (NAT) and a firewall, which means they tend to be part of an Active Directory forest and are often managed using Group Policy. For a Know Issue Rollback, Microsoft publishes a specific Group Policy on the Download Center that can be used to configure and apply a rollback policy (rolling back the code in the latest cumulative update or LCU) within an enterprise. A link to the Group Policy is included in the Windows Update KB article and release notes as mitigations for a “Known Issue.”

enterprise-rollback.png

Microsoft Privacy Statement on Windows Diagnostics

In the KB article, we describe the issue and related information to help you and your IT administrators make informed choices. Our customer service teams are also aware of the Known Issue Rollback system and will be able to work with customers to identify problems with monthly updates and in turn coordinate a rollback if necessary.

Similar to the end-user scenario, devices that have opted into providing Microsoft with diagnostic data send specific information about which code-path is being exercised. This data helps us learn how well the rollback is succeeding in the ecosystem.

An example of Known Issue Rollback in action

We have been using the Known Issue Rollback since late 2019 to contain non-security fixes. Today, about 80% of fixes shipped on Windows 10, version 2004 (and later in-market versions), ship using Known Issue Rollback containment. Like most mitigation solutions, the real value of the Known Issue Rollback does not become truly apparent until you need it, and it works! Here’s an example.

The April 2020 preview non-security release (KB4550945) for Windows 10, version 1903 had a regression in a fix for the Microsoft Store. For gamers who had purchased in-app content (like PC game add-ons) from the Microsoft Store prior to the April update, that in-app content would no longer appear licensed once the OS was updated with this release. The net result was that the games would not run.

Fortunately, the Microsoft Store team used Known Issue Rollback.

On April 27th, a pattern indicating that a regression might be emerging was noticed by our social media monitoring team. With just over 110,000 devices updated, around a dozen tweets/reddit etc. posts indicated the following:

  • A consistent description of the problem (“Trial Period”, “App won’t run”, “App crash”, etc.)
  • Keywords that suggested that a Windows Update may be at fault (“KB4550945”, “1903”, “Latest Windows Update” etc.)

Internally, we created a post-release incident report to research, identify, and confirm the root cause. By April 29th, 170,000 devices had installed the April update and we had pinpointed the code change in the release that was problematic.

At 7:40 PM PST on April 29th, we made the decision to initiate a Known Issue Rollback. By 10:00 PM PST, we had kicked off the Known Issue Rollback process through our Azure configuration service and began configuring Windows 10, version 1903, devices to remedy the Microsoft Store issue. By 6:00 PM PST on April 30th, 145 million Windows 10, version 1903 devices had been configured to roll back the Microsoft Store regression. By May 3rd, after only 72 hours, 236 million devices had been rolled back.

In this example, we were able to deploy a Known Issue Rollback mitigation within 24 hours of identifying the root cause of a problem. The result is that the overwhelming majority of Windows users never saw the regression. For them, the problem never existed; their machines quietly updated and there was no visible issue.

The Known Issue Rollback lifecycle

Known Issue Rollback configurations have a limited lifespan—a few months at most—because we expect to solve the underlying problem quickly and re-issue the fix. Once the underlying problem has been fixed, the Group Policy has outlived its usefulness. It becomes a benign setting and can be undeployed safely.

It’s worth noting that each Known Issue Rollback Group Policy is unique to a specific issue, i.e. a regression, and, as such, these policies are not cumulative in nature.

Conclusion

Today, Windows runs on over a billion devices spanning industries and countries around the globe. At Microsoft, we have been cultivating a service mindset in our journey to continuously improve our engineering and software update experience.

With Known Issue Rollback, your organization can remain secure and productive. We understand that the size of our ecosystem and the scale of our service requires a service maturity mindset. Known Issue Rollback is a great capability that can enable you to quickly recover from regressions without the risks associated with rolling back critical security protections.

Vatsan and I are happy to be on the team leading these innovations to keep you, our customers, secure while you also stay productive.

To learn more

For more information on Known Issue Rollback, watch this video:

 

21 Comments
Occasional Contributor

Hi @Eric_Vernon!

Thanks for the blog post, it's nice to have a word from the PG on top of personal guesswork based on the MSKB article.

 

I've got a couple of questions.

1. Is KIR related in any way to the Windows Feature Store / Windows Notification Facility?

2. Is KIR related to another approach of fixing issues via the WU troubleshooter? E.g. https://support.microsoft.com/kb/4570336 Can you shed some light on this method anyway? E.g. whether it's still in use, when it's used instead of KIR, etc.

 

Thanks,
Vadim

 

Microsoft

Hi Vadim,

 

Thanks for your interest in Known Issue Rollback (KIR) and your questions. KIR is somewhat related to the troubleshooters from the standpoint that both are trying to solve customer issues. Troubleshooters tend to focus on a class of issues such as Activation, Networking, Microphone etc. Known Issue Rollback is much more targeted however; it deals with a specific issue that is a result of a specific code fix that shipped in a monthly Latest Cumulative Update (LCU).

 

We are not sharing other information about the internals/implementation details about KIR at this time.

 

Thanks,

 

Eric

 

Occasional Contributor

Eric, will known issue rollback also be available on the windows server platform,  or for other Microsoft products such as Office or .net framework?

Occasional Contributor

@Eric_Vernon: Great stuff.

Like @will nimmo I'd be interested to know if this is a thing that will impact the server OS where the criticality is often a bigger deal.  Though you could argue that on the server side there's probably less of a chance that a non-security update breaks critical functionality.

The other feedback I'd give is KIR GPO bloat.  I can promise you that most orgs are just going to add a KIR to the policy and never think of it again.  While on the individual level that's probably not a huge problem, if each KIR is it's own policy/whatever that you're publishing then overtime that's going to create bloat that will impact GPO processing times.  GPO is the poster child for 'set it and forget it' until a new admin comes in and asks what these 2000 policies do and why they were set.

Microsoft

Hi @will nimmo and @bdam55 ,

 

Thanks for your interest in KIR and your questions.

 

This technology works on Windows Server the same way as it works on Windows Client Editions i.e. newer versions such as Windows Server 2004 have fully functional support for KIR today while older versions have limited support.

 

At this time, .NET Framework and other Microsoft products like Office do not have support for Known Issue Rollback.

 

The concerns you've raised about GPO bloat and the consequent impact on processing times is valid @bdam55  - thanks for bringing these up! This is something we hope to address in the future as we continue to make improvements to KIR.

 

Thanks,

@Vatsan_Madhavan 

Occasional Contributor

Thanks @Vatsan_Madhavan, that's great info!  I should have specified though: can you describe the support for Server LTSC (2012, 2016, and 2019)?  I'm sure there's large farms of SAC out there somewhere but I've not seen them in the enterprises I interact with.

"Not happening" is a perfectly fine answer and kind of what users signed up for but I'd love to be surprised.

Microsoft
Thanks for the clarification @bdam55

As I mentioned previously, there is partial support for KIR in versions older than Windows Server 2004.

Specifically, we have some support for KIR in each of Windows Server 2019 LTSC (1809) and Windows Server 2016 LTSC (1607), though the extent of capability in each version varies (in general, newer versions of the OS have more refinements/capabilities). There is no support for KIR in Windows Server 2012.
Occasional Contributor

Point of clarification. Is this feature is limited to Windows in a business setting? 

 

Are Windows Home users are left out in the cold, with broken machines?

Microsoft

@ron S. This is not limited to the business setting. Group Policy is the KIR business setting solution. For other versions of Windows outside of a business setting AND are connected to Windows Update, KIR is implemented from our cloud service.

Occasional Contributor

Great timing.  I'm sure this isn't the right place for a request like this, but I'm not sure where else to request it.  Maybe report as an issue on M365 Admin Portal?  Can we can a KIR policy for this issue in KB5000802?  It's been a known issue for several days and still no fix.  The barcodes on our college ID cards are not getting printed until I uninstall this update.

 

After installing updates released March 9, 2021 or March 15, 2021, you might get unexpected results when printing from some apps. Issues might include:

  • Elements of the document might print as solid black/color boxes or might be missing, including barcodes, QR codes, and graphics elements, such as logos.

Visitor

@Brian_Klish_work 

Unfortunately Microsoft doesn't have a mechanism for handling feedback from IT professionals, they want us to use the Feedback Hub too, which is about the same as writing a note on toilet paper and flushing it. My feedback would be that Microsoft needs to up their game in regards to testing updates. The past several years have proven that their current testing process is totally inadequate and demonstrates incompetence at the highest levels. There is not a day that passes that I'm not fighting a bad Windows (or Office365) update somewhere. I'd like to sugar coat this, but the there is not enough lipstick produced on the planet to make this Microsoft pig look good.

Microsoft

@Brian_Klish_work Sorry to hear you are having issues with printing. Take a look at the Windows Release dashboard (https://aka.ms/windowsreleasehealth) for details of the issue and when the fix will be released. Unfortunately in this particular instance, KIR could not be used.

Help us understand when KIR can and cannot be used so we know when (or if) to expect things to be handled in this manner?

P.S. The dashboard at this time just says you are working on the fix so there is not ETA at this time of when to expect the fix for the image printing problem and the Dymo label issue so at this time the only workaround is to uninstall the update.

Microsoft

@Susan Bradley Thanks for the question.

 

There are several considerations for when KIR can be used (today):

  • Used only for non-security fixes.
  • More recent platforms, Windows 10 2004 and newer, leverage KIR more fully. Older platforms have limited KIR support.
  • Each fix is evaluated by the developer prior to release to see if KIR is feasible.

Today we don’t publish KIR usage for each fix that goes into a monthly update (LCU). KIR is part of a larger strategy to improve customer confidence in the LCU. The callout to the Windows Release dashboard is to highlight that this is where Microsoft will communicate - not only information about issues but also whether an issue is solved with KIR. KIR-mitigated issues typically release much more quickly than updates to solve regressions, and we are working to expand KIR usage in Windows platforms.

@Brian_Klish_work 

Another out of band update out tonight:

https://docs.microsoft.com/en-us/windows/release-health/windows-message-center#1574 I think your fix might in that

Valued Contributor

Thank you for the post, it is very valuable , however I see one minor issue here. It will work if update has been installed and Windows have been boot and while inside the Windows environment it is able to send a diagnose report , however consider the case where update affect the Network Driver or Reporting Mechanism so no report will be sent or the update caused failure in booting of the operating system where no report will be sent too.

In this case we are in state where systems are not reporting and it might be valid like user turn off their device or they are not connected to the internet or there is network connection issue (internet failure and so on) or it could be a failure like update failure. I am wondering is there any solution for such scenarios too or we should leave it to the expertise of the administrator to identify whether not reporting status of the update is due to failure or update installed successfully with no issue and there is only a connectivity problem. 

@Eric_Vernon In your response to @Brian_Klish_work, when he asked whether KIR is in effect for the current printing issues from the March 2021 patches, you mentioned "Unfortunately in this particular instance, KIR could not be used", however in your video in this blog at around marker 00:30, you describe the issue with a printing issue caused by a patch in September 2020, and KIR came to the rescue. So, why doesn't it work for the current printing issues? 

Microsoft

@Harjit Dhaliwal Each fix is evaluated individually based on the component, the risk of the change, fix type, the platform etc. In the case of the print issue mentioned by Brian, this was a security fix that had issues and KIR currently only supports non-security fixes.

New Contributor

Hi @Eric_Vernon 

So in AD environment - we should check regularly release notes to see if some regression  is confirmed , check does it affect our environment and if yes then apply related GPO? 

Another question - approx when it will Intune include this functionality? Azure Arc (for Servers)?

 

Microsoft

@Andres Pae You are correct about your AD environment. If you are not seeing the issue, there is no need to apply the GP.

 

We are working on integration with Intune and other management solutions but I don't have an eta to share at this time.

Version history
Last update:
‎Mar 26 2021 01:34 PM
Updated by: