This post was co-authored by @archana_ramesh (Senior Data Scientist, Microsoft Cloud and AI) and @michael_stephenson (Partner Data Scientist, Microsoft Cloud and AI).
Regular Microsoft updates to your Windows 10 PC help ensure that it’s kept secure from possible threats and empowered with the latest features for peak performance and productivity. Because of the wonderful diversity of hardware, devices and applications available to Windows customers, each PC’s update experience may be slightly different. To ensure that all PCs have a seamless update experience—regardless of their differences—we use a combination of testing, close partner engagement, feedback, diagnostic data, and real-life insights to manage quality.
To help with the complexity of the aspects we need to evaluate, we are increasing our investments in machine learning (ML) technologies. Machine learning helps us detect potential issues more quickly and helps us decide the best time to update each PC once a new version of Windows is available.
In this blog, we’ll cover the technical details of how machine learning is used in the rollout of Windows 10 feature updates.
The rollout of any feature update is a gradual process. We start with PCs predicted to have a great update experience while safeguarding those PCs with known issues and, ultimately, expand to all eligible PCs as issues are resolved.
Windows 10, version 1803 (the April 2018 Update) was the first time we used ML on a broad scale. We started with six core areas of PC health (e.g. overall PC reliability) to determine whether the feature update process went smoothly. With Windows 10, version 1903 (the May 2019 update), our third iteration of using ML in a feature update rollout, we can now evaluate 35 areas of PC health and the process will continue to evolve with additional health measures to improve your update experience.
Throughout our ML journey, we consistently see that PCs nominated for updates via ML have a significantly better update experience. For example, as shown in the chart below, PCs chosen via ML have fewer than half the number of system-initiated uninstalls, half the number of kernel mode crashes, and five times fewer post-update driver issues.
Figure 1. A comparison of system initiated uninstalls, post-update kernel mode crashes, and post-update driver issues for the baseline and the ML model.
Since building an ML model to effectively support the rollout of Windows 10 updates is a complex process, we are sharing more detail on the data science behind the model, including what makes this problem different from other ML problems. In addition, we are outlining how the ML model produces a probability of having a seamless update experience, how we identify possible safeguards, and how we determine if the model has learned enough to be put in action.
From an ML standpoint, operating system (OS) updates are unique for several reasons:
Given this complexity, we need a model that is dynamically trained on the most recent set of PCs that have been updated and a model that is capable of differentiating between PCs having a good update experience and those having a poor one.
The graphic below shows the overall architecture of how ML is used to nominate PCs.
Figure 2. Machine learning architecture used for the Windows 10 intelligent rollout process
Every release starts with offering the Windows 10 update to early adopters (such as Windows Insiders and those actively seeking out the update). Once the initial set of PCs has been offered the update, we monitor their update experience via diagnostic data (e.g. kernel mode crashes, system initiated uninstalls, abnormal shutdowns, and driver issues).
Machine learning provides two key capabilities here:
As this entire process repeats daily, the model constantly learns from the most recent set of updated PCs. Over time, as issues are fixed, PCs previously predicted to have a poor update experience will now be predicted to have a better one, leading to them being offered the update.
We build a classification-based ML solution for each update. The training data is focused on the latest set of PCs on the newest Windows 10 feature update, the PC configurations at the time of the update (i.e. hardware characteristics, drivers, apps, etc.), and a binary label constructed from a set of core diagnostic signals (e.g. whether a PC had a system initiated uninstall or reliability of the PC after the update).
Figure 3. An example of the diagnostic data used to train the ML model used for intelligent rollout
We use Microsoft Azure Databricks to build the ML model:
These elements all come together as follows: If your PC is eligible to receive an update, we will apply the best-known ML model to your configuration to assess how likely your PC is to have a good update experience and which compatibility issues we need to fix in order to ensure your update experience will be great.
A key element of the ML-driven rollout process is the capability to identify compatibility issues early, enabling us to establish safeguard holds to protect specific PCs from receiving a given update. Historically, compatibility issues were detected via laborious lab tests, feedback, support calls, and other channels. While these channels are still used, applying ML to the diagnostic data from the PCs in our broad ecosystem enables us to identify the patterns (in hardware characteristics, drivers, applications, etc.) that are most correlated with any update-related disruption.
To achieve this, we use anomaly detection, which identifies when a feature or pattern (two or more features) results in a higher failure rate than we see for the entire population. Implemented using Microsoft Azure Databricks, we can rapidly scale to millions of PCs and establish safeguard holds to prevent PCs from being disrupted from potential update-related issues.
Figure 5. Chart showing a feature or pattern that is failing at 82% against a baseline failure rate of about 3%. This identifies where a safeguard hold is needed to prevent other PCs from experiencing similar issues.
You can find a list and details on the latest known issues and safeguards by visiting the Windows release health dashboard.
With the diversity of the Windows ecosystem and an ML model that is refreshed every day, it's important to determine when the ML model is ready to be broadly applied. In other words, we can only use ML to determine when to offer the update to your PC if we have seen adequate similar hardware configurations that have successfully updated.
Typical ML scenarios use a learning curve to determine when models are adequately trained; however, due to the unique diversity of the Windows ecosystem, we use a concept called saturation, which looks at how many of the diverse hardware components, drivers, applications, etc. have been seen so far from updated PCs. Saturation helps us understand the extent to which the feature update rollout has penetrated the hardware and component ecosystem and, thus, is representative of the population of PCs to be updated.
Figure 6. Monitoring the saturation of the Windows feature update space. Continuous dynamic training of the ML model typically starts once saturation reaches greater than 60% indicating that the training data is representative of the diversity of the Windows ecosystem.
Keeping Windows 10 PC users safe and current has been an exciting journey, not just in terms of building out the technologies to support a large-scale ML deployment, but also in terms of determining how to measure impact in this context. We measure impact via a few different progress indicators:
While we are excited by the promise of machine learning, there is still much work to be done to ensure that ML is comprehensive, more automated, and agile enough to catch issues in a few seconds rather than hours. In upcoming feature updates, we will continue to evolve ML and share more details on the progress we make.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.