One Windows Kernel
Published Oct 17 2018 12:39 PM 214K Views
Microsoft

Windows is one of the most versatile and flexible operating systems out there, running on a variety of machine architectures and available in multiple SKUs. It currently supports x86, x64, ARM and ARM64 architectures. Windows used to support Itanium, PowerPC, DEC Alpha, and MIPS (wiki entry). In addition, Windows supports a variety of SKUs that run in a multitude of environments; from data centers, laptops, Xbox, phones to embedded IOT devices such as ATM machines.

 

The most amazing aspect of all this is that the core of Windows, its kernel, remains virtually unchanged on all these architectures and SKUs. The Windows kernel scales dynamically depending on the architecture and the processor that it’s run on to exploit the full power of the hardware. There is of course some architecture specific code in the Windows kernel, however this is kept to a minimum to allow Windows to run on a variety of architectures.

 

In this blog post, I will talk about the evolution of the core pieces of the Windows kernel that allows it to transparently scale across a low power NVidia Tegra chip on the Surface RT from 2012, to the giant behemoths that power Azure data centers today.

 

This is a picture of Windows taskmgr running on a pre-release Windows DataCenter class machine with 896 cores supporting 1792 logical processors and 2TB of RAM!

 

Task Manager showing 1792 logical processorsTask Manager showing 1792 logical processors

Evolution of one kernel

Before we talk about the details of the Windows kernel, I am going to take a small detour to talk about something called Windows refactoring. Windows refactoring plays a key part in increasing the reuse of Windows components across different SKUs, and platforms (e.g. client, server and phone). The basic idea of Windows refactoring is to allow the same DLL to be reused in different SKUs but support minor modifications tailored to the SKU without renaming the DLL and breaking apps.

 

The base technology used for Windows refactoring are a lightly documented technology (entirely by design) called API sets. API sets are a mechanism that allows Windows to decouple the DLL from where its implementation is located. For example, API sets allow win32 apps to continue to use kernel32.dll but, the implementation of all the APIs are in a different DLL. These implementation DLLs can also be different depending on your SKU. You can see API sets in action if you launch dependency walker on a traditional Windows DLL; e.g. kernel32.dll.

 

Dependency walkerDependency walker

With that detour into how Windows is built to maximize code reuse and sharing, let’s go into the technical depths of the kernel starting with the scheduler which is key to the scaling of Windows.

 

Kernel Components

Windows NT is like a microkernel in the sense that it has a core Kernel (KE) that does very little and uses the Executive layer (Ex) to perform all the higher-level policy. Note that EX is still kernel mode, so it's not a true microkernel. The kernel is responsible for thread dispatching, multiprocessor synchronization, hardware exception handling, and the implementation of low-level machine dependent functions. The EX layer contains various subsystems which provide the bulk of the functionality traditionally thought of as kernel such as IO, Object Manager, Memory Manager, Process Subsystem, etc.

 

arch.png 

 

To get a better idea of the size of the components, here is a rough breakdown on the number of lines of code in a few key directories in the Windows kernel source tree (counting comments). There is a lot more to the Kernel not shown in this table. 

 

Kernel subsystems

Lines of code

Memory Manager

501, 000

Registry

211,000

Power

238,000

Executive

157,000

Security

135,000

Kernel

339,000

Process sub-system

116,000

 

For more information on the architecture of Windows, the “Windows Internals” series of books are a good reference.

 

Scheduler

With that background, let's talk a little bit about the scheduler, its evolution and how Windows kernel can scale across so many different architectures with so many processors.

 

A thread is the basic unit that runs program code and it is this unit that is scheduled by the Window scheduler. The Windows scheduler uses the thread priority to decide which thread to run and in theory the highest priority thread on the system always gets to run even if that entails preempting a lower priority thread.

 

As a thread runs and experiences quantum end (minimum amount of time a thread gets to run), its dynamic priority decays, so that a high priority CPU bound thread doesn’t run forever starving everyone else. When another waiting thread is awakened to run, it is given a priority boost based on the importance of the event that caused the wait to be satisfied (e.g. a large boost is for a foreground UI thread vs. a smaller one for completing disk I/O). A thread therefore runs at a high priority as long as it’s interactive. When it becomes CPU (compute) bound, its priority decays, and it is considered only after other, higher priority threads get their time on the CPU. In addition, the kernel arbitrarily boosts the priority of ready threads that haven't received any processor time for a given period of time to prevent starvation and correct priority inversions.

 

The Windows scheduler initially had a single ready queue from where it picked up the next highest priority thread to run on the processor. However, as Windows started supporting more and more processors the single ready queue turned out to be a bottleneck and around Windows Server 2003, the scheduler changed to one ready queue per processor. As Windows moved to multiple per processor queues, it avoided having a single global lock protecting all the queues and allowed the scheduler to make locally optimum decisions. This means that any point the single highest priority thread in the system runs but that doesn’t necessarily mean that the top N (N is number of cores) priority threads on the system are running. This proved to be good enough until Windows started moving to low power CPUs, e.g. in laptops and tablets. On these systems, not running a high priority thread (such as the foreground UI thread) caused the system to have noticeable glitches in UI. And so, in Windows 8.1, the scheduler changed to a hybrid model with per processor ready queues for affinitized (tied to a processor) work and shared ready queues between processors. This did not cause a noticeable impact on performance because of other architectural changes in the scheduler such as the dispatcher database lock refactoring which we will talk about later.

 

Windows 7 introduced something called the Dynamic Fair Share Scheduler; this feature was introduced primarily for terminal servers. The problem that this feature tried to solve was that one terminal server session which had a CPU intensive workload could impact the threads in other terminal server sessions. Since the scheduler didn’t consider sessions and simply used the priority as the key to schedule threads, users in different sessions could impact the user experience of others by starving their threads. It also unfairly advantages the sessions (users) who has a lot of threads because the sessions with more threads get more opportunity to be scheduled and received CPU time. This feature tried to add policy to the scheduler such that each session was treated fairly and roughly the same amount of CPU was available to each session. Similar functionality is available in Linux as well, with its Completely Fair Scheduler. In Windows 8, this concept was generalized as a scheduler group and added to the Windows Scheduler with each session in an independent scheduler group. In addition to the thread priority, the scheduler uses the scheduler groups as a second level index to decide which thread should run next. In a terminal server, all the scheduler groups are weighted equally and hence all sessions (scheduler groups) receive the same amount of CPU regardless of the number or priorities of the threads in the scheduler groups. In addition to its utility in a terminal server session, scheduler groups are also used to have fine grained control on a process at runtime. In Windows 8, Job objects were enhanced to support CPU rate control. Using the CPU rate control APIs, one can decide how much CPU a process can use, whether it should be a hard cap or a soft cap and receive notifications when a process meets those CPU limits. This is like the resource controls features available in cgroups on Linux.

 

Starting with Windows 7, Windows Server started supporting greater than 64 logical processors in a single machine. To add support for so many processors, Windows internally introduced a new entity called “processor group”. A group is a static set of up to 64 logical processors that is treated as a single scheduling entity.  The kernel determines at boot time which processor belongs to which group and for machines with less than 64 cores, with the overhead of the group structure indirection is mostly not noticeable. While a single process can span groups (such as a SQL server instance), and individual thread could only execute within a single scheduling group at a time.

 

However, on machines with greater than 64 cores, Windows started showing some bottlenecks that prevented high performance applications such as SQL server from scaling their performance linearly with the number of processor cores. Thus, even if you added more cores and memory, the benchmarks wouldn’t show much increase in performance. And one of the main problems that caused this lack of performance was the contention around the Dispatcher database lock. The dispatcher database lock protected access to those objects that needed to be dispatched; i.e. scheduled. Examples of objects that were protected by this lock included threads, timers, I/O completion ports, and other waitable kernel objects (events, semaphores, mutants, etc.). Thus, in Windows 7 due to the impetus provided by the greater than 64 processor support, work was done to eliminate the dispatcher database lock and replace it with fine grained locks such as per object locks. This allowed benchmarks such as the SQL TPC-C to show a 290% improvement when compared to Windows 7 with a dispatcher database lock on certain machine configurations. This was one of the biggest performance boosts seen in Windows history, due to a single feature.

 

Windows 10 brought us another innovation in the scheduler space with CPU Sets. CPU Sets allow a process to partition the system such that its process can take over a group of processors and not allow any other process or system to run their threads on those processors. Windows Kernel even steers Interrupts from devices away from the processors that are part of your CPU set. This ensures that even devices cannot target their code on the processors which have been partitioned off by CPU sets for your app or process. Think of this as a low-tech Virtual Machine. As you can imagine this is a powerful capability and hence there are a lot of safeguards built-in to prevent an app developer from making the wrong choice within the API. CPU sets functionality are used by the customer when they use Game Mode to run their games.

 

Finally, this brings us to ARM64 support with Windows 10 on ARM.  The ARM architecture supports a big.LITTLE architecture, big.LITTLE is a heterogenous architecture where the “big” core runs fast, consuming more power and the “LITTLE” core runs slow consuming less power. The idea here is that you run unimportant tasks on the little core saving battery. To support big.LITTLE architecture and provide great battery life on Windows 10 on ARM, the Windows scheduler added support for heterogenous scheduling which took into account the app intent for scheduling on big.LITTLE architectures.

 

By app intent, I mean Windows tries to provide a quality of service for apps by tracking threads which are running in the foreground (or starved of CPU) and ensuring those threads always run on the big core. Whereas the background tasks, services, and other ancillary threads in the system run on the little cores. (As an aside, you can also programmatically mark your thread as unimportant which will make it run on the LITTLE core.)

 

Work on Behalf: In Windows, a lot of work for the foreground is done by other services running in the background. E.g. In Outlook, when you search for a mail, the search is conducted by a background service (Indexer). If we simply, run all the services on the little core, then the experience and performance of the foreground app will be affected. To ensure, that these scenarios are not slow on big.LITTLE architectures, Windows actually tracks when an app calls into another process to do work on its behalf. When this happens, we donate the foreground priority to the service thread and force run the thread in the service on the big core.

 

That concludes our first (huge?) One Windows Kernel post, giving you an overview of the Windows Kernel Scheduler. We will have more similarly technical posts about the internals of the Windows Kernel. 

 

Hari Pulapaka

(Windows Kernel Team)

43 Comments

Great insight into the Windows Kernel. Thanks for sharing. 

Copper Contributor
Great info into Windows Kernel. Can't wait for next post!
Thanks for the article! Can you please, make bigger images in the next article?
Copper Contributor

Hari, thanks for stepping out there in the ether, man.

 

As a former - albeit short-lived - member of that team, it pains me to see this great code base not "reach its full potential", to use a MSFT cliche. ;)

 

Linux rules and Windows is dying, so please please don't call it "Windows Kernel". Let's please call it "NT Kernel", like "God" named it in the first place. Windows was just an app "environment subsystem" hoisted on top of the NT Kernel. It will die out like the OS/2 one. I hear that the POSIX one is back in full force. Maybe that's the future ...

 

All the best. Keep up the fight!

~Vasile

Microsoft
Hey Vasile, long time. there will be a post soon (next 2 weeks), I suspect you will like that a lot. keep an eye out :).
Copper Contributor

great indo! so i have a question from this: the sum of "lines of code" in the post are not a big number, but why Microsoft need this LargeFileHandler Extension for git, in which direction i should look? 

Copper Contributor

Great post, hope many more will follow. Is 'windows refactoring' the official name for what was referred to as MinWin?  I remember Mark Russinovic and Arun Kishun channel 9 interviews on minwin and the dispatcher lock optimization.

 

BTW dependency walker does *not* support ApiSetSchema, and shows api sets as resolution errors - thereby confusing the hell out of a lot of people. Any chance of msft maintaining it?

Copper Contributor

Yet we still have to use third party applications for long-name files and locations. 

Copper Contributor

I like reading things like this. Looking forward for the next article.

Brass Contributor

Thank you so much for providing information with the context of "which version" and the related goals. As an 3 month owner of a Snapdragon Lenovo Miix 630, having ripped 8 DVDs with a Win32 program & getting 22 hours a charge, thank you again for including comments regarding ARM64. All the years I thought Windows has an issue with 100% Utilization and it turns out that it was all Intel CPU fault. WOS is frankly amazing. Snapdragon Chip for Video and Sound humiliates everything that has ever been put into a PC. #SoOverIntel --and the lame Tech Media with never tell anyone that the system fully recharges in 2 hours from 0%. ......Keep up the wonderful work folks.....and I will sign a NDA if U need more input/feedback.

Copper Contributor

 Amazing, thanks for the interesting read! I was hoping (and looking for) such articles about the Windows Kernel for a long time, but nothing was to be found with recent and confirmed information. It´s great to hear you guys are starting a series about it! Hungry for the next article already :) Thank you again!

Copper Contributor

Kernels and low-level software happen to intrigue me the most. Thanks for sharing! 

Copper Contributor

@ Vasile Paraschiv

 

Hello. Your assessment of Windows future seems all too harsh to me. Windows Server is here to stay for a long time despite Linux gains (and I am writing as fan and user of Linux with Windows being my primary platform). Windows client seems irreplaceable right now. And for the nomenclature “NT Kernel” vs. Windows Kernel, that is a moot point given the fact that this particular kernel is presently used in various Windows installations all over the world and practically nowhere else. Now as for a new POSIX subsystem being in works, I would certainly be interested to hear more about that.

Copper Contributor

Great post, thanks!

Keep it coming.

 

Copper Contributor

Thanks for sharing the post. It was a question for me before about Windows Kernel and its upcoming approach to be modernized and unified. 

And yes from this post I got answer for my question.

Copper Contributor

Great read!  Thanks for the post!

Brass Contributor

Nice write-up!

Copper Contributor

Interesting read. Thanks for the article!

Copper Contributor

 This is fantastic that Microsoft has not ditched efforts to promote Windows kernel to developers but the best way to do this is providing access to source code. Apple does this with the Darwin/XNU kernel and it doesn't hurt their business model. The source code is the best way to promote Windows kernel to university students and those who make some research on OS internals. Windows kernel is the last closed source major OS kernel and I blame this for the Windows core/Windows IoT failure . Instead of providing a platform builder (like it was with Windows CE) where a kernel and drivers can be tuned and rebuilt for a custom hardware Microsoft ventured to provide a binary closed source system for a tiny subset of proprietary hardware and lost the momentum so developers abandoned the platform.

Copper Contributor
Now that processors have many cores, would it be possible to have an API to treat some cores as "realtime" processors, by putting a single thread on that core that runs exclusively until it finishes? At the moment I'm having to use external hardware (Arduino, FPGA etc.) to do this sort of task. The CPU sets feature sounds hopeful but does it guarantee good timing of the process I've put on that core? (e.g. if <1ms response is considered real-time)
Copper Contributor

 @Stephen Brooks a core still shares hardware resources such as memory bus and an interconnection network so it does not have a realtime behaviour in a true sense if a bandwidth is not reserved on each shared bus. Without resources reservation a core is not immune from unpredictable pipeline stalls on memory access to DRAM or device mapped memory/registers .

Copper Contributor

Good article, look forward next.

Copper Contributor

Fascinating article, I hope this will be the start of something that will enable those who have a technical interest in the "under the hood" aspects of Windows to sate their curiosity.

Copper Contributor

nice post, full of details.

 

If I can move a minor critic, in some places I found expressions and grammars different from usual technical US English.

Consider for example the phrase: 

"As a thread runs and experiences quantum end (minimum amount of time a thread gets to run), its dynamic priority decays, so that a high priority CPU bound thread doesn’t run forever starving everyone else."

Copper Contributor

It's good to know the evolution of the Windows Kernel Scheduler over the years.

Copper Contributor

Thanks for the sharing and look forward to more sharings.

Copper Contributor

I have a question about the CPU Sets API, it doesn't seem from the available calls that it is possible to isolate a core from all other threads/processes.  The SetProcessDefaultCpuSets() only restricts what cores threads can be scheduled on for the given process.  I assume that it would be necessary to iterate all the running processes in order to set their CpuSet to exclude the core you want to isolate.  That doesn't really seem feasible especially as you'd have to handle processes that start afterwards.

 

Is there something I'm missing or is the functionality mentioned above not yet available as a public API?

Microsoft
@mendelmonteiro, yes good observation. as you can imagine having the ability to cordon off cores is quite powerful, we have not exposed a way for non-windows code to use that portion of the feature.
Copper Contributor

@Hari_Pulapaka it would be great if this could be exposed in a future version of the Windows API. The same functionality exists on Linux (isolcpus) and it doesn’t seem to be too controversial.

Copper Contributor

Great article..

Copper Contributor

Just like CPU sets, it would be nice if it was possible to “Create Process but stay away from Cores used for RSS”.

And why that?

We are running online trading and are sensitive to latency.

Where we often are hit is under heavy network traffic, especially with Multicast.

To keep up with our performance needs today we need at least 8 Cores set aside (with Meltdown Patch 16 cores) for RSS and afterwards DPC execution.

Under heavy load we are hit by this “Thread Stealing”, so instead that our developers are forced to know what affinity mask to set to avoid cores used for this type of IO , it would be really nice to have this possibility of dynamically starting a Process with attribute “Stay away from Cores used for IO”

Copper Contributor

Hallo Windows Kernel Team !

great job at time,

miss more comments in Headerfiles by SDK's, by same Functions.

Good Info about Windows Kernel !, take more for developers !.

Take more about Windows source code construction,

C++ unmanaged Code or C++ .NET and C# managed Code or What ?

by interest !.

Give in Future a Part of Windows as sourcecode ?

That would be really good for learning for Developer, make more better and secure App's!

best regard to all the Windows Team !

Christian 'TIPPO' Kurs - Visual C#, C++ Developer and .NET Nerd !

Copper Contributor

Thanks for the article; I suppose it explains what are the changes that made the Windows 10 scheduler practically broken for heavily multithreaded applications.

Copper Contributor

Re. "Note that EX is still kernel mode, so it's not a true microkernel":

 

In the books that I wrote about Windows I have called it a "hybrid-kernel".

 

Copper Contributor

First screen shows 100% CPU load and 0% memory load, how is that possible?

Copper Contributor

@Hari_Pulapaka So the only way to isolate cores for an application is for games?

We need the functionality for VMs; is there no way to do this? I know we could reserve cores at the hypervisor level, but can't do that with the "root" scheduler mode. And the "root" scheduler mode is the only supported one for client machines :(

Copper Contributor

While scheduling is a primary concern for performance, the end result is that all of the aforementioned considerations, and navigation of resource details creates microseconds of latency which further reduce the performance of the system overall.  Scheduling should be a 2-5 instruction operation around changing some pointers and restoring register contents.  It's great to see MS working on improving the kernel, but the NT kernel seems to be a bigger thorn than a pretty flower that users see the effects thereof.

 

It would be great to see how the app environment can be hoisted out of the kernel and into the user space where complex needs of system interaction are things that complex apps lose quanta over, not stuff that simple apps suffer through and thus tend to have very intermittent scheduling throughput due to how many things are being considered, in the kernel, around who runs next.

That taskmanager with 869 cores though...just Amazing!

Copper Contributor

Awesome. Great read.

Copper Contributor

Great stuff. Thank you.

Copper Contributor

Hi,

 

Thank you for your post.

If we compare Win10 Pro with Win 10 IoT.

1) Is it the same kernel?

2) If yes, is there such thing like "kernel configuration" which may differ?

3) Another question (it can be slightly out-of-scope). If I design system that is on one hand dedicated-concrete-tasks. On other, on host with huge resources... From kernel perspective, (performance, thread scheduling, soft-real-time e.t.c) how to choose between Win10 Pro with Win 10 IoT.?

 

Regards,

Copper Contributor

The largest problem is that windows is licensed based on CPU size and core count.  Microsoft believes that they should be able to make money off of the size of your environment, because whatever gain you get from having more CPU power, is only enabled by them.  This is one of the more primary reasons why people are choosing linux or macos as well as freeBSD instead of windows.  There is no direct relationship to the compute size of a deployed system to the amount of "money" that may be made.  But, alas, microsoft only operates in the money making mode.  This is why consumer versions of windows have always suffered from security risks and very poor performance under load.  The complexity of the scheduling details mentioned here are part of the problem.  Instead of fixing problems and providing APIs and systems that keep this stuff from happening, they've engineered solutions around something that one person can do.  There is no end to end view of what windows and the kernel might be such that the OS can completely function with the least amount of visible presence in the operation of the software systems.

Copper Contributor

Because capacity describes complexity, and complexity is inefficient.

What capacity does the Windows NT kernel presently use?

Version history
Last update:
‎Dec 12 2022 11:06 AM
Updated by: