First published on MSDN on Sep 20, 2016
We all know (or should!) that SQL Server is NUMA aware. In fact, there are certain background processes that are created and run per NUMA node inside of SQL Server, which is why the soft-numa changes in 2016 can make a big performance gain.
There was a need to affinitize certain cpus for a SQL Server instance which can easily be accomplished through the ALTER SERVER CONFIGURATION DDL.
The first thing I like to do is use coreinfo (a sysinternals tool) to double check numa and cores. In this case the tool showed exactly what I expected to see. Turning my gaze to SQL Server and checking sys.dm_os_schedulers I wasn't expecting the output I saw.
Notice that parent_node_id (numa node) changes from 0 to 1 indicating a different numa node, but the cpu_id changes from 35 to 64! This struck me as very odd, as I expected the cpu_id to not have any gaps in the numbering. The customer I was working with was also intrigued and wondered if this was a problem.
I started to dig into why this would be happening and what would cause it. Since SQL Server runs on Windows it seemed like starting with the Windows documentation was the place to begin. I came upon the explanation of how Windows has support for NUMA and the APIs associated with accessing NUMA information. There are many very interesting pieces of information in the text, however there is a specific piece that interests us:
"On systems with more than 64 logical processors, nodes are assigned to processor groups according to the capacity of the nodes. The capacity of a node is the number of processors that are present when the system starts together with any additional logical processors that can be added while the system is running."
This leads to the question of, "What is a processor group ?" Following the documentation there we have two very interesting pieces of information:
"Systems with fewer than 64 logical processors always have a single group, Group 0."
While this doesn't explain my current situation it is good to note that less than 64 logical processors will always have a single processor group, which is group 0. This explains why on smaller servers with multiple numa nodes the numbering in Windows and SQL Server is sequential with no gaps. But what about our case where we have much more than 64 logical processors?
"Support for systems that have more than 64 logical processors is based on the concept of a processor group , which is a static set of up to 64 logical processors that is treated as a single scheduling entity."
Ok, we're getting closer. We now know that the maximum size of a group is 64 cores.
"When the system starts, the operating system creates processor groups and assigns logical processors to the groups. If the system is capable of hot-adding processors, the operating system allows space in groups for processors that might arrive while the system is running. The operating system minimizes the number of groups in a system."
This is a hot add system, but doesn't fit out ticket. It is good to know that the OS will minimize the number of processor groups, though.
"For better performance, the operating system takes physical locality into account when assigning logical processors to groups. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. A NUMA node is assigned to a single group unless the capacity of the node exceeds the maximum group size."
AH HA! Now everything is making sense and coming together.
In this specific system there are 36 logical cores in each numa node, which is less than the 64 max size of the processor group. However, Windows wants to keep the same numa node in the same processor group and not straddle groups if it doesn't have to. This means, in our situation we should see our logical processors grouped in sizes of 64 and that each group should start on a processor group boundary.
From the initial screen shot, this is true and exactly what is happening. In this case it starts at cpu 0->35 for the first processor group. Since 35+36 > 64 Windows creates a new processor group and starts the next numa there 64->99.
Since, again, SQL Server runs on Windows and uses the same APIs to gather information it sees the same exact information. This is why we may not have contiguous cpu_ids in sys.dm_os_schedulers on systems with a large number of cores and numa nodes!