If you're using SQL CLR, you may have encountered an unexpected SQL Server service termination issue on SQL Servers with a large number of CPUs that are unevenly distributed across the processor groups.
SQL Server service terminates unexpectedly.
In the ERRORLOG, you'll see the following message:
<datetime>
Server * *******************************************************************************
<datetime> Server *
<datetime> Server * BEGIN STACK DUMP:
<datetime> Server * 04/27/20 10:30:16 spid 38076
<datetime> Server *
<datetime> Server * A fatal error occurred in .NET Framework runtime.
<datetime> Server Error: 6536, Severity: 16, State: 1.
<datetime> Server A fatal error occurred in the .NET Framework common language runtime. SQL Server is shutting down. If the error recurs after the server is restarted, contact Customer Support Services.
Not all fatal errors in .NET Framework runtime will be caused by same issue, which is to say just because you observe "A fatal error occurred in .NET Framework runtime", it doesn't mean it's the same issue as described in this article. This is a generic message indicating the .NET Framework runtime failed.
This article is addressing an issue with Threadpool Manager in handling a large number of threads. If you're able to open the dump file, you'll see a stack like:
1a ntdll!RtlDispatchException
1b ntdll!KiUserExceptionDispatch
1c clr!ThreadpoolMgr::GetCPUBusyTime_NT
1d clr!ThreadpoolMgr::GateThreadStart
1e kernel32!BaseThreadInitThunk
1f ntdll!RtlUserThreadStart
More information about ThreadpoolMgr::GetCPUBusyTime_NT can be found at GitHub Win32ThreadPool.cpp.
IMPORTANT - The issue can occur when the following conditions are present:
Number of processors in the NUMA node is unevenly distributed across nodes
Number of processors in each NUMA node is 40 or more
We've observed this on multiple instances with systems with 160 processors. At the top of SQL Server ERRORLOG is a logging of the NUMA node configuration for the CPUs:
{datetime} Server Node configuration: node 0: CPU mask: 0x00000000000fffff:0 Active CPU mask: 0x00000000000fffff:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
{datetime} Server Node configuration: node 1: CPU mask: 0x00000000000fffff:1 Active CPU mask: 0x00000000000fffff:1. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
{datetime} Server Node configuration: node 2: CPU mask: 0x00000000000fffff:2 Active CPU mask: 0x00000000000fffff:2. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
{datetime} Server Node configuration: node 3: CPU mask: 0x000000fffff00000:1 Active CPU mask: 0x000000fffff00000:1. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
{datetime} Server Node configuration: node 4: CPU mask: 0x000000fffff00000:0 Active CPU mask: 0x000000fffff00000:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
{datetime} Server Node configuration: node 5: CPU mask: 0x000000fffff00000:2 Active CPU mask: 0x000000fffff00000:2. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
{datetime} Server Node configuration: node 6: CPU mask: 0x0fffff0000000000:1 Active CPU mask: 0x0fffff0000000000:1. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
{datetime} Server Node configuration: node 7: CPU mask: 0x0fffff0000000000:0 Active CPU mask: 0x0fffff0000000000:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
If we look at the NUMA node masking by CPU group:
Node configuration: node 0: CPU mask: 0x00000000000fffff:0 Active CPU mask: 0x00000000000fffff:0.
Node configuration: node 1: CPU mask: 0x00000000000fffff:1 Active CPU mask: 0x00000000000fffff:1.
Node configuration: node 2: CPU mask: 0x00000000000fffff:2 Active CPU mask: 0x00000000000fffff:2.
Node configuration: node 3: CPU mask: 0x000000fffff00000:1 Active CPU mask: 0x000000fffff00000:1.
Node configuration: node 4: CPU mask: 0x000000fffff00000:0 Active CPU mask: 0x000000fffff00000:0.
Node configuration: node 5: CPU mask: 0x000000fffff00000:2 Active CPU mask: 0x000000fffff00000:2.
Node configuration: node 6: CPU mask: 0x0fffff0000000000:1 Active CPU mask: 0x0fffff0000000000:1.
Node configuration: node 7: CPU mask: 0x0fffff0000000000:0 Active CPU mask: 0x0fffff0000000000:0.
Each 'f' in the mask indicates 4 CPUs. Therefore, the above translates to:
The problem arises when there is a mismatch between the number of CPUs in the SQL CLR Host configuration and the number of CPUs in the SQL Server Engine for a given processor group. The exception and SQL Server service crash happens when CLR starts a new thread.
If you want to check the processor mapping on your system, you can use Windows Sysinternals tool Coreinfo v3.5.
This issue was fixed in .NET Framework versions 4.6 and beyond.
Preview of Quality Rollup for .NET Framework 4.8 for Windows 8.1, RT 8.1, and Windows Server 2012 R2 (KB4537482)
https://support.microsoft.com/en-us/help/4537482/kb4537482
Preview of Quality Rollup for .NET Framework 4.6, 4.6.1, 4.6.2, 4.7, 4.7.1, 4.7.2 for Windows 8.1, RT 8.1, and Windows Server 2012 R2 (KB4537488)
https://support.microsoft.com/en-us/help/4537488/kb4537488
Other versions
https://devblogs.microsoft.com/dotnet/net-framework-february-2020-preview-of-quality-rollup/
If the version of .NET Framework on your system is less than 4.6, you will need to upgrade to at least .NET Framework 4.6 to get the fix. We recommend that update to .NET Framework 4.8. If you're unable to upgrade, please see the workarounds below.
Note: the workaround will depend on the number of logical processors for the individual machine. The following is an example based on the 160 logical CPU machine used as example above.
There's two workarounds we have identified for this issue.
1st Workaround
Issue doesn't happen after disabling the hyper-threading. Once after disabling hyper-threading, there will be 2 different processor groups:
Processor Group 0 has 40 Logical Processors
Processor Group 1 has 40 Logical Processors
Under each Group, we have 40 processors, so there is no uneven distribution across processor groups and the issue doesn't reoccur.
2nd Workaround
BCDEdit is a command-line tool for managing Boot Configuration Data (BCD). This tool allows you to edit the size of the processor group. In the scenario described above with 160 processors, you can execute the following command to create an evenly distributed number of CPUs across the processor groups:
bcdedit.exe /set groupsize 40
You need to reboot the server after running above command. After this, you'll see:
Processor Group 0 has 40 Logical Processors
Processor Group 1 has 40 Logical Processors
Processor Group 2 has 40 Logical Processors
Processor Group 3 has 40 Logical Processors
Nathan Schoenack
Support Escalation Engineer - SQL Server Engine
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.