SQL CLR .NET Framework runtime fatal error due to Uneven Number of CPUs Across Processor Groups
Published May 29 2020 02:54 PM 4,086 Views
Microsoft

Problem

If you're using SQL CLR, you may have encountered an unexpected SQL Server service termination issue on SQL Servers with a large number of CPUs that are unevenly distributed across the processor groups.

Symptoms

SQL Server service terminates unexpectedly.

 

In the ERRORLOG, you'll see the following message:

 

<datetime>

Server      * *******************************************************************************

<datetime> Server      *

<datetime> Server      * BEGIN STACK DUMP:

<datetime> Server      *   04/27/20 10:30:16 spid 38076

<datetime> Server      *

<datetime> Server      * A fatal error occurred in .NET Framework runtime.

 

  • In addition, a memory dump will be created
  • The last message in ERRORLOG will show something like:

 

<datetime> Server      Error: 6536, Severity: 16, State: 1.

<datetime> Server      A fatal error occurred in the .NET Framework common language runtime. SQL Server is shutting down. If the error recurs after the server is restarted, contact Customer Support Services.

 

Cause

Not all fatal errors in .NET Framework runtime will be caused by same issue, which is to say just because you observe "A fatal error occurred in .NET Framework runtime", it doesn't mean it's the same issue as described in this article. This is a generic message indicating the .NET Framework runtime failed. 

 

This article is addressing an issue with Threadpool Manager in handling a large number of threads. If you're able to open the dump file, you'll see a stack like:

 

1a ntdll!RtlDispatchException

1b ntdll!KiUserExceptionDispatch

1c clr!ThreadpoolMgr::GetCPUBusyTime_NT

1d clr!ThreadpoolMgr::GateThreadStart

1e kernel32!BaseThreadInitThunk

1f ntdll!RtlUserThreadStart

 

More information about ThreadpoolMgr::GetCPUBusyTime_NT can be found at GitHub Win32ThreadPool.cpp.

 

IMPORTANT - The issue can occur when the following conditions are present:

Number of processors in the NUMA node is unevenly distributed across nodes

Number of processors in each NUMA node is 40 or more

 

We've observed this on multiple instances with systems with 160 processors. At the top of SQL Server ERRORLOG is a logging of the NUMA node configuration for the CPUs:

 

{datetime} Server      Node configuration: node 0: CPU mask: 0x00000000000fffff:0 Active CPU mask: 0x00000000000fffff:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

{datetime} Server      Node configuration: node 1: CPU mask: 0x00000000000fffff:1 Active CPU mask: 0x00000000000fffff:1. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

{datetime} Server      Node configuration: node 2: CPU mask: 0x00000000000fffff:2 Active CPU mask: 0x00000000000fffff:2. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

{datetime} Server      Node configuration: node 3: CPU mask: 0x000000fffff00000:1 Active CPU mask: 0x000000fffff00000:1. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

{datetime} Server      Node configuration: node 4: CPU mask: 0x000000fffff00000:0 Active CPU mask: 0x000000fffff00000:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

{datetime} Server      Node configuration: node 5: CPU mask: 0x000000fffff00000:2 Active CPU mask: 0x000000fffff00000:2. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

{datetime} Server      Node configuration: node 6: CPU mask: 0x0fffff0000000000:1 Active CPU mask: 0x0fffff0000000000:1. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

{datetime} Server      Node configuration: node 7: CPU mask: 0x0fffff0000000000:0 Active CPU mask: 0x0fffff0000000000:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.

 

If we look at the NUMA node masking by CPU group:

 

Node configuration: node 0: CPU mask: 0x00000000000fffff:0 Active CPU mask: 0x00000000000fffff:0.

Node configuration: node 1: CPU mask: 0x00000000000fffff:1 Active CPU mask: 0x00000000000fffff:1.

Node configuration: node 2: CPU mask: 0x00000000000fffff:2 Active CPU mask: 0x00000000000fffff:2.

Node configuration: node 3: CPU mask: 0x000000fffff00000:1 Active CPU mask: 0x000000fffff00000:1.

Node configuration: node 4: CPU mask: 0x000000fffff00000:0 Active CPU mask: 0x000000fffff00000:0.

Node configuration: node 5: CPU mask: 0x000000fffff00000:2 Active CPU mask: 0x000000fffff00000:2.

Node configuration: node 6: CPU mask: 0x0fffff0000000000:1 Active CPU mask: 0x0fffff0000000000:1.

Node configuration: node 7: CPU mask: 0x0fffff0000000000:0 Active CPU mask: 0x0fffff0000000000:0.

 

Each 'f' in the mask indicates 4 CPUs. Therefore, the above translates to:

  • Processor Group 0 has 5*4*3=60 CPUs
  • Processor Group 1 has 5*4*3=60 CPUs
  • Processor Group 2 has 5*4*2=40 CPUs

 

The problem arises when there is a mismatch between the number of CPUs in the SQL CLR Host configuration and the number of CPUs in the SQL Server Engine for a given processor group. The exception and SQL Server service crash happens when CLR starts a new thread.

 

If you want to check the processor mapping on your system, you can use Windows Sysinternals tool Coreinfo v3.5.

 

Resolution

This issue was fixed in .NET Framework versions 4.6 and beyond.

 

Preview of Quality Rollup for .NET Framework 4.8 for Windows 8.1, RT 8.1, and Windows Server 2012 R2 (KB4537482)

https://support.microsoft.com/en-us/help/4537482/kb4537482

 

Preview of Quality Rollup for .NET Framework 4.6, 4.6.1, 4.6.2, 4.7, 4.7.1, 4.7.2 for Windows 8.1, RT 8.1, and Windows Server 2012 R2 (KB4537488)

https://support.microsoft.com/en-us/help/4537488/kb4537488

 

Other versions

https://devblogs.microsoft.com/dotnet/net-framework-february-2020-preview-of-quality-rollup/

 

If the version of .NET Framework on your system is less than 4.6, you will need to upgrade to at least .NET Framework 4.6 to get the fix. We recommend that update to .NET Framework 4.8. If you're unable to upgrade, please see the workarounds below.

 

Workaround

Note: the workaround will depend on the number of logical processors for the individual machine. The following is an example based on the 160 logical CPU machine used as example above.

 

There's two workarounds we have identified for this issue.

 

1st Workaround

Issue doesn't happen after disabling the hyper-threading. Once after disabling hyper-threading, there will be 2 different processor groups:

 

Processor Group 0 has 40 Logical Processors

Processor Group 1 has 40 Logical Processors

 

Under each Group, we have 40 processors, so there is no uneven distribution across processor groups and the issue doesn't reoccur.

 

2nd Workaround

BCDEdit is a command-line tool for managing Boot Configuration Data (BCD). This tool allows you to edit the size of the processor group. In the scenario described above with 160 processors, you can execute the following command to create an evenly distributed number of CPUs across the processor groups:

 

bcdedit.exe /set groupsize 40

 

You need to reboot the server after running above command. After this, you'll see:

 

Processor Group 0 has 40 Logical Processors

Processor Group 1 has 40 Logical Processors

Processor Group 2 has 40 Logical Processors

Processor Group 3 has 40 Logical Processors

 

Nathan Schoenack

Support Escalation Engineer - SQL Server Engine 

Co-Authors
Version history
Last update:
‎Aug 31 2021 06:46 AM
Updated by: