A new way to troubleshoot out-of-memory errors in the database engine

Microsoft

Mar 30, 2022

Introduction

As part of our efforts to improve database engine supportability, we have added a new dynamic management view (DMV), sys.dm_os_out_of_memory_events. It is now available in Azure SQL Database and Azure SQL Managed Instance and will become available in a future version of SQL Server. If an out-of-memory (OOM) event occurs in the database engine, this view will provide details to help you troubleshoot the problem, including the OOM cause, memory consumption by the database engine components at the time of event, potential memory leaks (if any), and other relevant information.

How it works

The MSSQL database engine can produce a detailed memory usage report when encountering an OOM event. However, the report is large and complex, and without expert-level knowledge of database engine internals, most customers cannot easily use it to troubleshoot the problem. With this improvement, we have added a new heuristic algorithm that analyzes the memory report, determines the likely cause of OOM, and outputs a summary to an in-memory ring buffer. Memory for the ring buffer is reserved at engine startup, so that it is available even if the engine is under memory pressure.

The sys.dm_os_out_of_memory_events view exposes the contents of this ring buffer, which include:

The time of OOM.
OOM cause, determined heuristically from the contents of memory report.
Memory-related details, such as available physical memory, job object limits, process memory, and memory usage by SOS and non-SOS components.
A list of memory clerks with highest memory consumption.
A list of memory clerks that might have leaked memory.
A list of resource pools that are out of memory, if any.
A list of resource pools with highest memory consumption.

For full details, including the list of possible OOM causes produced by the heuristic algorithm, see documentation for sys.dm_os_out_of_memory_events.

Data in the ring buffer is stored in memory only and is not persisted across database engine restarts. Also, because the size of ring buffer is finite, as new OOM events occur, older ones may be removed. To preserve older entries for longer, the view may omit similar OOM events that occur in quick succession. Consider capturing reported data soon after an OOM event, to have it at your disposal when troubleshooting the problem.

Seeing it in action

Let’s cause an OOM event to demonstrate this new capability. To do that, we will run a specially crafted complex query on an otherwise idle Basic database in Azure SQL DB. Using a database engine instance with a small resource allocation makes it easier to produce an OOM event.

After running for about 30 seconds, our query fails with an out-of-memory error (error 701 )

Msg 40197, Level 20, State 123, Line 66109
The service has encountered an error processing your request. Please try again. Error code 701.
Msg 0, Level 20, State 0, Line 0
A severe error occurred on the current command.  The results, if any, should be discarded.

Let’s query the new DMV:

SELECT event_time,
       oom_cause_desc,
       process_memory_usage_mb,
       committed_memory_target_mb,
       committed_memory_mb,
       oom_resource_pools,
       top_memory_clerks,
       top_resource_pools
FROM sys.dm_os_out_of_memory_events;

We can see that the OOM cause is MEMORY_LOW, which is not particularly surprising, considering that the total process memory for a Basic database is only about 2 GB, as seen in the results.

Can we tell more specifically why this query ran out of memory? We can, by looking at the top_memory_clerks column. It provides the list of memory clerks that have consumed most memory on the instance, with their memory allocations included. This data is provided as a JSON value (shown here formatted for readability):

[
  {
    "clerk_type_name": "CACHESTORE_PHDR",
    "page_allocated_mb": 950,
    "vm_committed_mb": 0
  },
  {
    "clerk_type_name": "OBJECTSTORE_SNI_PACKET",
    "page_allocated_mb": 297,
    "vm_committed_mb": 0
  },
  {
    "clerk_type_name": "MEMORYCLERK_XE_BUFFER",
    "page_allocated_mb": 0,
    "vm_committed_mb": 175
  }
]

The top memory clerk is CACHESTORE_PHDR, with close to 1 GB of memory allocated. Documentation for sys.dm_os_memory_clerks explains how this memory is used: “This cache store is used for temporary memory caching during parsing for views, constraints, and defaults algebrizer trees during compilation of a query. Once query is parsed, the memory should be released. Some examples include: many statements in one batch - thousands of inserts or updates into one batch, a T-SQL batch that contains a large dynamically generated query, a large number of values in an IN clause.”

The query we executed was indeed abnormally large, consisting of thousands of repeated statements combined into one statement using UNION clauses. This required more memory for parsing than what’s available for query processing in a Basic database, causing the OOM error.

Another column in the result set, also returning JSON values, is top_resource_pools. Here’s the value for this OOM event:

[
  {
    "pool_name": "SloSharedPool1",
    "allocations_mb": 1104,
    "pool_target_mb": 1803,
    "is_oom": 0
  }
]

In this case, the only resource pool listed is SloSharedPool1, which is the pool used for user workloads in Azure SQL Database (see documentation of Resource Governance for details). Seeing this pool, or the default pool for Azure SQL Managed Instance and SQL Server, as the top resource pool is normal and expected, but seeing other resource pools with large memory allocations on this list may warrant a closer investigation.

Not just a DMV

After reading so far, an experienced DBA may be wondering, “but what if the OOM condition is so impactful that it causes the database engine to terminate?” The OOM data in the ring buffer is not persisted to disk, so it would be lost. How could we troubleshoot the OOM error in that scenario?

We can do that by using a new extended event introduced together with the DMV, named summarized_oom_snapshot. Here is how the event looks in the SSMS extended events UI:

This event provides the same data as the DMV, but it lets you persist the data to a file target on disk or in Azure blob storage to avoid losing it in case of an engine crash. Using extended events also lets you track OOM events beyond the capacity of the ring buffer. Note though that depending on the type of OOM, this event may fire only in a server-scoped extended event session. Azure SQL Database supports only database-scoped event sessions, so you may or may not see this event there. However, you can use this event in a server-scoped event session in Azure SQL Managed Instance, or in SQL Server once this improvement becomes available there.

Conclusion

With this improvement, DBAs and database engineers working with the MSSQL database engine gain the ability to troubleshoot out-of-memory errors more easily and without having in-depth knowledge of database engine internals. The sys.dm_os_out_of_memory_events DMV and summarized_oom_snapshot extended event report the cause of the OOM and the relevant details of memory consumption at the time of the error. This lets you find and address the root cause without a lengthy investigation, and in many cases avoids a support case.

Updated Mar 30, 2022

Version 4.0

azure sql

azure sql database

azure sql managed instance

Dimitri_Furman

Microsoft

Joined November 27, 2018

View Profile

Azure SQL Blog

Follow this blog board to get notified when there's new activity