What caused the sudden DTU/CPU spikes to the limit
Published Aug 19 2019 09:07 AM 14.6K Views
Microsoft

It is commonly seen questions from our customers that why are their DTU/CPU usage spikes to the limit.

 

Hope following workflow can help you find the possible cause and solution

 

  1. Understand the usage trend

 

Check which resource is in question here. Check azure portal or TSQL command Select * FROM sys.dm_db_resource_stats. In this article, we will talk about CPU spike issue.

 

Questions I may have to get started: 

When did the spike start to happen?

Is it really suddenly spike or it has been using 70-90% usage over the time?

 

If the DTU resource has been using above 70% and recently reached to 100%, you may want to consider to scale the database up.

 

Has the database performance tier recently been lowered?

 

2. Find out the top CPU queries

This can be found in Query Performance Insight.

Or it can be seen in Query Store in SSMS. For query store, more info can be found here https://docs.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-usin...

 

For ongoing issue, we can also run following DMV against customer's database:

print '--top 10 Active CPU Consuming Queries by sessions--'

SELECT top 10 req.session_id, req.start_time, cpu_time 'cpu_time_ms', object_name(st.objectid,st.dbid) 'ObjectName' ,

                                substring (REPLACE (REPLACE (SUBSTRING(ST.text, (req.statement_start_offset/2) + 1, 

                                ((CASE statement_end_offset  

                                                WHEN -1 THEN DATALENGTH(ST.text) 

                                                ELSE req.statement_end_offset END  

                                                                - req.statement_start_offset)/2) + 1), CHAR(10), ' '), CHAR(13), ' '), 1, 512)  AS statement_text 

                                 FROM sys.dm_exec_requests AS req 

                                 CROSS APPLY sys.dm_exec_sql_text(req.sql_handle) as ST

                                order by cpu_time desc

Go

 

If the high CPU issue occurred in the past, you can run following query;

-- Top 15 CPU consuming queries by query hash

-- note that a query hash can have many query id if not parameterized or not parameterized properly

-- it grabs a sample query text by min

WITH AggregatedCPU AS (SELECT q.query_hash, SUM(count_executions * avg_cpu_time / 1000.0) AS total_cpu_millisec, SUM(count_executions * avg_cpu_time / 1000.0)/ SUM(count_executions) AS avg_cpu_millisec, MAX(rs.max_cpu_time / 1000.00) AS max_cpu_millisec, MAX(max_logical_io_reads) max_logical_reads, COUNT(DISTINCT p.plan_id) AS number_of_distinct_plans, COUNT(DISTINCT p.query_id) AS number_of_distinct_query_ids, SUM(CASE WHEN rs.execution_type_desc='Aborted' THEN count_executions ELSE 0 END) AS Aborted_Execution_Count, SUM(CASE WHEN rs.execution_type_desc='Regular' THEN count_executions ELSE 0 END) AS Regular_Execution_Count, SUM(CASE WHEN rs.execution_type_desc='Exception' THEN count_executions ELSE 0 END) AS Exception_Execution_Count, SUM(count_executions) AS total_executions, MIN(qt.query_sql_text) AS sampled_query_text

                       FROM sys.query_store_query_text AS qt

                           JOIN sys.query_store_query AS q ON qt.query_text_id=q.query_text_id

                           JOIN sys.query_store_plan AS p ON q.query_id=p.query_id

                           JOIN sys.query_store_runtime_stats AS rs ON rs.plan_id=p.plan_id

                           JOIN sys.query_store_runtime_stats_interval AS rsi ON rsi.runtime_stats_interval_id=rs.runtime_stats_interval_id

                       WHERE rs.execution_type_desc IN ('Regular', 'Aborted', 'Exception')AND rsi.start_time>=DATEADD(HOUR, -2, GETUTCDATE())

                       GROUP BY q.query_hash), OrderedCPU AS (SELECT query_hash, total_cpu_millisec, avg_cpu_millisec, max_cpu_millisec, max_logical_reads, number_of_distinct_plans, number_of_distinct_query_ids, total_executions, Aborted_Execution_Count, Regular_Execution_Count, Exception_Execution_Count, sampled_query_text, ROW_NUMBER() OVER (ORDER BY total_cpu_millisec DESC, query_hash ASC) AS RN

                                                             FROM AggregatedCPU)

SELECT OD.query_hash, OD.total_cpu_millisec, OD.avg_cpu_millisec, OD.max_cpu_millisec, OD.max_logical_reads, OD.number_of_distinct_plans, OD.number_of_distinct_query_ids, OD.total_executions, OD.Aborted_Execution_Count, OD.Regular_Execution_Count, OD.Exception_Execution_Count, OD.sampled_query_text, OD.RN

FROM OrderedCPU AS OD

WHERE OD.RN<=15

ORDER BY total_cpu_millisec DESC;

 

 

If the top CPU query is a new query which is not commonly seen, investigate this query and see if there is any room to tune this query. Or maybe reschedule it to less business hours if possible. Consider to check the query plan and add any missing indexes if needed. Please note that index created also causes performance degradation on all INSERT, UPDATE, DELETE, BULK INSERT and BCP operations.

 

Here is our documentation about performance tuning: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-performance-guidance

 

3. Any increase in the workload?

Check the top CPU queries has any execution count increase during the time in question which can be found in Query performance insight.

 

If the workload increase is expected, consider to scale the database to next level or move some cold data to different table and see if it helps.

 

SELECT q.query_id, qt.query_text_id, qt.query_sql_text,  

   SUM(rs.count_executions) AS total_execution_count

FROM sys.query_store_query_text AS qt  

JOIN sys.query_store_query AS q  

   ON qt.query_text_id = q.query_text_id  

JOIN sys.query_store_plan AS p  

   ON q.query_id = p.query_id  

JOIN sys.query_store_runtime_stats AS rs  

   ON p.plan_id = rs.plan_id

GROUP BY q.query_id, qt.query_text_id, qt.query_sql_text

ORDER BY total_execution_count DESC;

 

4. Check if there is any regression on query plan

This can be checked in Query Store in SSMS. Select Regressed Queries to open the Regressed Queries pane or enter the query id in Tracked Queries in SQL Server Management Studio.

 

The following query example returns all queries for which execution time doubled in last 48 hours due to a plan choice change. Query compares all runtime stat intervals side by side.

 

SELECT  

   qt.query_sql_text,  

   q.query_id,  

   qt.query_text_id,  

   rs1.runtime_stats_id AS runtime_stats_id_1,

   rsi1.start_time AS interval_1,  

   p1.plan_id AS plan_1,  

   rs1.avg_duration AS avg_duration_1,  

   rs2.avg_duration AS avg_duration_2,

   p2.plan_id AS plan_2,  

   rsi2.start_time AS interval_2,  

   rs2.runtime_stats_id AS runtime_stats_id_2

FROM sys.query_store_query_text AS qt  

JOIN sys.query_store_query AS q  

   ON qt.query_text_id = q.query_text_id  

JOIN sys.query_store_plan AS p1  

   ON q.query_id = p1.query_id  

JOIN sys.query_store_runtime_stats AS rs1  

   ON p1.plan_id = rs1.plan_id  

JOIN sys.query_store_runtime_stats_interval AS rsi1  

   ON rsi1.runtime_stats_interval_id = rs1.runtime_stats_interval_id  

JOIN sys.query_store_plan AS p2  

   ON q.query_id = p2.query_id  

JOIN sys.query_store_runtime_stats AS rs2  

   ON p2.plan_id = rs2.plan_id  

JOIN sys.query_store_runtime_stats_interval AS rsi2  

   ON rsi2.runtime_stats_interval_id = rs2.runtime_stats_interval_id

WHERE rsi1.start_time > DATEADD(hour, -48, GETUTCDATE())  

   AND rsi2.start_time > rsi1.start_time  

   AND p1.plan_id <> p2.plan_id

   AND rs2.avg_duration > 2*rs1.avg_duration

ORDER BY q.query_id, rsi1.start_time, rsi2.start_time;

 

There are several reasons that can cause the statistics change which eventually cause the plan change, such as underlying data volume change or distribution change, schema change, parameter change and etc. It is very important that update statistics [tablename] with fullscan is run to ensure statistics is up-to-date. This statistics maintenance should be done on regular base depend on how often the underlying data changes.

 

In case you are seeing the average CPU usage spikes with only one plan used, the query may be sensitive to the parameters and we call it parameter sniffing. There are several workarounds talked in https://docs.microsoft.com/en-us/azure/sql-database/sql-database-monitor-tune-overview#ParamSniffing

 

Several options to avoid bad plans:

  1. You can consider to force using the good plans based on the historical performance. However, the problem is we don't know if these plan will be always good for all the parameters as the table grows or changes. 
  2. Azure SQL DB has a feature called automatic tuning where you can also set force plan too. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-automatic-tuning-enable This feature will force the last known good plan. It also uses back off strategy to let the SQL engine picks the plan and then this automatic tuning feature will decide if the engine generate a better plan so it can change the plan to force. The only risk is that the bad plan may be picked during the back off time.
  3. Add with recompile option to the query so that it will force the SQL engine to generate plan in every execution. However, this option will sacrifice CPU in every execution.

 

5. Compilation can also be possibly the problem

High number of compiles of a query can result in high CPU utilization. It can happen when frequent statistics change, or query is not properly parameterized.

 

The following query shows the count of queries by query hash to determine if a query is properly parameterized or not:

 

SELECT TOP 10

q.query_hash

, count (distinct p.query_id ) AS number_of_distinct_query_ids

, min(qt.query_sql_text) AS sampled_query_text

FROM sys.query_store_query_text AS qt

JOIN sys.query_store_query AS q

     ON qt.query_text_id = q.query_text_id

JOIN sys.query_store_plan AS p

     ON q.query_id = p.query_id

JOIN sys.query_store_runtime_stats AS rs

     ON rs.plan_id = p.plan_id

JOIN sys.query_store_runtime_stats_interval AS rsi

     ON rsi.runtime_stats_interval_id = rs.runtime_stats_interval_id

WHERE

rsi.start_time >= DATEADD(hour, -2, GETUTCDATE())

AND query_parameterization_type_desc IN ('User', 'None')

GROUP BY q.query_hash

ORDER BY count (distinct p.query_id) DESC

 

 

6. Check wait statistics

When CPU usage reaches to 100%, high SOS_Scheduler_Yield wait is expected as most queries may wait for the CPU resource and get slow down. But you can also exam if there is any other wait which may cause any issues, for example blocking related wait.

1 Comment
Version history
Last update:
‎Aug 20 2019 12:25 PM
Updated by: