It is commonly seen questions from our customers that why are their DTU/CPU usage spikes to the limit.
Hope following workflow can help you find the possible cause and solution
- Understand the usage trend
Check which resource is in question here. Check azure portal or TSQL command Select * FROM sys.dm_db_resource_stats. In this article, we will talk about CPU spike issue.
Questions I may have to get started:
When did the spike start to happen?
Is it really suddenly spike or it has been using 70-90% usage over the time?
If the DTU resource has been using above 70% and recently reached to 100%, you may want to consider to scale the database up.
Has the database performance tier recently been lowered?
2. Find out the top CPU queries
This can be found in Query Performance Insight.
Or it can be seen in Query Store in SSMS. For query store, more info can be found here https://docs.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-2017
For ongoing issue, we can also run following DMV against customer's database:
print '--top 10 Active CPU Consuming Queries by sessions--'
SELECT top 10 req.session_id, req.start_time, cpu_time 'cpu_time_ms', object_name(st.objectid,st.dbid) 'ObjectName' ,
substring (REPLACE (REPLACE (SUBSTRING(ST.text, (req.statement_start_offset/2) + 1,
((CASE statement_end_offset
WHEN -1 THEN DATALENGTH(ST.text)
ELSE req.statement_end_offset END
- req.statement_start_offset)/2) + 1), CHAR(10), ' '), CHAR(13), ' '), 1, 512) AS statement_text
FROM sys.dm_exec_requests AS req
CROSS APPLY sys.dm_exec_sql_text(req.sql_handle) as ST
order by cpu_time desc
Go
If the high CPU issue occurred in the past, you can run following query;
-- Top 15 CPU consuming queries by query hash
-- note that a query hash can have many query id if not parameterized or not parameterized properly
-- it grabs a sample query text by min
WITH AggregatedCPU AS (SELECT q.query_hash, SUM(count_executions * avg_cpu_time / 1000.0) AS total_cpu_millisec, SUM(count_executions * avg_cpu_time / 1000.0)/ SUM(count_executions) AS avg_cpu_millisec, MAX(rs.max_cpu_time / 1000.00) AS max_cpu_millisec, MAX(max_logical_io_reads) max_logical_reads, COUNT(DISTINCT p.plan_id) AS number_of_distinct_plans, COUNT(DISTINCT p.query_id) AS number_of_distinct_query_ids, SUM(CASE WHEN rs.execution_type_desc='Aborted' THEN count_executions ELSE 0 END) AS Aborted_Execution_Count, SUM(CASE WHEN rs.execution_type_desc='Regular' THEN count_executions ELSE 0 END) AS Regular_Execution_Count, SUM(CASE WHEN rs.execution_type_desc='Exception' THEN count_executions ELSE 0 END) AS Exception_Execution_Count, SUM(count_executions) AS total_executions, MIN(qt.query_sql_text) AS sampled_query_text
FROM sys.query_store_query_text AS qt
JOIN sys.query_store_query AS q ON qt.query_text_id=q.query_text_id
JOIN sys.query_store_plan AS p ON q.query_id=p.query_id
JOIN sys.query_store_runtime_stats AS rs ON rs.plan_id=p.plan_id
JOIN sys.query_store_runtime_stats_interval AS rsi ON rsi.runtime_stats_interval_id=rs.runtime_stats_interval_id
WHERE rs.execution_type_desc IN ('Regular', 'Aborted', 'Exception')AND rsi.start_time>=DATEADD(HOUR, -2, GETUTCDATE())
GROUP BY q.query_hash), OrderedCPU AS (SELECT query_hash, total_cpu_millisec, avg_cpu_millisec, max_cpu_millisec, max_logical_reads, number_of_distinct_plans, number_of_distinct_query_ids, total_executions, Aborted_Execution_Count, Regular_Execution_Count, Exception_Execution_Count, sampled_query_text, ROW_NUMBER() OVER (ORDER BY total_cpu_millisec DESC, query_hash ASC) AS RN
FROM AggregatedCPU)
SELECT OD.query_hash, OD.total_cpu_millisec, OD.avg_cpu_millisec, OD.max_cpu_millisec, OD.max_logical_reads, OD.number_of_distinct_plans, OD.number_of_distinct_query_ids, OD.total_executions, OD.Aborted_Execution_Count, OD.Regular_Execution_Count, OD.Exception_Execution_Count, OD.sampled_query_text, OD.RN
FROM OrderedCPU AS OD
WHERE OD.RN<=15
ORDER BY total_cpu_millisec DESC;
If the top CPU query is a new query which is not commonly seen, investigate this query and see if there is any room to tune this query. Or maybe reschedule it to less business hours if possible. Consider to check the query plan and add any missing indexes if needed. Please note that index created also causes performance degradation on all INSERT, UPDATE, DELETE, BULK INSERT and BCP operations.
Here is our documentation about performance tuning: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-performance-guidance
3. Any increase in the workload?
Check the top CPU queries has any execution count increase during the time in question which can be found in Query performance insight.
If the workload increase is expected, consider to scale the database to next level or move some cold data to different table and see if it helps.
SELECT q.query_id, qt.query_text_id, qt.query_sql_text,
SUM(rs.count_executions) AS total_execution_count
FROM sys.query_store_query_text AS qt
JOIN sys.query_store_query AS q
ON qt.query_text_id = q.query_text_id
JOIN sys.query_store_plan AS p
ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats AS rs
ON p.plan_id = rs.plan_id
GROUP BY q.query_id, qt.query_text_id, qt.query_sql_text
ORDER BY total_execution_count DESC;
4. Check if there is any regression on query plan
This can be checked in Query Store in SSMS. Select Regressed Queries to open the Regressed Queries pane or enter the query id in Tracked Queries in SQL Server Management Studio.
The following query example returns all queries for which execution time doubled in last 48 hours due to a plan choice change. Query compares all runtime stat intervals side by side.
SELECT
qt.query_sql_text,
q.query_id,
qt.query_text_id,
rs1.runtime_stats_id AS runtime_stats_id_1,
rsi1.start_time AS interval_1,
p1.plan_id AS plan_1,
rs1.avg_duration AS avg_duration_1,
rs2.avg_duration AS avg_duration_2,
p2.plan_id AS plan_2,
rsi2.start_time AS interval_2,
rs2.runtime_stats_id AS runtime_stats_id_2
FROM sys.query_store_query_text AS qt
JOIN sys.query_store_query AS q
ON qt.query_text_id = q.query_text_id
JOIN sys.query_store_plan AS p1
ON q.query_id = p1.query_id
JOIN sys.query_store_runtime_stats AS rs1
ON p1.plan_id = rs1.plan_id
JOIN sys.query_store_runtime_stats_interval AS rsi1
ON rsi1.runtime_stats_interval_id = rs1.runtime_stats_interval_id
JOIN sys.query_store_plan AS p2
ON q.query_id = p2.query_id
JOIN sys.query_store_runtime_stats AS rs2
ON p2.plan_id = rs2.plan_id
JOIN sys.query_store_runtime_stats_interval AS rsi2
ON rsi2.runtime_stats_interval_id = rs2.runtime_stats_interval_id
WHERE rsi1.start_time > DATEADD(hour, -48, GETUTCDATE())
AND rsi2.start_time > rsi1.start_time
AND p1.plan_id <> p2.plan_id
AND rs2.avg_duration > 2*rs1.avg_duration
ORDER BY q.query_id, rsi1.start_time, rsi2.start_time;
There are several reasons that can cause the statistics change which eventually cause the plan change, such as underlying data volume change or distribution change, schema change, parameter change and etc. It is very important that update statistics [tablename] with fullscan is run to ensure statistics is up-to-date. This statistics maintenance should be done on regular base depend on how often the underlying data changes.
In case you are seeing the average CPU usage spikes with only one plan used, the query may be sensitive to the parameters and we call it parameter sniffing. There are several workarounds talked in https://docs.microsoft.com/en-us/azure/sql-database/sql-database-monitor-tune-overview#ParamSniffing
Several options to avoid bad plans:
- You can consider to force using the good plans based on the historical performance. However, the problem is we don't know if these plan will be always good for all the parameters as the table grows or changes.
- Azure SQL DB has a feature called automatic tuning where you can also set force plan too. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-automatic-tuning-enable This feature will force the last known good plan. It also uses back off strategy to let the SQL engine picks the plan and then this automatic tuning feature will decide if the engine generate a better plan so it can change the plan to force. The only risk is that the bad plan may be picked during the back off time.
- Add with recompile option to the query so that it will force the SQL engine to generate plan in every execution. However, this option will sacrifice CPU in every execution.
5. Compilation can also be possibly the problem
High number of compiles of a query can result in high CPU utilization. It can happen when frequent statistics change, or query is not properly parameterized.
The following query shows the count of queries by query hash to determine if a query is properly parameterized or not:
SELECT TOP 10
q.query_hash
, count (distinct p.query_id ) AS number_of_distinct_query_ids
, min(qt.query_sql_text) AS sampled_query_text
FROM sys.query_store_query_text AS qt
JOIN sys.query_store_query AS q
ON qt.query_text_id = q.query_text_id
JOIN sys.query_store_plan AS p
ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats AS rs
ON rs.plan_id = p.plan_id
JOIN sys.query_store_runtime_stats_interval AS rsi
ON rsi.runtime_stats_interval_id = rs.runtime_stats_interval_id
WHERE
rsi.start_time >= DATEADD(hour, -2, GETUTCDATE())
AND query_parameterization_type_desc IN ('User', 'None')
GROUP BY q.query_hash
ORDER BY count (distinct p.query_id) DESC
6. Check wait statistics
When CPU usage reaches to 100%, high SOS_Scheduler_Yield wait is expected as most queries may wait for the CPU resource and get slow down. But you can also exam if there is any other wait which may cause any issues, for example blocking related wait.