Lesson Learned #426:Identifying and Mitigating Query Regressions in Azure SQL using Query Data Store

Jose_Manuel_Jurado · ‎Sep 11 2023

SQL performance is often a critical aspect for most business applications. Over time, the execution characteristics of a SQL query can change, leading to unexpected performance issues. This article will guide you on how to identify regressive queries in Azure SQL using the powerful tool, Query Data Store (QDS), and a detailed script to put it into practice.

This article is regarding a question received in support to know how much time in terms of duration a query has increased comparing this month with the year. Of course, that you could adapt the query using other columns and time.

What is a Regressive Query?

A regressive query is one that, over time, performs worse than it did in the past. There are numerous reasons for a query to become regressive:

Data volume growth.
Index changes.
Statistics getting out of date.
Plan changes due to database updates or configurations.

Identifying such queries is the first step towards optimizing them.

The Power of Query Data Store

Azure SQL's Query Data Store (QDS) is a treasure trove of query performance data. It retains a history of query execution plans and runtime statistics, providing deep insights into query performance. Using QDS, we can compare a query's performance over different time frames and detect any regressions.

The Script

We've constructed a script that compares the average duration of queries over the past month with their average duration from the previous year. If a query's recent average duration exceeds 1.5 times its past average, it's flagged as regressive.

Here's a breakdown of the script:

CTE DistinctPlans: This calculates the number of distinct execution plans for each query. Different execution plans for the same query could hint at potential optimization opportunities.
CTE Recent: Gathers performance statistics for the past month, including the latest and earliest execution times, the forced and parallel plan flags, the actual query text, and the number of distinct plans for each query.
CTE Historical: Gathers the average duration statistics for queries from the past year (excluding the last month).
Final Query: Combines the results from the Recent and Historical CTEs to identify which queries have regressed recently, by comparing their past month's average duration to the previous year's.

WITH DistinctPlans AS (
    SELECT 
        p.query_id,
        COUNT(DISTINCT p.plan_id) AS number_of_plans
    FROM sys.query_store_plan p
    GROUP BY p.query_id
),
Recent AS (
    SELECT 
        p.query_id, 
        r.plan_id, 
        AVG(r.avg_duration) AS avg_duration_last_month,
        MAX(r.last_execution_time) AS latest_high_duration_time,
        MIN(r.first_execution_time) AS earliest_high_duration_time,
        p.query_plan,
        p.is_forced_plan,
        p.is_parallel_plan,
        qt.query_sql_text,
        dp.number_of_plans
    FROM sys.query_store_runtime_stats r
    JOIN sys.query_store_plan p ON r.plan_id = p.plan_id
    JOIN sys.query_store_query q ON p.query_id = q.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
    JOIN DistinctPlans dp ON p.query_id = dp.query_id
    WHERE r.last_execution_time > DATEADD(DAY, -30, GETDATE())
    GROUP BY p.query_id, r.plan_id, p.query_plan, p.is_forced_plan, p.is_parallel_plan, qt.query_sql_text, dp.number_of_plans
),
Historical AS (
    SELECT 
        p.query_id, 
        AVG(r.avg_duration) AS avg_duration_last_year
    FROM sys.query_store_runtime_stats r
    JOIN sys.query_store_plan p ON r.plan_id = p.plan_id
    WHERE r.first_execution_time BETWEEN DATEADD(DAY, -360, GETDATE()) AND DATEADD(DAY, -30, GETDATE())
    GROUP BY p.query_id
)
SELECT 
    r.query_id, 
    r.plan_id, 
    r.avg_duration_last_month, 
    h.avg_duration_last_year,
    r.latest_high_duration_time,
    r.earliest_high_duration_time,
    r.query_plan,
    r.is_forced_plan,
    r.is_parallel_plan,
    r.query_sql_text,
    r.number_of_plans
FROM Recent r
JOIN Historical h ON r.query_id = h.query_id
WHERE r.avg_duration_last_month > 1.5 * h.avg_duration_last_year
ORDER BY r.query_id;

Example of data

Analysis of Results

Once the script is executed:

query_id: A unique identifier for each query. If this ID recurs multiple times in the result set, it indicates multiple regressive plans for the same query.
plan_id: The identifier for a specific execution plan.
avg_duration_last_month: Average duration for the query's execution over the past month.
avg_duration_last_year: Average duration for the query's execution over the past year.
latest_high_duration_time & earliest_high_duration_time: The latest and earliest times when the query was executed.
query_plan: The XML representation of the query execution plan.
is_forced_plan & is_parallel_plan: Flags indicating whether the plan is forced or executed in parallel, which can be key indicators of performance characteristics.
query_sql_text: The actual SQL text of the query.
number_of_plans: Indicates the number of distinct plans the query has had. Multiple plans could suggest variations in how the query was optimized and executed over time.

Conclusion

Identifying regressive queries is crucial for maintaining optimal database performance. By leveraging the Query Data Store in Azure SQL, we can pinpoint problematic queries and take targeted action. Whether it's optimizing the query, updating statistics, changing indexes, or other tuning methods, knowing is half the battle. With this script and insights from QDS, you're well-equipped to keep your Azure SQL Database performing at its best.

Remember, while it's essential to detect regressive queries, it's equally vital to test any changes in a non-production environment before implementing them in a live setting.

And additional question is? Why it changed?

Well, Azure SQL's Query Data Store (QDS) provides a wealth of information about query behavior, but there are certain limitations. QDS captures and retains information about query execution and plans, but it doesn't provide a specific "reason" for why a new plan was generated or why a query took longer on a particular occasion. However, with the right analysis, you can infer certain reasons or circumstances leading to these events.

Creation of a new execution plan: There are several reasons why SQL Server might generate a new execution plan for a query:
- Statistics Changes: Statistics get updated, leading SQL Server to reevaluate and generate a new execution plan.
- Schema Changes: Alterations to related tables, such as adding/dropping indexes, can lead to new plans.
- Configuration Changes: Modifying server or database configurations, like compatibility mode, might result in new plans.
- Plan Eviction: Due to memory pressure, older plans might get evicted from the plan cache, and when the query is run again, a new plan might get generated.
Increased query execution time: The reasons can be varied, and some of them include:
- Locking and Waits: If a query is waiting for resources due to locks, its execution time can increase.
- Increased Workload: A higher load on the server might impact the performance of individual queries.
- Data Changes: A significant increase in data volumes or changes in data distribution can impact execution time.
- Execution Plan Changes: As mentioned earlier, a new execution plan might be less efficient than the previous one.

To truly understand the underlying reasons behind these behaviors, you often need to combine QDS information with other monitoring and diagnostic tools and techniques, such as Extended Events, DMVs related to locks and waits, and system performance monitoring.

While QDS is a powerful tool, understanding query behavior in a particular system often requires a holistic approach and in-depth analysis.

Enjoy!

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs