Lesson Learned #482: Identifying Potential Duplicate Statistics

Microsoft

Apr 01, 2024

Some time ago, we encountered a support case where a customer experienced significant delays in updating auto-created and user-created statistics. I would like to share the insights gained from this experience, including the underlying causes of the issue and the potential solutions we identified and implemented to address the problem effectively.

The Scenario

Consider the following table:

CREATE TABLE EmployeesStats (
    EmployeeID INT PRIMARY KEY IDENTITY(1,1),
    LastName VARCHAR(255),
    FirstName VARCHAR(255),
    Department VARCHAR(255)
);

We added rows to this table and rebuilt the clustered index:

-- Insert sample data
BEGIN TRANSACTION;
    DECLARE @i int = 0;
    WHILE @i < 1000000
    BEGIN
        INSERT INTO EmployeesStats (LastName, FirstName, Department) 
        VALUES (CONCAT('LastName', @i), CONCAT('FirstName', @i), CASE WHEN @i % 5 = 0 THEN 'Engineering' ELSE 'Marketing' END);
        SET @i = @i + 1;
    END;
COMMIT TRANSACTION;

ALTER INDEX [PK__Employee__7AD04FF149345346] ON EmployeesStats rebuild

To obtain the total number of rows filtered by the Engineering department, we executed the following T-SQL:

select Department, COUNT(*) 
FROM EmployeesStats WHERE Department = 'Engineering'
group by Department

With auto-create statistics enabled in our database, SQL Server created an auto-created statistic using the Department column for this table.

Following this, we found that the performance was not optimal, and upon reviewing the execution plan, we discovered that creating an index on the Department column would improve performance by approximately 82%. Consequently, the index was created:

-- Creating a nonclustered index
CREATE NONCLUSTERED INDEX [Index_Departament]
ON [dbo].[EmployeesStats] ([Department]);

At this point, we observed the following:

Two statistics were present: _WA_Sys_00000004_09C5A64A and Index_Departament.
The execution plan would utilize and update (if needed, with auto-update statistics enabled) the Index_Departament statistics.
The _WA_Sys_00000004_09C5A64A statistic would never be used.
However, when our customer used sp_updatestats or another tool that required updating all auto/user-created statistics, this statistic would be recalculated, which could be time-consuming depending on the number of rows.

To address this, we advised our customer to check for potentially duplicated statistics and to review each one to determine whether it could be deleted. As an example, we provided the following T-SQL to identify all statistics for this table where a column is involved in any statistics, allowing for later review and decision-making on whether to retain or delete them.

It's important to note that manual review is often the best approach to deciding on the maintenance of statistics, as a column may be used in multiple definitions. Additionally, it's crucial to test any changes in a test environment before implementation to avoid potential performance issues:

WITH StatsColumns AS (
    SELECT
        s.object_id,
        s.stats_id,
        s.name AS StatisticName,
        s.auto_created,
        s.user_created,
        c.name AS ColumnName,
        sc.stats_column_id
    FROM 
        sys.stats s
    INNER JOIN 
        sys.stats_columns sc ON s.stats_id = sc.stats_id AND s.object_id = sc.object_id
    INNER JOIN 
        sys.columns c ON sc.column_id = c.column_id AND sc.object_id = c.object_id
),
DuplicatedStats AS (
    SELECT
        sc1.object_id,
        sc1.ColumnName,
        STRING_AGG(sc1.StatisticName, ', ') WITHIN GROUP (ORDER BY sc1.StatisticName) AS StatisticsNames,
        COUNT(*) AS NumberOfDuplicatedStats
    FROM 
        StatsColumns sc1
    GROUP BY
        sc1.object_id,
        sc1.ColumnName
    HAVING 
        COUNT(*) > 1
)
SELECT
    OBJECT_NAME(ds.object_id) AS TableName,
    ds.ColumnName,
    ds.StatisticsNames,
    ds.NumberOfDuplicatedStats
FROM 
    DuplicatedStats ds
WHERE 
    ds.object_id = OBJECT_ID('EmployeesStats')
ORDER BY 
    TableName,
    ds.ColumnName;

Here's an example of the output