Some time ago, we encountered a support case where a customer experienced significant delays in updating auto-created and user-created statistics. I would like to share the insights gained from this experience, including the underlying causes of the issue and the potential solutions we identified and implemented to address the problem effectively.
Consider the following table:
CREATE TABLE EmployeesStats (
EmployeeID INT PRIMARY KEY IDENTITY(1,1),
LastName VARCHAR(255),
FirstName VARCHAR(255),
Department VARCHAR(255)
);
We added rows to this table and rebuilt the clustered index:
-- Insert sample data
BEGIN TRANSACTION;
DECLARE @i int = 0;
WHILE @i < 1000000
BEGIN
INSERT INTO EmployeesStats (LastName, FirstName, Department)
VALUES (CONCAT('LastName', @i), CONCAT('FirstName', @i), CASE WHEN @i % 5 = 0 THEN 'Engineering' ELSE 'Marketing' END);
SET @i = @i + 1;
END;
COMMIT TRANSACTION;
ALTER INDEX [PK__Employee__7AD04FF149345346] ON EmployeesStats rebuild
To obtain the total number of rows filtered by the Engineering department, we executed the following T-SQL:
select Department, COUNT(*)
FROM EmployeesStats WHERE Department = 'Engineering'
group by Department
With auto-create statistics enabled in our database, SQL Server created an auto-created statistic using the Department column for this table.
Following this, we found that the performance was not optimal, and upon reviewing the execution plan, we discovered that creating an index on the Department column would improve performance by approximately 82%. Consequently, the index was created:
-- Creating a nonclustered index
CREATE NONCLUSTERED INDEX [Index_Departament]
ON [dbo].[EmployeesStats] ([Department]);
At this point, we observed the following:
To address this, we advised our customer to check for potentially duplicated statistics and to review each one to determine whether it could be deleted. As an example, we provided the following T-SQL to identify all statistics for this table where a column is involved in any statistics, allowing for later review and decision-making on whether to retain or delete them.
It's important to note that manual review is often the best approach to deciding on the maintenance of statistics, as a column may be used in multiple definitions. Additionally, it's crucial to test any changes in a test environment before implementation to avoid potential performance issues:
WITH StatsColumns AS (
SELECT
s.object_id,
s.stats_id,
s.name AS StatisticName,
s.auto_created,
s.user_created,
c.name AS ColumnName,
sc.stats_column_id
FROM
sys.stats s
INNER JOIN
sys.stats_columns sc ON s.stats_id = sc.stats_id AND s.object_id = sc.object_id
INNER JOIN
sys.columns c ON sc.column_id = c.column_id AND sc.object_id = c.object_id
),
DuplicatedStats AS (
SELECT
sc1.object_id,
sc1.ColumnName,
STRING_AGG(sc1.StatisticName, ', ') WITHIN GROUP (ORDER BY sc1.StatisticName) AS StatisticsNames,
COUNT(*) AS NumberOfDuplicatedStats
FROM
StatsColumns sc1
GROUP BY
sc1.object_id,
sc1.ColumnName
HAVING
COUNT(*) > 1
)
SELECT
OBJECT_NAME(ds.object_id) AS TableName,
ds.ColumnName,
ds.StatisticsNames,
ds.NumberOfDuplicatedStats
FROM
DuplicatedStats ds
WHERE
ds.object_id = OBJECT_ID('EmployeesStats')
ORDER BY
TableName,
ds.ColumnName;
Here's an example of the output
Enjoy!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.