Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- Home
- SQL Server
- SQL Server Support Blog
- Sampling can produce less accurate statistics if the data is not evenly distributed

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

Published
Jan 15 2019 01:09 PM
127
Views

Jan 15 2019
01:09 PM

Jan 15 2019
01:09 PM

Recently I worked with a very knowledgable customer who called in and wanting to know things about statistics. This is because he noticed that his query would get inaccurate cardinality estimate due to ‘inaccurate histogram. Specifically, he has questioned why AVG_RANGE_ROWS would be very high when he did 10% sampling. But it became very low (almost distinct) when he used 100% sampling.

In order to illustrate the issue, let me create a fake table and populate data using this script:

create database dbStats

go

go

alter database dbstats set recovery simple

go

use dbStats

go

create table t(c1 uniqueidentifier)

go

set nocount on

begin tran

declare @i int = 0

declare @id1 uniqueidentifier = newid()

while @i< 8000000

begin

declare @id uniqueidentifier

if @i % 100 = 0

set @id = NEWID()

insert into t values (NEWID())

insert into t values (@id)

if @i < 2000000

insert into t values (@id1)

set @i +=1

if (@i % 100000 = 0)

begin

commit tran

Technorati Tags:
Performance

begin tran

end

end

commit tran

go

create index ix on t(c1)

go

If you update the statistics with 10% sampling, you will get histogram 1 (below). But if you update statistics with 100% sampling, you will get histogram 2. Note that one major difference is that AVG_RANGE_ROWs are much higher in histogram 1 than in histogram 2.

In fact, as you increase sampling rate from 10% to a larger number, the AVG_RANGE_ROWS will gradually decrease. First of all, AVG_RANGE_ROWS basically means for any value within a histogram step, how many duplicates are there. If you have a value of 2 for AVG_RANGE_ROWS, it means for any given value within the histogram step, it will have 2 duplicates. SQL Server optimizer uses this to do cardinality estimate for the values falling within a histogram step.

In order to explain what’s going on, let’s take a look at data first. The data is constructed in a way that is not evenly distributed. The column has 8 million distinct values that only appear once. There is one value that appears 2 million times. Then there are 80,000 values that appear 100 times within the table.

So overall, the data is very selective. Out of 18 million rows, there are more than distinct 8 million values. The customer’s argument is that for any given value that falls within a histogram, SQL really should estimate less than 2 rows. AVG_RANGE_ROWS should be less than 2 rows.

When you do 100% sampling, the AVG_RANGE_ROWS is 1.8 or 2. So it’s accurate. But with 10% sampling, most AVG_RANGE_ROWS is 18 (much higher).

The reason is that the data is not evenly distributed. If the data is truly evenly distributed, 10% sampling and 100% sampling will produce similar results. But if some values appear way more than other values, sampling produce less accurate results. This is because the more frequent values will have higher chance of being selected to compute statistics. At final stage, the values are scaled up to produce the ‘inflated’ statistics. This eventually ends up with a statistics histogram that tells SQL Server that data is less selective than really is.

**
What’s the solution?
**

There are a couple of things. If you can afford fullscan (100%) or increasing sampling, do that. If you can’t, you may have to rely in index hints for some queries.

Histogram 1 (10% sampling)

RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS

------------------------------------ ------------- ------------- -------------------- --------------

ADA7D301-B912-4927-BC84-0000730876EE 0 1 0 1

1D4D7A35-9E7B-46E7-BF87-01B486B2166B 113919.4 1005.627 6442 17.6839

2F22F6DE-A893-4826-A8B8-04E5A31C5819 187018.4 1005.627 11932 15.67406

2602AD30-5365-4DE6-BEFB-07E79CB221F0 193601.4 1005.627 11077 17.47708

ADC703C1-2CD6-4107-9177-09268149C7C6 73038.45 1005.627 4597 15.88914

968BE242-C378-4FDA-9A6F-0B883FCBAB52 169844.2 1005.627 9031 18.80737

87B47694-B775-4414-B638-103D45D87C84 303451.8 1005.627 17569 17.27227

7DE3FC0A-2A74-4C90-B0E3-12177F27B7F2 110587.5 1005.627 6711 16.47944

7B9DB839-7F96-46A6-821D-130BC4345A90 66637.23 1005.627 3668 18.16596

BFC1684B-EB83-4D20-B0DF-150805AC5A8E 109729.3 1005.627 7341 14.9474

Histogram 2 (100% sampling)

RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS

------------------------------------ ------------- ------------- -------------------- --------------

7972667E-CC57-42F4-A414-00000190A98E 0 1 0 1

CFF7C764-41AA-4E89-B802-00AA81D1D512 39129 100 21012 1.862222

473C80D5-A9CC-4186-B4B3-093CCCEAD064 538744 100 270949 1.988359

61E3899D-74CD-4090-9E76-0A031A54EFD6 44973 100 24480 1.837132

E857ABD0-DA6D-4367-92C7-1D98186AF634 1224175 100 617701 1.981825

8863C50F-160D-4DD3-A1C4-1E24D9DE79D3 31413 100 17454 1.799759

3AF1D87E-805E-49B6-9870-301658472C85 1109208 100 565698 1.960778

20BEE350-A654-43F3-BB84-30A978964057 33231 100 17985 1.847706

443AC49E-FE4B-4E32-9FF9-47A667FF1E0F 1451375 100 725309 2.001044

5BAFE28B-6AF7-42D8-BB9E-48C3D1FB3F7B 67121 100 35342 1.899185

FE43EE54-4D31-4CCF-B231-4D6E02B3F852 295111 100 147502 2.000726

8EC2F5D1-50BC-4452-8F9A-4DFD046D3E5B 32713 100 17566 1.862291

C636084E-7EC5-48B6-A273-5451C9EE8269 393640 100 199600 1.972144

75DE7AB9-EEE9-4426-A907-54E4F00BA1C1 33538 100 18292 1.833479

5F5A4E68-3573-4EB3-8ABC-5ACA7ABE7C73 361499 100 185873 1.944871

48BC6F49-D712-4B7B-A91D-5BC822179B2F 65751 100 31497 2.087532

B22C40FE-1F99-47F8-872D-ACFB95246E52 5078911 100 2564212 1.980691

Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support

0
Likes

Labels