When we introduced the new CE back in SQL Server 2014, several changes were done regarding the assumptions of the model. These changes were introduced because the vast potential for variations in data distribution, volume and query patterns, made some of the model assumptions not applicable.
The 4 assumptions for the pre-2014 model were:
: Data distributions on different columns are assumed to be independent, unless correlation information is available and usable.
: Distinct values are evenly spaced and that they all have the same frequency. More precisely, within each histogram step, distinct values are evenly spread and each value has same frequency.
: Users query for data that exists. For equi-join of two tables, we factor in the predicates selectivity in each input histogram before joining histograms and come up with the JOIN selectivity.
: For filter predicates where Col = Const, the constant is assumed to actually exist for the associated column. If a corresponding histogram step is non-empty, one of the step’s distinct values is assumed to match the value from the predicate.
A few of those assumptions changed in SQL Server 2014 CE, namely:
: The combination of the different column values are not necessarily independent. This may resemble more real-life data querying.
Simple Containment becomes
: Users might query for data that does not exist, so we use probabilistic approach. For equi-join of two tables, we use the base tables histograms to come up with the JOIN selectivity, and then factor in the predicates selectivity.
For Base Containment, there was a trace flag (2301) to enable it in prior versions – see the article
Tuning options for SQL Server when running in high performance workloads
For correlated columns in the 2014 CE, to derive cardinality we need to sort the filters according to their density, where the smallest density value is first. Since the density is expressed in a range between 0.xxxx and 1 the smaller values means lower density, better selectivity or more different values. We then use only the first four most selective filters to calculate the combined density, using
, expressed like:
This hasn’t changed
in the 2016 CE, but there is a relevant difference. While in the 2014 CE we always used the single column statistics for each of the predicate columns, in 2016 CE, if multi-column statistics over the predicate columns are available, those will be used to derive estimations with a different calculator. Allow me to demonstrate using AdventureWorksDW2014.