The Problem of Cardinality Estimation
There is a single main query processing "engine" that powers SQL Server and Azure SQL Database. Within that engine is the part of a program that optimizes and executes queries. We call the process of optimizing and executing a query “Query Processing” or QP for short. Optimization is a step in which the query is, or can be, rewritten or reorganized to semantically equivalent but faster/more efficient operators. A critical part of that process is determining the approximate sizes of different sub-parts of a query. Think answers to questions like: “How many rows will this join return?” or “How many rows will be filtered out of this table with this clause?” This information is used by the optimizer at large to determine the right operators and right order of operations for best performance, and the process of generating these size-of-sub-result numbers is called cardinality estimation, or CE.
CE is based on two things:
- Statistical information about the underlying data as stored in histograms.
- General rules, approaches, or assumptions on how to combine sub-results and interpret histograms. These rules, taken together, are called the model.
The Model
The CE Model is a nuanced part of the estimation component, as it attempts to create general rules that will apply to specific situations, with limited ability to detect specifics about any one given situation. Take, for example, the rules within the model that determine how we estimate rows coming out of a join. Sometimes, the two tables being joined have an inherent relationship. The example we’ve used most often for this scenario is ‘Sales’ and ‘Returns’. It holds that every entry in the ‘Returns’ table must have a corresponding entry in the ‘Sales’ table. If the model assumes this kind of a relationship between any two tables, it would be effective for queries that match this pattern.
However, consider the situation where two tables are not related in such a direct way – and any relationship between them becomes apparent only after certain filters are applied. Take for example a table tracking the languages people speak and people’s addresses. Only after filtering for people living in, say, Quebec on one side and filtering for people who speak French on the other would you have a kind of implied relationship. (Many people who live in Quebec speak French). There is no inherent relationship of meaning between the two tables - the meaning is only added by the filters. Therefore, a different assumption on the part of CE – that the filters impute meaning into the join relationship – is also a very valid way to approach estimating joins between two tables.
So now we’ve just presented, at a high level, two different model assumptions, and given examples where one model fits but the other does not. The problem in software is that we need a single general rule – we can’t know BEFORE executing the query which rule is the better rule for any given situation.
This is the challenge of the model – finding that perfect single set of rules that will work for all (or at least a majority of) situations.
It can be helpful to think of the model across three basic dimensions:
- Correlation to Independence
- Data uniformity to skewed data
- Semantic relationships between tables being implicit in structure, or implied by filters
Of these three dimensions, the first two exist on a little bit of a continuum, whereas the last is a bit of a toggle – either one model or the other, but not really a sliding scale in between. Across these multiple dimensions, different data sets and different query patterns can be anywhere in the space of what assumptions would work best for each query. Some patterns may be more common than others, but each pattern is valid. So, how do you design a system that will work in a general way?
Before SQL Server 2022, we had to pick a single default model for all queries. There were ways that advanced users could debug a query and possibly add trace flags or hints to a query to force a different model. That process was tedious, time-consuming, and required expert knowledge. We had to select a default model, and that was pretty much what customers got.
Enter Cardinality Estimation (CE) Feedback
CE Feedback was introduced in SQL Server 2022 and is now also available Azure SQL Database.
This feature tailors the model to each query after a few executions – ensuring any changes made do not regress the query.
CE Feedback can tailor the model to an individual query, in a do no harm way. Where before, we had only one model that was used for all queries, we can now modify the model used for each query appropriately.
How it works
- A query is executed repeatedly (using the default model) and captured by the Query Store.
- If the query estimates seem “off” or like they could use improvement, we identify one or more ways in which a different model might address the problematic estimate.
- We try out the new model and evaluate:
- Are estimates better?
- Is performance better?
- Did the query plan change?
- If the answer to all three of these questions is ‘yes’, we persist this new model for this query only. It will be stored as a query store hint.
- We try only one model variant at a time but can continue to identify further model variants to be combined with the ones we have already found, so the process may repeat and refine.
- This all happens without user action or input.
CE Feedback works only for queries that are executed repeatedly, and we must have the query information in Query Store. Each query starts with the default model for estimation. After the requisite number of executions, we can determine if a different model might work better for the query – we make these decisions based on the accuracy of our estimates. If a new model variant could be tried, the CE Feedback framework tries it out on the next execution. If the plan changes and the estimates/runtime are better, we keep the new variant. If the plan does not change, or if the estimates do not improve, we do NOT keep the change. Keeping the change involves adding a hint (using Query Store Hints) to the query which will be persisted and used for that query for all future executions.
It sounds very simple, right? Identify a problem, try a solution, and if the solution works, keep it. All without user input or action. So, if your workload contains a range of queries whose optimal model spans our model dimensions, we can adjust the model for each query, with no action on your part, to be optimal or closer to optimal than the default.
CE Feedback customizes the model used for each query automatically, without user action.
This feature reduces the limitations of a one-model-fits-all approach.
The model and available changes
Each query still starts/begins with the default CE Model for database compatibility level 160, and there are three currently implemented model changes that are available for users. In this section, we'll discuss the available dimensions, explain where the model starts, and explain the variants available.
Independence vs. Containment
Relevant scenario:
The query has multiple filters on a table joined by an ‘AND’ clause. Something like:
WHERE City = ‘Seattle’ AND State = ‘WA’
In this case, we have two predicates:
- p1: City = ‘Seattle’
- p2: State = ‘WA’
They are joined in a conjunctive clause. Each predicate has a selectivity, or the probability that the predicate is true for a given row in the database. In this case, assume the selectiveness of p1 and p2 are s1 and s2. Selectivity values are always less than or equal to 1. So, if 1/3 of the rows in the table match a filter, that filter’s selectivity would be 0.33.
As we work through this example, it’s useful to remember that in probability theory, the probability of two independent events occurring is the product of the probabilities of each event respectively. As an example, for a fair coin toss, we must multiply the probability of flip 1 being heads (0.5) by the probability of flip 2 being heads (0.5) to get the probability of BOTH flips being heads (0.25).
Default Model:
The default model along this spectrum is a middle ground between assuming complete independence and assuming complete dependence between the filters. This is called exponential backoff. To find the selectivity of the conjunction, we order the "selectivities" from most to least selective, and then reduce the impact of successive selectiveness by taking them to the 1/ ½^nth power. That is, if we have 3 predicates in order of most to least selective of s1, s3, s2, the selectivity of p1 AND p2 AND p3 is:
Selectivity of conjunction = s1 * s3^(1/2) * s2^(1/4)
Relevant Hint:
You get this model without any hint as it is the default, but if you wanted to force it, you could use the hint: ASSUME_PARTIAL_CORRELATION_FOR_FILTER_ESTIMATES (See this documentation link for more details)
Variants available:
For this model, we have two available variants. One is complete independence, and one is complete dependence.
Independence
The independence model is the most selective option, and it simply multiplies the "selectivities" of each predicate, a known principle from probability theory. Thus, the selectivity of the conjunction becomes s1 * s2 * s3. Given that all the "selectivities" are numbers less than or equal to one, this number will be smaller than the exponential backoff number computed above. We use this when we are overestimating the number of rows returned by a filter with a conjunctive clause.
Relevant Hint:
ASSUME_FULL_INDEPENDENCE_FOR_FILTER_ESTIMATES
Dependence
The dependence model is the least selective option, as it simply takes the single most selective predicate and uses that as the selectivity of the conjunction. Thus, returning only s1. The case of above seems to fit this model best – everything located in Seattle is also located in Washington state. So, once we have reduced to the Seattle scope, we do not need to further reduce for Washington. We use this model when we are underestimating the rows returned by a conjunction using the default model.
Relevant Hint:
ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES
Uniformity
For the database compatibility level 160 version of CE feedback, we only make one change that has to do with the overall model assumption of uniformity. This has to do with using ‘Row Goal’ during optimization. There are other scenarios and ways in which the uniformity assumption is used that we do not modify at all using CE Feedback currently.
Relevant Scenario and default model:
A user executes a query with a ‘Top N’ or ‘Fast N’ or ‘IN’ construct. The optimizer looks at the number of rows that it thinks will match any underlying filters and chooses to use a ‘Row Goal’ optimization – assuming that we can scan just a small number of rows in the table to fulfill the specified number of rows or relevant clause.
Alternative model:
However, due to the way that we build and interpret histograms, it is possible to incorrectly overestimate the number of rows and assume that a small scan will meet whatever goal is set by the query. When this happens, the query ends up scanning a lot more rows during execution than expected, and likely an index seek (which, instead of paging in all rows, pages in a targeted set of rows matching given predicates) would have been a more appropriate choice.
Relevant Hint:
DISABLE_OPTIMIZER_ROWGOAL
Join Relationships (a.k.a. join containment)
Relevant Scenario and default model:
This scenario is as described above in the section introducing the concept of the model – with ‘Sales’ and ‘Returns’ tables. We have two tables that are joined together, with filters on one or both tables. The current default model assumes that there is an underlying relationship between the tables (think ‘Sales’ and ‘Returns’). So, we estimate the join assuming a containment relationship, and scale down the number of rows from the join based on the filters. We sometimes call this “Base Containment”.
Alternative Model
Since the CE doesn’t know anything about the underlying semantic relationship that may or may not exist in a user query, we have an alternate way of estimating joins if the default model seems like it could be improved. This other option assumes that the filters impute meaning into the join, and so we estimate the filters first, followed by estimating the join based on assuming an underlying relationship between the rows returned from the filters. We sometimes call this “Simple Containment”.
It is not guaranteed that the new model will always be better, but if we think one model creates poor overall estimates, we may try the other option to see if it results in better estimates. It may not, and we will not keep the change if it does not.
Relevant Hint:
ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS
Limitations
As with everything in the query optimizer, there is no magic button that will solve all estimation errors. There may still be queries with out of model constructs that we cannot fix with CE Feedback today. And sometimes, under a new model, even when estimates get better, the plan does not change, or performance doesn’t improve enough to justify keeping the new model. And of course, we have only a handful of model variants that we can try with CE Feedback, so, the ideal model, or assumption, may not exist for a given query.
Conclusion
This was a whirlwind tour of cardinality estimation, the model, and the limitations of the model. We provided an overview of what CE Feedback does to address the limitation of a single fixed model, with explanations of the three model variants that are currently supported by CE feedback.
Drop a comment below if you think of a specific model variation that you’d like to see us add.