Statistics & Cardinality Estimator Model Variations

This blog post covers some of the fundamentals of statistics in SQL Server, the cardinality estimation models across different versions of SQL Server, how cardinality estimates are derived in different scenarios and variations in estimates under the different cardinality estimation models.

SQL Server Versions 7 – 2012 used the same cardinality estimation model 70.  In SQL Server 2014, a redesigned, revamped, cardinality estimator model 120 was introduced. Further improvements were made in SQL Server 2016 which uses cardinality esimation model 130. Notice how the CE (Cardinality Estimation) model versions tie in nicely with the database compatibility levels.  Confirming the CE version used for a given query can be accomplished by checking the CardinalityEstimationModelVersion attribute in an execution plan.

The demos/examples in this post use Microsoft’s new sample database WideWorldImporters and the intention behind the sample code provided is so the reader can follow along and see the differences in cardinality estimates under the different models for themselves.

Statistics Terminology & Definitions

Statistic objects in SQL Server are light-weight objects containing statistical information about the distribution of values in one or more columns.  Statistics are used to calculate cardinality estimates during optimisation allowing the query optimiser to generate efficient query execution plans.

Density is a measure of the uniqueness of the data in a column(s). It is a ratio that shows just how many unique values there are within a given column, or set of columns. Highly dense columns aren’t very unique. Density is calculated as:
Density = 1 / ( # distinct values for column(s) )
Density = SELECT 1.0 / COUNT( DISTINCT [ColumnName] ) FROM [TableName];

Selectivity is a measure of the uniqueness of a given predicate. It is a number between 0 and 1, representing the fraction of rows in the table that pass the predicate. For example, if a predicate selects 12 rows from a table of 100 rows, the selectivity is (12/100) = 0.12. When a column is referred to as “highly selective” that usually means values aren’t repeating all that often, if at all. Selectivity for a filter predicate can be calculated as:
Selectivity = (# rows that pass the predicate) / (# rows in the table)

Cardinality is the number of rows returned by a query operator. Each operator in a query plan has an estimated cardinality (the number of rows the optimiser guessed that the operator would return) and an actual cardinality (the number of rows that the operator returned in actuality). You can see both by running a query with “SET STATISTICS PROFILE ON” or by capturing the actual execution plan.

Data Skew is the uneven distribution of data within a column.

Statistical Model Assumptions

The query optimiser needs to perform well across several workload types such as OLTP , relational data warehouses (OLAP) and hybrid workloads. In order to strike a balance given typical SQL Server customer usage scenarios and the vast potential for variations in data distribution, volume and query patterns, the query optimiser has to make certain assumptions which may or may not reflect the actual state of any given database design and data distribution. The core assumption models are:

  • Independence: Data distributions on different columns are independent unless correlation information is available (think multi-column statistics).
  • Uniformity: Within each statistics object histogram step, distinct values are evenly spread and each value has the same frequency.
  • Inclusion: For filter predicates involving a column-equals-constant expression, the constant is assumed to actually exist for the associated column.
  • Containment: For a join predicate involving an equijoin for two tables, it is assumed that distinct join column values from one side of the join will exist on the other side of the join. In addition, the smaller range of distinct values is assumed to be contained in the larger range. In the presence of additional non-join filter predicates the legacy CE assumes some level of correlation. This implied correlation is called “simple containment”.

A few of those assumptions changed in the new  SQL Server 2014/2016 CE, namely:

  • Independence becomes Correlation: In absence of existing multi-column statistics, the legacy CE views the distribution of data contained across different columns as uncorrelated with one another. This assumption of independence often does not reflect the reality of a typical SQL Server database schema, where implied correlations do actually exist. The new CE uses an increased correlation assumption for multiple predicates and an exponential back off algorithm to derive cardinality estimates.
  • Simple Join Containment becomes Base Join Containment: Under the legacy CE, the assumption is that non-join predicates are somehow correlated which is called “Simple Containment”. For the new Cardinality Estimator, these non-join predicates are assumed to be independent (called “Base Containment”), and so this can translate into a reduced row estimate for the join. At a high level, the new CE derives the join selectivity from base-table histograms without scaling down using the associated filter predicates. Instead the new CE computes join selectivity using base-table histograms before applying the selectivity of non-join filters.

The Importance of Accurate Cardinality Estimation

Good cardinality estimation (row count expectations at each node of the logical tree) is vital; if these numbers are wrong, all later decisions are affected. Cardinality estimates are a major factor in deciding which physical operator algorithms are used, the overall plan shape (join orders etc) and ultimately they determine the final query plan that executes. Given these critical plan choices, when the cardinality estimation process contains a significantly skewed assumption, this can lead to an inefficient plan choice. This can, in turn, result in degraded performance.

Under estimating rows can lead to:

  • Memory spills to disk, for example, where not enough memory was requested for sort or hash operations.
  • The selection of a serial plan when parallelism would have been more optimal.
  • Inappropriate join strategies.
  • Inefficient index selection and navigation strategies.

Inversely, over estimating rows can lead to:

  • Inflated memory grants.
  • Selection of a parallel plan when a serial plan might be more optimal.
  • Inappropriate join strategy selection.
  • Inefficient index navigation strategies (scan versus seek).

Improving the accuracy of cardinality estimates can improve the quality of the query execution plan and, as a result, improve the performance of the query.

Statistics Objects in SQL Server

Cardinality estimates are row count estimates calculated for each operator within a query execution plan. Estimates are calculated using input from statistics associated with objects referenced in the query. A statistics object has three main areas of information associated with it; the header, density vector and histogram.

  • Header: The header includes information such as the last time statistics were updated and the number of sampled rows.
  • Density Vector: The density vector measures the uniqueness of a column or set of columns, with lower density values indicating a higher uniqueness.
  • Histogram: The histogram represents a column’s data distribution and frequency of occurrence for distinct values. Histograms are limited to 200 contiguous steps, with steps representing noteworthy boundary values.

Using the WideWorldImporters database, create an index on the orderdate column, then take a look at the statistic object created as a by-product of the index creation:

CREATE INDEX NC_IX_Purchasing_PurchaseOrders_OrderDate
ON Purchasing.PurchaseOrders (OrderDate)

DBCC SHOW_STATISTICS(N'Purchasing.PurchaseOrders', NC_IX_Purchasing_PurchaseOrders_OrderDate)
--WITH STAT_HEADER, DENSITY_VECTOR, HISTOGRAM
GO

1-dbcc-show_statistics

Note that in the statistics header, ‘Rows’ is equal to ‘Rows Sampled’ telling us that this statistic was created/updated using a full scan which is to be expected since statistics created as a by-product of index creation/re-building will use a full scan.  Also note that the density vector contains a second row for the combination of the OrderDate and PurchaseOrderId columns despite the nonclustered index explictly being created only on the OrderDate column.  This is because the clustered index key columns are included in all nonclustered indexes.  Finally, note that the histogram only represents data distribution for the first column in the density vector which is the first named column or left most column of the index.

Querying Statistics With T-SQL

EXEC [sp_helpindex] 'schema.object';
GO

EXEC [sp_helpstats] 'schema.object', 'all';
GO

SELECT
sp.last_updated,
stat.stats_id,
stat.name as stats_name,
STUFF((SELECT ', ' + cols.name
FROM sys.stats_columns AS statcols
JOIN sys.columns AS cols ON
statcols.column_id=cols.column_id
AND statcols.object_id=cols.object_id
WHERE statcols.stats_id = stat.stats_id and
statcols.object_id=stat.object_id
ORDER BY statcols.stats_column_id
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)'), 1, 2, '') as stat_cols,
sp.modification_counter,
sp.rows,
sp.rows_sampled
FROM sys.stats as stat
CROSS APPLY sys.dm_db_stats_properties (stat.object_id, stat.stats_id) AS sp
JOIN sys.objects as so on
stat.object_id=so.object_id
JOIN sys.schemas as sc on
so.schema_id=sc.schema_id
WHERE
sc.name= 'Schema' -- Schema Name
and so.name='Table' -- Table Name
ORDER BY 1 DESC OPTION (RECOMPILE)
GO

DBCC SHOW_STATISTICS(N'schema.object', statistic_name)
--WITH STAT_HEADER, DENSITY_VECTOR, HISTOGRAM
GO 

The above T-SQL is useful when looking at statistics.

Find the Statistics Used to Compile an Execution Plan

Trace Flags: There are a number of undocumented trace flags that will allow you to determine which statistics objects were used by the query optimiser to produce an execution plan.  The below trace flags are useful when troubleshooting cardinality estimate issues:

TraceFlag Details
3604 Redirects trace output to the client
9481 Forces Legacy CE model used by SQL Server 7.0 – 2012
2312 Force New SQL Server 2014 CE
2363 New CE: Shows statistical computational information used by the new CE
9292 Legacy CE: Shows statistics objects which are considered ‘interesting’ by the query optimiser when compiling
9204 Legacy CE: Shows the ‘interesting’ statistics which end up being fully loaded and used by the query optimiser

For the upcoming examples in this blog post, the trace flags are used in the below manner:

SELECT column_name FROM table_name
OPTION
(
QUERYTRACEON 3604, QUERYTRACEON 2363 --New CE Stats Report
QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 --Legacy CE Stats Report
)

Extended Events: The query_optimizer_estimate_cardinality XEvent fires when the Query Optimiser estimates cardinality on a relational expression.  This is outwith the scope of this blog, but see this post by Pedro Lopes and this post by Joe Sack for more details.

Mining The Plan Cache With TF 8666: If you really want to get your hands dirty, you can use the undocumented trace flag 8666 which exposes hidden internal information that is not normally available within an execution plan.  This trace flag exposes the ‘InternalInfo’ element in the execution plan XML which shows the statistic objects that were used by a query.  The below T-SQL, a modified version of the script in Fabio Amorim’s post can be used to mine the plan cache for this information:

DBCC TRACEON(8666)
GO
WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/showplan' as p)
SELECT	DB_NAME(qt.dbid) AS [Database], cp.objtype AS [CachedPlan_ObjectType],
		cp.usecounts AS [Use_Counts],
		TableName.XMLCol.value('@FieldValue','NVarChar(500)') AS [TableName],
		StatsUsed.XMLCol.value('@FieldValue','NVarChar(500)') AS [StatsName],
		qt.text AS [SQL_Statement],
		qp.query_plan AS [QueryPlan]
FROM sys.dm_exec_cached_plans cp
		CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp
		CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt
		CROSS APPLY query_plan.nodes('//p:Field[@FieldName="wszTable"]') TableName(XMLCol)
		CROSS APPLY query_plan.nodes('//p:Field[@FieldName="wszStatName"]') StatsUsed(XMLCol)
WHERE	qt.text NOT LIKE '%sys.%'
--AND	DB_NAME(qt.dbid) = 'UserDatabase'
 GO
 DBCC TRACEOFF(8666)
 GO

If you were considering dropping auto-created column stats, this technique could be used to measure what statistic objects are being used and what ones aren’t.  This trace flag is also discussed by Dave Ballantyne in this excellent post.

Local Variables Cardinality Estimates

The optimiser can sniff the values of parameters and constants but with variables it cannot. This is because optimisation is done at the batch level and at the point when a batch is compiled the variables have not been defined or given values. This leads to different row estimations when using variables as opposed to parameters or constants. In the absence of a known value the optimiser can’t use the histogram to estimate row count, instead it has to resort to the other information in the statistics object to derive cardinality estimates. Depending on the type of predicate and other factors, the estimate will be derived differently:

Equality Predicate (WHERE Parameter = @Value)
If the filtered column is unique, ( that is, it has a primary key or unique constraint defined on it, created implictly through index creation or explicilty otherwise ), the query optimiser knows that the values are unique and there can’t be more than one match, so it esitmates one row. If the column is non-unique, the query optimiser uses the density vector information to derive an estimate:
Estimate = Density Vector x Number Of Rows In Table
Estimate = SELECT [AllDensity_DensityVector] * [TableCardinality]

-- (Ctrl + M) Include Actual Execution Plan

-- Local Variable (New CE Equality)
DECLARE @OrderDate date
SET @OrderDate = '2016-05-31'
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate = @OrderDate
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- SELECT (2074 * 0.0009398496) = 1.94925

--Local Variable (Legacy CE Equality)
DECLARE @OrderDate date
SET @OrderDate = '2016-05-31'
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate = @OrderDate
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 );
-- Legacy Ce Estimate -- SELECT (2074 * 0.0009398496) = 1.94925

2-local-variable-est-rows

The same estimate is derived under both the legacy CE (CE Model 70) and the new CE (CE Model 130) and the same execution plan shape is generated too.  Looking back at the earlier screenshot of the statistics histogram, the last step in the histogram is ‘2016-05-31’ which has an EQ_ROWS value of 2.  Since the query optimiser can’t use the histogram in this scenario, it uses the density vector to calculate an estimate of 1.94925 rows which in this case is very close to the actual 2 rows returned.  By looking at the messages tab, you will see some information on statistics and decisions made by the query optimiser generated via the various trace flags.

InEquality Predicate (WHERE Parameter >, >=, <, <= @Value)
Estimate = 30% of Number Of Rows In Table
Estimate = SELECT [TableCardinality] * 0.3

--Local Variable (New CE InEquality)
DECLARE @OrderDate date
SET @OrderDate = '2016-05-31'
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate &gt; @OrderDate
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = SELECT (2074 * 0.3) = 622.2

--Local Variable (Legacy CE InEquality)
DECLARE @OrderDate date
SET @OrderDate = '2016-05-31'
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate &gt; @OrderDate
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate -- Estimate = SELECT (2074 * 0.3) = 622.2

BETWEEN
When using the BETWEEN operator with local variables the query optimiser will make estimations differently depending on the CE model:

  • Legacy CE: A hard coded guess of 9% of the tables cardinality is used.
    Estimate = SELECT [TableCardinality] * 0.09
  • New CE: The guess is computed using exponential backoff, applied to the separate >= and <= comparisons that BETWEEN is shorthand for. The guess for >= and <= is 30% (0.3) each, so the overall guess is 0.3 * SQRT(0.3) = 0.164317 (* table cardinality), so it estimates 16.4317% of the tables cardinality.
    Estimate = SELECT [TableCardinality] * 0.164317
--Local Variable (New CE BETWEEN)
DECLARE @OrderDate1 date = '2016-05-28'
DECLARE @OrderDate2 date = '2016-05-31'
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate BETWEEN @OrderDate1 AND @OrderDate2
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = SELECT (2074 * 0.164317) = 340.793

--Local Variable (Legacy CE BETWEEN)
DECLARE @OrderDate1 date = '2016-05-28'
DECLARE @OrderDate2 date = '2016-05-31'
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate BETWEEN @OrderDate1 AND @OrderDate2
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate -- Estimate = SELECT (2074 * 0.09) = 186.66

LIKE
When using the LIKE operator with local variables the query optimiser will make estimations differently depending on the CE model:

  • Legacy CE: Uses a quirky, complex algorithm typical of the legacy LE:
    Estimate = C * (G / M * LOG(D))
    C = Table Cardinality (Number of rows as shown in the statistic header)
    G = The standard 9% guess
    M = A factor 6 (magic number)
    D = Average length rounded down to integer (Average data length taken from statistics density vector). LOG(D) is omitted if D is between one 1 & 2. If D is less than 1 (including for missing or NULL statistics) then D = FLOOR(0.5 * maximum column byte length)
    Estimate = SELECT [TableCardinality] * (0.09 / 6 * LOG(D))
  • New CE: The query optimiser will estimate 9% of the tables cardinality.
    Estimate = SELECT [TableCardinality] * 0.09
-- Local Variable (New CE LIKE)
DECLARE @SupplierRef AS NVARCHAR(25) = N'BC%'
SELECT PurchaseOrderID, OrderDate, SupplierReference
FROM Purchasing.PurchaseOrders
WHERE SupplierReference LIKE @SupplierRef
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = SELECT (2074 * 0.09) = 186.66

--Local Variable (Legacy CE LIKE)
DECLARE @SupplierRef AS NVARCHAR(25) = N'BC%'
SELECT PurchaseOrderID, OrderDate, SupplierReference
FROM Purchasing.PurchaseOrders
WHERE SupplierReference LIKE @SupplierRef
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 );
-- Legacy Ce Estimate -- Estimate = SELECT 2074 * (0.09 / 6 * LOG(14))) = 82.1011
-- Note that the optimiser will automatically generate a column statistic on SupplierReference
-- for this query using the legacy CE and not the query above that used the new CE.  This is
-- because the legacy CE uses information from the density vector, the new CE does not.

Selectivity Guesses (Missing Statistics)

In the absence of statistics, or indexes with the associated statistics, or constraints, the optimiser has to resort to heuristics and essentially has to guess in terms of the number of estimated rows. Based on the predicate form, you’ll get different guess types, like specific percentages that are guessed based on how you formed your predicate:

Equality Predicates
In the absence of statistics, when using an equality predicate (= operator) estimates are derived differently under the different CEs:

  • Legacy CE: Uses an estimate of C^0.75 (an exponent of three quarters) where C is the cardinality.
    Estimate = SELECT POWER( [TableCardinality],.75 )
  • New CE: Uses an estimate of C^0.5 (the square root) where C is the cardinality.
    Estimate = SELECT POWER( [TableCardinality],.5 )
ALTER DATABASE WideWorldImporters SET AUTO_CREATE_STATISTICS OFF
GO

SELECT * INTO Purchasing.PurchaseOrders_NoStats
FROM Purchasing.PurchaseOrders

-- Selectivity Guess (New CE Equality)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders_NoStats
WHERE OrderDate = '2016-05-31'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate = Estimate = SELECT POWER(2074.00,.5) = 45.54

-- Selectivity Guess (Legacy CE Equality)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders_NoStats
WHERE OrderDate = '2016-05-31'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate -- Estimate = SELECT POWER(2074.00,.75) = 307.33

In the example above, the automatic creation of statistics are disabled and a new table is created to simulate a scenario where the query optimiser has to resort to a selectivity guess.  There will be ‘Columns With No Statistics’ warnings in the execution plan.   For the first query labelled ‘– Selectivity Guess (New CE Equality)’ the selectivity guess can be seen in the messages tab:3-selectivity-guess

Inequality Predicates
In the absence of statistics, when using an inequality predicate (>, >=, <, <= operators) the query optimiser will estimate 30% of the tables cardinality, this is the same across all versions of the CE.
Estimate = SELECT [TableCardinality] * 0.3

-- Selectivity Guess (New CE InEquality)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders_NoStats
WHERE OrderDate &gt; '2016-05-31'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = SELECT (2074 * 0.3) = 622.2

-- Selectivity Guess (Legacy CE InEquality)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders_NoStats
WHERE OrderDate &gt; '2016-05-31'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate -- Estimate = SELECT (2074 * 0.3) = 622.2

BETWEEN
In the absence of statistics, when using the BETWEEN operator the query optimiser will estimate 9% of the tables cardinality, this is the same across all versions of the CE.
Estimate = SELECT [TableCardinality] * 0.09

Note that the query optimiser internally evaluates BETWEEN as two inequality operators, ie ‘WHERE FilterPredicate BETWEEN Value1 AND Value2’ is evaluated as ‘WHERE FilterPredicate >= Value1 AND FilterPredicate <= Value2’. So two inequality operators explicitly specified together as a conjuction will result in the same estimate of 9%.

LIKE
In the absence of statistics, when using the LIKE operator, estimates are derived differently under the different CEs and depending on the use of the LIKE operator:

  • Legacy CE: Estimates vary depending on the use of the LIKE operator and the number of wildcards. For a single leading or trailing wildcard ( LIKE ‘%A’; LIKE ‘A%’; ) the estimate will be 26.9616% of the table cardinality.
    Estimate = SELECT [TableCardinality] * 0.269616
    For a leading and trailing wildcard ( LIKE ‘%A%’; ) the estimate will be 53.9232% of the table cardinality.
    Estimate = SELECT [TableCardinality] * 0.539232
  • New CE: The query optimiser will estimate 9% of the tables cardinality regardless of the use of the LIKE operator and the number of wildcards:
    Estimate = SELECT [TableCardinality] * 0.09

This blog post by Joe Sack covers more scenarios.

-- Selectivity Guess (New CE LIKE)
SELECT PurchaseOrderID, OrderDate, SupplierReference
FROM Purchasing.PurchaseOrders_NoStats
WHERE SupplierReference LIKE N'B%'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = SELECT (2074 * 0.09) = 186.66

-- Selectivity Guess (Legacy CE LIKE)
SELECT PurchaseOrderID, OrderDate, SupplierReference
FROM Purchasing.PurchaseOrders_NoStats
WHERE SupplierReference LIKE N'B%'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate -- Estimate = SELECT (2074 * 0.269616) = 559.183

DROP TABLE Purchasing.PurchaseOrders_NoStats

ALTER DATABASE WideWorldImporters SET AUTO_CREATE_STATISTICS ON
GO

Multiple Predicate Cardinality Estimates

By default, during query optimisation, SQL Server automatically creates single-column statistics on all predicate columns (if not already available). SQL Server will not automatically create multi-column statistics, they could be created manually, but would still only create a histogram over the first named column.

In the absence of multi-column column statistics, the cardinality estimate model is as follows:

Legacy CE: The cardinality estimator uses the independency assumption model where SQL Server generally assumes that values of different attributes in a table are distributed completely independently of each other.

  • AND Selectivity: Using the independency assumption, three predicates connected by AND (known as a conjunction) with selectivities S1, S2 and S3 result in a combined selectivity of: (S1 * S2 * S3)
    Estimate = Table Cardinality * (S1 * S2 * S3)
    Estimate = SELECT ( SELECT COUNT(*) FROM [TableName] ) * ( S1 * S2 * S3 )
  • OR Selectivity: Two predicates connected by OR (a disjunction) with selectivities S1 and S2, results in a combined selectivity of: (S1 + S2) – (S1 * S2)
    Estimate = Table Cardinality (TotalRows) * ( (S1 + S2) – (S1 * S2) )
    Estimate = SELECT ( SELECT COUNT(*) FROM [TableName] ) * ( (S1 + S2) – (S1 * S2) )

New CE: The new cardinality estimator uses an exponential backoff formula for both conjunctive(AND) and disjunctive(OR) predicates, though the formula used in the disjunctive (OR) case has not yet been documented (officially or otherwise). The new approach to conjunctive predicates is to use exponential backoff: given a table with cardinality C, and predicate selectivities S1, S2, S3, S4 where S1 is the most selective and S4 the least:
Estimate = Table Cardinality * S1 * SQRT(S2) * SQRT(SQRT(S3)) * SQRT(SQRT(SQRT(S4)))
Estimate = SELECT ( SELECT COUNT(*) FROM [TableName] ) * S1 * SQRT(S2) * SQRT(SQRT(S3)) * SQRT(SQRT(SQRT(S4)))

-- Manually create single column statistics
CREATE STATISTICS AC_Region ON Application.Countries (Region) WITH FULLSCAN
CREATE STATISTICS AC_SubRegion ON Application.Countries (SubRegion) WITH FULLSCAN

-- Multiple Predicate in absence of multi-column statistics (New CE Conjunction)
SELECT CountryName, Region, Subregion
FROM Application.Countries
WHERE Region = 'Americas'
AND Subregion = 'Caribbean'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- Region Selectivity = SELECT(35.00 / 190) = 0.184210 (S2)
-- SubRegion Selectivity = SELECT(13.00 / 190) = 0.068421 (S1 Most Selective)
-- New Ce Estimate -- Estimate = SELECT ( 190 ) * 0.068421 * SQRT(0.184210) = 5.5795

-- Multiple Predicate in absence of multi-column statistics (Legacy CE Conjunction)
SELECT CountryName, Region, Subregion
FROM Application.Countries
WHERE Region = 'Americas'
AND Subregion = 'Caribbean'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate -- Estimate = SELECT (190) * (0.068421 * 0.184210) = 2.39474

-- Multiple Predicate in absence of multi-column statistics (New CE Disjunction)
SELECT CountryName, Region, Subregion
FROM Application.Countries
WHERE Region = 'Americas'
OR Subregion = 'Caribbean'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = 40.3966
-- Exponential backoff formula for disjunctive(OR) predicates is undocumented
-- Region Selectivity = SELECT(35.00 / 190) = 0.184210 (S2)
-- SubRegion Selectivity = SELECT(13.00 / 190) = 0.068421 (S1 Most Selective)

-- Multiple Predicate in absence of multi-column statistics (Legacy CE Disjunction)
SELECT CountryName, Region, Subregion
FROM Application.Countries
WHERE Region = 'Americas'
OR Subregion = 'Caribbean'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate -- Estimate = SELECT (190) * ((0.068421 + 0.184210) - (0.068421 * 0.184210)) = 45.6053

In the presence of multi-column column statistics, the cardinality estimates are calculated under the different models as follows:

  • SQL Server 7 – 2012 (Legacy CE): The legacy cardinality estimator will take advantage of multi-column statistics and takes the ‘all density’ value for the combination of columns and multiplies it by the total number of rows in the table.
  • SQL Server 2014: The new CE introduced in SQL Server 2014 RTM does not make use of multi-column statistics and uses exponential backoff with the individual single column statistics.
  • SQL Server 2016: Further improvements were made in SQL Server 2016 and the CE model version 130 takes advantage of multi-column statistics again using the ‘all density’ value for the combination of columns and multipliying it by the total number of rows in the table.

Despite multi-column, column statistics (and filtered statistics for that matter) not being accessible by the SQL Server 2014 CE, you can still leverage these by forcing the legacy CE using trace flags and query hints.

-- Manually create a multi-column statistic
CREATE STATISTICS AC_Region_Subregion ON Application.Countries (Region, Subregion) WITH FULLSCAN

-- Multiple Predicate (New CE Conjunction With MultiColumn Stats)
SELECT CountryName, Region, Subregion
FROM Application.Countries
WHERE Region = 'Americas'
AND Subregion = 'Caribbean'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = SELECT 0.04545455 * 190 = 8.63636 

-- Multiple Predicate (Legacy CE Conjunction With MultiColumn Stats)
SELECT CountryName, Region, Subregion
FROM Application.Countries
WHERE Region = 'Americas'
AND Subregion = 'Caribbean'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate -- Estimate = SELECT 0.04545455 * 190 = 8.63636

ALTER DATABASE [WideWorldImporters] SET COMPATIBILITY_LEVEL = 120
GO

-- Multiple Predicate (New CE (SQL Server 2014 only) Conjunction With MultiColumn Stats)
SELECT CountryName, Region, Subregion
FROM Application.Countries
WHERE Region = 'Americas'
AND Subregion = 'Caribbean'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- SQL Server 2014 CE does not make use of multi-column stats
-- Region Selectivity = SELECT(35.00 / 190) = 0.184210 (S2)
-- SubRegion Selectivity = SELECT(13.00 / 190) = 0.068421 (S1 Most Selective)
-- New Ce Estimate -- Estimate = SELECT ( 190 ) * 0.068421 * SQRT(0.184210) = 5.5795 

DROP STATISTICS Application.Countries.AC_Region
DROP STATISTICS Application.Countries.AC_SubRegion
DROP STATISTICS Application.Countries.AC_Region_Subregion

ALTER DATABASE [WideWorldImporters] SET COMPATIBILITY_LEVEL = 130
GO

Scaling Cardinality Estimates

All of the demos/examples so far have used statistic objects generated with a full scan where no data is being updated and no new rows are being inserted so the statistics can be considered as being fresh, accurate and up to date.  What happens when the data isn’t static, are the cardinality estimates affected?

Statistics Up To Date
When there is a direct hit in the statistics histogram for an equality predicate, the query optimiser can use the EQ_ROWS (equality rows) value to get a perfect estimation, assuming the statistics are up to date. When the equality predicate falls within the entire histogram range but doesn’t result in a direct histogram step hit the query optimiser can use the AVG_RANGE_ROWS (average range rows) to derive an accurate estimation. This is the same across all versions of the CE.

Statistics Out Of Date – Literal or Parameter Equality Predicate
Both CEs are aware of outdated statistics and scale the estimation according to actual total number of rows in the table. When there is a direct histogram step hit, the below formula is used where the equality rows from the statistics histogram is multiplied by the current number of rows in the table then divided by the number of rows as shown in the statistics header:

Estimate = (EQ_ROWS * Actual Current Rows) / Statistics Total Rows

When the equality predicate is within the entire histogram range but doesn’t result in a direct histogram step hit the query optimiser will use the AVG_RANGE_ROWS value from the statistics object to derive an estimate.  The behaviour is slightly different under the different CE models.  Under the legacy CE, the estimate does not scale and the estimate will be based solely on the AVG_RANGE_ROWS value from the statistics object.   Under the new CE, the estimation is again scaled using the same principal:

Legacy CE Estimate = AVG_RANGE_ROWS
New CE Estimate = (AVG_RANGE_ROWS * Actual Current Rows) / Statistics Total Rows

-- All of the previous examples so far have had up to date, full scan statistics, but what happens when
-- new rows are inserted, do the cardinality estimates scale?

-- Insert rows, some within the current histogram range, some outside the current histogram range, but not
-- enough rows to breach the threshold for triggering an automatic update of statistics.
-- Turn off actual execution plans before doing the INSERT operation
INSERT INTO [Purchasing].[PurchaseOrders]
           ([SupplierID],[OrderDate],[DeliveryMethodID],[ContactPersonID],[ExpectedDeliveryDate],[SupplierReference]
           ,[IsOrderFinalized],[Comments],[InternalComments],[LastEditedBy])
     VALUES (7, '2016-05-31', 2, 2, '2016-06-09', 'BC0280982', 1, NULL, NULL, 7);
GO 163
INSERT INTO [Purchasing].[PurchaseOrders]
           ([SupplierID],[OrderDate],[DeliveryMethodID],[ContactPersonID],[ExpectedDeliveryDate],[SupplierReference]
           ,[IsOrderFinalized],[Comments],[InternalComments],[LastEditedBy])
     VALUES (7, '2016-11-08', 2, 2, '2016-11-20', 'BC0280982', 1, NULL, NULL, 7);
GO 163

DBCC SHOW_STATISTICS(N'Purchasing.PurchaseOrders', NC_IX_Purchasing_PurchaseOrders_OrderDate)
--WITH STAT_HEADER, DENSITY_VECTOR, HISTOGRAM
GO

-- Scaling Estimate (New CE Equality - Direct Histogram Step Hit)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate = '2016-05-31'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- Estimate = (EQ_ROWS * Actual Current Rows) / Statistics Total Rows
-- New Ce Estimate -- Estimate = SELECT( 2.00 * 2400 ) / 2074 = 2.31437

-- Scaling Estimate (Legacy CE Equality - Direct Histogram Step Hit)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate = '2016-05-31'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 );
-- Legacy Ce Estimate -- Estimate = SELECT( 2.00 * 2400 ) / 2074 = 2.31437

-- Scaling Estimate (New CE Equality - Non Direct Histogram Step Hit)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate = '2013-01-03'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = SELECT( 5.00 * 2400 ) / 2074 = 5.78592

-- Scaling Estimate (Legacy CE Equality - Non Direct Histogram Step Hit)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate = '2013-01-03'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 );
-- Legacy Ce Estimate = 5
-- The estimate does not scale and is based on the AVG_RANGE_ROWS value from the statistics object.

Statistics Out Of Date – Local Variable Equality Predicate
The values assigned to variables are not known by the optimiser at compile time and as such it can’t use the histogram and has to resort to the density vector to derive cardinality estimates. In this scenario, where the statistics are out of date, the estimate does not scale under the legacy CE:

Estimate = SELECT [AllDensity_DensityVector] * [StatisticsHeaderTotalRows]

Under the new CE, the estimation is again scaled and uses the actual number of rows in the table:

Estimate = SELECT [AllDensity_DensityVector] * [ActualTotalRows]

-- Scaling Estimate (New CE Equality - Local Variable)
DECLARE @OrderDate date
SET @OrderDate = '2016-05-31'
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate =  @OrderDate
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = SELECT 0.0009398496 * 2400 = 2.25564
-- The estimate is scaled and is based on the tables current cardinality

-- Scaling Estimate (Legacy CE Equality - Local Variable)
DECLARE @OrderDate date
SET @OrderDate = '2016-05-31'
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate =  @OrderDate
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate -- Estimate = SELECT 0.0009398496 * 2074 = 1.94295
-- The estimate does not scale and is based on the total rows in the statistic object

Searching For Off-Histogram Values (Selectivity Guesses)

The inclusion assumption dictates that we query for data that does exist and as such the cardinality estimator will (almost) never estimate zero rows.  Even if a query selects from a table that has been truncated, the query optimiser will still estimate one row despite the table being completely empty as pointed out by Kendra Little here.

Legacy CE: In the absence of an actual histogram step when a filter predicate falls outside the histogram range, the cardinality estimator assumes the value actually exists and will estimate one row. This is true regardless of the predicate, ( =, >, <, >=, <=, BETWEEN ), as long as the filter predicate falls outside the histogram range, the estimate will be one row.

-- Searching For Off-Histogram Values (Legacy CE Equality)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate = '2016-11-08'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate = 1

New CE (SQL Server 2014 & 2016): In the absence of an actual histogram step when an equality filter predicate falls outside the histogram range, the cardinality estimator will use the density vector information. The estimate is calculated by using the all density value from the density vector multiplied by the total number of rows from the statistics header, the estimate is not scaled:

Estimate = Density Vector x Table Cardinality
Estimate = SELECT [AllDensity_DensityVector] * [StatisticsHeaderTotalRows]

-- Searching For Off-Histogram Values (New CE Equality)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate = '2016-11-08'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- New Ce Estimate -- Estimate = SELECT 0.0009398496 * 2074 = 1.94925

The New CE simply ignores literals if they are beyond the statistics histogram and for inequality predicates it estimates 30% of the total number of modifications for the leading statistics column since the last time statistics were updated. So it makes a 30% guess over the added rows, which can be calculated as such:

Estimate =SELECT(rowmodctr*0.3) FROM sys.sysindexes WHERE name=’IndexName’

For the BETWEEN logical operator, the 9% guess over the added rows is made:

Estimate =SELECT(rowmodctr*0.09) FROM sys.sysindexes WHERE name=’IndexName’

-- Searching For Off-Histogram Values (New CE InEquality)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate &gt; '2016-05-31'
OPTION ( QUERYTRACEON 3604, QUERYTRACEON 2363 );
-- Estimate = SELECT (rowmodctr * 0.3) FROM sys.sysindexes WHERE name = 'NC_IX_Purchasing_PurchaseOrders_OrderDate'
-- New Ce Estimate = 97.8 

-- Searching For Off-Histogram Values (Legacy CE InEquality)
SELECT PurchaseOrderID, OrderDate
FROM Purchasing.PurchaseOrders
WHERE OrderDate &gt; '2016-05-31'
OPTION ( QUERYTRACEON 9481, QUERYTRACEON 3604, QUERYTRACEON 9292, QUERYTRACEON 9204 )
-- Legacy Ce Estimate = 1

Ascending Key Problem

Ascending Key is a common data pattern found in most databases usually implemented using identity columns, real time timestamp columns etc. Insert operations append new values to ascending columns and when the number of rows added is too small to trigger a statistics update, the most recently added range is not modelled by the histogram which could lead to inaccurate cardinality estimates as was demonstrated in the pevious section ‘Searching For Off-Histogram Values (Selectivity Guesses)’.

SQL Server can detect when the leading column of a statistics object is ascending and can mark it as ascending. A statistics object that belongs to an ascending column is branded as “Ascending” after three statistics updates using full scan and for each statistics update, more than 90% of newly added rows need to have a value greater than the current maximum value in the statistics histogram. If older data is inserted, breaking the ascending sequence, the column will be branded as “Stationary”.

A statistics’ branding is hidden by default but can be viewed using trace flag 2388. With this trace flag turned on it changes the output of DBCC SHOW_STATISTICS and displays a new column called “Leading column type” which contains the column branding.

By default, the query optimiser keeps the information about the branding of statistics, but doesn’t make use of it. The optimiser won’t choose a different plan based on whether the column is ascending or not. To change this, there are several trace flags available:

  • TF 2389: Helps the Query Optimiser to come up with better estimations and correct execution plans only for columns branded as Ascending.
  • TF 2390: Instructs SQL Server to perform on-the-fly calculation for statistics objects for columns marked as Ascending or Unknown.
  • TF 4139: Works with all three Leading Column Types: Ascending, Unknown and Stationary. However, it is available from SQL Server 2012 SP1 CU10 or SP2 CU1

These trace flags are not required with the new CE in SQL Server 2014 as this behaviour is enabled by default.

Identifying Cardinality Estimate Issues

Execution Plans: The estimated plan will not help you in diagnosing cardinality esitmation issues, but the actual plan will and this can be captured using:

  • SET STATISTICS XML ON;
  • SET STATISTICS PROFILE ON; — (Deprecated)
  • Graphical Showplan (“Include Actual Execution Plan”)
  • Extended Events ( query_post_execution_showplan ) — (Beware of Overhead)

Query Profiles DMV: The sys.dm_exec_query_profiles DMV (introduced in SQL Server 2014) shows the number of rows processed by the various operators at that point in time. Look for row counts that are much higher than the estimated row count which could potentially be used to find skews due to overestimates. If it’s an under-estimate, this might be trickier to see using this DMV because you don’t know whether it’s a true under-estimate or whether all the rows just haven’t flowed through the operators yet.

SELECT [session_id], [node_id], [physical_operator_name],
[estimate_row_count], [row_count]
FROM [sys].[dm_exec_query_profiles]
ORDER BY [node_id];

Root Level Skews via Query Stats: The sys.dm_exec_query_stats DMV can be cross applied to sys.dm exex_sql_text and sys.dm exec query plan to show the estimated number of rows (which is shredded from the query plan xml), the last rows (which is the last number of rows that were returned for a particular execution of one of the queries) and the query text. This can be used to identify root level skews but will not show leaf level skews that are not propagated up through the tree to the root level.

SELECT DB_NAME(qt.dbid) AS [Database], cp.objtype AS [Cached_Plan_Type], cp.usecounts AS [Use_Counts],
qp.query_plan.value('(//@CardinalityEstimationModelVersion)[1]','INT') AS [CE_Version],
CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) AS [Estimated_Rows],
qs.last_rows AS [Last_Rows],
CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) - (qs.last_rows) AS [Estimate_Skew],
qs.total_logical_reads / qs.execution_count AS [Avg_Logical_Reads],
CAST((qs.total_elapsed_time ) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_Execution_Time(µs)],
CAST((qs.total_worker_time) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_CPU_Time(µs)],
SUBSTRING(qt.TEXT, (qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset WHEN - 1 THEN DATALENGTH(qt.TEXT) ELSE qs.statement_end_offset END - qs.statement_start_offset ) / 2 ) + 1) AS [Query_Text],
qs.query_hash AS [QueryHash], qp.query_plan AS [QueryPlan]
FROM sys.dm_exec_cached_plans cp WITH (NOLOCK)
CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle) qp
CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt
INNER JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle
WHERE qt.text NOT LIKE '%sys.%'
AND DB_NAME(qt.dbid) = '[Database_Name]'
ORDER BY [Avg_CPU_Time(µs)] DESC OPTION(RECOMPILE);
GO

Troubleshooting Questions

When troubleshooting cardinality estimate skews, there are several troubleshooting questions that should be asked in order to get to the root cause:

  • Are statistics missing? Is auto-creation of statistics disabled? Missing statistics warnings in query execution plans?
  • Are my statistics stale? Are the statistics stale and no longer reflect realistic distributions? Is auto-update of statistics disabled? Is ‘No Recompute’ enabled for specific statistics? Are some tables large enough that automatic statistics updates simply aren’t happening frequently enough?
  • Is the sampling adequate? If the statistics are sampled, would better quality statistics be produced with a full scan, or a higher sampling percentage?
  • Would multi-column statistics help? If multiple predicates are involved in the query and they happen to be correlated, would multi-column statistics help with the estimates?
  • Is this a parameter sniffing issue? This effects objects that take input parameters such as stored procedures.
  • Are table variables and/or MSTVF’s causing problems? If there are significant underestimates, look at the use of table variables or multi-statement table valued functions, which both use fixed cardinality and don’t have statistics associated with them.
  • Are only the most recent rows being queried? Is this a typical ascending key problem where the most recent data most likely isn’t reflected in the most recent histogram.
  • Data type conversion issues? Some data type conversions are known to block proper estimation.
  • Are linked servers involved? Linked servers can be associated with skews if your linked server credentials don’t have high enough permissions to pull statistics information. This was something fixed starting in SQL Server 2012 SP1.
  • Are there Intra-Table Column Comparisons? Cardinality estimation issues can occur when comparing columns within the same table. This is a known issue. If you have to do so, you can improve the cardinality estimates of the column comparisons by using computed columns instead or by re-writing the query to use self-joins or common table expressions.

Missing Statistics: Is the ‘Auto Create Statistics’ database setting set to false? If so, why? Verify that the database is not read-only. If the database is read-only, the query optimiser cannot save statistics (unless it’s a secondary read-only replica in an Availability Group in which case temporary statistics can be created in tempdb).

Stale Statistics: Stale statistics does not necessarily mean old statistics or statistics that haven’t been updated in a while, but rather outdated statistics that no longer accurately represent the current data distribution.

With default database settings, the query optimiser looks at how many changes have occurred for a given column statistic as part of query optimisation. If the number of rows in the column that have changed exceed certain thresholds, SQL Server updates the statistic, then optimises the query.  Why optimise a query on bad data estimates?

The query optimiser determines whether to update statistics based on changes to column modification counters (colmodctrs) and uses the below thresholds in SQL Server 2005 – 2014.

  • If the table has zero rows and one or more rows are added.
  • If the table has between 1-500 rows and 500 rows have changed.
  • If the table has 500+ rows and 500 rows + 20% of the total rows have changed.

Note that the statistics don’t auto-update when the rows are changed. It’s only when a query is being compiled and the statistics would actually be used to make an optimisation decision that the update is kicked off.

For very large tables (millions of rows) the default thresholds for automatic updating of statistics may not be triggered frequently enough. This could lead to potential performance problems. SQL Server 2008 R2 Service Pack 1 and later versions introduce trace flag 2371 that you can enable to change this default behaviour. Trace Flag 2371 makes the formula for large tables more dynamic and only applies to tables with more than 25,000 rows. Trace flag 2371 uses a decreasing, dynamic statistics update threshold, calculated as the square root of the table cardinality multiplied by 1000 which can be expressed as (RT = SQRT(1000 * Table Cardinality). Simply put, under this trace flag, the higher the number of rows in a table, the lower the threshold will become to trigger an update of the statistics. For example, if the trace flag is activated, update statistics will be triggered on a table with 1 billion rows when ~1 million changes occur. If the trace flag is not activated, then the same table with 1 billion records would need ~200 million changes before an update statistics is triggered. SQL Server 2016 automatically uses this improved algorithm.

Are statistics staled due to statistics set to not recompute?

  • When creating or altering an index, there is an option ‘WITH STATISTICS_NORECOMPUTE = { ON | OFF}’ which specifies whether distribution statistics are recomputed. The default is OFF which enables automatic statistics updating for the particular statistic objects associated with that index. If set to ON, out-of-date statistics are not automatically recomputed which may prevent the query optimiser from picking optimal execution plans.
  • When creating or updating statistics, there is an option ‘with NORECOMPUTE’ that when specified will complete the current statistics update but disable future updates, effectively disabling the automatic statistics update option for the specified statistic.

The below query can be used to check for stale statistics on a specific object:

SELECT
OBJECT_NAME([sp].[object_id]) AS "Table",
[sp].[stats_id] AS "Statistic ID",
[s].[name] AS "Statistic",
[sp].[last_updated] AS "Last Updated",
[sp].[rows],
[sp].[rows_sampled],
[sp].[unfiltered_rows],
[sp].[modification_counter] AS "Modifications"
FROM [sys].[stats] AS [s]
OUTER APPLY sys.dm_db_stats_properties ([s].[object_id],[s].[stats_id]) AS [sp]
WHERE [s].[object_id] = OBJECT_ID(N'schema.object');

Sampling: Often, sampling is good enough, but sometimes you might need to use full scan for your statistics generation in order to capture all interesting points in your data. Histograms are not loss-less, sometimes distinct values aren’t represented by steps or some frequencies are not accurate, they might be an average across very high and very low values and this can cause issues around estimates.

The automatic creation, or automatic updating of statistics triggered during query optimisation will use default sampling. Statistics created as a by-product of index creation will use a full scan. The default sample size is calculated using a non-linear algorithm where the sample size decreases as the table size gets larger. If the default sample does not create a good histogram, it is possible to manually update statistics with a higher sampling rate:

  • FULLSCAN: Scans every row to create the best histogram possible.
    UPDATE STATISTICS WITH FULLSCAN
  • SAMPLE PERCENT: Specify the percentage of data or number of rows to scan.
    UPDATE STATISTICS WITH SAMPLE 25 PERCENT
    UPDATE STATISTICS WITH SAMPLE 75 PERCENT
  • RESAMPLE: Inherits the previous sample size
    UPDATE STATISTICS WITH RESAMPLE

Table Variables & MSTVFs: Table variables do not have statistics objects associated with them and the query optimiser uses a hard-coded estimate of one row. Consider using alternatives like temporary tables and/or permanent staging tables.

The same applies to MSTVF’s where the query optimiser uses a hard-coded estimate of one row, although with the new SQL Server 2014 CE it now uses a fixed guess of 100 rows. Consider using inline table valued functions instead.

Data Type Conversions: Mismatched data type usage within join or filter predicates can affect cardinality estimates and the associated plan selection. To avoid cardinality estimate issues caused by data type mismatches, use identical data types in search and join conditions.

Resources

For anyone wanting to learn more on this topic, I would highly recommend Joe Sack’s ‘Common Query Tuning Problems and Solutions‘ PluralSight course which gave me the inspiration for this blog post.  Joe also has a great blog over at SQLSkills.  I’d also recommend reading anything by PaulWhite who helped me out with the quirky, complex algorithm used by the legacy CE in the local variable and LIKE operator example.

Advertisements
This entry was posted in Statistics, CE & QO. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s