Troubleshooting Query Regressions Caused By The New Cardinality Estimator

In this blog post I’ll show methods that can be used to identify and troubleshoot query plan regressions caused by the new cardinality estimator.  I’ll show how this was a very manual task in SQL Server 2014 and how the process was made much easier in SQL Server 2016 with the introduction of Query Store.

For the purpose of this demo, I will use HammerDB to generate and populate a database that I will then run TPC-C (OLTP) workloads against.  Benchmarks will be taken against the HammerDB generated database under the different database compatibility levels to show the effects the different cardinality estimator model versions have on the workload:

  1. Benchmark 1: Database Compatibility Level = 110 (Legacy CE)
  2. Benchmark 2: Database Compatibility Level = 120 (New SQL Server 2014 CE)
  3. Benchmark 3: Database Compatibility Level = 130 (New SQL Server 2016 CE)

Trace Flag 4199 which enables various query optimiser hotfixes could potentially affect the results of these workload tests and will not be enabled in this demo.

Using HammerDB in this manner to generate a synthetic workload is a very contrived example.  In reality you should be using tools like RML Utilities or Distributed Replay to test what effect the new cardinality estimator has against your specific application workload.

My demo environment consists of a Windows Server 2012 R2 VM with a single installation of SQL Server 2016 RTM configured to use a maximum of 8GB of memory with MAXDOP set to 4, CTOP set to 25 and the tpcc database using the simple recovery model.

Configuring HammerDB & PerfMon

HammerDB:  Launch HammerDB and select ‘MSSQL Server’ from the benchmark options.  Expand ‘TPC-C’, navigate to ‘Schema Build’ and double click ‘Options’.  This is the template for the database which will be created and populated.  For this demo I have used the options below.  To create the ‘tpcc’ database, double click ‘Build’.

tpcc_database

Backup the newly created ‘tpcc’ database which will serve as the starting point for each benchmark test:

BACKUP DATABASE [tpcc] TO  DISK = N'C:\Temp\tpcc.bak'
WITH COPY_ONLY, COMPRESSION,  STATS = 10, CHECKSUM
GO

Prepare the workload driver options as below by double clicking on ‘Options’ under ‘Driver Script’.  Then, double click ‘Load’ which should populate the ‘Script Editor’ window with 387 lines of code.

workload-options

Next, configure the virtual users as below by double clicking ‘Options’ under ‘Virtual User’.  Create the virtual users by double clicking ‘Create’.

virtualuser-options

PerfMon:  Configure PerfMon to capture ‘Batches Requests/Sec’ and ‘% Processor Time’.  Set a sample time of every second for a durations of 300 seconds ( 5 minutes ).

perfmon-counters

perfmon-properties

Tracking Query Regressions The Old Way

SQL Server 2012 Legacy CE Benchmark Test

Preparation:  Set the database compatibility level to 110 so the query optimiser will use the legacy CE model when compiling queries for this database.  Flush the plan cache and buffer pool prior to each workload test to ensure a consistent starting point.

ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 110
GO
DBCC FREEPROCCACHE
GO
DBCC DROPCLEANBUFFERS
GO

 

With PerfMon configured and the HammerDB workload defined, it’s time to start the test.  Start the PerfMon counters, wait 30 seconds and then start the HammerDB workload by double clicking ‘Run’ under ‘Virtual Users’ or by selecting the green arrow icon highlighted below:

runvirtualusers

Analysing Results:  Under the ‘Virtual User Output’ tab as shown below you can see the results which are: 30875 TPM (Transactions Per Minute) & 6666 NOPM (New Orders Per Minute).

legacy-ce-tpm

Looking at the PerfMon graph, you can see that Average Batch Requests/sec was ~40 and Average % Processor Time was ~15%.

legacyceperfmon

This workload test against the legacy cardinality estimator will serve as the baseline which we will use for comparison after running the same HammerDB workload against the newer cardinality estimator model versions.

hammerdb_workloadresults1

Use the below T-SQL to filter the contents of the plan cache and insert query stat information into a temporary table named #TempCEStats for later analysis.

--INSERT INTO #TempCEStats
SELECT  DB_NAME(qt.dbid) AS [Database], cp.objtype AS [Cached_Plan_Type],
        cp.usecounts AS [Use_Counts],
		qp.query_plan.value('(//@CardinalityEstimationModelVersion)[1]','INT') AS [CE_Version],
		CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) AS [Estimated_Rows],
		qs.last_rows AS [Last_Rows],
		CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) - (qs.last_rows) AS [Estimate_Skew],
		qs.total_logical_reads / qs.execution_count AS [Avg_Logical_Reads],
		CAST((qs.total_elapsed_time ) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_Execution_Time(µs)],
		CAST((qs.total_worker_time) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_CPU_Time(µs)],
		SUBSTRING(qt.TEXT, (qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset WHEN - 1 THEN DATALENGTH(qt.TEXT) ELSE qs.statement_end_offset END - qs.statement_start_offset ) / 2 ) + 1) AS [Query_Text],
		qs.query_hash AS [QueryHash],
        qp.query_plan AS [QueryPlan]
INTO #TempCEStats
FROM sys.dm_exec_cached_plans cp WITH (NOLOCK)
        CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle) qp
        CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt
		INNER JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle
WHERE   qt.text NOT LIKE '%sys.%'
AND   DB_NAME(qt.dbid) = 'tpcc'
ORDER BY [Avg_CPU_Time(µs)] DESC OPTION(RECOMPILE);
GO

SQL Server 2014 CE Model Version 120 Benchmark Test

Preparation:  Restore the tpcc database from the initial backup.  Set the database compatibility level to 120 so the query optimiser will use the new cardinality estimator model version 120 when compiling queries for this database.  Flush the plan cache and buffer pool prior to each workload test to ensure a consistent starting point.

 

RESTORE DATABASE [tpcc] FROM  DISK = N'C:\Temp\tpcc.bak' WITH  FILE = 1,  NOUNLOAD,  REPLACE,  STATS = 5
GO
ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 120
GO
DBCC FREEPROCCACHE
GO
DBCC DROPCLEANBUFFERS
GO

Clear the PerfMon screen, re-start the PerfMon counters, wait 30 seconds and then start the HammerDB workload once again.  Once the workload is complete, run the below T-SQL once again to insert the contents of the plan cache into the same temporary table for later analysis:

INSERT INTO #TempCEStats
SELECT  DB_NAME(qt.dbid) AS [Database], cp.objtype AS [Cached_Plan_Type],
        cp.usecounts AS [Use_Counts],
		qp.query_plan.value('(//@CardinalityEstimationModelVersion)[1]','INT') AS [CE_Version],
		CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) AS [Estimated_Rows],
		qs.last_rows AS [Last_Rows],
		CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) - (qs.last_rows) AS [Estimate_Skew],
		qs.total_logical_reads / qs.execution_count AS [Avg_Logical_Reads],
		CAST((qs.total_elapsed_time ) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_Execution_Time(µs)],
		CAST((qs.total_worker_time) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_CPU_Time(µs)],
		SUBSTRING(qt.TEXT, (qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset WHEN - 1 THEN DATALENGTH(qt.TEXT) ELSE qs.statement_end_offset END - qs.statement_start_offset ) / 2 ) + 1) AS [Query_Text],
		qs.query_hash AS [QueryHash],
        qp.query_plan AS [QueryPlan]
--INTO #TempCEStats
FROM sys.dm_exec_cached_plans cp WITH (NOLOCK)
        CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle) qp
        CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt
		INNER JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle
WHERE   qt.text NOT LIKE '%sys.%'
AND   DB_NAME(qt.dbid) = 'tpcc'
ORDER BY [Avg_CPU_Time(µs)] DESC OPTION(RECOMPILE);
GO

Analysing Results:  A quick glance at the PerfMon graph below shows that CPU usage has increased significantly and batch requests/sec has decreased.

newce_sql2014_perfmon

Plug the PerfMon and HammerDB metrics into the table for a better comparison:

hammerdb_workloadresults2

It’s clear that the HammerDB synthetic workload performs better under the legacy cardinality estimator as opposed to the new cardinality estimator model version 120.  What tools are available to us for analysing how individual queries or stored procedures perform?  It’s possible to use Extended Events or Profiler but for this demo we are going to use the various DMOs and fortunately we’ve already been inserting query statistics into a temporary table, #TempCEStats.  Since we know that that there was a significant increase in CPU usage during the second workload test, issue the below query against the temporary table to filter on this metric:

SELECT * FROM #TempCEStats
ORDER BY [Avg_CPU_Time(µs)] DESC

 

query-stats

Instantly it is apparent that the most resource intensive query was the same query across both workload tests and note that the query hash is consistent too.  It is also apparent that this query performs worse under the new cardinality estimator model version 120.  To investigate and understand why this particular query behaves differently under the different cardinality estimators we’ll need to look at the actual query and the execution plans.

Looking at the information in #TempCEStats and the execution plans, the problematic query below belongs to the SLEV stored procedure.  The execution plans are as follows:

ce70_plan

ce120_plan

This excerpt, taken from my previous blog post, is very relevant here:
“Good cardinality estimation (row count expectations at each node of the logical tree) is vital; if these numbers are wrong, all later decisions are affected. Cardinality estimates are a major factor in deciding which physical operator algorithms are used, the overall plan shape (join orders etc) and ultimately they determine the final query plan that executes. Given these critical plan choices, when the cardinality estimation process contains a significantly skewed assumption, this can lead to an inefficient plan choice. This can, in turn, result in degraded performance.”

The different assumption models and algorithms used by the new cardinality estimator model will lead to different cardinality estimates in certain scenarios and this is apparent here in the form of a completely different execution plan shape.  Under the legacy CE, a serial plan with an index seek and RID lookup using a nested loop join was used.  Under the new SQL Server 2014 CE, a parallel plan with a table scan and hash match join was used.

SQL Server 2014 Potential Solutions:  In my previous blog post I showed how Trace Flag 9481 can be used in SQL Server 2014 & SQL Server 2016 at various scopes to force the legacy CE.  In our scenario, the HammerDB synthetic workload performs much the same under the different cardinality estimators with the exception of the SLEV stored procedure.  The recommendation in this scenario then would be to keep using the new SQL Server 2014 CE but force the legacy CE for the one specific query regression as a temporary measure until the stored procedure can be tuned / rewritten so it performs well without the hint.  This can be done by setting Trace Flag 9481 at the individual statement level within the stored procedure assuming we have the permission to edit the stored procedure code.

Let’s modify the stored procedure and re-run the HammerDB workload:

RESTORE DATABASE [tpcc] FROM  DISK = N'C:\Temp\tpcc.bak' WITH  FILE = 1,  NOUNLOAD,  REPLACE,  STATS = 5
GO
USE [tpcc]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

ALTER PROCEDURE [dbo].[SLEV]
@st_w_id int,
@st_d_id int,
@threshold int
AS
BEGIN
DECLARE
@st_o_id int,
@stock_count int
BEGIN TRANSACTION
BEGIN TRY
SELECT @st_o_id = DISTRICT.D_NEXT_O_ID FROM dbo.DISTRICT WHERE DISTRICT.D_W_ID = @st_w_id AND DISTRICT.D_ID = @st_d_id
SELECT @stock_count = count_big(DISTINCT STOCK.S_I_ID) FROM dbo.ORDER_LINE, dbo.STOCK
WHERE ORDER_LINE.OL_W_ID = @st_w_id AND ORDER_LINE.OL_D_ID = @st_d_id AND (ORDER_LINE.OL_O_ID < @st_o_id) AND
ORDER_LINE.OL_O_ID >= (@st_o_id - 20) AND STOCK.S_W_ID = @st_w_id AND STOCK.S_I_ID = ORDER_LINE.OL_I_ID AND STOCK.S_QUANTITY < @threshold OPTION (QUERYTRACEON 9481)
SELECT	@st_o_id as N'@st_o_id', @stock_count as N'@stock_count'
END TRY
BEGIN CATCH
SELECT
ERROR_NUMBER() AS ErrorNumber
,ERROR_SEVERITY() AS ErrorSeverity
,ERROR_STATE() AS ErrorState
,ERROR_PROCEDURE() AS ErrorProcedure
,ERROR_LINE() AS ErrorLine
,ERROR_MESSAGE() AS ErrorMessage;
IF @@TRANCOUNT > 0
ROLLBACK TRANSACTION;
END CATCH;
IF @@TRANCOUNT > 0
COMMIT TRANSACTION;
END
GO

ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 120
GO
DBCC FREEPROCCACHE
GO
DBCC DROPCLEANBUFFERS
GO

Analysing Results:

newce_sql2014_tf9481_perfmon

hammerdb_workloadresults3

The results clearly show the effect having Trace Flag 9481 set on the problematic statement in the SLEV stored procedure has had on the workload performance.  CPU usage is back to what it was under the legacy cardinality estimator and TPM, NOPM and batch requests / sec have all almost doubled.  To confirm that the problematic statement is using the legacy cardinality estimator, open the query execution plan XML and search for CardinalityEstimationModelVersion=”70″.  Or you can scroll down the execution plan XML to the relevant statement and you will find this attribute.

In the scenario where you can’t modify the stored procedure, you could use a plan guide to elicit the same effect.

SQL Server 2016 CE Model Version 130 Benchmark Test

To complete the picture, let’s run the same workload test against the new SQL Server 2016 CE and collect the same metrics.

Preparation:  Restore the tpcc database from the backup taken earlier prior to the first workload test.  Set the database compatibility level to 130 so the query optimiser will use the new SQL Server 2016 cardinality estimator model version 130 when compiling queries for this database.  Flush the plan cache and buffer pool prior to each workload test to ensure a consistent starting point.

RESTORE DATABASE [tpcc] FROM  DISK = N'C:\Temp\tpcc.bak' WITH  FILE = 1,  NOUNLOAD,  REPLACE,  STATS = 5
GO
ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 130
GO
DBCC FREEPROCCACHE
GO
DBCC DROPCLEANBUFFERS
GO

Clear the PerfMon screen, re-start the PerfMon counters, wait 30 seconds and then start the HammerDB workload once again.  Once the workload is complete, run the below T-SQL once again to insert the contents of the plan cache into the same temporary table for later analysis:

INSERT INTO #TempCEStats
SELECT  DB_NAME(qt.dbid) AS [Database], cp.objtype AS [Cached_Plan_Type],
        cp.usecounts AS [Use_Counts],
		qp.query_plan.value('(//@CardinalityEstimationModelVersion)[1]','INT') AS [CE_Version],
		CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) AS [Estimated_Rows],
		qs.last_rows AS [Last_Rows],
		CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) - (qs.last_rows) AS [Estimate_Skew],
		qs.total_logical_reads / qs.execution_count AS [Avg_Logical_Reads],
		CAST((qs.total_elapsed_time ) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_Execution_Time(µs)],
		CAST((qs.total_worker_time) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_CPU_Time(µs)],
		SUBSTRING(qt.TEXT, (qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset WHEN - 1 THEN DATALENGTH(qt.TEXT) ELSE qs.statement_end_offset END - qs.statement_start_offset ) / 2 ) + 1) AS [Query_Text],
		qs.query_hash AS [QueryHash],
        qp.query_plan AS [QueryPlan]
--INTO #TempCEStats
FROM sys.dm_exec_cached_plans cp WITH (NOLOCK)
        CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle) qp
        CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt
		INNER JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle
WHERE   qt.text NOT LIKE '%sys.%'
AND   DB_NAME(qt.dbid) = 'tpcc'
ORDER BY [Avg_CPU_Time(µs)] DESC OPTION(RECOMPILE);
GO

Analysing Results: 

newce_sql2016_perfmon

Putting the metrics into the table below, we can see that the HammerDB workload as a whole performs slightly better under the SQL Server 2016 CE as opposed to the SQL Server 2014 CE.

hammerdb_workloadresults4

The evidence above is reinforced when querying the #TempCEStats temporary table as the single problematic query also performs slightly better under the SQL Server 2016 CE as opposed to the SQL Server 2014 CE, but still not as performant as under the Legacy CE.  Looking at the execution plan it’s unsurprising to see the same parallel plan with a table scan and hash match join.

query-stats_final

The next step now would be to test the HammerDB workload against the SQL Server 2016 CE with trace flag 9481 added to the problematic statement within the SLEV stored procedure but I’ll leave that as an exercise for the reader.

Tracking Query Regressions The New Way (Feat. Query Store)

For this demo, the setup and approach will be slightly different.  The scenario this time is that the tpcc database will be upgraded to the latest database compatibility level 130 but we want to continue using the legacy CE until given the go ahead to switch to the SQL Server 2016 CE.  In SQL Server 2016, this can be achieved using the following methods:

  1. Trace Flag: Executing ‘DBCC TRACEON (9481, -1);’ will enable this trace flag globally and will affect all databases on the SQL Server instance.
  2. Database Scoped Configuration: Executing ‘ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON;’ will only effect the database it is executed under.

Using the database scoped configuration method is the better option and is what we’ll use in this demo.

Preparation:  Restore the tpcc database from the initial backup.  Set the database compatibility level to 130, turn on the Legacy CE database scoped configuration so the query optimiser will use the legacy CE when compiling queries for this database.  Enable query store as it is not active for new databases by default.  Flush the plan cache and buffer pool.

RESTORE DATABASE [tpcc] FROM DISK = N'C:\Temp\tpcc.bak' WITH  FILE = 1,  NOUNLOAD,  REPLACE,  STATS = 5
GO
ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 130
GO
USE [tpcc]
GO
ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON
GO
ALTER DATABASE [tpcc] SET QUERY_STORE = ON
GO
ALTER DATABASE [tpcc] SET QUERY_STORE (OPERATION_MODE = READ_WRITE, DATA_FLUSH_INTERVAL_SECONDS = 900, INTERVAL_LENGTH_MINUTES = 10)
GO
DBCC FREEPROCCACHE
GO
DBCC DROPCLEANBUFFERS
GO

Run the HammerDB workload and put the results into the table below.  Next, turn off the Legacy CE database scoped configuration with the below T-SQL.  Run another HammerDB workload and enter the results into the table below.

USE [tpcc]
GO
ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON
GO

hammerdb_workloadresults5

Analysing Results:  Looking at the HammerDB metrics above, we can see once again the throughput of our HammerDB workload has decreased significantly when switching to the new SQL Server 2016 CE.  Instead of trawling through the various DMOs, let’s take a look at Query Store.  Expand the tpcc database in object explorer, navigate to ‘Query Store’ and open ‘Top Resource Consuming Queries’.  Filtering on Total CPU Time or Total Duration in the top left pane we can clearly see the same problematic query.

querystore_1

Looking at the top right pane, we can see there are three execution plans for this query and it’s clear that the execution plan depicted by the yellow circle at the top most right performs the worst.  To get more detailed information on the performance of the individual execution plans, select the ‘View Plan Summary In A Grid Format’ box highlighted in red below.  Selecting the individual plan_id’s will show the graphical execution plan in the bottom pane.  The execution plan with plan_id 49 is the most efficient plan and is the one we will force using Query Store.

querystore_2

Switching back to the chart format, I will select plan_id 90 and plan_id 49 and then compare the execution plans side by side.  Looking at the execution plans and plan properties below, we can see the poorly performing plan at the top with plan_id 90, is a parallel plan with a table scan and hash match join compiled using the SQL Server 2016 CE.  The more performant plan at the bottom with plan_id 49 is a serial plan with an index seek and RID lookup using a nested loop join compiled using the Legacy CE.  The query regression is obvious here.  Query Store has made it very easy for us to visually identify that this particular query performs better under the Legacy CE.

querystore_3

querystore_4

To force plan_id 49, select the plan and, click the ‘Force Plan’ button highlighted below and accept the pop up confirmation box.

querystore_5

Kick off another HammerDB workload test and enter the results into the table below:

hammerdb_workloadresults6

Using Query Store to force the good plan has resulted in the best workload throughput.  The PerfMon graph for this series of tests would look like this:

perfmon_querystoretest

Conclusion

Query Store provides insight on query plan choice and performance.  It makes this information easily accessible and it simplifies performance troubleshooting by helping you quickly find performance differences caused by query plan changes.  Prior to Query Store, troubleshooting query regressions was more challenging and relied on manual analysis of the plan cache and various DMVs.

Query Store enables the database administrator to directly affect the performance of queries by forcing Execution Plans for specific queries.  Prior to Query Store, this type of manual intervention was achieved through the use of plan guides and trace flags which required modifying code and was a much more involved process.

The key points of this blog post were to show how useful the Query Store can be in SQL Server DBA’s tool belt and to highlight the importance of testing applications workloads against the new cardinality estimator models prior to upgrading to SQL Server 2014 or SQL Server 2016.

Resources & Further Reading

I highly recommend reading @SQL_JGood’s excellent posts on this.  Using his examples, I built upon and added what I have learned recently out there in the real world.

Enrico Van De Laar ( @evdlaar ) has a great series on Query Store on the Simple-Talk site for anyone wanting a deep look into this feature.

Advertisement
This entry was posted in Statistics, CE & QO. Bookmark the permalink.

3 Responses to Troubleshooting Query Regressions Caused By The New Cardinality Estimator

  1. AleD says:

    In conclusion the new CE releases are worse than previous versions?

    Like

    • That’s quite a conclusion to draw from this contrived example AleD! If anything, hopefully this post highlighted to you the importance of testing when upgrading to newer versions of SQL Server and using the new cardinality estimator.

      Like

  2. Pingback: Cardinality Estimator Regressions – Curated SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s