In this blog post I’ll show methods that can be used to identify and troubleshoot query plan regressions caused by the new cardinality estimator. I’ll show how this was a very manual task in SQL Server 2014 and how the process was made much easier in SQL Server 2016 with the introduction of Query Store.
For the purpose of this demo, I will use HammerDB to generate and populate a database that I will then run TPC-C (OLTP) workloads against. Benchmarks will be taken against the HammerDB generated database under the different database compatibility levels to show the effects the different cardinality estimator model versions have on the workload:
- Benchmark 1: Database Compatibility Level = 110 (Legacy CE)
- Benchmark 2: Database Compatibility Level = 120 (New SQL Server 2014 CE)
- Benchmark 3: Database Compatibility Level = 130 (New SQL Server 2016 CE)
Trace Flag 4199 which enables various query optimiser hotfixes could potentially affect the results of these workload tests and will not be enabled in this demo.
Using HammerDB in this manner to generate a synthetic workload is a very contrived example. In reality you should be using tools like RML Utilities or Distributed Replay to test what effect the new cardinality estimator has against your specific application workload.
My demo environment consists of a Windows Server 2012 R2 VM with a single installation of SQL Server 2016 RTM configured to use a maximum of 8GB of memory with MAXDOP set to 4, CTOP set to 25 and the tpcc database using the simple recovery model.
Configuring HammerDB & PerfMon
HammerDB: Launch HammerDB and select ‘MSSQL Server’ from the benchmark options. Expand ‘TPC-C’, navigate to ‘Schema Build’ and double click ‘Options’. This is the template for the database which will be created and populated. For this demo I have used the options below. To create the ‘tpcc’ database, double click ‘Build’.
Backup the newly created ‘tpcc’ database which will serve as the starting point for each benchmark test:
BACKUP DATABASE [tpcc] TO DISK = N'C:\Temp\tpcc.bak' WITH COPY_ONLY, COMPRESSION, STATS = 10, CHECKSUM GO
Prepare the workload driver options as below by double clicking on ‘Options’ under ‘Driver Script’. Then, double click ‘Load’ which should populate the ‘Script Editor’ window with 387 lines of code.
Next, configure the virtual users as below by double clicking ‘Options’ under ‘Virtual User’. Create the virtual users by double clicking ‘Create’.
PerfMon: Configure PerfMon to capture ‘Batches Requests/Sec’ and ‘% Processor Time’. Set a sample time of every second for a durations of 300 seconds ( 5 minutes ).
Tracking Query Regressions The Old Way
SQL Server 2012 Legacy CE Benchmark Test
Preparation: Set the database compatibility level to 110 so the query optimiser will use the legacy CE model when compiling queries for this database. Flush the plan cache and buffer pool prior to each workload test to ensure a consistent starting point.
ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 110 GO DBCC FREEPROCCACHE GO DBCC DROPCLEANBUFFERS GO
With PerfMon configured and the HammerDB workload defined, it’s time to start the test. Start the PerfMon counters, wait 30 seconds and then start the HammerDB workload by double clicking ‘Run’ under ‘Virtual Users’ or by selecting the green arrow icon highlighted below:
Analysing Results: Under the ‘Virtual User Output’ tab as shown below you can see the results which are: 30875 TPM (Transactions Per Minute) & 6666 NOPM (New Orders Per Minute).
Looking at the PerfMon graph, you can see that Average Batch Requests/sec was ~40 and Average % Processor Time was ~15%.
This workload test against the legacy cardinality estimator will serve as the baseline which we will use for comparison after running the same HammerDB workload against the newer cardinality estimator model versions.
Use the below T-SQL to filter the contents of the plan cache and insert query stat information into a temporary table named #TempCEStats for later analysis.
--INSERT INTO #TempCEStats SELECT DB_NAME(qt.dbid) AS [Database], cp.objtype AS [Cached_Plan_Type], cp.usecounts AS [Use_Counts], qp.query_plan.value('(//@CardinalityEstimationModelVersion)[1]','INT') AS [CE_Version], CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) AS [Estimated_Rows], qs.last_rows AS [Last_Rows], CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) - (qs.last_rows) AS [Estimate_Skew], qs.total_logical_reads / qs.execution_count AS [Avg_Logical_Reads], CAST((qs.total_elapsed_time ) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_Execution_Time(µs)], CAST((qs.total_worker_time) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_CPU_Time(µs)], SUBSTRING(qt.TEXT, (qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset WHEN - 1 THEN DATALENGTH(qt.TEXT) ELSE qs.statement_end_offset END - qs.statement_start_offset ) / 2 ) + 1) AS [Query_Text], qs.query_hash AS [QueryHash], qp.query_plan AS [QueryPlan] INTO #TempCEStats FROM sys.dm_exec_cached_plans cp WITH (NOLOCK) CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle) qp CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt INNER JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle WHERE qt.text NOT LIKE '%sys.%' AND DB_NAME(qt.dbid) = 'tpcc' ORDER BY [Avg_CPU_Time(µs)] DESC OPTION(RECOMPILE); GO
SQL Server 2014 CE Model Version 120 Benchmark Test
Preparation: Restore the tpcc database from the initial backup. Set the database compatibility level to 120 so the query optimiser will use the new cardinality estimator model version 120 when compiling queries for this database. Flush the plan cache and buffer pool prior to each workload test to ensure a consistent starting point.
RESTORE DATABASE [tpcc] FROM DISK = N'C:\Temp\tpcc.bak' WITH FILE = 1, NOUNLOAD, REPLACE, STATS = 5 GO ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 120 GO DBCC FREEPROCCACHE GO DBCC DROPCLEANBUFFERS GO
Clear the PerfMon screen, re-start the PerfMon counters, wait 30 seconds and then start the HammerDB workload once again. Once the workload is complete, run the below T-SQL once again to insert the contents of the plan cache into the same temporary table for later analysis:
INSERT INTO #TempCEStats SELECT DB_NAME(qt.dbid) AS [Database], cp.objtype AS [Cached_Plan_Type], cp.usecounts AS [Use_Counts], qp.query_plan.value('(//@CardinalityEstimationModelVersion)[1]','INT') AS [CE_Version], CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) AS [Estimated_Rows], qs.last_rows AS [Last_Rows], CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) - (qs.last_rows) AS [Estimate_Skew], qs.total_logical_reads / qs.execution_count AS [Avg_Logical_Reads], CAST((qs.total_elapsed_time ) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_Execution_Time(µs)], CAST((qs.total_worker_time) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_CPU_Time(µs)], SUBSTRING(qt.TEXT, (qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset WHEN - 1 THEN DATALENGTH(qt.TEXT) ELSE qs.statement_end_offset END - qs.statement_start_offset ) / 2 ) + 1) AS [Query_Text], qs.query_hash AS [QueryHash], qp.query_plan AS [QueryPlan] --INTO #TempCEStats FROM sys.dm_exec_cached_plans cp WITH (NOLOCK) CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle) qp CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt INNER JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle WHERE qt.text NOT LIKE '%sys.%' AND DB_NAME(qt.dbid) = 'tpcc' ORDER BY [Avg_CPU_Time(µs)] DESC OPTION(RECOMPILE); GO
Analysing Results: A quick glance at the PerfMon graph below shows that CPU usage has increased significantly and batch requests/sec has decreased.
Plug the PerfMon and HammerDB metrics into the table for a better comparison:
It’s clear that the HammerDB synthetic workload performs better under the legacy cardinality estimator as opposed to the new cardinality estimator model version 120. What tools are available to us for analysing how individual queries or stored procedures perform? It’s possible to use Extended Events or Profiler but for this demo we are going to use the various DMOs and fortunately we’ve already been inserting query statistics into a temporary table, #TempCEStats. Since we know that that there was a significant increase in CPU usage during the second workload test, issue the below query against the temporary table to filter on this metric:
SELECT * FROM #TempCEStats ORDER BY [Avg_CPU_Time(µs)] DESC
Instantly it is apparent that the most resource intensive query was the same query across both workload tests and note that the query hash is consistent too. It is also apparent that this query performs worse under the new cardinality estimator model version 120. To investigate and understand why this particular query behaves differently under the different cardinality estimators we’ll need to look at the actual query and the execution plans.
Looking at the information in #TempCEStats and the execution plans, the problematic query below belongs to the SLEV stored procedure. The execution plans are as follows:
This excerpt, taken from my previous blog post, is very relevant here:
“Good cardinality estimation (row count expectations at each node of the logical tree) is vital; if these numbers are wrong, all later decisions are affected. Cardinality estimates are a major factor in deciding which physical operator algorithms are used, the overall plan shape (join orders etc) and ultimately they determine the final query plan that executes. Given these critical plan choices, when the cardinality estimation process contains a significantly skewed assumption, this can lead to an inefficient plan choice. This can, in turn, result in degraded performance.”
The different assumption models and algorithms used by the new cardinality estimator model will lead to different cardinality estimates in certain scenarios and this is apparent here in the form of a completely different execution plan shape. Under the legacy CE, a serial plan with an index seek and RID lookup using a nested loop join was used. Under the new SQL Server 2014 CE, a parallel plan with a table scan and hash match join was used.
SQL Server 2014 Potential Solutions: In my previous blog post I showed how Trace Flag 9481 can be used in SQL Server 2014 & SQL Server 2016 at various scopes to force the legacy CE. In our scenario, the HammerDB synthetic workload performs much the same under the different cardinality estimators with the exception of the SLEV stored procedure. The recommendation in this scenario then would be to keep using the new SQL Server 2014 CE but force the legacy CE for the one specific query regression as a temporary measure until the stored procedure can be tuned / rewritten so it performs well without the hint. This can be done by setting Trace Flag 9481 at the individual statement level within the stored procedure assuming we have the permission to edit the stored procedure code.
Let’s modify the stored procedure and re-run the HammerDB workload:
RESTORE DATABASE [tpcc] FROM DISK = N'C:\Temp\tpcc.bak' WITH FILE = 1, NOUNLOAD, REPLACE, STATS = 5 GO USE [tpcc] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO ALTER PROCEDURE [dbo].[SLEV] @st_w_id int, @st_d_id int, @threshold int AS BEGIN DECLARE @st_o_id int, @stock_count int BEGIN TRANSACTION BEGIN TRY SELECT @st_o_id = DISTRICT.D_NEXT_O_ID FROM dbo.DISTRICT WHERE DISTRICT.D_W_ID = @st_w_id AND DISTRICT.D_ID = @st_d_id SELECT @stock_count = count_big(DISTINCT STOCK.S_I_ID) FROM dbo.ORDER_LINE, dbo.STOCK WHERE ORDER_LINE.OL_W_ID = @st_w_id AND ORDER_LINE.OL_D_ID = @st_d_id AND (ORDER_LINE.OL_O_ID < @st_o_id) AND ORDER_LINE.OL_O_ID >= (@st_o_id - 20) AND STOCK.S_W_ID = @st_w_id AND STOCK.S_I_ID = ORDER_LINE.OL_I_ID AND STOCK.S_QUANTITY < @threshold OPTION (QUERYTRACEON 9481) SELECT @st_o_id as N'@st_o_id', @stock_count as N'@stock_count' END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber ,ERROR_SEVERITY() AS ErrorSeverity ,ERROR_STATE() AS ErrorState ,ERROR_PROCEDURE() AS ErrorProcedure ,ERROR_LINE() AS ErrorLine ,ERROR_MESSAGE() AS ErrorMessage; IF @@TRANCOUNT > 0 ROLLBACK TRANSACTION; END CATCH; IF @@TRANCOUNT > 0 COMMIT TRANSACTION; END GO ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 120 GO DBCC FREEPROCCACHE GO DBCC DROPCLEANBUFFERS GO
Analysing Results:
The results clearly show the effect having Trace Flag 9481 set on the problematic statement in the SLEV stored procedure has had on the workload performance. CPU usage is back to what it was under the legacy cardinality estimator and TPM, NOPM and batch requests / sec have all almost doubled. To confirm that the problematic statement is using the legacy cardinality estimator, open the query execution plan XML and search for CardinalityEstimationModelVersion=”70″. Or you can scroll down the execution plan XML to the relevant statement and you will find this attribute.
In the scenario where you can’t modify the stored procedure, you could use a plan guide to elicit the same effect.
SQL Server 2016 CE Model Version 130 Benchmark Test
To complete the picture, let’s run the same workload test against the new SQL Server 2016 CE and collect the same metrics.
Preparation: Restore the tpcc database from the backup taken earlier prior to the first workload test. Set the database compatibility level to 130 so the query optimiser will use the new SQL Server 2016 cardinality estimator model version 130 when compiling queries for this database. Flush the plan cache and buffer pool prior to each workload test to ensure a consistent starting point.
RESTORE DATABASE [tpcc] FROM DISK = N'C:\Temp\tpcc.bak' WITH FILE = 1, NOUNLOAD, REPLACE, STATS = 5 GO ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 130 GO DBCC FREEPROCCACHE GO DBCC DROPCLEANBUFFERS GO
Clear the PerfMon screen, re-start the PerfMon counters, wait 30 seconds and then start the HammerDB workload once again. Once the workload is complete, run the below T-SQL once again to insert the contents of the plan cache into the same temporary table for later analysis:
INSERT INTO #TempCEStats SELECT DB_NAME(qt.dbid) AS [Database], cp.objtype AS [Cached_Plan_Type], cp.usecounts AS [Use_Counts], qp.query_plan.value('(//@CardinalityEstimationModelVersion)[1]','INT') AS [CE_Version], CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) AS [Estimated_Rows], qs.last_rows AS [Last_Rows], CONVERT(float, (qp.query_plan.value('(//@EstimateRows)[1]', 'varchar(128)'))) - (qs.last_rows) AS [Estimate_Skew], qs.total_logical_reads / qs.execution_count AS [Avg_Logical_Reads], CAST((qs.total_elapsed_time ) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_Execution_Time(µs)], CAST((qs.total_worker_time) / qs.execution_count AS DECIMAL(28, 2)) AS [Avg_CPU_Time(µs)], SUBSTRING(qt.TEXT, (qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset WHEN - 1 THEN DATALENGTH(qt.TEXT) ELSE qs.statement_end_offset END - qs.statement_start_offset ) / 2 ) + 1) AS [Query_Text], qs.query_hash AS [QueryHash], qp.query_plan AS [QueryPlan] --INTO #TempCEStats FROM sys.dm_exec_cached_plans cp WITH (NOLOCK) CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle) qp CROSS APPLY sys.dm_exec_sql_text (cp.plan_handle) qt INNER JOIN sys.dm_exec_query_stats qs ON qs.plan_handle = cp.plan_handle WHERE qt.text NOT LIKE '%sys.%' AND DB_NAME(qt.dbid) = 'tpcc' ORDER BY [Avg_CPU_Time(µs)] DESC OPTION(RECOMPILE); GO
Analysing Results:
Putting the metrics into the table below, we can see that the HammerDB workload as a whole performs slightly better under the SQL Server 2016 CE as opposed to the SQL Server 2014 CE.
The evidence above is reinforced when querying the #TempCEStats temporary table as the single problematic query also performs slightly better under the SQL Server 2016 CE as opposed to the SQL Server 2014 CE, but still not as performant as under the Legacy CE. Looking at the execution plan it’s unsurprising to see the same parallel plan with a table scan and hash match join.
The next step now would be to test the HammerDB workload against the SQL Server 2016 CE with trace flag 9481 added to the problematic statement within the SLEV stored procedure but I’ll leave that as an exercise for the reader.
Tracking Query Regressions The New Way (Feat. Query Store)
For this demo, the setup and approach will be slightly different. The scenario this time is that the tpcc database will be upgraded to the latest database compatibility level 130 but we want to continue using the legacy CE until given the go ahead to switch to the SQL Server 2016 CE. In SQL Server 2016, this can be achieved using the following methods:
- Trace Flag: Executing ‘DBCC TRACEON (9481, -1);’ will enable this trace flag globally and will affect all databases on the SQL Server instance.
- Database Scoped Configuration: Executing ‘ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON;’ will only effect the database it is executed under.
Using the database scoped configuration method is the better option and is what we’ll use in this demo.
Preparation: Restore the tpcc database from the initial backup. Set the database compatibility level to 130, turn on the Legacy CE database scoped configuration so the query optimiser will use the legacy CE when compiling queries for this database. Enable query store as it is not active for new databases by default. Flush the plan cache and buffer pool.
RESTORE DATABASE [tpcc] FROM DISK = N'C:\Temp\tpcc.bak' WITH FILE = 1, NOUNLOAD, REPLACE, STATS = 5 GO ALTER DATABASE [tpcc] SET COMPATIBILITY_LEVEL = 130 GO USE [tpcc] GO ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON GO ALTER DATABASE [tpcc] SET QUERY_STORE = ON GO ALTER DATABASE [tpcc] SET QUERY_STORE (OPERATION_MODE = READ_WRITE, DATA_FLUSH_INTERVAL_SECONDS = 900, INTERVAL_LENGTH_MINUTES = 10) GO DBCC FREEPROCCACHE GO DBCC DROPCLEANBUFFERS GO
Run the HammerDB workload and put the results into the table below. Next, turn off the Legacy CE database scoped configuration with the below T-SQL. Run another HammerDB workload and enter the results into the table below.
USE [tpcc] GO ALTER DATABASE SCOPED CONFIGURATION SET LEGACY_CARDINALITY_ESTIMATION = ON GO
Analysing Results: Looking at the HammerDB metrics above, we can see once again the throughput of our HammerDB workload has decreased significantly when switching to the new SQL Server 2016 CE. Instead of trawling through the various DMOs, let’s take a look at Query Store. Expand the tpcc database in object explorer, navigate to ‘Query Store’ and open ‘Top Resource Consuming Queries’. Filtering on Total CPU Time or Total Duration in the top left pane we can clearly see the same problematic query.
Looking at the top right pane, we can see there are three execution plans for this query and it’s clear that the execution plan depicted by the yellow circle at the top most right performs the worst. To get more detailed information on the performance of the individual execution plans, select the ‘View Plan Summary In A Grid Format’ box highlighted in red below. Selecting the individual plan_id’s will show the graphical execution plan in the bottom pane. The execution plan with plan_id 49 is the most efficient plan and is the one we will force using Query Store.
Switching back to the chart format, I will select plan_id 90 and plan_id 49 and then compare the execution plans side by side. Looking at the execution plans and plan properties below, we can see the poorly performing plan at the top with plan_id 90, is a parallel plan with a table scan and hash match join compiled using the SQL Server 2016 CE. The more performant plan at the bottom with plan_id 49 is a serial plan with an index seek and RID lookup using a nested loop join compiled using the Legacy CE. The query regression is obvious here. Query Store has made it very easy for us to visually identify that this particular query performs better under the Legacy CE.
To force plan_id 49, select the plan and, click the ‘Force Plan’ button highlighted below and accept the pop up confirmation box.
Kick off another HammerDB workload test and enter the results into the table below:
Using Query Store to force the good plan has resulted in the best workload throughput. The PerfMon graph for this series of tests would look like this:
Conclusion
Query Store provides insight on query plan choice and performance. It makes this information easily accessible and it simplifies performance troubleshooting by helping you quickly find performance differences caused by query plan changes. Prior to Query Store, troubleshooting query regressions was more challenging and relied on manual analysis of the plan cache and various DMVs.
Query Store enables the database administrator to directly affect the performance of queries by forcing Execution Plans for specific queries. Prior to Query Store, this type of manual intervention was achieved through the use of plan guides and trace flags which required modifying code and was a much more involved process.
The key points of this blog post were to show how useful the Query Store can be in SQL Server DBA’s tool belt and to highlight the importance of testing applications workloads against the new cardinality estimator models prior to upgrading to SQL Server 2014 or SQL Server 2016.
Resources & Further Reading
I highly recommend reading @SQL_JGood’s excellent posts on this. Using his examples, I built upon and added what I have learned recently out there in the real world.
Enrico Van De Laar ( @evdlaar ) has a great series on Query Store on the Simple-Talk site for anyone wanting a deep look into this feature.
Follow @SQLScotsman
In conclusion the new CE releases are worse than previous versions?
LikeLike
That’s quite a conclusion to draw from this contrived example AleD! If anything, hopefully this post highlighted to you the importance of testing when upgrading to newer versions of SQL Server and using the new cardinality estimator.
LikeLike
Pingback: Cardinality Estimator Regressions – Curated SQL