Troubleshooting Query Plan Regressions guide #20893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

bsanchez-the-roach wants to merge 1 commit into main from DOC-15152

+147 −0

Contributor

bsanchez-the-roach commented Oct 31, 2025 •

edited

Loading

This is a first draft, I definitely want a review for accuracy since I'm still pretty new to this product.

There's an unfinished section at the very bottom, I've left a note there and am looking for some guidance.

Happy to iterate on this more, I just want to get eyes on it.


          Troubleshooting Query Plan Regressions guide

eeb57a9

netlify bot commented Oct 31, 2025

✅ Deploy Preview for cockroachdb-api-docs canceled.

Name	Link
🔨 Latest commit	`eeb57a9`
🔍 Latest deploy log	https://app.netlify.com/projects/cockroachdb-api-docs/deploys/6904e0cac8609900081c481b

netlify bot commented Oct 31, 2025

✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name	Link
🔨 Latest commit	`eeb57a9`
🔍 Latest deploy log	https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/6904e0ca4f3af500084b7905

github-actions bot commented Oct 31, 2025

Files changed:

src/current/_includes/v25.4/sidebar-data/troubleshooting.json
src/current/images/v25.4/troubleshooting-query-plan-regressions-1.png:

src/current/v25.4/troubleshoot-query-plan-regressions.md

src/current/v25.4/troubleshoot-query-plan-regressions.md

bsanchez-the-roach marked this pull request as draft

October 31, 2025 16:16

bsanchez-the-roach requested review from kevin-v-ngo and mwang1026

October 31, 2025 16:17

netlify bot commented Oct 31, 2025

✅ Netlify Preview

Name	Link
🔨 Latest commit	`eeb57a9`
🔍 Latest deploy log	https://app.netlify.com/projects/cockroachdb-docs/deploys/6904e0cac6bfe700086d1ef9
😎 Deploy Preview	https://deploy-preview-20893--cockroachdb-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

rytaft reviewed

View reviewed changes

Contributor

rytaft left a comment

This is really great! Thank you for doing this! I left a few suggestions, and I bet @yuzefovich may have some more.

src/current/v25.4/troubleshoot-query-plan-regressions.md


		## Query plan regressions vs. suboptimal plans

		The DB Console's [Insights page]({% link {{page.version.version}}/ui-insights-page.md %}) keeps track of [suboptimal plans]({% link {{page.version.version}}/ui-insights-page.md %}#suboptimal-plan). A suboptimal plan is a query plan whose execution time exceeds a certain threshold (configurable with the `sql.insights.latency_threshold` cluster setting) and whose slow execution has caused CockroachDB to generate an index recommendation for the table. Table statistics that were once valid, but which are now stale, can lead to a suboptimal plan scenario. A suboptimal plan scenario does not imply that the query plan has changed, and in fact a failure to change the query plan is often the root problem. The Insights page identifies these scenarios and provides recommendations on how to fix them.

Contributor

rytaft Nov 3, 2025

I know the Insights page mentions outdated statistics as a possible cause of a suboptimal plan, but I think it's confusing two different concepts. I could be wrong, but the current Insights page seems entirely focused on index recommendations. As far as I can tell, whether or not stats are stale doesn't currently show up as one of the insights. I think it's good for you to mention this, but I'd make it a separate paragraph.

Member

yuzefovich Nov 4, 2025

Yes, "suboptimal plan" insight only means that we generated an index recommendation which - if applied - will lead to a better plan than the currently chosen one, and it's completely independent from the reason for why we don't choose that better plan right now (chances are it's not the stale statistics, but rather a missing index).

src/current/v25.4/troubleshoot-query-plan-regressions.md


		The DB Console's [Insights page]({% link {{page.version.version}}/ui-insights-page.md %}) keeps track of [suboptimal plans]({% link {{page.version.version}}/ui-insights-page.md %}#suboptimal-plan). A suboptimal plan is a query plan whose execution time exceeds a certain threshold (configurable with the `sql.insights.latency_threshold` cluster setting) and whose slow execution has caused CockroachDB to generate an index recommendation for the table. Table statistics that were once valid, but which are now stale, can lead to a suboptimal plan scenario. A suboptimal plan scenario does not imply that the query plan has changed, and in fact a failure to change the query plan is often the root problem. The Insights page identifies these scenarios and provides recommendations on how to fix them.

		A query plan regression occurs when the cost-based optimizer chooses an optimal query plan, but then later it changes that query plan to a less optimal one. It is not the same thing as a suboptimal plan, though it is possible that the conditions that triggered a suboptimal plan insight were caused by a query plan regression.

Contributor

rytaft Nov 3, 2025

It is not the same thing as a suboptimal plan

This doesn't seem necessary to mention (and might be confusing). A regression is definitely a suboptimal plan, but maybe it just doesn't fit the definition used by the insights page.

src/current/v25.4/troubleshoot-query-plan-regressions.md


		- [Understand how the cost-based optimizer chooses query plans]({% link {{page.version.version}}/cost-based-optimizer.md %}) based on table statistics, and how those statistics are refreshed.

		## Query plan regressions vs. suboptimal plans

Contributor

rytaft Nov 3, 2025

This section seems a bit too focused on the technicality of what the Insights page currently supports. I think it's worth mentioning that the Insights page can help, but I'm not sure you need to distinguish between plan regressions v suboptimal plans.

Member

yuzefovich Nov 4, 2025

I agree with Becca on this. This section seems confusing to me in the current form. "Slow execution" and "suboptimal plan" insights might be good starting points for troubleshooting an unsatisfactory latency for a given query, yet neither necessarily confirms / disproves that this query has experienced a query plan regression.

Perhaps a better way to include the information about the insights would be to have just a single sentence in "Before you begin" section to indicate that "suboptimal plan" insight might help with identifying / understanding the query plan regression. I'd probably omit the mention of "slow execution" insight altogether since it doesn't give much useful signal with query plan regressions - after all, the execution time exceeding the threshold controlled via the cluster setting could be the best we can do.

src/current/v25.4/troubleshoot-query-plan-regressions.md


		Though these two scenarios are conceptually different, both scenarios will likely require an update to the problematic query plan.

		## What to look out for

Contributor

rytaft Nov 3, 2025

This section is great!

Member

yuzefovich Nov 4, 2025

+1

src/current/v25.4/troubleshoot-query-plan-regressions.md

+. If you've already identified specific time intervals in Step 1, you can use the time interval selector to create a custom time interval. Click **Apply**.
+. If there is only one plan in the resulting table, there was only one plan used for this statement fingerprint during this time interval, and therefore a query plan regression could not have occurred. If there are multiple plans listed in the resulting table, the query plan changed within the given time interval. By default, the table is sorted from most recent to least recent query plan. Compare the **Average Execution Time** of the different plans.
+              If a plan in the table has a significantly higher average execution time than the one that preceded it, it's possible that this is a query plan regression. It's also possible that the increase in latency is coincidental, or that the plan change was not the actual cause. For example, if the average execution time of the latest query plan is significantly higher than the average execution time of the previous query plan, this could be explained by a significant increase in the **Average Rows Read** column.

Contributor

rytaft Nov 3, 2025

An increase in Average Rows Read could indicate a query plan regression, since it's possible that the bad query plan is scanning more rows than it should.

But as I think you're intending to show, an increase in Average Rows Read could also indicate that more data was added to the table. It's probably worth mentioning both possibilities here.

Member

yuzefovich Nov 4, 2025

To me it seems more likely that a significant increase (like an order of magnitude growth) in Average Rows Read is actually due to a plan regression, rather than due to the table size growth, since we're comparing two plans for the given query fingerprint that presumably were executed close - time-wise - to each other. I agree though that both are possibilities.

src/current/v25.4/troubleshoot-query-plan-regressions.md

+. In the **Explain Plans** tab, click on the Plan Gist of the more recent plan to see it in more detail.
+. Click on **All Plans** above to return to the list of plans.
+. Click on the Plan Gist of the previous plan to see it in more detail. Compare the two plans to understand what changed. Do the plans use different indexes? Are they scanning the different portions of the table? Do they use different join strategies?

Contributor

rytaft Nov 3, 2025

nit: the different portions -> different portions

src/current/v25.4/troubleshoot-query-plan-regressions.md


		#### Determine if a literal in the SQL statement has changed

		[NOTE FROM BRANDON: I need more information on this case, mainly how to identify that this is the case, and what to do about it.]

Contributor

rytaft Nov 3, 2025

I'm not sure there is a good way to determine this without collecting a conditional statement bundle for a slow execution of the statement fingerprint (unless the DB operator happens to know that the application is using a new value for a particular placeholder). Maybe @yuzefovich has another idea?

Member

yuzefovich Nov 4, 2025 •

edited

Loading

Oof, yeah, this is hard one. The tutorial so far assumes that there is a single good plan for a query fingerprint that might have regressed, but it's actually possible that multiple plans are good, depending on the values of placeholders ("literals").

Here is an example of two different optimal plans (although they do look similar):

CREATE TABLE small (k INT PRIMARY KEY, v INT);
CREATE TABLE large (k INT PRIMARY KEY, v INT, INDEX (v));
INSERT INTO small SELECT i, i FROM generate_series(1, 10) AS g(i);
INSERT INTO large SELECT i, 1 FROM generate_series(1, 10000) AS g(i);
ANALYZE small;
ANALYZE large;
-- this scans `large` on the _left_ side of merge join
EXPLAIN SELECT * FROM small INNER JOIN large ON small.v = large.v AND small.v = 1;
-- this scans `large` on the _right_ side of merge join
EXPLAIN SELECT * FROM small INNER JOIN large ON small.v = large.v AND small.v = 2;

Complicating things is that we deal with query fingerprints internally, so all such constants are removed from our observability tooling. If there was an escalation saying that a particular query fingerprint is occasionally slow, similar to Becca I'd have asked for a conditional statement bundle, and then I'd play around locally with different values of placeholders to see whether multiple plans could be chosen based on concrete placeholder values. But so far we've used statement bundles mostly as internal (to Queries team in particular and Cockroach Labs support in general) tooling, so I'd probably not mention going down this route.

Instead, I'd consider suggesting looking into application side to see whether the literal has changed or something like that.

src/current/v25.4/troubleshoot-query-plan-regressions.md


		[NOTE FROM BRANDON: I need more information on this case, mainly how to identify that this is the case, and what to do about it.]

		If you suspect that the query plan change is the cause of the latency increase, and you suspect that the query plan changed due to a changed query literal, [what should you do]

Contributor

rytaft Nov 3, 2025

what should you do

The likely problem is that the query stats don't accurately reflect how this value is represented in the data. This can be fixed by running ANALYZE <table> to refresh the stats for the table. It's also possible that a good index isn't available, which could be fixed by checking the index recommendations displayed by EXPLAIN-ing the query or on the insights page. If none of these options fixes the issue, a more drastic redesign of the schema/application may be needed.

yuzefovich reviewed

View reviewed changes

Member

yuzefovich left a comment

Nice, glad to see this work!

src/current/v25.4/troubleshoot-query-plan-regressions.md


		## Query plan regressions vs. suboptimal plans

		The DB Console's [Insights page]({% link {{page.version.version}}/ui-insights-page.md %}) keeps track of [suboptimal plans]({% link {{page.version.version}}/ui-insights-page.md %}#suboptimal-plan). A suboptimal plan is a query plan whose execution time exceeds a certain threshold (configurable with the `sql.insights.latency_threshold` cluster setting) and whose slow execution has caused CockroachDB to generate an index recommendation for the table. Table statistics that were once valid, but which are now stale, can lead to a suboptimal plan scenario. A suboptimal plan scenario does not imply that the query plan has changed, and in fact a failure to change the query plan is often the root problem. The Insights page identifies these scenarios and provides recommendations on how to fix them.

Member

yuzefovich Nov 4, 2025

Yes, "suboptimal plan" insight only means that we generated an index recommendation which - if applied - will lead to a better plan than the currently chosen one, and it's completely independent from the reason for why we don't choose that better plan right now (chances are it's not the stale statistics, but rather a missing index).

src/current/v25.4/troubleshoot-query-plan-regressions.md


		- [Understand how the cost-based optimizer chooses query plans]({% link {{page.version.version}}/cost-based-optimizer.md %}) based on table statistics, and how those statistics are refreshed.

		## Query plan regressions vs. suboptimal plans

Member

yuzefovich Nov 4, 2025

I agree with Becca on this. This section seems confusing to me in the current form. "Slow execution" and "suboptimal plan" insights might be good starting points for troubleshooting an unsatisfactory latency for a given query, yet neither necessarily confirms / disproves that this query has experienced a query plan regression.

Perhaps a better way to include the information about the insights would be to have just a single sentence in "Before you begin" section to indicate that "suboptimal plan" insight might help with identifying / understanding the query plan regression. I'd probably omit the mention of "slow execution" insight altogether since it doesn't give much useful signal with query plan regressions - after all, the execution time exceeding the threshold controlled via the cluster setting could be the best we can do.

src/current/v25.4/troubleshoot-query-plan-regressions.md


		Though these two scenarios are conceptually different, both scenarios will likely require an update to the problematic query plan.

		## What to look out for

Member

yuzefovich Nov 4, 2025

+1

src/current/v25.4/troubleshoot-query-plan-regressions.md


		One way of tracking down query plan regressions is to identify SQL statements whose executions are relatively high in latency. Use one or both of the following methods to identify queries that might be associated with a latency increase.

		#### Use workload insights

Member

yuzefovich Nov 4, 2025

As I mentioned in another comment, my understanding of "slow execution" and "suboptimal plan" insights is that they cannot really be used to find or troubleshoot query plan regressions, so I'd remove "Use workload insights" approach altogether.

That said, it might be worth reaching out to TSEs / EEs to check whether their experience matches my understanding.

src/current/v25.4/troubleshoot-query-plan-regressions.md

+. Among the resulting Statement Fingerprints, look for those with high latency. Click on the column headers to sort the results by **Statement Time** or **Max Latency**.
+. Click on the Statement Fingerprint to go to the page that details the statement and its executions.
+              {{site.data.alerts.callout_success}}
+              Look for statements whose **Execution Count** is high. Statements that are run once, such as import statements, aren't likely to be the cause of increased latency due to query plan regressions.

Member

yuzefovich Nov 4, 2025

nit: capitalize IMPORT and perhaps link to the IMPORT docs page.

src/current/v25.4/troubleshoot-query-plan-regressions.md

+. Go to the [**SQL Activity** page]({% link {{page.version.version}}/ui-overview.md %}#sql-activity) in the DB Console.
+. If you've already identified specific time intervals in Step 1, you can use the time interval selector to create a custom time interval. Click **Apply**.
+. Among the resulting Statement Fingerprints, look for those with high latency. Click on the column headers to sort the results by **Statement Time** or **Max Latency**.

Member

yuzefovich Nov 4, 2025

I'd also mention Rows Processed as a possible column to sort by. Often, when a plan regression occurs, we end up scanning more rows than before.

src/current/v25.4/troubleshoot-query-plan-regressions.md

+. If you've already identified specific time intervals in Step 1, you can use the time interval selector to create a custom time interval. Click **Apply**.
+. If there is only one plan in the resulting table, there was only one plan used for this statement fingerprint during this time interval, and therefore a query plan regression could not have occurred. If there are multiple plans listed in the resulting table, the query plan changed within the given time interval. By default, the table is sorted from most recent to least recent query plan. Compare the **Average Execution Time** of the different plans.
+              If a plan in the table has a significantly higher average execution time than the one that preceded it, it's possible that this is a query plan regression. It's also possible that the increase in latency is coincidental, or that the plan change was not the actual cause. For example, if the average execution time of the latest query plan is significantly higher than the average execution time of the previous query plan, this could be explained by a significant increase in the **Average Rows Read** column.

Member

yuzefovich Nov 4, 2025

To me it seems more likely that a significant increase (like an order of magnitude growth) in Average Rows Read is actually due to a plan regression, rather than due to the table size growth, since we're comparing two plans for the given query fingerprint that presumably were executed close - time-wise - to each other. I agree though that both are possibilities.

src/current/v25.4/troubleshoot-query-plan-regressions.md

+              #### Determine if the table indexes changed
+. Look at the **Used Indexes** column for the older and the newer query plans. If these aren't the same, it's likely that the creation or deletion of an index resulted in a change to the statement's query plan.
+. In the **Explain Plans** tab, click on the Plan Gist of the more recent plan to see it in more detail. Identify the table used in the initial "scan" step of the plan.

Member

yuzefovich Nov 4, 2025

nit: s/table/tables/ - it's possible that we have initial scans of multiple tables.

src/current/v25.4/troubleshoot-query-plan-regressions.md


		#### Determine if a literal in the SQL statement has changed

		[NOTE FROM BRANDON: I need more information on this case, mainly how to identify that this is the case, and what to do about it.]

Member

yuzefovich Nov 4, 2025 •

edited

Loading

Oof, yeah, this is hard one. The tutorial so far assumes that there is a single good plan for a query fingerprint that might have regressed, but it's actually possible that multiple plans are good, depending on the values of placeholders ("literals").

Here is an example of two different optimal plans (although they do look similar):

CREATE TABLE small (k INT PRIMARY KEY, v INT);
CREATE TABLE large (k INT PRIMARY KEY, v INT, INDEX (v));
INSERT INTO small SELECT i, i FROM generate_series(1, 10) AS g(i);
INSERT INTO large SELECT i, 1 FROM generate_series(1, 10000) AS g(i);
ANALYZE small;
ANALYZE large;
-- this scans `large` on the _left_ side of merge join
EXPLAIN SELECT * FROM small INNER JOIN large ON small.v = large.v AND small.v = 1;
-- this scans `large` on the _right_ side of merge join
EXPLAIN SELECT * FROM small INNER JOIN large ON small.v = large.v AND small.v = 2;

Complicating things is that we deal with query fingerprints internally, so all such constants are removed from our observability tooling. If there was an escalation saying that a particular query fingerprint is occasionally slow, similar to Becca I'd have asked for a conditional statement bundle, and then I'd play around locally with different values of placeholders to see whether multiple plans could be chosen based on concrete placeholder values. But so far we've used statement bundles mostly as internal (to Queries team in particular and Cockroach Labs support in general) tooling, so I'd probably not mention going down this route.

Instead, I'd consider suggesting looking into application side to see whether the literal has changed or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet