A PostgreSQL planner semi-join gotcha with CTE, LIMIT, and RETURNING

i_have_to_speak · 2025-05-04T10:10:33 1746353433

This really has nothing to do with CTEs.

The SQL standard does not specify how many times a subquery must or can be run. In this case, Postgres is free to run the uncorrelated subquery many times. Unfortunately, the subquery result is not idempotent.

The real solution is not to rely on the optimizer for the correctness of your query. The query should be logically correct, irrespective of the query plan, right? The correct solution would be to ensure that the subquery runs only once, using a materialized cte or temp table. Do not base your "solution" on the discovery that for this particular version of Postgres, the planner runs this type of subquery once for the '=' operator and multiple times for 'IN'.

Diggsey · 2025-05-04T13:40:29 1746366029

Based on my understanding, your comment is the most accurate in this thread.

However, I think it's telling that almost every single comment here has a different understanding of how PostgreSQL works, and nobody can link to any documentation that conclusively addresses this case. I think that indicates that this is a failing if not of implementation, then of documentation.

For example: another comment says that `=` forces PostgreSQL to evaluate the subquery exactly once. I couldn't find any justification for that statement. Is it guaranteed? I don't know.

Another comment says "if only the subquery was ordered then the result would be consistent", but AFAIK, this is not true in the presence of "SKIP LOCKED".

I think the problem stems from the fact that the SQL standard is written from an academic point of view: the ordering of execution of subqueries is not defined because it is irrelevant within the relational algebra, so optimizations which change the execution order are just that: optimizations, with no observable side effects.

Real database systems do not actually have the nice algebraic properties of the relational algebra... The same subquery may give different results over multiple executions, etc. Given this, it's important that the database system give sufficient and sufficiently clear guarantees such that they can still be used correctly, and so that the optimizations implemented by the system can be justified.

swid · 2025-05-04T13:59:07 1746367147

The comment about = is mine and I thought about rewriting that part of the comment after the fact. It is soooo hard to explain these things without writing a book.

The reason = mean it runs once is because the outer query will only run once, and in this case that query, when using =, can only delete based on a single id. But if that outer query was subquery in a context where it could be run more than once, you are back to where you started. Hence me saying their fix was sort of incidental.

Diggsey · 2025-05-04T14:19:31 1746368371

I think your logic is flawed (even if PostgreSQL may behave this way in practice).

First, let's make sure we're talking about the same two examples.

A:

    DELETE FROM task_queue
    WHERE id = (
      SELECT id FROM task_queue
      WHERE queue_group_id = 15
      LIMIT 1
      FOR UPDATE SKIP LOCKED
    )

B:

    DELETE FROM task_queue
    WHERE id IN (
      SELECT id FROM task_queue
      WHERE queue_group_id = 15
      LIMIT 1
      FOR UPDATE SKIP LOCKED
    )

You seem to be saying that B may exhibit the problem whilst A does not. (ie. there is a different between using `=` vs `IN`). I would like to see some documentation justifying that.

Here's my logic:

- In both cases, the outer query is run once.

- In both cases the outer WHERE clause is evaluated for each row in `task_queue`.

- In both cases, it is up to the optimizer whether the result of the subquery is materialized or not.

- In both cases, if the subquery is not materialized, multiple rows in the outer query may match the condition.

In practice, it may be that the optimizer always materializes uncorrelated subqueries on the RHS of an `=` expression. My contention is whether that is a formal guarantee.

swid · 2025-05-04T14:33:39 1746369219

Most of what you say I agree with. But if this outer query is run only once with version A there is a caveat with where it says “where id = X”.

This cannot match more than one X at a time. So that forces the inner query to be run once, as we can only have one id, and running it twice may produce 2.

I am not sure though to be honest.

…

We are too deep for me to reply now, but to your next comment I didn't mean only one row, but only one id. It is easy for a small difference in word choice to get things wrong.

I think if it did return two rows; ie limit 2- the query with = will fail. Hmmm maybe that will happen even with limit 1 under certain plans. I wouldn’t trust it.

Diggsey · 2025-05-04T14:38:49 1746369529

Why would it only be run once? The WHERE condition of the outer query is run multiple times: once for each row, so of course it can return TRUE multiple times.

For example:

    DELETE FROM example WHERE id = (SELECT RANDOM(0, 100) FROM other_table LIMIT 1)

This could delete multiple rows in principle, since there may be multiple rows where the `=` expression is true.

swid · 2025-05-05T02:23:30 1746411810

After thinking about it more, yes I agree with your take. Maybe it can even delete different ids using an =.

impulsivepuppet · 2025-05-04T07:58:35 1746345515

Since I don't often write raw SQL, I can only assume the author named their CTE `deleted_tasks` to elucidate that the query might delete multiple items. Otherwise, it makes little sense, for they intended to "pop" a single row, and yet their aptly named `deleted_tasks` ended up removing more than one!

The query reads to me like a conceptual mish-mash. Without understanding what the innermost `SELECT` was meant to accomplish, I'd naturally interpret the `WHERE id IN (...)` as operating on a set. But the most sacrilegious aspect is the inclusion of `FOR UPDATE SKIP LOCKED`. It assumes a very specific execution order that the query syntax doesn't actually enforce.

Am I right to think that not avoiding lock contention, i.e. omitting `SKIP LOCKED` would have actually produced the intended result?

phoronixrly · 2025-05-04T13:16:06 1746364566

DELETE with an overly-broad operator in the WHERE clause and no explicit limit: check, non-trivial subquery in the WEHERE: check. This should not have passed code review, let alone have been caught in production.

I will give OP the benefit of the doubt and say that automated testing did not catch this because the optimisations depend on table statistics, and not because it was not appropriately covered.

eknkc · 2025-05-04T04:47:20 1746334040

I don’t know about “gotcha”. This sounds like a full blown planner bug to me.

swid · 2025-05-04T05:07:25 1746335245

I think arguably there is no bug here, but the blog doesn't do a good job of explaining the issue and their fix may work because they remove the CTE and switch to `=`, but they prove they don't understand why the difference when they suggest "using id IN (...) might still be necessary"; if you do that, then the problem will return.

There are two factors here.

The subquery

    SELECT id FROM task_queue WHERE queue_group_id = 15 FOR UPDATE SKIP LOCKED LIMIT 1

can return different ids each time you run it. If it was ordered, then it would always return the same id and if postgres optimized in a way that it runs more than once it would just get the same result each time anyway.

Otherwise, you need to force postgres to evaluate your subquery exactly once by materializing it. There are different ways this might be accomplished - the blog post does this incidentally by using `=`. But it is not the only way to tell postgres that.

For instance, like this. But it is fragile - without AS MATERIALIZED, it could be run more than once.

    WITH candidate AS MATERIALIZED (
      SELECT id FROM task_queue WHERE queue_group_id = 15 FOR UPDATE SKIP LOCKED LIMIT 1
    )
    DELETE FROM task_queue t USING candidate c WHERE t.id = c.id
    RETURNING t.item_id;

shayonj · 2025-05-04T13:15:20 1746364520

That's correct, if you want to keep using CTE, the gurantee can be achieved via something like this

WITH deleted_tasks AS MATERIALIZED (

  SELECT id 
  FROM task_queue
  WHERE queue_group_id = 5
  AND status = 'pending'
  LIMIT 1
  FOR UPDATE SKIP LOCKED

)

DELETE FROM task_queue WHERE id IN (SELECT id FROM deleted_tasks) RETURNING item_id, id;

swid · 2025-05-04T03:18:54 1746328734

This is interesting - I don’t expect undefined behavior like this. Should the query delete more than one row or not?

The blog post doesn’t really give answers - it isn’t obvious to me that the second query can’t be executed in the exact same way. The even cop to this fact:

> This structure appears less ambiguous to the planner and typically encourages it to evaluate the subquery first.

So then - their bug still exists maybe?

I have thoughts - probably we do expect it never should delete multiple rows, but after considering the query plan, I can see it being ambiguous. But I would have expected Postgres to use only a single interpretation.

dankebitte · 2025-05-04T03:44:25 1746330265

It's not undefined behavior, it's the fact that the uncorrelated subquery within the CTE doesn't specify an ordering, therefore it cannot be implicitly materialized / evaluated once. Postgres documentation is clear here [1]:

> If sorting is not chosen, the rows will be returned in an unspecified order.

The original query from TFA could've instead just had the uncorrelated subquery moved into a materialized CTE.

[1] https://www.postgresql.org/docs/current/queries-order.html

swid · 2025-05-04T04:23:22 1746332602

I see what you saying, but it is very subtle, wouldn’t you agree?

Under a different plan, the inner query would only be evaluated once, it is hard to mentally parse how it will first find the rows with the group id and then pass into the sub query.

And still I am not sure how using a CTE or not in the manner in the post is supposed to avoid this issue, so now I’m a bit skeptical it does. I see how a sort would.

I hope if the sub query was its own CTE, the limit would be applied correctly, but am no longer sure… before this post I wouldn’t have questioned it.

Edit to say - now I see you need to explicitly use AS MATERIALIZED if you bump the subquery to a CTE. Someone should really write a better blog post on this… it raises an interesting case but fails to explain it… they probably have not even solved it for themselves.

shayonj · 2025-05-04T13:18:29 1746364709

That is correct and our curiosities are very much aligned here. I wrote this post in passing on a bus, so it def lacks some critical detail :D. I can very much solve this with a AS MATERIALIZED like you pointed, and as also mentioned in - https://news.ycombinator.com/item?id=43886522

I will look into a more thought out post perhaps. Thanks for sharing all the comments and feedback. Really enjoying the discussions.

shayonj · 2025-05-04T13:34:32 1746365672

Also, I believe using an `ORDER BY` is only as good as `=` where you are giving the planner a hint on the route to choose, but it doesn't necessarily solve the issue like using `AS MATERIALIZED` does. The issue being - it should only delete the number of rows and return them as authored by the value to `LIMIT`. So, using `ORDER BY` and `LIMIT`, will ensure the right set of IDs are returned, but it's not stopping the sub query from getting executed more than once.

It is still a good rule of those to pair your LIMITs with ORDER, however, yeah.

porridgeraisin · 2025-05-04T05:27:48 1746336468

It doesn't matter. The output of a query cannot depend on the plan. All plans should generate semantically equivalent output necessarily. Notice I say semantically equivalent because obviously Select random() can return different numbers each time. But it should still be semantically, one single random number.

In this case, the number of rows changing given the same input dataset, is a bug.

swid · 2025-05-04T05:57:28 1746338248

It is allowed to be a planner choice though! Very surprising, but if you understand the docs, it will follow that it is not actually a bug. It could change in the future and has changed in the past apparently - but that change was to give the planner more opportunity to optimize queries by not materializing parts of the query and inlining that part into the parent.

Regarding selects: [0]: “A key property of WITH queries is that they are normally evaluated only once per execution of the primary query… However, a WITH query can be marked NOT MATERIALIZED to remove this guarantee. … By default, a side-effect-free WITH query is folded into the primary query if it is used exactly once in the primary query’s FROM clause.

Regarding CTEs: [1]: “A useful property of WITH queries is that they are normally evaluated only once per execution of the parent query… However, the other side of this coin is that the optimizer is not able to push restrictions from the parent query down into a multiply-referenced WITH query"

Now, in either case - if you don't want the planner to inline the query - you might have to be explicit about it (I think since postgres 10?), or otherwise - yes, the output of the query will depend on the plan and this is allowed based on the docs.

[0]: https://www.postgresql.org/docs/current/sql-select.html

[1]: https://www.postgresql.org/docs/current/queries-with.html

Sesse__ · 2025-05-04T09:37:18 1746351438

The output of a query absolutely will depend on the plan. In ANSI SQL you're right it cannot (up to ordering), which is why ANSI SQL doesn't have anything like random() or nondeterministic user functions. But nearly all databases support something that is outside of that standard because it's genuinely useful, and then you get into poorly-defined situations that you need to deal with ad-hoc as an implementer.

E.g., for a very simple case, in SELECT * FROM a,b WHERE random() < 0.5, would you push random() down through the join or not? To one table? Both tables? Evaluate it once at the start of a query because it depends on neither a nor b? What if it's random() < a.x? Different databases have different and often poorly-defined semantics here.

swid · 2025-05-04T06:13:25 1746339205

I wrote this to a deleted comment, but even if the CTE was materialized, the subquery of the CTE would still not be...

For instance, with the stand alone query:

  DELETE … WHERE id IN (
    SELECT id … LIMIT 1 FOR UPDATE SKIP LOCKED
  )

the planner is free to turn that IN ( subquery ) into a nested‐loop semi‐join, re-executing the subquery as many times as it deems optimal. Therefore it can delete more than 1 row.

Sesse__ · 2025-05-04T09:31:30 1746351090

More to the point: The SQL standard doesn't talk about semijoins at all, it conceptually evaluates the WHERE clause for each and every row. So the question is whether the subquery is guaranteed to give the same result each and every time. That would presumably depend on the transaction isolation level in use, except that SKIP LOCKED (which is nonstandard, from what I know) presumably calls off all bets anyway.

wordofx · 2025-05-04T02:42:34 1746326554

Why was the author not just doing returning *

No need for the CTE to select everything…

tomnipotent · 2025-05-04T03:15:51 1746328551

Could be a generational thing. There was a time not every query planner would properly eliminate columns based on the final projection and any intermediary copies required more I/O if you included everything. Even now I still find myself following "best practices" for things that haven't been relevant in a decade.

clhodapp · 2025-05-04T07:14:38 1746342878

Hmm, it seems like the subquery is getting re-run on every single row of the task queue table, which seems like a performance issue in addition to correctness.

Personally, if I were going to put any part of that query in a CTE, it would have been (just) the select, as I generally prefer CTE's to subqueries.

I'm really not sure what motivated the author to create the CTE in the first place, as the final select seems to be essentially the identity function.

Normal_gaussian · 2025-05-04T08:56:59 1746349019

"I'm really not sure what motivated the author to create the CTE in the first place, as the final select seems to be essentially the identity function."

We're likely seeing a simplified example. All queries should be able to be embedded in a CTE without integrity loss.

zokier · 2025-05-04T10:08:31 1746353311

I think CTE is red herring here, the problem is in the sub-query regardless if it's in CTE or not

Normal_gaussian · 2025-05-04T11:01:22 1746356482

Yes, it is a red herring for sure; using a limit without a sort in a sub-query is always an issue.

McGlockenshire · 2025-05-04T07:44:53 1746344693

> it seems like the subquery is getting re-run on every single row of the task queue table

Y'all might find the term "correlated subquery" helpful here.

When I want them, I always get them. Sometimes I get them when I don't want them. Sometimes it's even my fault, not the result of the query planner looking at me sideways.

simonw · 2025-05-04T04:29:36 1746332976

  DELETE FROM task_queue
  WHERE id = ( -- Use '=' for a single expected ID
    SELECT id FROM task_queue
    WHERE queue_group_id = 15
    LIMIT 1
    FOR UPDATE SKIP LOCKED
  )
  RETURNING item_id;

I don't understand where that item_id value comes from, since that's not a column that's mentioned anywhere else in the query.

I guess it must be an unmentioned column on that task_queue table?

dragonwriter · 2025-05-04T04:43:15 1746333795

In DELETE ... RETURNING, RETURNING works like SELECT in a read query, so, yes, item_id is a column in task_queue and the result set is the item_id value of the row deleted.

abhisek · 2025-05-04T03:01:36 1746327696

Interesting to think about how to guard against these cases where query optimisation leads to unexpected results.

I mean this could lead to serious bugs. What can be a way to detect these using linters or in CI before they hit the production.

magicalhippo · 2025-05-04T03:24:15 1746329055

On a somewhat related note...

We've moved to MSSQL due to several reasons including customer demand.

We're experiencing the MSSQL query planner occasionally generating terrible plans for core queries which then gets cached, leading to support calls.

Workaround for now has been to have our query geneator append a fixed-value column which and have it change the value every 10 minutes, as a cache defeat.

Still, surprised the engine doesn't figure this out itself, like try regenerating plans frequently if they contain non-trivial table scans say.

Or just expire cache entries every 15 minutes or so, so a bad plan doesn't stick around for too long.

rotis · 2025-05-04T10:30:25 1746354625

To add to sibling comment. This sounds like a parameter sniffing. Read it up. It should help you understand the problem. It is a known issue with SQL Server. Or rather a tradeoff. You don't want to compile a new execution plan for every query. This would eat up CPU by itself.

magicalhippo · 2025-05-04T12:45:08 1746362708

Yeah seems likely but haven't had time to really dig into it. For now our workaround works.

So my point was that if they want to keep parameter sniffing as default then short cache expiration can at the very least mitigate the performance cliffs that can occur. Better to have a slow query for 10-15 minutes than for hours. Recompiling once in a while shouldn't have a significant performance impact.

In general though I think in many cases Carmack's approach when he made Quake's renderer is a good one.

That is, it's often better to use an algorithm which is on average slightly slower but has much lower variance in execution time, than one which is on average slightly faster but has much greater variance. Parameter sniffing seems to fall into the latter category.

baq · 2025-05-04T08:14:11 1746346451

It’s one of RDBMS peculiarities that you just need to learn about. Any half decent MSSQL course or book will tell you about the plan cache and the occasional gotcha exactly like yours. Time wasted debugging this issue by the unaware is probably in kilo man years by now.

…but MSSQL is still a fantastic database, if you can afford it. Postgres and mysql come with their own set of gotchas, some of which need an actually decent book to be explained. (Note all the RDBMS manuals are decent books and everyone without exception should read at least the TOC of the db they’re using, which IME is still a rarity.)

magicalhippo · 2025-05-04T10:40:10 1746355210

My point was more that I feel it's a discrepancy that the DB tries to be very clever, but seemingly doesn't self-evaluate its decisions enough to determine when it wasn't.

Why should I have to guard against poor optimization decisions when the whole point is to just let the DB figure out the best way?

baq · 2025-05-04T11:52:33 1746359553

Because the point isn't that the DB should figure everything out, as it is a baseless claim that it's a 'poor optimization decision'. The point is to allow you to meet your requirements within your constraints. You're trading some of your constraints for different ones (here - you don't want to implement your own database, so you're using one that's already built with its own requirements and constraints). A database is a database, not a holy grail.

I haven't used MSSQL for some time now, but I'm sure there are tools which do exactly what you say you want, i.e. continuously monitor the DBMS and suggest actions or even act itself. This is usually done by a team of DBAs, though.

dathinab · 2025-05-04T11:03:41 1746356621

in general end to end tests/whole system tests which mock/stub as little as possible and are focused on testing whole use-case flows is a good starting point (sadly in many teams also a constant uphill battle from convincing PM over allocate time to bootstrap it to making CI harder due to basically having to bootstrap a whole testing system including db etc. to educating devs to write such tests properly, instead of writing unit tests at a end to end test level (which rarely ends well).

Optimal you then combine this with automatic collection and analysis of log traced to detect performance regression (which is hard but just because of the involved tooling but now it needs a CI env with reliable perf. characteristics e.g. not GitHub hosted GitHub action runners).

Through in this case the problem is a trap in the behavior of `IN (...)` sub queries which probably is not specific to WITH clause and might be as deeply rooted as the SQL standard but a you have to pay for being able to read it I'm not sure.

Horffupolde · 2025-05-04T03:42:05 1746330125

IN is not recommended. Use = ANY instead.

codesnik · 2025-05-04T04:04:21 1746331461

aren't they the same in Postgres?

frollogaston · 2025-05-04T05:13:49 1746335629

Yes, according to the manual, IN is equivalent to = ANY() https://www.postgresql.org/docs/current/functions-subquery.h...

I had to check because for some reason, I always thought =ANY was somehow better than IN.

Sesse__ · 2025-05-04T09:39:05 1746351545

IN is fine. The biggest problem comes with NOT IN, which has NULL semantics that makes life difficult for the planner. It is consistent but rarely what the user wants. NOT EXISTS is typically better there.

tczMUFlmoNk · 2025-05-04T06:28:04 1746340084

ANY can be used with arrays, particularly query parameters: `id = ANY($1::int[])`, but not `id IN $1::int[]`.

frollogaston · 2025-05-04T22:08:10 1746396490

Oh that must be why I got into this habit.

masklinn · 2025-05-04T05:47:18 1746337638

It is, but in a pretty minor way: any can be used with no items, IN can not and will error.

tucnak · 2025-05-04T06:34:01 1746340441

DELETE in SKIP LOCKED scenario is recipe for failure. God/Postgres gave you transactions for atomicity... only UPDATE w/ SKIP LOCKED, holy brother in Christ, and only DELETE completed normally.

baq · 2025-05-04T08:18:04 1746346684

God gave us Postgres to teach humanity that everything is an INSERT.

tucnak · 2025-05-04T10:22:59 1746354179

SWE's are learning about Type-2 tables the hard way round; of course, this is pure entertainment to data engineering people! Wait until "Audit table considered harmful" blog posts start popping up...