Why CockroachDB doesn't use EvalPlanQual

michae2 · on April 6, 2024

Author here. We've spent the past year adding read committed isolation to CockroachDB.

There were many interesting design decisions, such as:

- whether to use multiple snapshots or a single snapshot per statement

- how to handle read uncertainty intervals

- how to incorporate SELECT FOR UPDATE locking into Raft

- how to handle SELECT FOR UPDATE subqueries

- how to prevent lost update anomalies between two UPDATEs

Some of the gory details are in the public RFC: https://github.com/cockroachdb/cockroach/blob/master/docs/RF...

This blog post just discusses the last point, but please AMA.

sbstp · on April 6, 2024

Why is Cockroach adding READ COMMITTED? Is using a lower level of isolation better for performance or just reduces the amount of serialization errors and retries that need to be done?

kogir · on April 6, 2024

READ COMMITTED is great for applications that need a coherent snapshot of the database but not necessarily the absolutely most recent data, which in my experience is actually most apps.

It allows readers to see valid data (relationships are correct), while not blocking writers. It can be the difference between constant deadlocks and super-high throughput without lock contention.

michae2 · on April 6, 2024

The main motivation is reduce serialization errors, for applications that can handle the weaker isolation level. Especially for applications that were previously running fine under RC on another database.

nextaccountic · on April 7, 2024

Can this improvement also be implemented in Postgresql, removing the need for EvalPlanQual?

Maybe as an option with a new syntax (UPDATE .. WHERE .. RETRYING) or something like this.

foota · on April 6, 2024

It's often talked about how new sql databases offer better scalability than standard SQL databases, but I think it's maybe sometimes underappreciated how (some, not all) of them are also much simpler in terms of their consistency models.

I'd speculate this is because postgres and friends try to eek out every bit of single node performance (which helps with single row throughout and overall throughout, which is obviously much better for them than newsql) but the scalability of new SQL databases might allow them to prefer easy consistency over single node performance.

Possibly this is also just the passage of time benefiting newer systems.

wbl · on April 6, 2024

Read committed is explicitly asking for hard mode. If you want a simple life stick with Serializable as always. It took years before people found anomalies in Repeatable Read in Postgres. This stuff is hard even for world class researchers.

paulryanrogers · on April 6, 2024

How does one get serializable in a multi-writer system without a lot more locking and having to retry at app layer?

Diggsey · on April 6, 2024

You always need app-level retries for SERIALIZABLE isolation level. You don't need any explicit locking - the database should handle that for you (and in the case of PostgreSQL, locking is not the only tool it uses for avoiding serialization anomalies).

The strategy I use is to keep transactions as small as possible, and have retry functionality built into the transaction abstraction, so the buesiness logic doesn't really need to worry about it. I also explicitly use read-only transactions where possible.

mjb · on April 6, 2024

Even more generally, distributed systems can find simpler solutions to things like "raise the throughput ceiling", and "handle disk failure", and "handle power failure" than single-box systems. This is for the simple reason that they have more options: beyond the constraints of a box, resource allocation is more flexible, failures less correlated, etc. That allows modern distributed databases to simply avoid some of the super hard problems that prior databases had to solve. Efficiency is still important, but the thing to optimize is mean system efficiency, not the peak performance of a handful of super hot boxes.

There's also the fact that decades of DB research have brought techniques and approaches that beat old ones, and retrofitting existing systems with them can be hard (e.g. see the efforts to remove some of the sharp edges of PG's MVCC behavior and how hard they've turned out to be).

CGamesPlay · on April 6, 2024

How does the CockroachDB approach not deadlock? Surely retrying could encounter a situation where two competing UPDATE will lock rows in different order, and no amount of retrying will unlock the required rows, right?

dilyevsky · on April 6, 2024

I wondered this too and found this in the docs[0]:

  Transactions at all isolation levels are subject to lock contention, where a transaction attempts to lock a row that is already locked by a write or
  locking read. In such cases, the later transaction is blocked until the earlier transaction commits or rolls back, thus releasing its lock on the row.
  Lock contention that produces a deadlock between two transactions will result in a transaction abort and a 40001 error
  (ABORT_REASON_ABORTED_RECORD_FOUND or ABORT_REASON_PUSHER_ABORTED) returned to the client.

So looks like you still get a good old 40001 error just like with SERIALIZABLE isolation.

[0] - https://www.cockroachlabs.com/docs/stable/read-committed#rea...

michae2 · on April 6, 2024

It's a good question. For simple UPDATEs, CockroachDB always executes in a deterministic, serial order and so it's likely the rows will be locked in the same order by any competing updates. (This can be confirmed by looking at the query plans.) Complex UPDATEs using joins and subqueries will need explicit ORDER BY to always lock in the same order.

If an UPDATE has to retry halfway through, locks are held across the retry to help the system make progress. But as you point out, this could cause lock acquisition to happen in an unexpected order if new rows qualify during a retry. So far we haven't run into this, but we might need to provide an option for an UPDATE to drop locks on retry if deadlock turns into a bigger problem than livelock. It depends on the workload.

temac · on April 6, 2024

> So far we haven't run into this, but we might need to provide an option for an UPDATE to drop locks on retry if deadlock turns into a bigger problem than livelock.

I'm not sure what you mean by that: the design can deadlock but you just have not seen it happening yet?

Edit: oh i see in a comment bellow that deadlocks are detected and abort the transaction.

erhaetherth · on April 6, 2024

I'm having trouble with the example given.

If `UPDATE player SET level = 'AA' WHERE team = 'Gophers';` is executed before the player swap, then why should "Stonebreaker" be upgraded to "AA"? I'd be pretty mad at my database if I sent those 2 queries in sequence and my DB decided to re-order them.

The sleep actually really complicates things here. I understand some queries run slower than others and the sleep is a useful tool to artificially slow things down, but now I don't know I don't know if I should interpret that as one command or two. If `WITH sleep AS (SELECT pg_sleep(5)) UPDATE player SET level = 'AA' FROM sleep WHERE team = 'Gophers';` is atomic then I'd expect it to put a lock on the 3 Gophers (which doesn't include Stonebreaker), wait the 5 seconds and then complete the update. The player swap would be blocked for those 5 seconds because it touches a row that's being updated.

michae2 · on April 6, 2024

The timing of this example is tricky because the two update statements execute concurrently (which is only possible under read committed isolation; under serializable isolation it's much more like what you're describing).

Here's a full timeline in PG (U1 for first update, U2 for second update):

0. U1 begins executing, establishes read snapshot, starts pg_sleep(5).

1. U2 runs to completion.

2. U1 wakes up after 5 sec, scans `player` using snapshot from step 0.

3. U1 filters `team = Gophers`, gets 4, 5, 6.

4. U1 locks 4, 5, 6.

5. U1 performs EvalQualPlan: re-scans latest version of those locked rows, which sees U2's write to 4 but not to 3.

6. U1 performs EvalQualPlan: re-filters those locked rows using latest version, gets 5, 6.

7. U1 writes new versions.

CRDB is easier to reason about: after U1 wakes up from the sleep, it sees that it conflicts with U2 and simply retries the entire statement.

masklinn · on April 6, 2024

> If `UPDATE player SET level = 'AA' WHERE team = 'Gophers';` is executed before the player swap, then why should "Stonebreaker" be upgraded to "AA"? I'd be pretty mad at my database if I sent those 2 queries in sequence and my DB decided to re-order them.

It’s easy to miss but in the swap query the levels also get swapped. Because — and it’s harder to miss but easy to skip over — given what constraint 2 says the level is actually a team level, not a player level.

So in a seqcst view, either the team’s players get upgraded to AA then the players’ levels get swapped during the exchange, or the players get exchanged then the team’s players get upgraded.

In both sequences you end up with Stonebaker as an AA gopher and Lamport as an A dolphin.

npstr · on April 6, 2024

> given what constraint 2 says the level is actually a team level, not a player level.

Then it's bad unnormalized data design that is the problem here. If that is a team level, it should be in the team table, not the player table.

masklinn · on April 6, 2024

> Then it's bad unnormalized data design that is the problem here.

That the table is not normalised makes the example somewhat confusing but it does not actually affect the issue being demonstrated. And denormalisation is a fact of data modelling.

npstr · on April 6, 2024

Why not simply use an example that isn't confusing? Many developers, and especially academics, love wasting effort and time on solving issues they made up but that have no real life examples. When using examples that have trivial alternative solutions, it does not help me as a reader to distinguish whether this is a real problem, or something made up.

avianlyric · on April 6, 2024

Because every example you can think of is confusing, because understanding how concurrent transactions should operate is confusing.

Almost all problems like this can be solved by improving an application data model, but here’s the thing, lots applications have dodgy data models, either due to time constraints, or because the application evolved over time, and the data model didn’t. So these are all real world problems and examples, but creating a “simple” problem to demonstrate the issue almost certainly means also creating an example where other “obvious” solutions exist.

Just because you can’t imagine how this simple example might represent a much more complex “real world” problem, doesn’t mean it doesn’t exist.

michae2 · on April 6, 2024

> If that is a team level, it should be in the team table, not the player table.

It's all contrived, of course, but the reason I would consider skill level to be a player attribute rather than a team attribute is that there could be free agents with a skill level but no team:

INSERT INTO player VALUES (10, 'Pavlo', 'AAA', NULL);

Then with enough free agents, you could imagine building a new team out of free agents that are all at the same skill level:

UPDATE player SET team = 'Otters' WHERE level = 'AAA' AND team IS NULL ORDER BY id LIMIT 3;

ngalstyan4 · on April 6, 2024

For similar isolation level anomalies in real world applications check out this SIGMOD '17 paper:

ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications: http://www.bailis.org/papers/acidrain-sigmod2017.pdf

ramchip · on April 6, 2024

Great article. The `pg_sleep` is a nice trick to test for concurrency issues in postgres.

jb1991 · on April 6, 2024

I sometimes wonder if the database would be more widely used if they had picked a different name.

big_whack · on April 6, 2024

My guess is orgs that have a real need for a distributed SQL database (which are rare in the space of orgs) will make their choices based on requirements analysis rather than naming.

So probably if you would be turned off enough by the name not to use the software, you don't actually need a distributed SQL database and are not the target customer.

bradleyjg · on April 6, 2024

The flip side of that is that businesses are more than okay with selling their products to organizations for which it’s overkill.

Gyms know that people that sign up on Jan 2 don’t need annual (or any) membership. They take their money anyway. The more honorable ones allow cancellation or something but I don’t think anyone expects them to actively turn these folks away.

michae2 · on April 6, 2024

To put it somewhat irreverently: running large production databases is not for the squeamish.

brianmcc · on April 9, 2024

I think a lot of people miss the "Cockroaches survive everything" implication.

It's meant to imply indestructability, not creepy yucky beastliness.

Obviously that goes over the heads of many people initially but at least it can then be explained!

swsieber · on April 6, 2024

I don't wonder. It would be more widely considered, at the very least, judging by interactions I've had with a few people.

dgfitz · on April 6, 2024

I couldn’t imagine suggesting cockroachDB in a planning meeting for a new product. Especially when over half the room isn’t engineering. Just won’t happen.

duped · on April 6, 2024

Why are you wasting the time of half the room talking about technical decisions they wouldn't understand

dgfitz · on April 6, 2024

It’s called a “design review” and they’re quite pervasive. What a strange question.

echelon · on April 6, 2024

You know how "master", "slave", "blacklist", etc. cause bad feelings in a lot of people? So do terms like "roach", "cock", "cockroach", "maggot", etc.

They could have named this waterbear or adamantium or something.

Exuma · on April 6, 2024

the worst name in existence

brianmcc · on April 9, 2024

I think Mongo is way worse tbh

Exuma · on April 9, 2024

lol like blazing saddles

CurtHagenlocher · on April 6, 2024

UPDATE player set name = 'Stonebraker' where name = 'Stonebreaker'

michae2 · on April 6, 2024

Oh, thank you!

qazxcvbnm · on April 6, 2024

Which of these strategies does Postgres use under Repeatable Read?

michae2 · on April 6, 2024

PG only uses EvalPlanQual under read committed isolation. Under repeatable read the first update fails with a "could not serialize" error, just as it does under serializable.

slavabobik · on April 6, 2024

How you incorporate SELECT FOR UPDATE locking into Raft? Also maybe slightly off topic, but how does the CockroachDB deals with read-only queries and raft log?

arulajmani · on April 9, 2024

Prior to building read committed, CRDB only supported intents as a form of replicated locks. Intents are provisional values written by active transactions which are resolved once the writing transaction commits or aborts.

As part of building read committed, we introduced other forms of replicated locks. In particular, replicated shared and exclusive locks (SELECT FOR UPDATE statements acquire exclusive locks). Locks acquired by read committed transactions are always replicated.

The unit of replication in CockroachDB is called a range, and every range is its own raft group. Every range has a reserved portion of its keyspace where locks can be stored. This is known as the replicated lock table. Incorporating exclusive/shared locks in the replicated lock table involved adding a notion of lock strength to all locks that were stored (as previously they were all assumed to be intents).

pelagicAustral · on April 6, 2024

[flagged]

orthecreedence · on April 6, 2024

Look, every single time this database is mentioned here, this exact comment comes up. Every. Time. For years, and years, and years. And yet, true to its name, the database and the team building it are still around.

At some point we've got to realize the CRDB team is comfortable with the name, it's not going to change, and all comments lambasting it are just farts in the wind.

javawizard · on April 6, 2024

I'm relatively ambivalent on the name issue specifically, but I do think this has been hashed over ad nauseum to the point of making for boring reading. This line from the HN guidelines:

> Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead.

stands out as particularly relevant.

jrockway · on April 6, 2024

I think the idea is that after a nuclear war, the only two things left are your data and some cockroaches.

unethical_ban · on April 6, 2024

I scoffed at first, but I guess cockroach sounds to some what "maggot" sounds like to me.

Like the other person said, it's related to the survivability of cockroaches. One can hope for a management tier competent enough to evaluate an internal database on the merits.

jchw · on April 6, 2024

Sometimes, I like to open The GIMP so I can take the logo for the Coq proof assistant, export it and commit it into the HEAD of my Git repository.

I think the weird names are fine, it's the management that needs fixing.

nine_k · on April 6, 2024

If not for git, you could have to commit to Subversion! And then back up the whole thing with Duplicity.

I'm not a fan of such sense of humor, and prefer neutral monikers like rsync, or Xfce, or Linux, but the ship has sailed.

It just gets normalized and the edginess erased with time and daily use. Imagine that somebody would suggest that you have to wash your hair with some "sham poo". They would be laughed out of the room, had not this word existed for centuries.

anonzzzies · on April 6, 2024

You name tools to your management? They won’t know what anything is anyway? The word database (some sort of Excel!) is usually an issue already. Why would you go into that much detail; just don’t?

schmookeeg · on April 6, 2024

Did Mongo get slagged in their early days because of the unflattering Blazing Saddles movie reference?

Call it whatever you want. Engineers should be able to select their tools on merit, not marketing.

re-thc · on April 6, 2024

Do you use Git? Does management know what it is?

verticalscaler · on April 6, 2024

Or imagine a name like that for a band or car for the masses? Gawfah and hurmph indeed.

anamax · on April 6, 2024

Have you tried?

snapcaster · on April 6, 2024

Really? is your management stupid?

vcryan · on April 6, 2024

Ha! Do you think this is uncommon? The path to becoming a manager in many organizations is biased towards those who are good at politics, loyal, a friend, and charismatic—all sorts of things that don't ensure effective approaches to technology selection.

npstr · on April 6, 2024

Idk to me this looks like a modeling issue of the data. There should be a team table that contains team specific data such as the skill level, then these two queries wouldn't run into any problems.

nickpeterson · on April 6, 2024

Generally the best way to surmount a hard problem is to solve a different one.