Selective async commits in PostgreSQL – balancing durability and performance

cryptonector · 2025-03-19T10:58:55 1742381935

I suspect many apps would benefit from splitting the commit completion notification into two pieces:

- commit logically committed (meaning it won't fail to complete for any reason other than power or hardware failure, and in particular no trigger or other such logic can keep it from completing)

- commit completed durably (i.e., all associated writes have reached stable, persistent storage, and future reads will see them even if there is a power failure in between)

A UI could indicate that a transaction is logically complete and have a second way to indicate that the transaction is durably committed.

koolba · 2025-03-19T11:56:49 1742385409

IIRC, that doesn’t work in practice because once you have a synchronous commit requested, all existing async commits that preceded it must be guaranteed to be fsynced for that sync commit to be fsynced. Which kind of makes sense if you think about the later work potentially reading the result of the earlier one.

The high throughput gains if async commit shine when the totality of the workload is async.

rzwitserloot · 2025-03-19T12:38:58 1742387938

I think you mean: If anybody anywhere decided to invoke a wait() until the state "my COMMIT; has been durably committed", then that means all previous commits must also be in durably committed state before we can continue.

Sure. But this doesn't affect other code that is merely wait()ing for 'logically committed'. Not in the past, and usefully, not in the future either. The 'logical system' can be far ahead of the 'durable' system, it doesn't have to wait.

Where it would go wrong is if code A needs to wait around for code B to do a thing (such as provide a value), and B's code for some reason contains a wait() on durable. Possibly because B was written before the split in commit behaviour was around, and for backwards compatibility reasons, for them COMMIT; means: "wait for durable commit".

Given that `SET synchronous_commit = off;` works per transaction, I'm kinda inspired here. I can think of a few places where we commit for various logical reasons but I don't need the fsync guarantee.

paulddraper · 2025-03-19T19:04:58 1742411098

But....

Only the sync committer needs to wait for those. (Nothing changes.)

The async committers can go on their merry way.

x0x0 · 2025-03-19T17:48:12 1742406492

Right, but there are common use cases (audit, log, analytics tables) where you can tolerate a small risk of loss and get some hefty speedups.

ddorian43 · 2025-03-19T11:08:53 1742382533

Probably higher cost to send 2 notifications to the user compared to always fsync on all writes in a SSD.

shayonj · 2025-03-19T11:25:25 1742383525

+1 - I have found its better to instead use it on paths that you know are idempotent and recoverable. Liking expiring data on a cron and so on.

cryptonector · 2025-03-19T17:21:09 1742404869

The network is faster then the storage.

rubiquity · 2025-03-19T17:41:50 1742406110

No it isn't and hasn't been for quite some time and it will probably remain that way for the rest of our lives.

cryptonector · 2025-03-20T00:13:18 1742429598

It still doesn't matter. All you need is enough CPU cycles to send the additional completion notice.

sgarland · 2025-03-20T00:52:47 1742431967

With nearly all DBaaS, and _especially_ Aurora, as TFA mentions, the storage is also on the network, so it's moot.

ants_a · 2025-03-20T08:39:27 1742459967

There is a paper exploring this concept: https://cs.uwaterloo.ca/~kdaudjee/ED.pdf

UI wise it does not make sense to have this distinction, as the window to get durability is a small fraction of a second. But for concurrent modifications the reduction in lock duration can mean an order of magnitude throughput.

sgarland · 2025-03-20T01:07:43 1742432863

If you're careful and you know what you're doing, sure. A much more impressive speed boost is to not use Aurora. "Let us separate storage and compute with a minimum latency of 1 msec," said no sane person ever.

Go spin up vanilla Postgres on an i4i instance, or even your personal computer, assuming you have a fast NVMe drive. You'll see what fast is. Nothing comes remotely close, not even io2 Block Express. A modern enterprise NVMe drive can do over 1,000,000 IOPS.

nine_k · 2025-03-20T21:01:08 1742504468

While I generally agree, I must say than an NVMe can sustain many more reads than writes per second.

sgarland · 2025-03-20T23:26:09 1742513169

Fair point, but random write IOPS is still usually around 500K.

jasonthorsness · 2025-03-19T15:50:17 1742399417

I didn’t know this could be per-session or per-transaction. That seems a way better way to take advantage of this for very carefully thought-through use-cases without setting up an invisible footgun in a global setting someone else doesn’t realize is changed from the default.

pseudopersonal · 2025-03-19T15:14:38 1742397278

Thanks for sharing this! Like many others, we're at a scale where we use postgres for almost everything to reduce operational overhead, and there are some ETL,log and audit tables that would love a 30% speed bump for such a small risk window.

shayonj · 2025-03-19T15:48:57 1742399337

exactly! those are the places I have found it to be super useful. Glad you found it useful

uhoh-itsmaciek · 2025-03-19T16:31:48 1742401908

    def with_synchronous_commit_off(&block)
      ActiveRecord::Base.connection.exec_query("SET synchronous_commit = off")
      yield
    ensure
      ActiveRecord::Base.connection.exec_query("SET synchronous_commit = on")
    end

I'm not a Rails expert: does this reliably give you the same connection from the pool? If not, you could be setting `synchronous_commit = off` on one "physical" connection, and clearing it on another. This is setting it at the Postgres session level, so it will persist for the lifetime of the connection. If it's not managed reliably, it could lead to some very confusing bugs (only triggered in a very small window when Postgres shuts down unexpectedly while the transaction is still unsynced to durable storage, making it extra fun to debug).

In general, though, this is a very handy feature.

drdaeman · 2025-03-19T18:45:09 1742409909

In your code, in the `ensure` part you want `SET synchronous_commit TO DEFAULT` instead of the explicit `on` (https://www.postgresql.org/docs/current/sql-set.html)

Then, I suspect you may probably rather want to use `SET LOCAL` rather than just `SET`, so it applies to a single transaction only. So maybe something more like

    def asynchronous_transaction(&block)
      ActiveRecord::Base.connection.transaction do
        ActiveRecord::Base.connection.execute("SET LOCAL synchronous_commit = off")
        yield
      end
    end

Please note that I don't really know Rails (in particular, I've no idea whenever `Base.connection` is guaranteed to be exactly the same throughout this function's lifecycle or if e.g. there's a connection pool underneath), so I could be introducing some subtle bugs here.

cyberax · 2025-03-19T17:12:14 1742404334

Yes, the connection is "sticky" for the duration of the request. The middleware also automatically starts a transaction for you when you check out the connection.

shayonj · 2025-03-19T17:18:23 1742404703

There is something to be said about using SET LOCAL SESSION maybe. Esp. when using pgbouncer in between rails app and PostgreSQL and using session var

kelseydh · 2025-03-19T10:50:00 1742381400

We use PostgreSQL locally but moved our production database over to AWS Aurora.

If I use `SET synchronous_commit = off;` in my Rails code will Aurora also mimic this behaviour?

shayonj · 2025-03-19T11:26:11 1742383571

Yes, i touched a bit on it towards the end. If you turn it off, Aurora won't wait for 4 out of 6 nodes to acknowledge the commit (https://aws.amazon.com/blogs/database/amazon-aurora-postgres...)

anonymousDan · 2025-03-19T14:42:12 1742395332

Presumably if you disable this on a per-transaction basis you are still guaranteed that transactions are replicated in FIFO order? Otherwise you could end up in weird situations where a synced txn reads from a lost txn and your state is effectively corrupted.

anarazel · 2025-03-19T18:58:34 1742410714

Correct. Async commit transactions add their changes to the WAL the same way normal transactions do. The only relevant change is that the WAL is not synchronously flushed to disk before COMMIT completes.

ants_a · 2025-03-20T08:49:25 1742460565

Yes and no. On the primary durability order and visibility order are different. So an async transaction that starts committing later can become visible to readers before a sync transaction that comes before it.

A read-write sync transaction that reads the result of a non-durable transaction cannot commit ahead of the async transaction. But there is no such guarantee for read-only transactions. So a transaction could read a non-durable state, decide that no further action is needed, and have that decision be invalidated by a crash/failover.

To make things even worse, similar things can happen with synchronously replicated transactions. If a committing transaction waiting on replication is cancelled it will become visible immediately. A reasonably written application would run a retry of the transaction, see that the result is there and conclude that no action is necessary, even though the data is not replicated and would be lost if a failover then happens.

anonymousDan · 2025-03-20T18:15:39 1742494539

Yeah these are the kind of subtle issues I expected the article might be either glossing over or unaware of.

ltbarcly3 · 2025-03-19T13:43:21 1742391801

Just run work in parallel if possible. You are fighting fsync latency in large part, which can be done in parallel.

ndriscoll · 2025-03-19T15:15:42 1742397342

Parallel doesn't really help that much IME, but batching your work into larger transactions to amortize locks and fsyncs does.

anarazel · 2025-03-19T18:59:52 1742410792

Concurrent commits can group commit with one WAL flush (i.e. one fsync/fdatasync).

shayonj · 2025-03-19T15:48:31 1742399311

drdaeman · 2025-03-19T19:15:29 1742411729

I'm not sure working in parallel is always a good decision.

An anecdote: just the other day I've had to implement batching instead of concurrent parallel processing because PostgreSQL really hated me having thousands of concurrent transactions on the same table.

My particular workflow was essentially this - I get a batch (full state dump) with some products, and I need to update my `products` table to keep track of them (soft-deleting what had disappeared, inserting new, updating existing):

    BEGIN;
    -- Quickly load the batch into a temporary table
    CREATE TEMPORARY TABLE products_tmp (LIKE products INCLUDING ALL) ON COMMIT DROP;
    COPY products_tmp FROM STDIN;
    -- Soft-delete products missing from the current batch
    UPDATE products SET is_active = FALSE WHERE is_active AND store_id = ANY($1) AND id NOT IN (SELECT id FROM products_tmp WHERE store_id = ANY($1));
    -- Upsert products from the current batch (add new, update existing)
    INSERT INTO products (...) SELECT ... FROM products_tmp ON CONFLICT (id) DO UPDATE SET ...;
    COMMIT;

With just a few thousands of concurrent writers things started to look quite ugly, with constant serialization failures (I started at SERIALIZABLE, then downgraded to REPEATABLE READ, was reluctant to use READ COMMITTED) and deadlocks preventing me from performing some DDL (schema migrations) on the products table.

So I've started to batch those batches elsewhere and dump them at periodic intervals - and things started to look a lot better. Maybe that was a naive/bruteish approach and I should've done some parameter tweaking and/or fancy table partitioning or something else (idk) for congestion control instead, but at least it worked.

feverzsj · 2025-03-19T10:05:49 1742378749

It just breaks the transaction guarantee. In most cases, you don't want it. In other cases, like logging, pipeline mode may be better suited.

ddorian43 · 2025-03-19T10:20:16 1742379616

No? Can you explain?

It breaks the durabality for that transaction until a new transaction with synchronous commit 'on' completes or time passes for auto-fsync.

feverzsj · 2025-03-19T10:55:57 1742381757

On client side, there is no explicit way to know which transaction is broken when there are multiple transactions undergoing. If your use case relies on transactional assumption, it may fail at some point.

ddorian43 · 2025-03-19T11:07:26 1742382446

On client side you know because you're setting the async-commit on a case basis.

All transactions work the same. When/If the server fails, some transactions (that you allowed yourself) may be not durable (or not replicated). But all things are still transactional.