Scalable OLTP in the Cloud: What's the Big Deal?

taeric · 2025-01-30T17:18:00 1738257480

Fun read. I am a bit curious on not mentioning OLAP, at all?

From my experience, the vast majority of complications in systems is people not realizing they are asking an OLAP question while wanting parts of OLTP semantics.

I'm also curious how much of this extra complication comes from having a central database that tries to be a source of truth for everything? As an easy example, inventory systems that try and have a cloud source of truth have obvious problems when on the ground inventory issues crop up. That is, a lot of the complication is between the distributed nature of the application and database, sure. But, another large source is the non-distributed abstraction that a central database can represent not centralized activity.

mamcx · 2025-01-30T18:48:07 1738262887

> I'm also curious how much of this extra complication comes from having a central database that tries to be a source of truth for everything?

You can reverse this as 'how much extra complication comes from having many disparate databases, so now you push all ACID complications at the company level instead of the software(rdbms)'

It is a tricky problem!

Having a single source of truth is a tremendous simplification, especially for the people.

It should be ideal that the other databases are pure materializations on top of it, but then not matter what you have the need to input more stuff that is not part of the central one, and now you have both problems.

taeric · 2025-01-30T18:59:54 1738263594

A single source of truth for facts is fine. A single source of truth that has to be reflected by external reality? Less fine. At that point, you pretty much have to have reconciliation processes to account for when they don't match.

Oddly, I'd argue that the better way to view this is that the central database is the materialization of the data sources scattered around where work actually happens. As such, you have to have constructs that account for provenance and timing introduced there that aren't as necessary at the edges.

nyrikki · 2025-01-30T19:38:09 1738265889

Single source of truth is different in distributed systems, well the costs are different.

Using the language of 'software architecture: thee hard parts' this forces your entire system into a single 'architectural quanta', basically it is a monolith.

There are situations where monoliths are acceptable or even the least worst option.

The 'Fallacies of distributed computing' covers most of these.

It can work better if you use a hierarchical model, but OLTP is almost exclusively the relational model.

ACID transactions become pretty brittle, software becomes set in stone etc...

There are use cases, but the value proposition is typically much weaker

datadrivenangel · 2025-01-30T18:32:25 1738261945

I agree that analytical queries cause a significant portion of the complication in transactional systems.

However, at global/twitter scale, just basic CRUD operations become pretty hard for OLTP, so scalable OLTP in the cloud is a pretty big deal, especially when you get fun things like phantom read/writes as writes may take time to get to distributed read replicas, so a user may send a tweet and then not see it on their profile for a while.

taeric · 2025-01-30T19:02:28 1738263748

I mean, sure? I hesitate to take the global/twitter scale debates seriously, unfortunately. They are dominated by straw men arguments that you can get sub 50ms latency on every action for all viewers. :(

tharkun__ · 2025-01-30T22:36:09 1738276569

You can't and it shouldn't matter and I don't care if everyone sees my upvotes on Reddit right away / at the same time as others or if I see theirs or when exactly my comments show up.

But global/twitter/reddit etc. scale does matter in that I don't really remember a PHPBB or slashdot or HN swallowing my comments, ever. Even if you get an error you can always just go back in the browser and your input box is there with your comment.

But reddit has been effing atrocious over the last few months. You get "Internal Server Error" when trying to do any voting or commenting a lot and it won't go away until you refresh the page like 17 times (until presumably you round robin get to some BE server that actually works. It's also been swallowing comments, where about every third or fourth comment it accepts the comment instead of throwing an Internal Server Error at you (which would be preferable) and instead it "accepts" it but it will never show up. Ever. Only chance is to copy every comment before submitting, in case you have to re-submit it from scratch.

zo1 · 2025-01-31T06:08:17 1738303697

You can tell a lot about Reddit and it's early "architecture" from this video.

https://www.infoq.com/presentations/reddit-architecture-evol...

That and the Instagram architecture video are such eye-opening examples that I always send to juniors when they have doubts about their own abilities. In most cases, we're astronomically no where near Reddit/Twitter/Instagram scale, and yet we avoid a lot of the common pitfalls we see presented in those videos. Side note, I'm shocked at the quality, but hey "Time To Market!"

The Instagram video: https://www.youtube.com/watch?v=hnpzNAPiC0E

tharkun__ · 2025-01-31T23:23:58 1738365838

Haha yeah.

I had the same eye opening experience when I first "met" Jira. I was pretty fresh out of uni myself, yet I knew you should never use a business key that people might want to change and always use surrogate keys for your database.

And here is Jira "happily" using Issue Key as both the business key that signifies which project an issue belongs to, which can very obviously change as soon as you move a ticket to another project and as the primary key in their database and thus for everything related to an issue like say issue links.

I was laughing so hard that you couldn't tell I was also crying and cursing at the same time coz I had to deal with that fact.

They've since introduced ids for things but ... duuuudes! WTF! lol

lmm · 2025-01-31T04:36:25 1738298185

> don't really remember a PHPBB or slashdot or HN swallowing my comments, ever.

Really? I remember it happening a lot.

> Even if you get an error you can always just go back in the browser and your input box is there with your comment.

That's a relatively new feature and doesn't always work. (Old, I can't speak to new) Reddit's "push a button and it submits without moving off the page" is much nicer.

tharkun__ · 2025-01-31T23:05:17 1738364717

New reddit. Same submit without moving off the page. It's great if there is an actual error returned from their BE. You can try again etc.

The problematic case is that very often now it accepts your comment without any errors and... Then nothing. Your comment will never ever show up. The system swallowed it. Whole. Without chewing.

lmm · 2025-02-03T05:24:49 1738560289

> The problematic case is that very often now it accepts your comment without any errors and... Then nothing. Your comment will never ever show up. The system swallowed it. Whole. Without chewing.

Huh. I've never seen that. Does it not show on your user page even? I've known it to take a few seconds for comments to show up, and mods on certain subs love to silently delete comments (sometimes automatically), but I've never seen them just disappear.

tharkun__ · 2025-02-07T00:28:51 1738888131

Not the thread, not the user page. Nowhere. Just gone.

I do also have the case where it disappears but a refresh / going to the user page will show it, i.e. something async on their end was just a few milliseconds too slow or something.

And posting the exact same comment again (I c+p them now) works. So it wouldn't be any automatic removals and it's too fast for manual removals.

taeric · 2025-01-30T22:40:12 1738276812

Exactly on that first statement.

I think I agree across the board. HN and Stack Overflow are both amusing examples of much simpler architectures that kind of get to the point, though? Such that I'm not clear if we are on the same page, there.

(And I, sadly, have zero experience with Reddit.)

honestSysAdmin · 2025-01-30T17:48:45 1738259325

> From my experience, the vast majority of complications in systems is people not realizing they are asking an OLAP question while wanting parts of OLTP semantics.

If you could elaborate on this further, I and others are probably very interested in reading more about it.

taeric · 2025-01-30T18:32:09 1738261929

As the sibling says, it is easy to think about in terms of what you are doing with the data. Reporting on how many transactions of a type have happened over a duration? Probably OLAP. Really, if the word "report" shows up, at all, probably OLAP. Executing parts of a workflow for an active transaction? OLTP.

Briefly looking, I can't find the books that I thought covered a lot of this in a good way. Will keep looking, apologies.

datadrivenangel · 2025-01-30T18:33:08 1738261988

Designing Data Intensive Applications is a very good book in this space.

bobnamob · 2025-01-31T06:44:03 1738305843

Designing Data Intensive Applications by Martin Kleppmann is required reading if you're in this space.

I can't recommend it highly enough.

[1] https://www.amazon.com/Designing-Data-Intensive-Applications...

taeric · 2025-01-31T17:07:53 1738343273

Indeed, I upvoted that response to me, but I should have said as much. This is the book I couldn't remember the name of for the life of me. Really good book.

juancn · 2025-01-30T18:17:09 1738261029

The main data access difference between OLAP systems and OLTP systems is how many records on average do you need to access:

- OLAP: most queries need most records (aggregations span large swaths of data)

- OLTP: most queries access just a few records

Also, in OLAP, in many cases, you can live with a single-updater model without much trouble, where OLTP, the strength is to have many concurrent updaters (but mostly non-overlapping).

greggyb · 2025-01-30T21:11:51 1738271511

- OLAP: read-mostly, table-scan heavy, many queries run ad-hoc by users

- OLTP: write-mostly, index-seek heavy, ~all queries pre-defined up front

hobs · 2025-01-30T19:18:56 1738264736

OLAP - Most queries need an aggregate of records. Generally you do NOT need most records, but simply the records grouped by dimensions per interval (for almost all olap reporting). You do not change the data, you observe it. If you change it, you are not dealing with OLAP data.

OLTP - You are dealing with the ins and outs of people using stuff to do things. You buy something, you check out something, you some way perturb the state of things. This should not require large amount of row lookups in 99.9% of cases.

yoda97 · 2025-01-30T22:14:06 1738275246

So the first focuses on analytics and reporting, the second on transactions and performance. They are not meant to replace each other, they serve different purposes. Some teams may need both.

guru4consulting · 2025-01-31T19:56:51 1738353411

Apart from the above differences, another important difference is that OLAP is often columnar based, as opposed to the typical OLTP being row-based. So, OLAP queries use different kinds of index. Snowflake has introduced Hybrid tables where the same data is stored and indexed twice, once in OLAP columnar type and the other in OLTP style row index.

whartung · 2025-01-30T18:14:17 1738260857

Maybe someone can answer how this is done.

Simply, the mad crushing dash to get the last bit of committed inventory.

Ticketmaster has 50,000 General Admission Taylor Swift tickets and 1M fans eager to hoover them up.

This is a crushing load on a shared resource.

I don't know if there's any reasonable outcome from this besides the data center not catching on fire.

ndriscoll · 2025-01-30T21:22:03 1738272123

Add a rate limit in nginx so that a single client has bounded work on your backend (e.g. 1/1-10s), and batch your requests into ~100/transaction using an in-memory queue. 1 server has been able to deal with ~1M connections for many years. No fires needed.

On a business side, they drastically lower the request load and make scalping unprofitable by holding a Dutch auction.

dangoodmanUT · 2025-01-30T18:15:14 1738260914

You build systems designed for this workload, you don't foolishy try to use the same patterns as a CRUD app.

michaelt · 2025-01-31T11:34:13 1738323253

> This is a crushing load on a shared resource.

Taylor Swift tickets cost $500 each [1]

An AWS u7in-32tb.224xlarge with 896 vCPUs, 32 TB of memory and 200 Gbps network bandwidth costs $407/hour even at list price [2].

The question is whether they feel like going to the expense and engineering effort.

[1] https://www.businessinsider.com/guides/streaming/how-to-buy-... [2] https://instances.vantage.sh/aws/ec2/u7in-32tb.224xlarge

immibis · 2025-01-30T19:55:12 1738266912

You, as a business entity, have to figure out how you want to allocate them. They aren't going to get sold the normal way, so you get to choose whether you want to acknowledge that and design the system, or not design it and let nature take its course.

I've been to events that used a lottery. You order a ticket any time in, say, a month window. At the end, they tell you if you actually got a ticket. They have a process for linking orders together so you can choose to get a ticket if and only if your friends do.

I've also been to C3, which (in the second phase) knowingly used a first-come-first-serve three times, putting some kind of lightweight proxy in front of the actual shop that would only allow a certain number of people to access it at a time (this is important because you don't know how many tickets each user is going to order). In the first phase, they use a system of "self-replicating vouchers" for various trusted C3-affiliated groups: one voucher is given to each group, allowing an order; at the end of each day until this portion of the ticket pool runs out, a new voucher is given to whoever made an order the previous day. I don't know the reasons why self-replicating vouchers are designed exactly that way, but it means each group gets to run down their own self-determined priority order and gets punished for inefficiency.

The capitalist approach is, of course, raise the price to twenty thousand dollars or whatever level it takes for only 50,000 people to want to buy a ticket.

bigmutant · 2025-01-30T23:35:09 1738280109

As others have said, this is a solved problem in a lot of companies. Basic answers are: 1. Queuing 2. Asynchronous APIs (don't wait for the 'real' response, just submit the transaction) 3. Call-backs to the Client

A good async setup can easily handle 100k+ TPS

If you want to go the synchronous route, it's more complicated but amounts to partitioning and creating separate swim-lanes (copies of the system, both at the compute and data layers)

ndriscoll · 2025-01-31T01:36:05 1738287365

Note that the client doesn't need to know about async operations/you don't need an async api at the http layer. Put the queue in memory. Have your queue workers wait up to ~5 ms to build a batch, or run the transaction when a batch is big enough (at 100k RPS, you already have a batch of 100 every ms). You're adding ~1-5 ms latency, so no reason not to respond synchronously to the client. Conceptually, the queue and workers are an implementation detail within the model. As far as the controller knows, the db query just took an extra ms (or under any real load, responded more quickly).

zapkyeskrill · 2025-01-31T01:46:29 1738287989

Sure, but no matter how many async request you accept you still only have 50k items available. You also presumably take people's money, having them input their personal and card information so not waiting for real response means what? Thank you for your money and the data, we'll be in touch soon; pinky promise?

lmm · 2025-01-31T05:07:33 1738300053

> Thank you for your money and the data, we'll be in touch soon; pinky promise?

That's very much an option when it's something this popular - the Olympics I went to did an even more extreme version of that ("Thank you for putting in which events you wanted to see, your card may be charged up to x some time within the next month").

Or you can do it like plane seats: allocate 50k provisional tickets during the initial release (async but on a small timescale), and then if a provisional ticket isn't paid for within e.g. 3 days you put it back on sale.

Ultimately if it takes you x minutes to confirm payment details then you have to either take payment details from some people who then don't get tickets, or put some tickets back on sale when payment for them fails. But that's not really a scaling issue - you have the same problem trying to sell 1 thing to 5 people on an online shop.

foota · 2025-01-31T03:10:56 1738293056

You have 50,000 tickets to spread between one million people, you can partition people to tickets and only have 20 people per ticket. You won't have strict ordering (e.g., someone who applied after may get a ticket where someone who applied earlier doesn't), but we'd be talking about errors in the milliseconds.

dangets · 2025-01-30T19:54:59 1738266899

Sharding, pre-allocating leases of blocks of tickets across available resources, and eventual consistency. You don't need to keep the UX transactionally correct; you are able to say "0 tickets remaining" and then a minute or hour or day later say "100 tickets remaining". For something as popular as Taylor Swift, the fans will keep checking.

guru4consulting · 2025-02-01T03:33:15 1738380795

while all of the above engineering solutions work, I wish they adopt a human centric solution. When there is too much demand and too little inventory, they could introduce some kind of lottery system where they accept partial payments through out the day and then randomly pick tickets in a lottery (individuals and group bookings as well). In this way, fans can avoid the mad rush, ticketing systems don't have to handle the peak mad rush for 10 minutes.

riku_iki · 2025-01-30T22:24:51 1738275891

> ticketmaster has 50,000 General Admission Taylor Swift tickets and 1M fans eager to hoover them up.

I think something like trivial postgres setup can handle this thing..

danpalmer · 2025-01-30T23:10:52 1738278652

Surprising that neither this post nor the linked paper that it's about talk about Spanner, and instead only look at Oracle/SQL Server/Postgres. From what I understand, Spanner really does scale up OLTP to global level, where those alternatives struggle beyond regional level.

foota · 2025-01-31T03:04:45 1738292685

I think this is really talking about an orthogonal sort of scaling to what you're looking at. Yes, given sufficiently balanced traffic patterns, spanner can serve essentially arbitrary load. If I understand correctly, this article is proposing that most of the work of transaction processing can be done in a layer that's not affinitized, and only do the bare minimum in the key order affinitized server that is responsible for enforcing consistent ordering. This way, single key ranges can handle significantly greater load than they otherwise could, and load from read only calls can be entirely offloaded.

danpalmer · 2025-01-31T05:01:04 1738299664

Maybe, but I think that's just a function of where you draw the database boundary. My naive (external to Google) understanding of Spanner is that it probably is drawn to include these sorts of things, and similarly FoundationDB's internal architecture looks a bit like this setup.

I think perhaps the view of a database being a single server codebase like this is a bit naive. When you read about how Meta deploy MySQL for example, it's a whole service ecosystem that provides different levels of caching, replication, etc, to create a database at a higher level of abstraction that provides the necessary properties. FoundationDB is similarly better viewed as a set of microservices. When you architect a database like that it is possible to achieve these things, but that doesn't seem to be a new idea, that seems to be just how it has been done in the industry for a while now. The article isn't entirely clear on whether they realise this or are proposing something new.

jburbank · 2025-01-30T23:31:02 1738279862

Spanner really can. It has its own quirks, but I've enjoyed working with it.

theLiminator · 2025-01-31T00:54:47 1738284887

I'm really curious about the capabilities of https://cedardb.com/ they claim that they can make a single DB that does both OLTP and OLAP. It's basically the production version of umbra.

tomrod · 2025-01-31T01:36:42 1738287402

They seem to be missing a pricing page (or pricing information of any kind) as well as how to actually purchase the product.

vsipuli · 2025-01-31T03:47:39 1738295259

How does this compare to Amazon Aurora DSQL (https://aws.amazon.com/rds/aurora/dsql/)?

solatic · 2025-01-31T13:12:50 1738329170

Database design fundamentally trades off the speed of acknowledging writes with durability/availability (these are the same thing at scale) guarantees. If you need to guarantee that a change has been replicated globally then you cannot acknowledge the write as quickly as if you have a single control server with no need for persistence.

"Scalable"? Needs more context please. Scalable how? And what are you willing to compromise on for that scale?

honestSysAdmin · 2025-01-30T17:50:52 1738259452

I could be wrong here, I'm not an expert with databases, but I believe -- this is a belief, please help me disconfirm it if you think it is wrong -- that CitusDB's approach that shards Postgres across redundant mirrors, should successfully scale horizontally for OLTP without sacrificing consistency.

AlexB138 · 2025-01-30T18:52:38 1738263158

Citus still has a centralized coordinator that all requests go through to be routed to their shard, or at least it did several years ago when I last ran it in production. It definitely scales further than single instances, but it's not a magic bullet.

dangoodmanUT · 2025-01-30T18:16:14 1738260974

Others can do it too: Spanner, FoundationDB. SI just has generally better performance because there is less to check, and you can always opt-in to row locking if you need (because of write-write conflicts)

teekert · 2025-01-31T10:14:53 1738318493

Ah yes, not the One Laptop Child Per, as I'm typing it, I see my error. This is some DeepSeek level reasoning right here.