The end of a myth: Distributed transactions can scale

q3k · on April 11, 2023

The important parts:

> The data is assumed to be randomly distributed to memory nodes in the shared memory pool. Any memory node is equi-distant to any compute node, and a compute node needs to reach multiple memory nodes for transaction execution.

and also:

> Once a memory server fails, NAM-DB halts the complete system and recover all memory servers to a consistent state from the last persisted checkpoint. The recovery procedure is executed by one dedicated compute server that replays the merged log for all memory servers.

>This global stall is problematic for production, but NAM is a research proof-of-concept system. Maybe using redundancy/quorums would be a way to solve this, but then that could be introducing challenges for consistency. Not straightforward.

Yeah, research and proof-of-concept sounds about right. Some might even call it a toy database.

Also worth noting that this is a paper from 2016/2017. The world of (actually!) distributed database design has significantly moved on since then.

spenczar5 · on April 11, 2023

> The world of (actually!) distributed database design has significantly moved on since then.

What are some of the significant movements since then?

preseinger · on April 13, 2023

right

distributed transactions that can assume fast and reliable access to some shared memory are interesting but really the easy part of the problem

"scale" means going across large physical distances where neither low latency nor high availability can be assumed, so not in a single datacenter where nodes are connected by infiniband or whatever

jamesblonde · on April 11, 2023

This is impressive, but won't have huge impact, IMO.

Way back in 2015, MySQL Cluster (NDB Cluster engine) benchmarked 200m transactions/second on commodity hardware [1]. It was read-committed transactions, not snapshot isolation, but still impressive. NDB (or RonDB, the new DB by its author) uses a non-blocking 2-phase commit protocol (failed transaction coordinators are failed over) and is even open-source. Still, it hasn't had a big impact outside of the Telecom and real-time gaming worlds.

[1] https://www.slideshare.net/frazerClement/200-million-qps-on-...

_448 · on April 11, 2023

Why not many are talking about RonDB? This is like Erlang of the database world! People at Ericsson were really very smart and ahead of their time.

FoundationDB, TiKV, CitusData's Postgres extension are very well known, but RonDB looks like a hidden gem.

What are the practical issues with RonDB that it is not widely known(or used?)?

giovannibonetti · on April 11, 2023

Because RonDB is not a SQL database, it is just a KV store.

_448 · on April 11, 2023

Yes, but as per the document you can put a MySQL server along with the Data server node and use SQL.

mikaelronstrom · on April 11, 2023

RonDB is a KV store with SQL capabilities (the MySQL storage engine NDB is maintained by Oracle as part of MySQL NDB Cluster). Hopsworks is on top of this adding a new REST API (REST + gRPC) service that is already in the github tree and will ready for production usage in a few months.

doh · on April 11, 2023

FoundationDB is not a SQL database either though.

ergonaught · on April 11, 2023

[flagged]

andruby · on April 11, 2023

Why not? Is the name that bad?

ergonaught · on April 11, 2023

I can't say whether the name is bad or not. I dislike it, and I dislike it strongly enough to not discuss the database, despite personally liking and respecting the developer.

Simple and honest answer to the question and I have probably the most downvotes I've ever gotten for anything. Y'all are just plain broken.

Supermancho · on April 11, 2023

You: > I don't discuss it because of its name

Also you: > I can't say whether the name is bad or not. I dislike it, and I dislike it strongly enough to not discuss the database

> probably the most downvotes I've ever gotten for anything. Y'all are just plain broken.

You didn't add to the conversation, except adding your own personal problems as noise. This is working as intended. Hopefully this information will be helpful to all of us, going forward.

ergonaught · on April 11, 2023

Let's try this again.

Question is posed: What is the explanation for X?

Answer is provided: Here is my explanation for X.

You submit that this answer is "adding personal problems as noise" and downvoted, so "working as intended".

If the relevance leap is hard for you, let's try this.

Question: We just launched a new product called LAKJfkdshfdskjfwfdsfnbozg, but no one is talking about it. Why doesn't anyone ever talk about LAKJfkdshfdskjfwfdsfnbozg?

Answer: I don't talk about it because I don't like the name.

Your response to this would be to downvote the answer because it is a personal problem, noise, which does not contribute to the conversation?

I hope you do not function in any capacity which requires understanding of sentiment.

Supermancho · on April 11, 2023

Let's not. GL with whatever.

wwilim · on April 11, 2023

Is it Ron to discuss it?

qaq · on April 11, 2023

what is the problem with the name?

vlovich123 · on April 11, 2023

How many machines was that? This is doing it across ~35.

hardwaresofton · on April 11, 2023

105 machines with 28 cores each, says the presentation, and over infiniband. Looks like RonDB is quite impressive.

jongjong · on April 11, 2023

That sounds really good.

I think reasons for slow adoption are probably a mix of:

1. Lack of developer awareness.

2. Security implications (or perceived implications) of exposing memory directly to a network without passing through CPU or application-level access control mechanisms.

The second point may be prohibitive for a lot of general purpose database systems which are intended to be run on shared infrastructure on virtualized instances.

Another reason may be that a lot of production systems are CPU-bound, not memory-bound. RDMA seems ideal for systems which require a lot of memory. I'm thinking maybe with recent advancements in AI/LLMs, it could be an interesting technology as these do require a huge amount of memory relative to CPU.

jandrewrogers · on April 11, 2023

There are several good ideas in distributed databases that are effectively not deployable in cloud environments because the requisite hardware features don’t exist. Since cloud deployability is a pervasive prerequisite for commercial viability, database designs tend to overfit for the limitations of cloud environments even though we know how to do much better.

Basically, we are stuck in a local minima of making databases work well in cloud environments that are not designed to enable efficient distributed databases. It is wasteful and also provides an arbitrage opportunity for cloud companies.

jpalomaki · on April 11, 2023

What kind of hardware features are we missing on cloud?

withinboredom · on April 11, 2023

- network packet timestamping via hardware

- this paper

- dedicated bandwidth (Azure gives you bandwidth based on instance size)

- XDP on network interface

Probably more, but those are what I know of from running into their non-existence.

deepGem · on April 11, 2023

The atomic clock is the key to enable distributed transactions and Google has a proprietary lock on their atomic clocks for Spanner.

withinboredom · on April 11, 2023

You need this for <em>global</em> consistency, but for logical local (like a single entity) this is unnecessary.

deepGem · on April 11, 2023

Yeah, you have a much better understanding of this topic than me for sure :)

amitport · on April 11, 2023

Do you know what is exactly proprietary in spanner? AFAIK most (all?) of the ideas existed before in the theoretical clock synchronization literature.

deepGem · on April 11, 2023

Nothing proprietary in Spanner but I believe no other vendor has an atomic clock similar to what Google has. Hence they are not able to implement the paxos global transaction lock which requires that all servers participating in the transaction be perfectly synchronized. This is what one of the comments above refers to I think.

So while the spanner paper is open to be implemented by any vendor, they don't have the proprietary advantage that Google has - the atomic clock. So Yugabyte, CockroachDB don't rely on atomic clocks. I tried to get to the ground level basics of this, but I haven't understood this matter completely yet.

amitport · on April 11, 2023

I guess using worse clocks would mean using a (slightly?) slower spanner, but I'm not sure what is the impact. In any case, if a big vendor (e.g., Amazon IBM Oracle Dell...) would want something on par with Google's clock they probably can achieve it (though I don't know much about these clocks).

di4na · on April 11, 2023

The problem is that the slower it is the more the window for error grows. And the harder it is to recover. It also impose a high boundary to latency. You cannot be faster than your clock error without risks.

Note that even Spanner had multiple downtime due to clock and/or network failures. In these case, any operation lal guarantees are lost. This makes it really dangerous.

jeffreygoesto · on April 11, 2023

Do you have indications why CPU bound? I have never seen Systems other than for example linear algebra or special algorithms to max out a modern CPU. Almost all loads today are in love way or another memory bound.

JackSlateur · on April 11, 2023

Bah Systems are either limited by CPU (processing) or by CPU (waiting for IO, especially memory)

Systems limited by memory as in "quantity of" are scarse

lightbendover · on April 11, 2023

Only scarce since it's an easy distributed problem to solve compared to IO.

cbsmith · on April 11, 2023

Distributed transactions are invariably latency (usually storage I/O) bound, rather than CPU or memory bound. I think your #2 is a big part of the challenge with RDMA, as well as the "trickiness" of the programming paradigm.

HopenHeyHi · on April 11, 2023

  The experimental setup involves using a cluster of 56 machines connected by an InfiniBand FDR network

That bit might have something to do with it.

necubi · on April 11, 2023

Amazon will rent you infiband-class machines like the C6in.metal (https://instances.vantage.sh/aws/ec2/c6in.metal) with 200Gb/s of bandwidth. With EFA (https://aws.amazon.com/hpc/efa/) you can use HPC features like rDMA.

justinclift · on April 11, 2023

Heh Heh Heh

A bulk purchase of ~60 FDR IB cards, cabling, and network switches to support them sounds pretty expensive.

That being said, IB FDR gear is "older tech" now so the cards and switches can commonly be found reasonably cheaply on Ebay. The switches tend to be bloody loud though, so they're not something you'd want nearby if you can help it.

btown · on April 11, 2023

The linked blog post at the top of this article - https://muratbuffalo.blogspot.com/2023/01/is-scalable-oltp-i... - provides graphics that give extremely useful context. And here's the repo for the paper that discusses: https://github.com/DataManagementLab/ScaleStore

The idea that one of many writer-compute-nodes can literally reach into a memory buffer that is shared across machines, atomically flip some lock bits and propagate some cache-coherence messages, and use that to build a multi-writer distributed database without needing to partition (and where any writer-compute-node can handle any message, so you can just round-robin a firehose of messages at them)... and that there's a chance (though not yet implemented) that one could implement ACID on top of this? It's absolute madness, and wildly exciting.

concerned_ · on April 11, 2023

Now do it with shared nothing architecture. When you share all this stuff you are twisting the meaning of "distributed transaction"

jzelinskie · on April 11, 2023

The last discussion point from this article very much so rings true for me, too. It's been how many years since the Spanner paper and I still can't get a GCP VM with atomic clocks?

At least now I can provision Cloud Spanner as a managed service, but is this the future of clouds keeping their services at an advantage?

withinboredom · on April 11, 2023

Not GCP related, but I found some local (to my server) telcos and reached out to their IT department and got access to use their internal NTP servers. So my servers are now stratum 1, connected to GPS and atomic clocks, doing skew via chrony. I’m not paying a dime. I merely had to agree to keep my access low and my servers peer with each other to keep their clocks within a few nanoseconds of each other.

It’s amazing what you can get, if you just ask. It’s the latter part you really want, which is doable with modern hardware accelerated time stamping and a peering NTP system (like chrony).

starfallg · on April 11, 2023

In real world applications, NTP is accurate to the millisecond, while PTP is accurate to the microsecond.

PTP is often used now in the Telco world as well as for broadcasting applications.

bjornsing · on April 11, 2023

Time synchronization over a network with unknown latency characteristics is fundamentally impossible.

withinboredom · on April 11, 2023

Wow. It’s a wonder anyone has a correct time then. /s

Sarcasm aside, their NTP servers are less than 2 hops from my router. I know that because of excellent tools like ping and trace route.

rezonant · on April 11, 2023

Based on your other post, it sounds like you are doing NTP with your ISP and then using PTP to sync your servers on the local network. Nothing about this implies that you are accurate to the global clock within a microsecond, let alone accurate within any number of nanoseconds.

With a correctly equipped and configured PTP installation locally, you can expect your clocks to be synchronized on the order of a very small number of microseconds with respect to each other, but the relationship between those and the atomic global clock is something else.

If you are using satellite/GPS time sources, then that's another matter. Why then is it relevant to have an NTP hookup at your ISP?

withinboredom · on April 11, 2023

With A ISP, not my ISP. Also, I’m syncing with a Stratum 0 servers which are connected to an atomic clock and several GPS clocks.

Also, consistency and accuracy are orthogonal to each other. My servers clocks are consistent and fairly accurate in regards to global time.

rezonant · on April 11, 2023

Ah yes, I figured as much, this is why I skipped over the GPS and satellite parts initially. Having a high quality highly local NTP source is better than most sources, but it is nowhere near as close as having your own local time sources, which is what a lot of time sensitive installations do.

protocolture · on April 11, 2023

no-propagate-ttl really does clean up an ISP network.

atoav · on April 11, 2023

Really? It is now Tuesday in central europe and I guess most machines¹ managed to figure out that fact.

Most machines will also know it is 8:11.

The question is how much resolution do you need and are your clocks accuratly synced enough for the thing you plan to do.

You are aware that many of the physics experiments that operate at the edge of what is possible use extremely accurate clocks synced over national and sometimes continental borders?

¹ let's ignore the ones that have been configured incorrectly

sethammons · on April 11, 2023

We've several thousands of nodes in our system. Our ops people have witnessed clock skew from seconds to months that are transient. Time is constantly synced via ntpd.

You can't trust system clocks in a distributed system to ensure ordering. Some reading:

https://codeburst.io/why-shouldnt-you-trust-system-clocks-72...

rezonant · on April 11, 2023

Perfect time synchronization over a network with unknown latency is provably impossible.

Fairly certain distributed physics experiments will be using atomic clocks that are not synchronized over the network. And/or they'll use direct satellite or GPS time sources which have predictable latencies.

t0b1 · on April 11, 2023

CERN at least seems to be using a solution that is running over Ethernet[1] but with custom hardware that is probably fairly expensive. They use a single time source and then measure the delay between each switch/node. Though, this is limited by needing to be able to run a cable between each node so idk how your definition of a distributed experiment fits.

[1]: http://white-rabbit.web.cern.ch/

dilyevsky · on April 11, 2023

Drop the “over a network with unknown latency characteristics” - it’s cleaner

bjornsing · on April 12, 2023

Cleaner perhaps, but wrong. If you know that latency is statistically symmetrical (same in both directions, on average) then you can synchronize clocks to arbitrary precision (asymptotically at least).

dilyevsky · on April 24, 2023

> If you know that latency is statistically symmetrical (same in both directions, on average)

My understanding of relativity is that this is principally unknowable and we only assume that speed of light is same in every direction by convention (every few years there’s a paper that claims they managed to measure one way sol but later it turns out they actually measured two-way in a roundabout way) so you can’t really know that?

ikiris · on April 11, 2023

ptp to the ToR isn't rocket science. If you need something hyper precise and are a large enough customer, you can get just about anything you need built in even the big clouds.

crazy customer needs is what made up most projects I saw or worked with.

readams · on April 11, 2023

You are definitely not within nanoseconds using NTP

withinboredom · on April 11, 2023

I don’t think Chrony is NTP, if we are being pedantic, just NTP compatible. My servers have hardware timestamping on the network interface and chrony uses this.

rezonant · on April 11, 2023

That would be PTP then

mlichvar · on April 11, 2023

No, NTP and PTP are two different protocols. They can both use hardware timestamps and reach single-digit nanosecond accuracy in ideal conditions. The main difference is in existing support in switches and routers, which is needed to avoid the impact of asymmetric delay between ports (typically tens of nanoseconds per switch).

PTP has good support in higher-end switches and routers, but it's difficult to secure and make resilient to failures. It was designed for automation and control networks in factories etc. NTP is a better fit for computer networks, but there doesn't seem to be any switches or routers with HW NTP support. If you really need the best accuracy with NTP, you can find old 100Mb/s hubs on ebay and create a separate network.

rezonant · on April 11, 2023

Yes, PTP and NTP are indeed very different protocols.

There's no networking hardware timestamp support for NTP because NTP has nothing to do with hardware timestamps.

PTP can be done without hardware timestamps, but it was designed with hardware support in mind.

I don't know where you got it that NTP does anything even orders of magnitude close to nanoseconds:

> NTP can usually maintain time to within tens of milliseconds over the public Internet, and can achieve better than one millisecond accuracy in local area networks under ideal conditions

https://en.m.wikipedia.org/wiki/Network_Time_Protocol

> The Precision Time Protocol (PTP) is a protocol used to synchronize clocks throughout a computer network. On a local area network, it achieves clock accuracy in the sub-microsecond range, making it suitable for measurement and control systems.

https://en.m.wikipedia.org/wiki/Precision_Time_Protocol

Literally neither solution comes anywhere near nanosecond accuracy.

For reference, there are 1,000,000 nanoseconds in a millisecond

mlichvar · on April 11, 2023

> There's no networking hardware timestamp support for NTP because NTP has nothing to do with hardware timestamps.

Both NTP and PTP don't care (as protocols) where the timestamps are coming from. That's an implementation detail.

> NTP can usually maintain time to within tens of milliseconds over the public Internet, and can achieve better than one millisecond accuracy in local area networks under ideal conditions

That was maybe 20-30 years ago, but not today. The wikipedia article needs an update. If you don't hit a routing asymmetry, in my experience it's usually milliseconds over Internet and tens of microseconds in local network if using SW timestamping. Please note that NTP clients by default use long polling intervals to avoid excessive load on public servers on Internet, so they need to be specifically configured for better performance in local networks.

You can find some measurements with HW timestamping here: https://chrony.tuxfamily.org/examples.html

Note that this is for the system clock, which has to be synchronized over PCIe to the hardware clock of the NIC. That adds hundreds of nanoseconds of uncertainty. It doesn't matter if the hardware clock is synchronized by PTP or NTP.

If you care only about the hardware clocks, it's easy to show how accurate is the synchronization by comparing their PPS signals on a scope. NTP between two directly connected NICs, or a with a hub, can get to single-digit nanosecond accuracy. I have seen that in my testing. It's just timestamps, it doesn't matter how they are exchanged.

deepGem · on April 11, 2023

The atomic clock is not in the VM, it’s a service from what I understand similar to NTP but obviously much more localised.

I in fact think once you offer atomic clock as a service for any public cloud, distributed transactions become a lot easier to implement thereby disrupting Spanner.

dangoodmanUT · on April 11, 2023

Last I checked the VMs are hooked up to their internal atomic clock time servers. They all have double digit microsecond drift at worst IIRC.

cbsmith · on April 11, 2023

I think the reality is making the SLA guarantees that Spanner in a third party cloud is just... not very practical.

Cthulhu_ · on April 11, 2023

That's it; there's plenty of competing products for Spanner, but Spanner's USP is the host and their hardware.

kohlerm · on April 11, 2023

That's a click bait title IMHO. Sure with some hardware changes you can get different tradeoffs with regards to performance, but that does not solve the distributed transactions problem completely. E.g. What about availability?

twhitmore · on April 11, 2023

Haven't read the full paper, but several aspects of this sound less than fully reliable.

It reminds me specifically of 90's multi-client LAN database systems (dBase, Clipper) where clients coordinated via file locks. Unreliability & hangs became a big problem for us.

In the summarized RDMA database, I'd be pretty concerned about reliability & integrity:

1) Crashed servers will leave records locked, and the system will hang. 2) Question whether lock timeouts can be adjudicated reliably. 3) Any errors in server behaviour can easily & widely corrupt data across any other nodes. 4) Overall the RDMA coordination makes me cautious. Can we really replace Paxos with RDMA reliably? If not, problems squeeze out elsewhere. 5) Proposed single-threaded recovery procedure sounds a hazardous operational bottleneck. 6) I'm also cautious about coordination requirements around recovery/ or to transact knowing that recovery is not in process, unless we can show that can be reliable & not add cost to the protocol.

samsquire · on April 11, 2023

Thank you for this.

I implemented multiversion concurrency control in Java with threads but I want to raise it to multimachine.

The problem I have is that the read and write timestamps need to be available to detect if a transaction between machines conflicts.

How do I synchronize read and write timestamps with minimal latency?

When a database replica receives a transaction for key X, it needs (a) a timestamp that is globally accurate (b) needs to tell other replicas about it so they can detect dangerous read-write dependencies.

I feel it inherently requires serialisation of timestamps to accurately determine what other nodes are doing with data.

Otherwise you get dangerous read-write skew.

(See the whitepaper "Serializable Snapshot Isolation")

Load balancers don't do much work, except shift traffic. I wonder if a load balancer that collects timestamps and enriches requests with timestamp information would be a scalable solution?

foota · on April 11, 2023

This is an example of a system built like you're describing at the end: https://www.pingcap.com/blog/how-an-open-source-distributed-...

Also see https://tikv.org/deep-dive/distributed-transaction/timestamp...

Cockroachdb describes their approach here: https://www.cockroachlabs.com/blog/living-without-atomic-clo...

amitport · on April 11, 2023

IBM IMS DB

Everything is direct memory access.

SNA protocol, No TCP. This is used since the 70s by many banks to process high volume transactions

exabrial · on April 11, 2023

Everything old is new again. The MongoDB people are still convinced they "invented" NoSQL databases, but really just poorly implemented an idea that https://en.wikipedia.org/wiki/ADABAS had perfected while the MongoDB hipsters were in diapers.

throwawayForMe2 · on April 11, 2023

Parallel Sysplex with 32 mainframes in a cluster.

mindwok · on April 11, 2023

If we're bypassing TCP/IP, does that mean we need to build our own protocol on top of this to achieve the reliability guarantees etc provided by the TCP stack?

eternalban · on April 11, 2023

-- extracted from: https://research.cs.cornell.edu/projects/Quicksilver/public_... --

RDMA (remote direct memory access) is a zero-copy communication standard.

RDMA is a user-space networking solution, accessed via queue pairs: lock-free data structures shared between user code and the network controller (NIC), consisting of a send queue and a receive queue.

RDMA supports several modes of operation [such as] reliable two-sided RDMA operations, which behave similarly to TCP. With this mode, the sender and receiver bind their respective queue pairs together, creating a session fully implemented by the NIC endpoints.

Once a send and the matching receive are posted, the data is copied directly from the sender’s memory to the receiver’s designated ___location, reliably and at the full rate the hardware can support.

A completion queue reports outcomes. End-to-end software resending or acknowledgments are not needed: either the hardware delivers the correct data (in FIFO order) and reports success, or the connection breaks.

fulafel · on April 11, 2023

Is the RDMA nic offload important for scaling here in the big-O sense? Does it eliminate some worse-than-linear-time operations?

justinclift · on April 11, 2023

    > Is RDMA mature (robust/reliable) enough to use in distributed transactions?

At least on Linux systems, RDMA has been pretty robust/reliable for probably a decade, maybe more.

    > What are the handicaps?

It works differently to TCP/IP, which everyone in IT has at least passing familiarity with. So, it tends to be automatically passed over unless people hit a situation where they're open to "exotic" solutions.

That being said, there's a TCP/IP shim layer (IPoIB) available which can be used by existing software to run on an IB network.

That shim layer though used to have a reputation for flakiness, and it was (or at least used to be) measurably slower than using native IB.

    > What are the reasons for slow uptake on this?

Repeated self-inflicted foot-guns by Mellanox leadership or perhaps their sales and marketing leadership is my best guess.

Mellanox adapters when brand new are priced fairly high for network adapters, or at least they used to be. However, when a network using them is upgraded to the next generation gear, a substantial number of the old ones would commonly end up for ~cheap resale on places like Ebay.

So, *nix DevOps staff ("Sysadmins" back in the day) and anyone else that needed fast networking for their home labs and similar would pick them up and figure out how use them.

Which of course meant over time an increasing group of people familiar with IB, that when figuring out solutions for their work places would then have enough confidence and knowledge to order them.

Sounds like typical organic growth right?

<rant>

Except Mellanox leadership - or at least their Sales & Marketing people - seemed to be fucking horrified that people were buying their expensive adapters cheaply on Ebay.

So, while the Mellanox technical people were receptive to this ground swell growth in usage by "unofficial" people, and tried to help out, their leadership time and time again did their level best to stamp it out.

Including killing off one of their decent "Community" growth initiatives -> just turned it into a fucking marketing channel for press releases and other crap <- Ugh.

But also doing things like instructing their support staff to not answer anyone on their forums who seemed to have purchased their gear through unofficial channels, etc. Nor let anyone else do so.

There were many, many examples of this bullshit over the years. And they were always like "Why don't we have huge adoption?"

Gee, I wonder? :( :( :(

</rant>

Anyway, I gave up on them and moved on a few years afterwards, prior to Nvidia buying them. I still buy older Mellanox ConnectX-3 (VPI) adapters off Ebay occasionally for home gear use (with Linux), and they're still solid. 10/40GbE. :)

audioheavy · on April 12, 2023

Good article, although the title might be a bit too click-bate-y... there's plenty of research and available services that reliably can have low-latency, distributed writes at the highest data isolation levels (strict serializability). Both the Calvin and Spanner papers describe such systems.

The two papers mentioned in the post are highly insightful. This is one of my favorite subjects.

To fully disclose, I have a biased view on this given that I work for a (closed source) serverless, no-ops DB provider (Fauna) that implements a distributed transaction engine that is natively document-relational and doesn't compromise on relational (ACID, transactional) guarantees.

pshirshov · on April 11, 2023

From what I understand this doesn't change anything on fundamental scale.

The particular timestamp oracle implementations still might suffer from metadata bloating and its commit protocol assumes an optimistic segment lock which would be a bottleneck in multiple scenarios.

So, from what I understand this is a (good?) practical speedup for some scenarios but doesn't change the fact that it's not possible to make an arbitrary sytem distributed without sacrificing correctness or throughput.

An easy reasoning: imagine that every node is one light year apart from all others.

juliangamble · on April 11, 2023

I’m using Transactions on a Distributed Database with Cassandra 4.0: https://medium.com/building-the-open-data-stack/an-apache-ca...

I’m curious to see what advantages this paper offers to everyday Cloud Developers over using the latest version of Cassandra.

throwawaaarrgh · on April 11, 2023

Wait, why aren't they just using an FPGA to offload processing? How does a 100G programmable NIC with direct DMA not give them the performance they need?

hestefisk · on April 11, 2023

Interesting paper. That said, I find the image at the top of the article and the play on words in NAM-DB disturbing, considering the many atrocities and war crimes the US committed in the Vietnam War. The authors should have been more thoughtful given the usefulness of the paper on pure technical merits.

m2f2 · on April 11, 2023

Something like Oracle Exadata perhaps?

Berengmark · on April 11, 2023

Good!

piuantiderp · on April 11, 2023

RDMA, sounds unsafe by design

withinboredom · on April 11, 2023

What seems unsafe about it? Are you worried about auth? There’s sRDMA?

exabrial · on April 11, 2023

Consider this: Is pcie insecure by design? Any tech could be used in a stupid way, making it insecure.

hanniabu · on April 11, 2023

Ethereum rollups have proven that transactions can scale

preseinger · on April 12, 2023

lol dude what