> The data is assumed to be randomly distributed to memory nodes in the shared memory pool. Any memory node is equi-distant to any compute node, and a compute node needs to reach multiple memory nodes for transaction execution.
and also:
> Once a memory server fails, NAM-DB halts the complete system and recover all memory servers to a consistent state from the last persisted checkpoint. The recovery procedure is executed by one dedicated compute server that replays the merged log for all memory servers.
>This global stall is problematic for production, but NAM is a research proof-of-concept system. Maybe using redundancy/quorums would be a way to solve this, but then that could be introducing challenges for consistency. Not straightforward.
Yeah, research and proof-of-concept sounds about right. Some might even call it a toy database.
Also worth noting that this is a paper from 2016/2017. The world of (actually!) distributed database design has significantly moved on since then.
distributed transactions that can assume fast and reliable access to some shared memory are interesting but really the easy part of the problem
"scale" means going across large physical distances where neither low latency nor high availability can be assumed, so not in a single datacenter where nodes are connected by infiniband or whatever
This is impressive, but won't have huge impact, IMO.
Way back in 2015, MySQL Cluster (NDB Cluster engine) benchmarked 200m transactions/second on commodity hardware [1]. It was read-committed transactions, not snapshot isolation, but still impressive. NDB (or RonDB, the new DB by its author) uses a non-blocking 2-phase commit protocol (failed transaction coordinators are failed over) and is even open-source. Still, it hasn't had a big impact outside of the Telecom and real-time gaming worlds.
RonDB is a KV store with SQL capabilities (the MySQL storage engine NDB is maintained by Oracle as part of MySQL NDB Cluster). Hopsworks is on top of this adding a new REST API (REST + gRPC) service that is already in the github tree and will ready for production usage in a few months.
I can't say whether the name is bad or not. I dislike it, and I dislike it strongly enough to not discuss the database, despite personally liking and respecting the developer.
Simple and honest answer to the question and I have probably the most downvotes I've ever gotten for anything. Y'all are just plain broken.
Also you:
> I can't say whether the name is bad or not. I dislike it, and I dislike it strongly enough to not discuss the database
> probably the most downvotes I've ever gotten for anything. Y'all are just plain broken.
You didn't add to the conversation, except adding your own personal problems as noise. This is working as intended. Hopefully this information will be helpful to all of us, going forward.
You submit that this answer is "adding personal problems as noise" and downvoted, so "working as intended".
If the relevance leap is hard for you, let's try this.
Question: We just launched a new product called LAKJfkdshfdskjfwfdsfnbozg, but no one is talking about it. Why doesn't anyone ever talk about LAKJfkdshfdskjfwfdsfnbozg?
Answer: I don't talk about it because I don't like the name.
Your response to this would be to downvote the answer because it is a personal problem, noise, which does not contribute to the conversation?
I hope you do not function in any capacity which requires understanding of sentiment.
I think reasons for slow adoption are probably a mix of:
1. Lack of developer awareness.
2. Security implications (or perceived implications) of exposing memory directly to a network without passing through CPU or application-level access control mechanisms.
The second point may be prohibitive for a lot of general purpose database systems which are intended to be run on shared infrastructure on virtualized instances.
Another reason may be that a lot of production systems are CPU-bound, not memory-bound. RDMA seems ideal for systems which require a lot of memory. I'm thinking maybe with recent advancements in AI/LLMs, it could be an interesting technology as these do require a huge amount of memory relative to CPU.
There are several good ideas in distributed databases that are effectively not deployable in cloud environments because the requisite hardware features don’t exist. Since cloud deployability is a pervasive prerequisite for commercial viability, database designs tend to overfit for the limitations of cloud environments even though we know how to do much better.
Basically, we are stuck in a local minima of making databases work well in cloud environments that are not designed to enable efficient distributed databases. It is wasteful and also provides an arbitrage opportunity for cloud companies.
Nothing proprietary in Spanner but I believe no other vendor has an atomic clock similar to what Google has. Hence they are not able to implement the paxos global transaction lock which requires that all servers participating in the transaction be perfectly synchronized. This is what one of the comments above refers to I think.
So while the spanner paper is open to be implemented by any vendor, they don't have the proprietary advantage that Google has - the atomic clock. So Yugabyte, CockroachDB don't rely on atomic clocks. I tried to get to the ground level basics of this, but I haven't understood this matter completely yet.
I guess using worse clocks would mean using a (slightly?) slower spanner, but I'm not sure what is the impact. In any case, if a big vendor (e.g., Amazon IBM Oracle Dell...) would want something on par with Google's clock they probably can achieve it (though I don't know much about these clocks).
The problem is that the slower it is the more the window for error grows. And the harder it is to recover. It also impose a high boundary to latency. You cannot be faster than your clock error without risks.
Note that even Spanner had multiple downtime due to clock and/or network failures. In these case, any operation lal guarantees are lost. This makes it really dangerous.
Do you have indications why CPU bound? I have never seen Systems other than for example linear algebra or special algorithms to max out a modern CPU. Almost all loads today are in love way or another memory bound.
Distributed transactions are invariably latency (usually storage I/O) bound, rather than CPU or memory bound. I think your #2 is a big part of the challenge with RDMA, as well as the "trickiness" of the programming paradigm.
A bulk purchase of ~60 FDR IB cards, cabling, and network switches to support them sounds pretty expensive.
That being said, IB FDR gear is "older tech" now so the cards and switches can commonly be found reasonably cheaply on Ebay. The switches tend to be bloody loud though, so they're not something you'd want nearby if you can help it.
The idea that one of many writer-compute-nodes can literally reach into a memory buffer that is shared across machines, atomically flip some lock bits and propagate some cache-coherence messages, and use that to build a multi-writer distributed database without needing to partition (and where any writer-compute-node can handle any message, so you can just round-robin a firehose of messages at them)... and that there's a chance (though not yet implemented) that one could implement ACID on top of this? It's absolute madness, and wildly exciting.
The last discussion point from this article very much so rings true for me, too. It's been how many years since the Spanner paper and I still can't get a GCP VM with atomic clocks?
At least now I can provision Cloud Spanner as a managed service, but is this the future of clouds keeping their services at an advantage?
Not GCP related, but I found some local (to my server) telcos and reached out to their IT department and got access to use their internal NTP servers. So my servers are now stratum 1, connected to GPS and atomic clocks, doing skew via chrony. I’m not paying a dime. I merely had to agree to keep my access low and my servers peer with each other to keep their clocks within a few nanoseconds of each other.
It’s amazing what you can get, if you just ask. It’s the latter part you really want, which is doable with modern hardware accelerated time stamping and a peering NTP system (like chrony).
Based on your other post, it sounds like you are doing NTP with your ISP and then using PTP to sync your servers on the local network. Nothing about this implies that you are accurate to the global clock within a microsecond, let alone accurate within any number of nanoseconds.
With a correctly equipped and configured PTP installation locally, you can expect your clocks to be synchronized on the order of a very small number of microseconds with respect to each other, but the relationship between those and the atomic global clock is something else.
If you are using satellite/GPS time sources, then that's another matter. Why then is it relevant to have an NTP hookup at your ISP?
Ah yes, I figured as much, this is why I skipped over the GPS and satellite parts initially. Having a high quality highly local NTP source is better than most sources, but it is nowhere near as close as having your own local time sources, which is what a lot of time sensitive installations do.
Really? It is now Tuesday in central europe and I guess most machines¹ managed to figure out that fact.
Most machines will also know it is 8:11.
The question is how much resolution do you need and are your clocks accuratly synced enough for the thing you plan to do.
You are aware that many of the physics experiments that operate at the edge of what is possible use extremely accurate clocks synced over national and sometimes continental borders?
¹ let's ignore the ones that have been configured incorrectly
We've several thousands of nodes in our system. Our ops people have witnessed clock skew from seconds to months that are transient. Time is constantly synced via ntpd.
You can't trust system clocks in a distributed system to ensure ordering. Some reading:
Perfect time synchronization over a network with unknown latency is provably impossible.
Fairly certain distributed physics experiments will be using atomic clocks that are not synchronized over the network. And/or they'll use direct satellite or GPS time sources which have predictable latencies.
CERN at least seems to be using a solution that is running over Ethernet[1] but with custom hardware that is probably fairly expensive. They use a single time source and then measure the delay between each switch/node.
Though, this is limited by needing to be able to run a cable between each node so idk how your definition of a distributed experiment fits.
Cleaner perhaps, but wrong. If you know that latency is statistically symmetrical (same in both directions, on average) then you can synchronize clocks to arbitrary precision (asymptotically at least).
> If you know that latency is statistically symmetrical (same in both directions, on average)
My understanding of relativity is that this is principally unknowable and we only assume that speed of light is same in every direction by convention (every few years there’s a paper that claims they managed to measure one way sol but later it turns out they actually measured two-way in a roundabout way) so you can’t really know that?
ptp to the ToR isn't rocket science. If you need something hyper precise and are a large enough customer, you can get just about anything you need built in even the big clouds.
crazy customer needs is what made up most projects I saw or worked with.
I don’t think Chrony is NTP, if we are being pedantic, just NTP compatible. My servers have hardware timestamping on the network interface and chrony uses this.
No, NTP and PTP are two different protocols. They can both use hardware timestamps and reach single-digit nanosecond accuracy in ideal conditions. The main difference is in existing support in switches and routers, which is needed to avoid the impact of asymmetric delay between ports (typically tens of nanoseconds per switch).
PTP has good support in higher-end switches and routers, but it's difficult to secure and make resilient to failures. It was designed for automation and control networks in factories etc. NTP is a better fit for computer networks, but there doesn't seem to be any switches or routers with HW NTP support. If you really need the best accuracy with NTP, you can find old 100Mb/s hubs on ebay and create a separate network.
Yes, PTP and NTP are indeed very different protocols.
There's no networking hardware timestamp support for NTP because NTP has nothing to do with hardware timestamps.
PTP can be done without hardware timestamps, but it was designed with hardware support in mind.
I don't know where you got it that NTP does anything even orders of magnitude close to nanoseconds:
> NTP can usually maintain time to within tens of milliseconds over the public Internet, and can achieve better than one millisecond accuracy in local area networks under ideal conditions
> The Precision Time Protocol (PTP) is a protocol used to synchronize clocks throughout a computer network. On a local area network, it achieves clock accuracy in the sub-microsecond range, making it suitable for measurement and control systems.
> There's no networking hardware timestamp support for NTP because NTP has nothing to do with hardware timestamps.
Both NTP and PTP don't care (as protocols) where the timestamps are coming from. That's an implementation detail.
> NTP can usually maintain time to within tens of milliseconds over the public Internet, and can achieve better than one millisecond accuracy in local area networks under ideal conditions
That was maybe 20-30 years ago, but not today. The wikipedia article needs an update. If you don't hit a routing asymmetry, in my experience it's usually milliseconds over Internet and tens of microseconds in local network if using SW timestamping. Please note that NTP clients by default use long polling intervals to avoid excessive load on public servers on Internet, so they need to be specifically configured for better performance in local networks.
Note that this is for the system clock, which has to be synchronized over PCIe to the hardware clock of the NIC. That adds hundreds of nanoseconds of uncertainty. It doesn't matter if the hardware clock is synchronized by PTP or NTP.
If you care only about the hardware clocks, it's easy to show how accurate is the synchronization by comparing their PPS signals on a scope. NTP between two directly connected NICs, or a with a hub, can get to single-digit nanosecond accuracy. I have seen that in my testing. It's just timestamps, it doesn't matter how they are exchanged.
The atomic clock is not in the VM, it’s a service from what I understand similar to NTP but obviously much more localised.
I in fact think once you offer atomic clock as a service for any public cloud, distributed transactions become a lot easier to implement thereby disrupting Spanner.
That's a click bait title IMHO. Sure with some hardware changes you can get different tradeoffs with regards to performance, but that does not solve the distributed transactions problem completely. E.g. What about availability?
Haven't read the full paper, but several aspects of this sound less than fully reliable.
It reminds me specifically of 90's multi-client LAN database systems (dBase, Clipper) where clients coordinated via file locks. Unreliability & hangs became a big problem for us.
In the summarized RDMA database, I'd be pretty concerned about reliability & integrity:
1) Crashed servers will leave records locked, and the system will hang.
2) Question whether lock timeouts can be adjudicated reliably.
3) Any errors in server behaviour can easily & widely corrupt data across any other nodes.
4) Overall the RDMA coordination makes me cautious. Can we really replace Paxos with RDMA reliably? If not, problems squeeze out elsewhere.
5) Proposed single-threaded recovery procedure sounds a hazardous operational bottleneck.
6) I'm also cautious about coordination requirements around recovery/ or to transact knowing that recovery is not in process, unless we can show that can be reliable & not add cost to the protocol.
I implemented multiversion concurrency control in Java with threads but I want to raise it to multimachine.
The problem I have is that the read and write timestamps need to be available to detect if a transaction between machines conflicts.
How do I synchronize read and write timestamps with minimal latency?
When a database replica receives a transaction for key X, it needs (a) a timestamp that is globally accurate (b) needs to tell other replicas about it so they can detect dangerous read-write dependencies.
I feel it inherently requires serialisation of timestamps to accurately determine what other nodes are doing with data.
Otherwise you get dangerous read-write skew.
(See the whitepaper "Serializable Snapshot Isolation")
Load balancers don't do much work, except shift traffic. I wonder if a load balancer that collects timestamps and enriches requests with timestamp information would be a scalable solution?
Everything old is new again. The MongoDB people are still convinced they "invented" NoSQL databases, but really just poorly implemented an idea that https://en.wikipedia.org/wiki/ADABAS had perfected while the MongoDB hipsters were in diapers.
If we're bypassing TCP/IP, does that mean we need to build our own protocol on top of this to achieve the reliability guarantees etc provided by the TCP stack?
RDMA (remote direct memory access) is a zero-copy communication standard.
RDMA is a user-space networking solution, accessed via queue pairs: lock-free data structures shared between user code and the network controller (NIC), consisting of a send queue and a receive queue.
RDMA supports several modes of operation [such as] reliable two-sided RDMA operations, which behave similarly to TCP. With this mode, the sender and receiver bind their respective queue pairs together, creating a session fully implemented by the NIC endpoints.
Once a send and the matching receive are posted, the data is copied directly from the sender’s memory to the receiver’s designated ___location, reliably and at the full rate the hardware can support.
A completion queue reports outcomes. End-to-end software resending or acknowledgments are not needed: either the hardware delivers the correct data (in FIFO order) and reports success, or the connection breaks.
> Is RDMA mature (robust/reliable) enough to use in distributed transactions?
At least on Linux systems, RDMA has been pretty robust/reliable for probably a decade, maybe more.
> What are the handicaps?
It works differently to TCP/IP, which everyone in IT has at least passing familiarity with. So, it tends to be automatically passed over unless people hit a situation where they're open to "exotic" solutions.
That being said, there's a TCP/IP shim layer (IPoIB) available which can be used by existing software to run on an IB network.
That shim layer though used to have a reputation for flakiness, and it was (or at least used to be) measurably slower than using native IB.
> What are the reasons for slow uptake on this?
Repeated self-inflicted foot-guns by Mellanox leadership or perhaps their sales and marketing leadership is my best guess.
Mellanox adapters when brand new are priced fairly high for network adapters, or at least they used to be. However, when a network using them is upgraded to the next generation gear, a substantial number of the old ones would commonly end up for ~cheap resale on places like Ebay.
So, *nix DevOps staff ("Sysadmins" back in the day) and anyone else that needed fast networking for their home labs and similar would pick them up and figure out how use them.
Which of course meant over time an increasing group of people familiar with IB, that when figuring out solutions for their work places would then have enough confidence and knowledge to order them.
Sounds like typical organic growth right?
<rant>
Except Mellanox leadership - or at least their Sales & Marketing people - seemed to be fucking horrified that people were buying their expensive adapters cheaply on Ebay.
So, while the Mellanox technical people were receptive to this ground swell growth in usage by "unofficial" people, and tried to help out, their leadership time and time again did their level best to stamp it out.
Including killing off one of their decent "Community" growth initiatives -> just turned it into a fucking marketing channel for press releases and other crap <- Ugh.
But also doing things like instructing their support staff to not answer anyone on their forums who seemed to have purchased their gear through unofficial channels, etc. Nor let anyone else do so.
There were many, many examples of this bullshit over the years. And they were always like "Why don't we have huge adoption?"
Gee, I wonder? :( :( :(
</rant>
Anyway, I gave up on them and moved on a few years afterwards, prior to Nvidia buying them. I still buy older Mellanox ConnectX-3 (VPI) adapters off Ebay occasionally for home gear use (with Linux), and they're still solid. 10/40GbE. :)
Good article, although the title might be a bit too click-bate-y... there's plenty of research and available services that reliably can have low-latency, distributed writes at the highest data isolation levels (strict serializability). Both the Calvin and Spanner papers describe such systems.
The two papers mentioned in the post are highly insightful. This is one of my favorite subjects.
To fully disclose, I have a biased view on this given that I work for a (closed source) serverless, no-ops DB provider (Fauna) that implements a distributed transaction engine that is natively document-relational and doesn't compromise on relational (ACID, transactional) guarantees.
From what I understand this doesn't change anything on fundamental scale.
The particular timestamp oracle implementations still might suffer from metadata bloating and its commit protocol assumes an optimistic segment lock which would be a bottleneck in multiple scenarios.
So, from what I understand this is a (good?) practical speedup for some scenarios but doesn't change the fact that it's not possible to make an arbitrary sytem distributed without sacrificing correctness or throughput.
An easy reasoning: imagine that every node is one light year apart from all others.
Wait, why aren't they just using an FPGA to offload processing? How does a 100G programmable NIC with direct DMA not give them the performance they need?
Interesting paper. That said, I find the image at the top of the article and the play on words in NAM-DB disturbing, considering the many atrocities and war crimes the US committed in the Vietnam War. The authors should have been more thoughtful given the usefulness of the paper on pure technical merits.
> The data is assumed to be randomly distributed to memory nodes in the shared memory pool. Any memory node is equi-distant to any compute node, and a compute node needs to reach multiple memory nodes for transaction execution.
and also:
> Once a memory server fails, NAM-DB halts the complete system and recover all memory servers to a consistent state from the last persisted checkpoint. The recovery procedure is executed by one dedicated compute server that replays the merged log for all memory servers.
>This global stall is problematic for production, but NAM is a research proof-of-concept system. Maybe using redundancy/quorums would be a way to solve this, but then that could be introducing challenges for consistency. Not straightforward.
Yeah, research and proof-of-concept sounds about right. Some might even call it a toy database.
Also worth noting that this is a paper from 2016/2017. The world of (actually!) distributed database design has significantly moved on since then.