CRDTs are often presented as a way of building asynchronous replication system with eventual consistency. This means that when modifying a CRDT object, you just update your local copy, and then synchronize in background with other nodes.
However this is not the only way of using CRDTs. At there core, CRDTs are a way to resolve conflicts between different versions of an object without coordination: when all of the different states of the object that exist in the network become known to a node, it applies a local procedure known as a merge that produces a deterministic outcome: all nodes do this, and once they have all done it, they are all in the same state. In that way, nodes do not need to coordinate before-hand when doing modifictations, in the sense where they do not need to run a two-phase commit protocol that ensures that operations are applied one after the other in a specific order that is replicated identically at all nodes. (This is the problem of consensus which is theoretically much harder to solve from a distributed system's theory perspective, as well as from an implementation perspective).
In Garage, we have a bit of a special way of using CRDTs. To simplify a bit, each file stored in Garage is a CRDT that is replicated on three known nodes (deterministically decided). While these three nodes could synchronize in the background when an update is made, this would mean two things that we don't want: 1/ when a write is made, it would be written only on one node, so if that node crashes before it had a chance to synchronize with other nodes, data would be lost; 2/ reading from a node wouldn't necessarily ensure that you have the last version of the data, therefore the system is not read-after-write consistent. To fix this, we add a simple synchronization system based on read/write quorums to our CRDT system. More precisely, when updating a CRDT, we wait for the value to be known to at least two of the three responsible nodes before returning OK, which allows us to tolerate one node failure while always ensuring durability of stored data. Further when performing a read, we ask for their current state of the CRDT to at least two of the three nodes: this ensures that at least one of the two will know about the last version that was written (due to the intersection of the quorums being non-empty), making the system read-after-write consistent. These are basically the same principles that are applied in CRDT databases such as Riak.
Sounds interesting. How do you handle the case where you're unable to send the update to other nodes?
So an update goes to node A, but not to B and C. Meanwhile, the connection to the client may be disrupted, so the client doesn't know the fate of the update. If you're unlucky here, a subsequent read will ask B and C for data, but the newest data is actually on A. Right?
I assume there's some kind of async replication between the nodes to ensure that B and C eventually catch up, but you do have an inconsistency there.
You also say there is no async replication, but surely there must be some, since by definition there is a quorum, and updates aren't hitting all of the nodes.
I understand that CRDTs make it easier to order updates, which solves part of consistent replication, but you still need a consistent view of the data, which is something Paxos, Raft, etc. solve, but CRDTs separated across multiple nodes don't automatically give you that, unless I am missing something. You need more than one node in order to figure out what the newest version is, assuming the client needs perfect consistency.
True; we don't solve this. If there is a fault and data is stored only on node A and not nodes B and C, the view might be inconsistent until the next repair procedure is run (which happens on each read operation if an inconsistency is detected, and also regularly on the whole dataset using an anti-entropy procedure). However if that happens, the client that sends an update will not have received an OK response, so it will know that its update is in an indeterminate state. The only guarantee that Garage gives you is that if you have an OK, the update will be visible in all subsequent reads (read-after-write consistency).
I'll also admit to having difficulty understanding how is all this distinct from non-CRDT replication mechanisms. Great mission and work by DeuxFluers team btw. Bonne chance!
> I'll also admit to having difficulty understanding how is all this distinct from non-CRDT replication mechanisms.
This is because "CRDT" is not about a new or different approach to replication, although for some reason this has become a widely held perception. CRDT is about a new approach to the _analysis_ of replication mechanisms, using order theory. If you read the original CRDT paper(s) you'll find old-school mechanisms like Lamport Clocks.
So when someone says "we're using a CRDT" this can be translated as: "we're using an eventually consistent replication mechanism proven to converge using techniques from the CRDT paper".
The thing is, if you don't have CRDT, the only way to replicate things over nodes in such a way that they end up in consistent states, is to have a way of ordering operations so that all nodes apply them in the same order, which is costly.
Let me give you a small example. Suppose we have a very simple key-value storage, and that two clients are writing different values at the same time on the same key. The first one will invoke write(k, v1), and the second one will invoke write(k, v2), where v1 and v2 are different values.
If all nodes receive the two write operation but don't have a way to know which one came first, some will receive v1 before v2, and end up with value v2 as the last written values, and other nodes will receive v2 before v1 meaning the will keep v1 as the definitive value. The system is now in an inconsistent state.
There are several ways to avoid this.
The first one is Raft consensus: all write operations will go through a specific node of the system, the leader, which is responsible for putting them in order and informing everyone of the order it selected for the operations. This adds a cost of talking to the leader at each operation, as well as a cost of simply selecting which node is the leader node.
CRDT are another way to ensure that we have a consistent result after applying the two writes, not by having a leader that puts everything in a certain order, but by embedding certain metadata with the write operation itself, which is enough to disambiguate between the two writes in a consistent fashion.
In our example, now, the node that does write(k, v1) will for instance generate a timestamp ts1 that corresponded to the (approximate) time at which v1 was written, and it will also generate a UUID id1. Similarly, the nodes that does write(k, v2) will generate ts2 and id2. Now when they send their new values to other nodes in the network, they will send along their values of ts1, id1, ts2 and id2. Nodes now know enough to always deterministcally select either v1 or v2, consistently everywhere: if ts1 > ts2 or ts1 = ts2 and id1 > id2, then v1 is selected, otherwise v2 is selected (we suppose that id1 = id2 has a negligible probability of happening). In terms of message round-trips between nodes, we see that with this new method, nodes simply communicate once with all other nodes in the network, which is much faster than having to pass through a leader.
Here the example uses timestamps as a way to disambiguate between v1 and v2, but CRDTs are more general and other ways of handling concurrent updates can be devised. The core property is that concurrent operations are combined a-posteriori using a deterministic rule once they reach the storage nodes, and not a-priori by putting them in a certain order.
Got it, thanks. This reminded me of e-Paxos (e for Egalitarian) and a bit of digging there are now 2 additional contenders (Atlas and Gryff, both 2020) in this space per below:
This sounds like ABD but avoiding the read round trip as long as the register can be modeled with a CRDT. It's interesting to see this applied to a metadata portion of an object storage system. I'm sure this has implications on how rich of storage APIs you can offer. Can you do conditional writes? Can old versions of a register be compacted to reduce metadata space? How does garbage collecting dead blobs work?
It would be helpful to see specific goals of the project laid out, especially since this bucks the trend of being an enterprise/production grade object storage service (and that's totally ok).
If by that you mean something like compare-and-swap or test-and-set, then no, that would require consensus number > 1 [0]
> Can old versions of a register be compacted to reduce metadata space?
Yes, when using last-write-wins registers, old versions are deleted as soon as they are overseded by a newer version. Almost all metadata stored in Garage uses this mechanism.
> How does garbage collecting dead blobs work?
Nodes have a shared table of block references, that identifies for each data block the set of objects (or rather, versions of objects) that use this block. Block references have a "deleted" flag that is set to true when the version in question is deleted from the store. Once set to true, it cannot go back, but a new version can always be created that references the same block. Nodes keep a local count of the number of non-deleted block references for each block, so that when it reaches zero, they queue up the block for deletion (after a small delay). There is also a metadata GC system so that block references with the deleted flag are dropped from the table after a certain time.