(Garage Contributor here) We reviewed many of the existing solutions and none of them had the feature set we wanted. Compared to SeaweedFS, the main difference we introduce with Garage is that our nodes are not specialized, which lead to the following benefits:
- Garage is easier to deploy and to operate: you don't have to manage independent components like the filer, the volume manager, the master, etc. It also seems that a bucket must be pinned to a volume server on SeaweedFS. In Garage, all buckets are spread on the whole cluster. So you do not have to worry that your bucket fills one of your volume server.
- Garage works better in presence of crashes: I would be very interested by a deep analysis of Seaweed "automatic master failover". They use Raft, I suppose either by running an healthcheck every second which lead to data loss on a crash, or sending a request for each transaction, which creates a huge bottleneck in their design.
- Better scalability: because there is no special node, there is no bottlenecks. I suppose that with SeaweedFS, all the requests have to pass through the master. We do not have such limitations.
As a conclusion, we choose a radically different design with Garage. We plan to do a more in-depth comparison in the future, but even today, I can say that if we implement the same API, our radically different designs lead to radically different properties and trade-off.
> independent components like the filer, the volume manager, the master, etc.
You can run volume/master/filer in a single server (single command).
> filer probably needs an external rdbms to handle the metadata
This is true. You can use an external db. Or build/embed some other db inside it (think a distributed kv in golang that you embed inside to host the metadata).
> It also seems that a bucket must be pinned to a volume server on SeaweedFS.
This is not true. A bucket will be using it's own volumes, but can be and is distributed on the whole cluster by default.
> They use Raft, I suppose either by running an healthcheck every second which lead to data loss on a crash, or running for each transaction, which creates a huge bottleneck.
Raft is for synchronized writes. It's slow in the case of a single-write being slow because you have to wait for an "ok" from replicas, which is a good thing (compared to async-replication in, say, cassandra/dynamodb). Keep in mind that s3 also moved to synced replication. This is fixed by having more parallelism.
> Better scalability: because there is no special node, there is no bottlenecks. I suppose that SeaweedFS, all the requests have to pass through the master. We do not have such limitations.
Going to the master is only needed for writes, to get a unique id. This can be easily fixed with a plugin to say, generate twitter-snowflake-ids which are very efficient. For reads, you keep a cache in your client for the volume-to-server mapping so you can do reads directly from the server that has the data, or you can randomly query a server and it will handle everything underneath.
I'm pretty sure seaweedfs has very good fundamentals from researching all other open-source distributed object storage systems that exists.
> Raft is for synchronized writes. It's slow in the case of a single-write being slow because you have to wait for an "ok" from replicas, which is a good thing (compared to async-replication in, say, cassandra/dynamodb). Keep in mind that s3 also moved to synced replication. This is fixed by having more parallelism.
We have synchronous writes without Raft, meaning we are both much faster and still strongly consistent (in the sense of read-after-write consistency, not linearizability). This is all thanks to CRDTs.
If you don't sync immediately, you may lose the node without it replicating yet and losing the data forever. There's no fancy algorithm when the machine gets destroyed before it replicated the data. And you can't write to 2 replicas simultaneously from the client like, say, when using a Cassandra-smart-driver since S3 doesn't support that.
So let's take the example of a 9-nodes clusters with a 100ms RTT over the network to understand. In this specific (yet a little bit artificial) situation, Garage particularly shines compared to Minio or SeaweedFS (or any Raft-based object store) while providing the same consistency properties.
For a Raft-based object store, your gateway will receive the write request and forward it to the leader (+ 100ms, 2 messages). Then, the leader will forward in parallel this write to the 9 nodes of the cluster and wait that a majority answers (+ 100ms, 18 messages). Then the leader will confirm the write to all the cluster and wait for a majority again (+ 100ms, 18 messages). Finally, it will answer to your gateway (already counted in the first step). In the end, our write took 300ms and generated 38 messages over the cluster.
Another critical point with Raft is that your writes do not scale: they all have to go through your leader. So on the writes point of view, it is not very different from having a single server.
For a DynamoDB-like object store (Riak CS, Pithos, Openstack Swift, Garage), the gateway receives the request and know directly on which nodes it must store the writes. For Garage, we choose to store every writes on 3 different nodes. So the gateway sends the write request to the 3 nodes and waits that at least 2 nodes confirm the write (+ 100ms, 6 messages). In the end, our write took 100ms, generated 6 messages over the cluster, and the number of writes is not dependent on the number of (raft) nodes in the cluster.
With this model, we can still provide always up to date values. When performing a read request, we also query the 3 nodes that must contain the data and wait for 2 of them. Because we have 3 nodes, wrote at least on 2 of them, and read on 2 of them, we will necessarily get the last value. This algorithm is discussed in Amazon's DynamoDB paper[0].
I reasoned in a model where there is no bandwidth, no CPU limit, no contention at all. In real systems, these limits apply, and we think that's another argument in favor of Garage :-)
> For a Raft-based object store, your gateway will receive the write request and forward it to the leader (+ 100ms, 2 messages). Then, the leader will forward in parallel this write to the 9 nodes of the cluster and wait that a majority answers (+ 100ms, 18 messages). Then the leader will confirm the write to all the cluster and wait for a majority again (+ 100ms, 18 messages). Finally, it will answer to your gateway (already counted in the first step). Our write took 300ms and generated 38 messages over the cluster.
No. The "proxy" node, a random node that you connect to will do:
0. split the file into chunks of ~4MB (can be changed) while streaming
for each chunk (you can write chunks in parallel):
1. get id from master (can be fixed by generating an id in the proxy node with some custom code, 0 messages with custom plugin)
2. write to 1 volume-server (which will write to another node for majority) (2 messages)
3. update metadata layer, to keep track of chunks so you can resume/cancel/clean-failed uploads (metadata may be another raft subsystem, think yugabytedb/cockroachdb, so it needs to do it's own 2 writes) (2 messages)
Mark as "ok" in metadata layer and return ok to client. (2 messages)
The chunking is more complex, you have to track more data, but in the end is better. You spread a file to multiple servers & disks. If a server fails with erasure-coding and you need to read a file, you won't have to "erasure-decode" the whole file since you'll have to do it only for the missing chunks. If you have a hot file, you can spread reads on many machines/disks. You can upload very-big-files (terabytes), you can "append" to a file. You can have a smart-client (or colocate a proxy on your client server) for smart-routing and stuff.
If you're still talking about SeaweedFS, the answer seems to be that it's not a "raft-based object store", hence it's not as chatty as the parent comment described.
That proxy node is a volume server itself, and uses simple replication to mirror its volume on another server. Raft consensus is not used for the writes. Upon replication failure, the data becomes read-only [1], thus giving up partition tolerance. These are not really comparable.
How does step 1 work? My understanding is that the ID from the master tells you which volume server to write to. If you're generating it randomly, then are you saying you have queried the master server for the number of volumes upfront & then just randomly distribute it that way?
> If you're generating it randomly, then are you saying you have queried the master server for the number of volumes upfront & then just randomly distribute it that way?
You just need a unique id (which you generate it locally). And you need an writable volume-id, which you can query the master, master-follower, cache it, or query a volume-server directly.
We ensure the CRDT is synced with at least two nodes in different geographical areas before returning an OK status to a write operation. We are using CRDTs not so much for their asynchronous replication properties (what is usually touted as eventual consistency), but more as a way to avoid conflicts between concurrent operations so that we don't need a consensus algorithm like Raft. By combining this with the quorum system (two writes out of three need to be successfull before returning ok), we ensure durability of written data but without having to pay the synchronization penalty of Raft.
> We ensure the CRDT is synced with at least two nodes in different geographical areas before returning an OK status to a write operation [...] we ensure durability of written data but without having to pay the synchronization penalty of Raft.
This is, in essence, strongly-consistent replication; in the sense that you wait for a majority of writes before answering a request: So you're still paying the latency cost of a round trip with a least another node on each write. How is this any better than a Raft cluster with the same behavior? (N/2+1 write consistency)
Raft consensus apparently needs more round-trips than that (maybe two round-trips to another node per write?), as evidenced by this benchmark we made against Minio:
Yes we do round-trips to other nodes, but we do much fewer of them to ensure the same level of consistency.
This is to be expected from a distributed system's theory perspective, as consensus (or total order) is a much harder problem to solve than what we are doing.
We haven't (yet) gone into dissecating the Raft protocol or Minio's implementation to figure out why exactly it is much slower, but the benchmark I linked above is already strong enough evidence for us.
I think it would be great if you could make a Github repo that is just about summarising performance characteristics and roundrip types of different storage systems.
You would invite Minio/Ceph/SeaweedFS/etc. authors to make pull requests in there to get their numbers right and explanations added.
This way, you could learn a lot about how other systems work, and users would have an easier time choosing the right system for their problems.
Currently, one only gets detailed comparisons from HN discussions, which arguably aren't a great place for reference and easily get outdated.
RAFT needs a lot more round trips that that, it needs to send a message about the transaction, the nodes need to confirm that they received it, then the leader needs to send back that it was committed (no response required). This is largely implementation specific (etcd does more round trips than that IIRC), but that's the bare minimum.
As someone who wants this technology to be useful to commercial and open-source entities GPL software is more compatible with the status quo and I don't want to fight the battle of the license and the technology.
More-permissive-than-AGPLv3 for service-based software creates a potential (or, I'd argue, likely) route towards large-platform capture and resale of the value of those services without upstream contribution in return, followed at a later date by the turning-down and closure of the software products and communities behind them.
That's the strategy that the platforms have, I think - for rational (albeit potentially immoral) economic and competitive reasons. I'd like to hear if you disbelieve that and/or have other reasons to think that permissive licensing is superior long-term.
Unless garage has a special mitigation against this, usually performance gets much worse in large clusters as the filesystem fills up. As files are added and deleted, it struggles to keep nodes exactly balanced and a growing percentage of nodes will be full and unavailable for new writes.
So in a high throughput system, you may notice a soft performance degradation before actually running out of space.
If you aren't performance sensitive or don't have high write throughput, you might not notice. This is definitely something you should be forecasting and alerting on so you can acquire and add capacity or delete old data.
If you really don't like the idea of everything falling over at the same time, you could use multiple clusters or come up with a quota system (e.g. disable workload X if it's using more than Y TiB).
Yes, in which case you just make your cluster bigger or delete some data. Seems like a reasonable compromise to me.
Garage has advanced functionnality to control how much data is stored on each node, and is also very flexible with respect to adding or removing nodes, making all of this very easy to do.
We don't currently have a comparison with SeaweedFS. For the record, we have been developping Garage for almost two years now, and we hadn't heard of SeaweedFS at the time we started.
To me, the two key differentiators of Garage over its competitors are as follows:
- Garage contains an evolved metadata system that is based on CRDTs and consistent hashing inspired by Dynamo, solidly grounded in distributed system's theory. This allows us to be very efficient as we don't use Raft or other consensus algorithms between nodes, and we also do not rely on an external service for metadata storage (Postgres, Cassandra, whatever) meaning we don't pay an additionnal communication penalty.
- Garage was designed from the start to be multi-datacenter aware, again helped by insights from distributed system's theory. In practice we explicitly chose against implementing erasure coding, instead we spread three full copies of data over different zones so that overall availability is maintained with no degradation in performance when one full zone goes down, and data locality is preserved at all locations for faster access (in the case of a system with three zones, our ideal deployment scenario).
CRDTs are often presented as a way of building asynchronous replication system with eventual consistency. This means that when modifying a CRDT object, you just update your local copy, and then synchronize in background with other nodes.
However this is not the only way of using CRDTs. At there core, CRDTs are a way to resolve conflicts between different versions of an object without coordination: when all of the different states of the object that exist in the network become known to a node, it applies a local procedure known as a merge that produces a deterministic outcome: all nodes do this, and once they have all done it, they are all in the same state. In that way, nodes do not need to coordinate before-hand when doing modifictations, in the sense where they do not need to run a two-phase commit protocol that ensures that operations are applied one after the other in a specific order that is replicated identically at all nodes. (This is the problem of consensus which is theoretically much harder to solve from a distributed system's theory perspective, as well as from an implementation perspective).
In Garage, we have a bit of a special way of using CRDTs. To simplify a bit, each file stored in Garage is a CRDT that is replicated on three known nodes (deterministically decided). While these three nodes could synchronize in the background when an update is made, this would mean two things that we don't want: 1/ when a write is made, it would be written only on one node, so if that node crashes before it had a chance to synchronize with other nodes, data would be lost; 2/ reading from a node wouldn't necessarily ensure that you have the last version of the data, therefore the system is not read-after-write consistent. To fix this, we add a simple synchronization system based on read/write quorums to our CRDT system. More precisely, when updating a CRDT, we wait for the value to be known to at least two of the three responsible nodes before returning OK, which allows us to tolerate one node failure while always ensuring durability of stored data. Further when performing a read, we ask for their current state of the CRDT to at least two of the three nodes: this ensures that at least one of the two will know about the last version that was written (due to the intersection of the quorums being non-empty), making the system read-after-write consistent. These are basically the same principles that are applied in CRDT databases such as Riak.
Sounds interesting. How do you handle the case where you're unable to send the update to other nodes?
So an update goes to node A, but not to B and C. Meanwhile, the connection to the client may be disrupted, so the client doesn't know the fate of the update. If you're unlucky here, a subsequent read will ask B and C for data, but the newest data is actually on A. Right?
I assume there's some kind of async replication between the nodes to ensure that B and C eventually catch up, but you do have an inconsistency there.
You also say there is no async replication, but surely there must be some, since by definition there is a quorum, and updates aren't hitting all of the nodes.
I understand that CRDTs make it easier to order updates, which solves part of consistent replication, but you still need a consistent view of the data, which is something Paxos, Raft, etc. solve, but CRDTs separated across multiple nodes don't automatically give you that, unless I am missing something. You need more than one node in order to figure out what the newest version is, assuming the client needs perfect consistency.
True; we don't solve this. If there is a fault and data is stored only on node A and not nodes B and C, the view might be inconsistent until the next repair procedure is run (which happens on each read operation if an inconsistency is detected, and also regularly on the whole dataset using an anti-entropy procedure). However if that happens, the client that sends an update will not have received an OK response, so it will know that its update is in an indeterminate state. The only guarantee that Garage gives you is that if you have an OK, the update will be visible in all subsequent reads (read-after-write consistency).
I'll also admit to having difficulty understanding how is all this distinct from non-CRDT replication mechanisms. Great mission and work by DeuxFluers team btw. Bonne chance!
> I'll also admit to having difficulty understanding how is all this distinct from non-CRDT replication mechanisms.
This is because "CRDT" is not about a new or different approach to replication, although for some reason this has become a widely held perception. CRDT is about a new approach to the _analysis_ of replication mechanisms, using order theory. If you read the original CRDT paper(s) you'll find old-school mechanisms like Lamport Clocks.
So when someone says "we're using a CRDT" this can be translated as: "we're using an eventually consistent replication mechanism proven to converge using techniques from the CRDT paper".
The thing is, if you don't have CRDT, the only way to replicate things over nodes in such a way that they end up in consistent states, is to have a way of ordering operations so that all nodes apply them in the same order, which is costly.
Let me give you a small example. Suppose we have a very simple key-value storage, and that two clients are writing different values at the same time on the same key. The first one will invoke write(k, v1), and the second one will invoke write(k, v2), where v1 and v2 are different values.
If all nodes receive the two write operation but don't have a way to know which one came first, some will receive v1 before v2, and end up with value v2 as the last written values, and other nodes will receive v2 before v1 meaning the will keep v1 as the definitive value. The system is now in an inconsistent state.
There are several ways to avoid this.
The first one is Raft consensus: all write operations will go through a specific node of the system, the leader, which is responsible for putting them in order and informing everyone of the order it selected for the operations. This adds a cost of talking to the leader at each operation, as well as a cost of simply selecting which node is the leader node.
CRDT are another way to ensure that we have a consistent result after applying the two writes, not by having a leader that puts everything in a certain order, but by embedding certain metadata with the write operation itself, which is enough to disambiguate between the two writes in a consistent fashion.
In our example, now, the node that does write(k, v1) will for instance generate a timestamp ts1 that corresponded to the (approximate) time at which v1 was written, and it will also generate a UUID id1. Similarly, the nodes that does write(k, v2) will generate ts2 and id2. Now when they send their new values to other nodes in the network, they will send along their values of ts1, id1, ts2 and id2. Nodes now know enough to always deterministcally select either v1 or v2, consistently everywhere: if ts1 > ts2 or ts1 = ts2 and id1 > id2, then v1 is selected, otherwise v2 is selected (we suppose that id1 = id2 has a negligible probability of happening). In terms of message round-trips between nodes, we see that with this new method, nodes simply communicate once with all other nodes in the network, which is much faster than having to pass through a leader.
Here the example uses timestamps as a way to disambiguate between v1 and v2, but CRDTs are more general and other ways of handling concurrent updates can be devised. The core property is that concurrent operations are combined a-posteriori using a deterministic rule once they reach the storage nodes, and not a-priori by putting them in a certain order.
Got it, thanks. This reminded me of e-Paxos (e for Egalitarian) and a bit of digging there are now 2 additional contenders (Atlas and Gryff, both 2020) in this space per below:
This sounds like ABD but avoiding the read round trip as long as the register can be modeled with a CRDT. It's interesting to see this applied to a metadata portion of an object storage system. I'm sure this has implications on how rich of storage APIs you can offer. Can you do conditional writes? Can old versions of a register be compacted to reduce metadata space? How does garbage collecting dead blobs work?
It would be helpful to see specific goals of the project laid out, especially since this bucks the trend of being an enterprise/production grade object storage service (and that's totally ok).
If by that you mean something like compare-and-swap or test-and-set, then no, that would require consensus number > 1 [0]
> Can old versions of a register be compacted to reduce metadata space?
Yes, when using last-write-wins registers, old versions are deleted as soon as they are overseded by a newer version. Almost all metadata stored in Garage uses this mechanism.
> How does garbage collecting dead blobs work?
Nodes have a shared table of block references, that identifies for each data block the set of objects (or rather, versions of objects) that use this block. Block references have a "deleted" flag that is set to true when the version in question is deleted from the store. Once set to true, it cannot go back, but a new version can always be created that references the same block. Nodes keep a local count of the number of non-deleted block references for each block, so that when it reaches zero, they queue up the block for deletion (after a small delay). There is also a metadata GC system so that block references with the deleted flag are dropped from the table after a certain time.
Basically, our association is trying to move away from file systems, for scalability and availability purposes.
There's a similarity between RDBMS (say SQL) and filesystems, in the sense that they distribute like crap. Main reason being that they have too strong consistency properties.
Although, one can use Garage as a file system, notably using WebDAV. It's a beta feature, called Bagage: https://git.deuxfleurs.fr/Deuxfleurs/bagage
I'll let my knowledgeable fellows answer, if you'd like to hear about differences between seaweedfs and Garage in more detail :)
"Storage optimizations: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication."