Hacker News new | past | comments | ask | show | jobs | submit login
Raft Consensus Animated (2014) (thesecretlivesofdata.com)
411 points by pkilgore on Aug 16, 2022 | hide | past | favorite | 67 comments



Author here. I made this visualization over a decade ago and I'm glad it's still useful for folks! Let me know if you have any questions.

I've also been trying on-and-off again some different techniques for doing the visualization as I'd like to do more of these. I'm currently looking at trying to make it work with Remotion[1]. The JavaScript version I did for Raft was time intensive and I ended up having to write an entire (albeit terrible) implementation of Raft to even get it to work. lol.

[1] https://www.remotion.dev/


It's awesome. Thanks for this. I kinda-sorta understood how it worked from watching logs of systems that used Raft, but seeing it clearly like this made me say "oooh!" a couple of times.


Fascinating! Thank you. Perhaps eventually the work of Heidi Howard would inspire a ___domain model that would work for multiple consensus algorithms? Great work, visualizations help a lot


LOL - I was wondering how you would do this without actually implementing Raft.

It appears you actually did implement it!


Funny enough, found this browsing the litefs repo and trying to understand the choice to go with consul. It was about five links down the rabbit hole, didn't realized I walked in a circle.

Thanks!


Welcome back! :)


I've only heard about Raft Consensus algorithm thrown around in a few GitHub repos/HN comments but never got a chance to really know it.

This webpage cleared some long-standing doubts about what distributed computing means, what a consensus algorithm is and what his Raft thing is.

Kudos to the developer. You got a newbie interested in the field!


This is the first I have heard of Raft, but enjoyed the animations and ideas. I work on multi-node radio communications for ag automation. I had two questions after watching this:

- Is Raft alone in this space, or are there other popular algorithms/libraries that fill the same space?

- What happens when the node count gets larger than a handful? What happens when you hit hundreds or even thousands of nodes, that are trying to achieve consensus? In particular, the part where all of the nodes respond (semi) simultaneously to a broadcast node. In a radio spectrum world, that would be a disaster. N:1 communication slots are choke points for timely communication.


- There also is Paxos[0] as the most significant option.

- You should not have too many nodes to make a decision, this is usually reserved for leaders; if you have a large distributed system you may clusterize them or forward decisions to leaders, whom decide for consensus. If you clusterize, the leaders for each node can also be selected by consensus. If you can't do any of those then having a consensus protocol might not even be a good idea; you'd end up with a sort of merkle tree (or some sort of blockchain) to make sure all the data is registered, or maybe audit transactions. In any case this[1] might be interesting.

[0] https://en.wikipedia.org/wiki/Paxos_(computer_science) [1] https://doi.org/10.1016/j.neucom.2016.10.011


Paxos and viewstamped replication are basically the two most well known other well-known asynchronous consensus mechanisms that have been mathematically verified.

If you just need eventual consistency, CRDTs are also possible.

Going in the other direction, if you don't mind the latency full consensus with global locking, you could just do that.


> - Is Raft alone in this space, or are there other popular algorithms/libraries that fill the same space?

to give an obvious answer, blockchains are one of the methods used for trustless consensus (imagine how things could go wrong in RAFT's case if the leader was malicious)


Warning: you're going to get buried here.

The 'right answer' is going to be Paxos and it's many flavors; a few references to Lamport, Google's Chubby, "Paxos Made Real". If the right people are on hn today you're going to get a few references ZAB (Zookeeper Atomic Broadcast) and View-Stampted Replication. Oh, and go watch all of Heidi Howard's talks.

The problem spaces of byzantine and non-byzantine consensus are far enough apart ... well kinda like Atomicity in postgres and atomics is C++. You decide if that's a lot.


I’m the biggest cryptocurrency hater there is but even I admit that, despite being mostly useless in the real world for any purpose other than committing crimes, Bitcoin solved a class of consensus problem that hadn’t been solved before and is therefore a major academic achievement. My GP probably shouldn’t have been downvoted.


I agree completely! I always joke that the block and chain part of "blockchain" are boring: that's basically git! The cool part is byzantine consensus!

But like I said, I also think that as a class of problems — byzantine consensus is just super different from the normal consensus problems paxos and raft are addressing. Way more than they might seem at first. Enough to be an off-topic to the point of just being naive or misleading.


You don’t normally have hundreds or thousands of nodes trying to achieve consensus. You have 3-5 nodes trying to achieve consensus and then serving requests to the other 100s-1000s nodes.


Is it a weakness to only commit on majority consensus? I'm thinking of a very unstable global network, where partitions are happening everywhere. In that scenario, only one cluster can reach consensus (if you're lucky). If the partitions are such that no cluster has majority, nothing can proceed.

Is there a better way to proceed with tentative consensus, until a majority cluster can be realized, and then have a conflict resolution strategy? People operate this way.


This is a consistent and partition tolerant system what you are describing is an available and partition tolerant system, but not one that can provide consistent results. (That you cannot have all three properties is called the CAP theorem and some people say they have all three but they just put a tight bound on unavailable and claim it doesn’t exist..) There are a variety of ways to achieve available and partition tolerant, with the conflict resolution as a rule implemented by the database or by the application.


I'm not sure Raft is the best distributed consensus algorithm for the situation of a global, unstable, frequently-partitioning network. I think it is in its niche when leaders are running on fairly stable networks (>1-2 nines), and the main source of node failures are due to task cycles / rolling deploys.

I've played around with Hashicorp Consul on "edge boxes" - long-haul, wirelessly-connected embedded computers, with unreliable power supplies. Allowing edge boxes be Consul leaders results in all kinds of mayhem: split brain situations, corrupted state, stale DNS resolutions (Consul handles DNS as well), cats and dogs living together, mass hysteria. A much better topology is to have 3 server nodes on a LAN as the "head cluster" and letting all the edge boxes be clients of the head.

I haven't used it but Consul has a multi-datacenter mode, which I believe is designed to better handle such a situation, which I believe has a dedicated raft cluster per datacenter.

https://learn.hashicorp.com/tutorials/consul/federation-goss...


You either need consistency or you don't. Raft is for systems that need this guarantee. If you don't need it, something like CRDT's can be used.


Yes, IIRC they are called leaderless protocols where you have more than 1 writer at a time. When there's a conflict you either let the user resolve (slow) or you pick a default resolution strategy. For example, LWW (last write win) simply accepts the last write.

It's been a while since I read "Designing Data-Intensive Applications" though, but there's a chapter on that.


Great timing. I'm part of a German podcast on fundamentals of computing [1], and we just recorded an episode on Distributed Systems that discusses Raft as an example. We will probably be adding an addendum to link to this.

[1] https://www.schluesseltechnologie-podcast.de


anything like this one but in English?


On a related note, I’ve found https://martinfowler.com/articles/patterns-of-distributed-sy... to be quite instructive in understanding distributed systems in general.


More generally, the Raft page on Github lists some good resources on that subject (including that really good animation):

https://raft.github.io/


I've had a surprisingly hard time finding a bare-bones Raft implementation in Java purely for leader election.

The same hunt also surprised me that there is no common way to do leader election among pods in Kubernetes.


How long ago was this? There is now a native Lease resource which allows you to piggyback off the etcd consensus.

https://kubernetes.io/docs/reference/kubernetes-api/cluster-...


Awesome, I failed to find that. Thanks!


Operator Framework (and I assume the upstream k8s go library) provide leader election.


This is just ridiculously good. I am normally a word learner but lately my mind has been going a little bit more visual. This was very very well thought out and helped me enormously.


Previous discussion (in 2020): https://news.ycombinator.com/item?id=25326645

Also, I personally think the current blockchain literature is much more intuitive and easier to follow, for learning about consensus. The Byzantine case isn't really that different than the crash case if we assume cryptography. On the other hand, Raft is a spiderweb of a protocol, very easy to get wrong.


I am working on MIT 6.824[0] by myself as a side project. This visualization is very very helpful at the beginning just so that I can build up the right mental model and understand how components interact with each others.

[0]: https://nil.csail.mit.edu/6.824/2020/schedule.html


Related:

Raft Visualization - https://news.ycombinator.com/item?id=25326645 - Dec 2020 (35 comments)

Raft: Understandable Distributed Consensus - https://news.ycombinator.com/item?id=8271957 - Sept 2014 (79 comments)


This is genuinely lovely and informative. Thank you!


Excellent!

A couple of questions:

1) In the case of a network partition, the client that is currently connected to the leader, do they get notified that there's a partition, or that the cluster is not in a healthy situation?

2) If a client writes to the partition that will get rolled back, and all their transactions get rolled back after the partition heals, do they get notified that their data was rolled back?


> 1) In the case of a network partition, the client that is currently connected to the leader, do they get notified that there's a partition, or that the cluster is not in a healthy situation?

The cluster - or any server of the cluster - finds out about network partition only when the timeout passes. At this point the leader - which becomes the former leader - can notify the client, or the client can see for itself that the timeout has passed.

> 2) If a client writes to the partition that will get rolled back, and all their transactions get rolled back after the partition heals, do they get notified that their data was rolled back?

Note that the client was never notified that their data was committed in the first place. So the client can assume that if the timeout passed without notification that the data wasn't set in the cluster.

Surely there could be problems between the client and the leader. Idempotent messages could be useful.


It would be cool to also see an animated visualization of the paxos consensus algorithm


Indeed. An animation by Terry Gilliam with each of the distributed processes represented by Leslie Lamport wearing a different disguise.


So, I am imagining making a distributed storage (maybe some global database)

If implement raft on top of S3 I can kinda see this working. Is there a sensible "file system on top of S3 like storage" out there already?



I’ve worked on two implementations of raft, and two other multi paxos implementations and still think single decree paxos + 2PC is probably a better idea just super hard to implement.


I ran into this while setting up Hashicorp Vault a year or two ago. It was good at helping me understand what's happening, but I don't particularly like raft. I want to be able to recover from one server, and I don't want to have to wait for a majority on every transaction should I add many servers. I know it's an impossible problem to solve generally, but I think in many situations an alert saying some specific data had a conflict and might not have been resolved correctly is a much better outcome than an outage.


If you don't want distributed consensus, don't use a distributed consensus algorithm. Raft/paxos is not the best fit for every problem, but for problems where you NEED to ensure consistency, it is the best tool for the job. And while it could have outage problems theoretically, Google's Chubby lock service, written using paxos, has such high availability in it's global instance, that the SRE's introduce artificial mini outages, just so dependent services don't assume it has a 100% SLA.


> Google's Chubby lock service, written using paxos, has such high availability in it's global instance, that the SRE's introduce artificial mini outages, just so dependent services don't assume it has a 100% SLA

That's fascinating. Got more information on that?


I think there was something about it in the Google SRE book?

https://sre.google/sre-book/table-of-contents/


See "The Global Chubby Planned Outage" on this page:

https://sre.google/sre-book/service-level-objectives/


Oh. I had read the book before, but that snipped simply disappeared from my mind.

Thank you!


Largely unrelated to the content but cutting the fade animations would really improve the presentation.


Have there been any notable ammendments made to the protocol whether to improve correctness or performance?


How does this relate to validator nodes for blockchain consensus, same concept?


Thank You!


is this similar to how the ethereum network operates? This is an awesome animation


Not at all. Today, ETH is PoW based for consensus. It is moving to PoS in the future.

ETH has do deal with at least one thing that Raft doesn't have to deal with... bad actors trying to inject bad data into the system, also known as the Byzantine generals problem [1].

[1] https://en.wikipedia.org/wiki/Byzantine_fault


Unsolicited feedback: use fewer text-appear animations, and allow people to skip through stuff. I've spent a full minute clicking next next and still haven't seen a visualization aside from text slides loading slowly with animations. It's like a long YouTube ad that you cannot skip.


Author here. Yeah, I think I'll go with actual video for future visualizations. I made this visualization about 10 years ago and going back to it I feel the same way about the slowness. At least with video you can run it at 2x. :)


What did you guys run this on? I have had zero issues with it and the animations and progression felt guided and informative.


...for you. You may love the speed, but it's not right for everyone. When designing interactive interfaces like this, it's important to cede control to the user so that they can choose the rate at which they consume content. Otherwise, half your users won't like it and bail.


I didn't even make it past the introduction.


Arrow keys worked for me, but sadly back-arrow did go to the previous animation (Firefox, mac).


Agreed!

I really like this, but not being able to go slightly faster with arrow keys was aggravating.

Cool explanation though!


yep, basically no animation just explanation


I like the animation because it shows the dynamic behavior.

But the slow nature of the introduction to the elements on each incremental click is a bit irritating.

I'd recommend static image(s) with legend/highlights for each node and message, etc. And animations for each relevant scenario illustrated.


After clicking six times and learning nothing, I left.


Ugh... how slow the animations are... I read much much faster than that but it feels like playing through an old JRPG that doesn't let you speed up the text playback.


You're complaining about an amazing animation in the world of complicated distributed protocol papers.


Rephrasing slightly: The animations are playing slower than grogenaut's reading speed, and they are forced to wait for animations to complete before advancing. Faster animations, a way to control animation speed, or even a way to skip animations could make for a more comfortable user experience.

Granted, this animation hails from 2014, so the above might not be possible without significant effort.


I went through the animations and they’re not too slow. I guess if you’re trying to speed run through it then sure, but then why even go through it if you’re not trying to learn the protocol?


The first 8 animations just show text or one circle with an X or 8 in them. But I guess if it works for you then my experience is invalid. For me I gave up and just read a blog on it.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: