Hacker News new | past | comments | ask | show | jobs | submit login

Many good points there, but I think a lot of people could be misled by the advice about timeouts and retries. In my experience, which is a lot more than the author's two years, having each component retry and eventually fail on its own schedule is a terrible way to produce an available and debuggable system. Simple retries are fine and even necessary, but there should only be one part of the system making the often-complex decisions about when to declare something dead and give up. That way you don't end up with things that are half-dead or half-connected, which is a real pain in the butt.



in my experience, which is also a fair bit more than two years, that is deeply not-great advice for what the author's talking about, which is available systems. Systems with one-true-deciders need to serialize through the decider, which tends to lower availability and limits scalability. And, the decider can die or make wrong decisions, or get partitioned outside of a perfectly functioning set of shards or whatever.

It is, no lie, far harder to make a system that can stay available and does the right thing during the partition. Even though the research has been around for decades, we're still in baby steps in actual functioning systems. Even Riak, the poster child for this sort of thing, was straight up last-write-wins for years while they were making fun of everyone else. Hard problems!


I think you're confused about what "one part of the system" means. I don't mean one process running on one piece of hardware. I mean one software module with many instances - itself a distributed system, with one purpose in life (determine liveness) so it can be simple and verifiable. There's no serialization, and it does not lower availability or scalability as you claim. In fact, it preserves those properties far better than "everyone doing their own failure detection a hundred different ways" does.


oh, not confused at all! I use erlang, like the OP. Right now I have several erlang nodes that are doing a few hundred thousand http connections concurrently. Looking at the logs, it looks like about 10-20 of those concurrent processes are dying every second for various reasons (upstream crappiness, network errors, parse errors, sunspots for all I know right now), and are getting resurrected by their supervisors. Every blue moon, an entire network target will go out and the restart frequency will be too high, and the supervisors themselves will commit suicide and be restarted by their supervisors.

Sometimes a machine will go down and come back up. Looks like the one I'm on has an uptime of about 90 days, which seems low, but doesn't really matter. If too many of them go down at the same time, I'm notified. Haven't ever been notified.

You don't have to do 'failure detection a hundred different ways' (what does that even mean?) if you just crash, and build your systems to go with the flow, independently, like ants.


Well, somebody has to detect the failures, and restart failed processes, etc. so you can just "go with the flow". If you can "just crash" that's great, but it doesn't lead to much helpful advice for people who actually design the underlying systems.


Sorry I wasn't clear. That's exactly the advice. Design systems so that they can crash harmlessly in the small, and availability and scalability come to you quite naturally.

I agree the (distant) second best alternative is to do as you suggest: to try to rely on a paxos implementation that keeps a stable leader and tries to keep track of liveness. That's pretty hard too! Besides all of aphyr's jepsen test series, check this shit out: https://github.com/coreos/fleet/issues/1289 . Tough! Tough even when you think you know what you're doing. And of course no help against freak occurrences where you can't elect a stable leader any more, or you flap leaders, or the election gets stuck.

Currently the only known-good implementation of raft is in Idris. Do you know Idris? I sure as hell don't. Fortunately I know ants. Ants can die in great numbers and still move the leaf. In practice, in production, thousands of terabytes, even on AWS, my ants move and never bother me. Let it crash.


Consensus algorithms and failure-detection algorithms are two different things. For example, a gossip-based failure detector is statistically likely to converge in well bounded time despite partitions, without providing the absolute guarantees of a consensus system. Your criticisms are specific to the category that's irrelevant to this discussion. Please, if you've only built on top of such systems and merely dabbled in their implementation, don't try to bluff your way through.


Could you go into greater detail about how the only known-good implementation of raft is in Idris? Never heard of raft before you mentioned it (i don't deal with distributed systems day-to-day), but I am reading the paper now.

and what's ants?


First, a correction -- the only known-good implementation of Raft is actually in Coq (https://github.com/uwplse/verdi), which predates Idris (which intends to be an easier-to-use Coq).

So the problem with distributed consensus algorithms is that they are hard to understand. It didn't help that Lamport wrote his original paper playfully, using a complex and unfamiliar metaphor. But as a result, many implementations of the relevant algorithms tend to miss complex edge cases. Even famous ones that many large companies rely on have either had meaningful serious bugs or have been misunderstood and misused by downstream applications.

There are a couple of ways to try to fix this. The common way is to try to write a bunch of unit tests. This doesn't work. Unit tests test only those things that your tests manage to cover, and you will probably not think of all of the edge cases.

The next most common way is to use something like QuickCheck, which automatically generates millions of cases and spends as long as you want (days, hours, weeks) hammering your code. This is much better, but still nondeterministic.

The better way is to go fully deterministic, and prove out that your algorithm works, either by exhaustively checking all possible interleavings (code paths) with a model checker, or by mathematical proof.

Historically, the model checkers (e.g. Spin or Groove) have used a pseudolanguage that you describe your algorithm in, which is then exhaustively run to completion. This approach can prove that you are on the right track, but since you cannot run those pseudolanguages in production, they are not the complete solution, because you must then translate the pseudolanguage into source code of your chosen language. This is nontrivial and very frequently there are subtle transcription errors.

An alternative approach is to use a model checker that uses your native language directly; e.g., Erlang has the brilliant Concuerror program, which instruments your actual code. This is great because if it can verify that your code works properly, then you are done; no transcription is needed. Nevertheless I don't believe there are yet any Concuerror-assisted public distributed consensus algorithms, even in Erlang. I would love to be mistaken on this point.

The last approach is to take a proof assistant language like Coq or Idris, and formally prove out the properties of the algorithm using mathematical proof techniques. This is probably optimal, because the exhaustive model checkers can, with complex/slow enough algorithms, run forever or run out of memory trying to test all the cases. However, Coq and Idris are not exactly popular languages and at this time it's not easy to implement line of business applications with them. So although there is a proven, correct implementation of Raft in Coq that guarantees linearizability, good luck accessing it. If you don't use Coq, then you're forced to transcribe it to your chosen language, which, as before, is error prone and does not result in a proven implementation.

It would be possible to mechanically transcribe Coq/Idris code into a more common language while maintaining its provability, but to my knowledge that hasn't happened yet. More likely is that Idris and its successors inherit mainstream language features and start making inroads.

Note also that maybe you don't care about being correct. For example, in the spirit of moving fast and breaking things quickly or whatever, at least one major notable VC-funded project in the news has taken the approach of just increasing timeouts in order to mask bugs in their implementation. And 99.999% of the time that will probably work fine, just as centralizing your database into one king-hell instance and just dealing with downtime every blue moon will probably be fine too. Own your own reliability alpha.


ants (the bug) is an analogy for erlang workers


But isn't this having one part of the system responsible for retries (the tree/hierarchy of supervisor processes)?


> Even Riak, the poster child for this sort of thing, was straight up last-write-wins for years while they were making fun of everyone else.

Disclaimer: I've only used Riak for toy projects, never on a real production app, but...

I think that's being overly simplistic. They've had vector clocks from basically the beginning (though I gather they've been superseded now) and have favored the strategy of offloading conflict resolution to the code that's reading the value. Without implementing that conflict resolution, then it might default to last-write-wins, but that's not the same thing as "straight up last-write-wins" since they give you the tools to implement something more intelligent.


I was referring to the shipped default. You could always change to client-resolution, but then, how's the client supposed to know? I would have looked forward to their increasingly-good-looking CRDT ideas if that company hadn't undergone uncommanded descent into terrain.


Yeah. I run ~1 autonomous retry then alert/log and pass the decision upstream.

I tend to find more than 1 retry with a clear timeout is better. [e.g. I know after Y seconds that X is dead to the world and it can be safely re-ran from the top]


Health checks are orthogonal to retries. It's not one or the other, you should have both (along with sane limits on retries so you don't cause a cascading failure with a persistently-down server).


No, it's not one or another, but they're related. Or should be. I've had to debug systems with dozens of different retry/failure intervals that had to relate to one another in very complex ways for failure detection/response to happen properly. It sucked. Systems where the different layers/modules handle simple retries but give up only according to one set of times and rules are much more robust and pleasant to work with.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: