How about (automated) testing? Seems the author missed an important one.

nostrademons · on Oct 2, 2015

It's necessary, but insufficient. You should have automated testing for each individual component. The problem is that by itself, automated testing won't give you a reliable distributed system. You need to assume that components will fail in ways that won't be caught by your tests, and design your architecture accordingly.

(Among other problems, automated testing won't catch a hardware failure, or someone tripping over the power cord, or a cosmic ray flipping a bit and corrupting memory.)

byroot · on Oct 3, 2015

> automated testing won't catch a hardware failure, or someone tripping over the power cord

While it's never 100%, you can still test quite a lot of those conditions.

We developed a TCP proxy[0] that allows us to test such cases: Datastore X being down / timing out / being excessively slow.

[0] https://github.com/shopify/toxiproxy

jacques_chester · on Oct 3, 2015

Netflix approach this with their "chaos engineering" -- characterise fault behaviour by triggering known faults.

benwilber0 · on Oct 2, 2015

sure but that's a larger topic than just the "best practices" article from OP. large-scale distributed systems are very difficult to test in an automated way and mostly rely on real production "tests" (ie, real traffic) to test comprehensively.

dunkelheit · on Oct 2, 2015

I think thanks to the famous Jepsen series vendors of distributed database systems are slowly acknowledging that this kind of testing (even if it is very difficult to perform) is nevertheless a must. But for application software it is still a luxury.

tkinom · on Oct 2, 2015

Start a group of VMs / Docker containers. Write some script to randomly reboot one or a few of them?

Tests are not that hard.

Debug the issues are harder.

    To the original post author's comment about gdb/go:   You can't use gdb to debug this type of problems.

nostrademons · on Oct 2, 2015

Among experienced teams, most failures aren't caused by single-node/single-service errors. They've already designed & tested for that case, and the ability to handle them is baked into the architecture.

The interesting failures are caused by a cascade of errors - someone writes an innocent bug, which causes a single-node fault, which exercises some pathway in the fault recovery code that has unintended side-effects, which results in an unexpected condition elsewhere in the system.

dunkelheit · on Oct 2, 2015

What you suggest is a start and can certainly be helpful in uncovering some failure modes of the application, but it is by no means complete answer. Debugging can be hard but at least it is tractable - you start with the effect and slowly work your way towards the cause. But anticipating which of the changes in environment (including but not limited to: sudden spikes in traffic, changes in statistical distribution of data, subtle partial hardware failures, peculiarities of the systems software etc.) will result in failures is impossible. What is the worst is that all causes can be relatively benign when taken in isolation but when they are all present they interact in some devilish way resulting in a total breakdown of your service. This facebook engineering blog post comes to mind as an example: https://code.facebook.com/posts/1499322996995183/solving-the.... So even if you test rigorously you still need to monitor everything and have some smart people on call to look into the issue when some obscure metric goes haywire.

nickpsecurity · on Oct 3, 2015

That one is important. I give details in this comment:

https://news.ycombinator.com/item?id=10322298

eric_h · on Oct 2, 2015

Indeed - I'd argue that's the most important one.