Start a group of VMs / Docker containers. Write some script to randomly reboot o...

nostrademons · on Oct 2, 2015

Among experienced teams, most failures aren't caused by single-node/single-service errors. They've already designed & tested for that case, and the ability to handle them is baked into the architecture.

The interesting failures are caused by a cascade of errors - someone writes an innocent bug, which causes a single-node fault, which exercises some pathway in the fault recovery code that has unintended side-effects, which results in an unexpected condition elsewhere in the system.

dunkelheit · on Oct 2, 2015

What you suggest is a start and can certainly be helpful in uncovering some failure modes of the application, but it is by no means complete answer. Debugging can be hard but at least it is tractable - you start with the effect and slowly work your way towards the cause. But anticipating which of the changes in environment (including but not limited to: sudden spikes in traffic, changes in statistical distribution of data, subtle partial hardware failures, peculiarities of the systems software etc.) will result in failures is impossible. What is the worst is that all causes can be relatively benign when taken in isolation but when they are all present they interact in some devilish way resulting in a total breakdown of your service. This facebook engineering blog post comes to mind as an example: https://code.facebook.com/posts/1499322996995183/solving-the.... So even if you test rigorously you still need to monitor everything and have some smart people on call to look into the issue when some obscure metric goes haywire.