It's necessary, but insufficient. You should have automated testing for each individual component. The problem is that by itself, automated testing won't give you a reliable distributed system. You need to assume that components will fail in ways that won't be caught by your tests, and design your architecture accordingly.
(Among other problems, automated testing won't catch a hardware failure, or someone tripping over the power cord, or a cosmic ray flipping a bit and corrupting memory.)
sure but that's a larger topic than just the "best practices" article from OP. large-scale distributed systems are very difficult to test in an automated way and mostly rely on real production "tests" (ie, real traffic) to test comprehensively.
I think thanks to the famous Jepsen series vendors of distributed database systems are slowly acknowledging that this kind of testing (even if it is very difficult to perform) is nevertheless a must. But for application software it is still a luxury.
Among experienced teams, most failures aren't caused by single-node/single-service errors. They've already designed & tested for that case, and the ability to handle them is baked into the architecture.
The interesting failures are caused by a cascade of errors - someone writes an innocent bug, which causes a single-node fault, which exercises some pathway in the fault recovery code that has unintended side-effects, which results in an unexpected condition elsewhere in the system.
What you suggest is a start and can certainly be helpful in uncovering some failure modes of the application, but it is by no means complete answer. Debugging can be hard but at least it is tractable - you start with the effect and slowly work your way towards the cause. But anticipating which of the changes in environment (including but not limited to: sudden spikes in traffic, changes in statistical distribution of data, subtle partial hardware failures, peculiarities of the systems software etc.) will result in failures is impossible. What is the worst is that all causes can be relatively benign when taken in isolation but when they are all present they interact in some devilish way resulting in a total breakdown of your service. This facebook engineering blog post comes to mind as an example: https://code.facebook.com/posts/1499322996995183/solving-the.... So even if you test rigorously you still need to monitor everything and have some smart people on call to look into the issue when some obscure metric goes haywire.