Hacker News new | past | comments | ask | show | jobs | submit login

What you suggest is a start and can certainly be helpful in uncovering some failure modes of the application, but it is by no means complete answer. Debugging can be hard but at least it is tractable - you start with the effect and slowly work your way towards the cause. But anticipating which of the changes in environment (including but not limited to: sudden spikes in traffic, changes in statistical distribution of data, subtle partial hardware failures, peculiarities of the systems software etc.) will result in failures is impossible. What is the worst is that all causes can be relatively benign when taken in isolation but when they are all present they interact in some devilish way resulting in a total breakdown of your service. This facebook engineering blog post comes to mind as an example: https://code.facebook.com/posts/1499322996995183/solving-the.... So even if you test rigorously you still need to monitor everything and have some smart people on call to look into the issue when some obscure metric goes haywire.



Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: