Sounds like a terrible situation. I wish those guys luck. One useful sys ops pra...

Sounds like a terrible situation. I wish those guys luck.

One useful sys ops practice is the creation and yearly validation of disaster recovery runbooks. We have a validated catalog of runbooks that describe the recovery process for each part of our infrastructure. The validation process involves provoking a failure (eliminate a master database), running the documented recovery steps and then validating the result. The validation process is a lot easier if you're in the cloud since it's cheap and easy to set up a validation environment that mirrors your production env.