I was testing disaster recovery for the database cluster I was managing. Spun up...

I was testing disaster recovery for the database cluster I was managing. Spun up new instances on AWS, pulled down production data, created various disasters, tested recovery.

Surprisingly it all seemed to work well. These disaster recovery steps weren't heavily tested before. Brilliant! I went to shut down the AWS instances. Kill DB group. Wait. Wait... The DB group? Wasn't it DB-test group...

I'd just killed all the production databases. And the streaming replicas. And... everything... All at the busiest time of day for our site.

Panic arose in my chest. Eyes glazed over. It's one thing to test disaster recovery when it doesn't matter, but when it suddenly does matter... I turned to the disaster recovery code I'd just been testing. I was reasonably sure it all worked... Reasonably...

Less than five minutes later, I'd spun up a brand new database cluster. The only loss was a minute or two of user transactions, which for our site wasn't too problematic.

My friends joked later that at least we now knew for sure that disaster recovery worked in production...

Lesson: When testing disaster recovery, ensure you're not actually creating a disaster in production.

(repeating my old story from https://news.ycombinator.com/item?id=7147108)