One of the first things I did at my current place was to add a Nagios check "Is ...

saidajigumi · on March 23, 2016

This is a fantastic illustration of testing 101: often it's the dumbest possible checks catch huge errors. Get those in place before overthinking.

E.g. a team adjacent to mine years ago had a dev who made a one-character typo in a commit that went to production. Which caused many $MM to incorrectly flow out the door post-haste. The bad transactions were fortunately reversible with some work, but I was floored that there were no automated tests gating these changes. It wasn't a subtle problem. The most basic, boring integration test of "run a set of X transactions, check the expected sum" would have prevented that failure.

chris_wot · on March 23, 2016

All these threads are complete gold mines of years of hard won experience and system administration tips! Fascinating, especially the war stories I'm reading - both hilarious and horrifying in equal measure.