* Multiple logins to the conserver, down the wrong system.
* rm -rf in the wrong directory as root on a dev box, get that sick feeling when it's taking too long.
* Sitting at the console before replacing multiple failed drives in a Sun A5200 storage array under a production Oracle DB, a more senior colleague walks up and says "Just pull it, we've got hot spares" and before I can reply yanks a blinking drive. Except we have only two remaining hot spares left and now we have three failed. Under a RAID5. Legato only took eight hours to restore it.
* Another SA hoses config on one side of a core router pair after hours doing who knows what and leaves telling me to fix it. We've got backups on CF cards, so restore to last good state. Nope, he's managed to trash the backups. Okay, pull config from other side's backup. Nope, he told me the wrong side and now I've copied the bad config. Restore? Nope, that backup was trashed by some other admin. Spent the night going through change logs to rebuild config.
There were a few others over the years, but all had in common not having/knowing/following procedure, lacking tooling, and good old human error.
I changed permission on the entire server (with hundreds of customer data) accidentally when I left out a space in the bash command by accident.
Once I realized what I had done (took me 1 sec), I got that sick feeling. I had to go to the bathroom to do #2. I know what it means to be scared sxxtless.
Is it really a good idea to use RAID 5 on a database? If the database is large enough rebuild time can be more lengthy than a straight restore and under many RAID 5 setups you have the added problem of slower write performance.
> Is it really a good idea to use RAID 5 on a database
Hell no. Had I been in involved in that setup it would have been RAID 10 or RAID 50. Actually, had there been some planning there would have been a second array and it would not have been physically co-located in the same rack as the first so when the cooling or power inevitably fails it won't take out both. But, you know, not my circus.
* Multiple logins to the conserver, down the wrong system.
* rm -rf in the wrong directory as root on a dev box, get that sick feeling when it's taking too long.
* Sitting at the console before replacing multiple failed drives in a Sun A5200 storage array under a production Oracle DB, a more senior colleague walks up and says "Just pull it, we've got hot spares" and before I can reply yanks a blinking drive. Except we have only two remaining hot spares left and now we have three failed. Under a RAID5. Legato only took eight hours to restore it.
* Another SA hoses config on one side of a core router pair after hours doing who knows what and leaves telling me to fix it. We've got backups on CF cards, so restore to last good state. Nope, he's managed to trash the backups. Okay, pull config from other side's backup. Nope, he told me the wrong side and now I've copied the bad config. Restore? Nope, that backup was trashed by some other admin. Spent the night going through change logs to rebuild config.
There were a few others over the years, but all had in common not having/knowing/following procedure, lacking tooling, and good old human error.