Hacker News new | past | comments | ask | show | jobs | submit login

Early 2000s, 9 day complete outage.

Don't know if that was the same one, but happened to me at Rent.com. The story is that a change in a shell script meant that backups were not actually being sent properly to tape. That was OK, there was another online backup copy. But the restore process deleted that for 1 hour each day before it was recreated.

The production database died during that hour. We had to take the last good backup (several months earlier) and replay WAL logs to bring it up to date.

The sysadmin whose mistake it was offered her resignation, and was turned down by the head of tech because he knew she wouldn't have made it if she had a more reasonable load. The head of tech offered his resignation to the CEO and was turned down because the CEO knew that it was due to incorrect company priorities.

The next tech hire was a DBA whose sole task was make sure that we have multiple levels of verified backups.

In less than a year we were sold to eBay at a nice price. Part of the reason was that they thought that the way that we handled failure said very positive things about the organization.




And this is how you run a business. It seems every person who worked with you had integrity and true leadership in spades!

That's rare. Too rare!


Oh, and one other memorable detail that I couldn't make up.

The database went down DURING the CEO's 50th birthday party!


i love happy endings :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: