Hacker News new | past | comments | ask | show | jobs | submit login

This gives most SaaS providers a bad name. The error here is not the engineer deleting the db, it's the complete lack of data restore testing.

Complex restoring never works well when the first implementation is under the pressure of the real event. Other SaaS providers will be cursing such a big name tool making such a public mess.




I can't remember who said this, but it seems apt: "Don't ask if they do regular backups -- ask if they do regular restores".


Good point. And in the cloud, restoring a backup can take ages, which is important to take into consideration. In fact, I'm not sure how to do it efficiently other than swapping the drives.


Yep. My regular rant: until it is tested what you have the is just a file, or collection of files or other object(s), that might be a backup. You don't really have a backup until you have tested it.


Schroedingers backup.



DR testing is hard, complicated and costs a lot. Yes, it should be done regularly, but it's not an easy task; I believe small(ish) companies simply can't afford it.


Then perhaps small(ish) companies shouldn't hold data that's critical to their customers. Just like real engineering companies don't build nuclear reactors if they can't test their safety systems, cars if they can't afford to do crash testing and so on.

DR is not a luxury. Systems that don't properly do DR aren't unoptimized or something, they're badly engineered.


Doubt any business will go down as a result of a web based diagram tool being unavailable for a few hours.


I wasn't referring to this specific case. Also, "don't worry, our services may suck, but not to the point where they bring down your business" is not exactly the kind of reliability one would want to aspire to.


In the DR services my company provides, failover testing is baked in to the cost, and is _mandatory_ annually. It is hard and complicated and expensive, but not as hard as explaining to the customer that "we have your backups, we just can't use them because we never tested whether the software would run on the DR hardware" or "don't worry, the replica of the file server is secure in the datacenter, you just can't log in now because the Active Directory server is tombstoned and won't process your logon." When someone needs their data, what's the difference? Can I see my files or not?

It's more than a cliché that without restore tests, you don't really have a backup-- if the customer won't commit to testing the DR, we won't provide the service anymore. Anyone who pretends anything else is acceptable is kidding themselves.


Testing complicated failover systems is hard. Testing backups is not. Especially if you just do a full nightly backup and not something more complex. The options range from "manually restore a random 10%" to "write a moderately complex script to automatically restore and validate everything".

Those all look like more trouble than they're worth before you start, but they aren't that hard. And you'll recover your whole investment the first time you need to restore something and it's just a simple tweak to your standard test practice.


They seem to afford now to run 4 restores in parallel, using different methods.

If they had done this exercise even once a year, they would have known better what to expect, or how much it would take.


Apologies if I've put the wrong slant on this — designer not a tech expert. Could still edit the title of the submission if there's something else it should say?


It wasn't aimed at you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: