A coworker of mine used to say "It's not the backup, it's the restore." Meaning ...

arethuza · on March 23, 2016

It's important to check that not only can you restore the database but you also check that it contains the data you think it contains.

Finding out that your backup consists of a perfectly backed up empty database isn't much fun!

wlll · on March 23, 2016

One of the first things I did at my current place was to add a Nagios check "Is the most recent backup size within x bytes of the previous one". It wasn't comprehensive, but was fast to implement and was later replaced with full restores and checks, but nevertheless a few weeks later it caught a perfectly formed tar.gz of a completely empty database due to a bug that would have continued creating empty archives until it was fixed.

saidajigumi · on March 23, 2016

This is a fantastic illustration of testing 101: often it's the dumbest possible checks catch huge errors. Get those in place before overthinking.

E.g. a team adjacent to mine years ago had a dev who made a one-character typo in a commit that went to production. Which caused many $MM to incorrectly flow out the door post-haste. The bad transactions were fortunately reversible with some work, but I was floored that there were no automated tests gating these changes. It wasn't a subtle problem. The most basic, boring integration test of "run a set of X transactions, check the expected sum" would have prevented that failure.

chris_wot · on March 23, 2016

All these threads are complete gold mines of years of hard won experience and system administration tips! Fascinating, especially the war stories I'm reading - both hilarious and horrifying in equal measure.

KB1JWQ · on March 23, 2016

And as Gliffy is discovering to their detriment, exactly how long said restore takes...

vonklaus · on March 23, 2016

Resonates. I didn't spend the extra $5 to get a USB 3.0 Flash Drive. Currently the read process for the image backup looks like it will take 4 hours to create write media, 2 to actually restore. Good lesson above about te restore being more inportant than backup.

Wish I learned that.

mkhalil · on March 23, 2016

This morning I backed up my iPhone with iTunes. Once finished iTunes said backup was successful. Then formatted my iPhone, and went to restore. It say's the backup was corrupt.

Tried some closed source iTunes Backup fixer, didn't work.

Pretty heated right now, but oh well, what can I do? I guess I'll just have to start over. Thankfully I am a iCloud Photo Library subscriber.

So yes, can confirm, "It's not the backup, it's the restore."

CaptSpify · on March 23, 2016

If you've never restored the backup, you haven't backed-up.

KMag · on March 23, 2016

My first job was doing software engineering consulting. My main client was a defense contractor. We wanted to at least use cvs or svn for source control, but you wouldn't believe the red tape associated with getting approval to use a free source control system... so we used date-named zip files for "source control". The rule was the last person out at the end of the day zipped up the shared drive contents and shut off the lights.

Though, we were required to keep good backups. 3 sets of tapes.. 1 always in the tape drive, 1 always in a fire-proof bomb-resistant bunker, and one sometimes in transit to or from the bunker.

Our manager was paying some obscene sum for this backup service, so I suggested we just hide one of the daily backup zip files and pretend we deleted it. The head of the group humored my request. It turned out that nobody was monitoring the tape or the backup job. The tape had filled up and nobody had swapped tapes and called the bunker courier.

Luckily, they at least used PVCS for configuration management of the releases. No source control for development, but every time we cut a release of the software, the head office sent over no fewer than 3 people to literally watch over our poor release guy's shoulder as he zipped up the source, built the binaries, checked both into PVCS, and burned two CDs of binaries and source.

Defense industry... things you're required to do get done in triplicate. Things you're not required to do, but cost no money and significantly reduce risk, take approvals from 5 different people 9 levels above you. I guess how often the backup tape needed to be checked for available space was insufficiently specified.

On a side note, at that client I also once sat quietly in a meeting for 30 minutes watching two grown men argue over if my use of "will" in a design document needed to be changed to "shall". I didn't care, and right away said I was fine changing the word to "shall", but the second reviewer was adamantly opposed to unnecessary changes.

ratboy666 · on March 23, 2016

"Will" to "Shall" -- very important legal distinction. Thank the second reviewer. Your job in "software engineering consulting" was actually managing risk.

KMag · on March 23, 2016

Except that, as mentioned, this was a software design document and not a legal document. I described intended behavior of a software component. There's no ambiguity in using "will" or "shall" since software does not (at least did not at that time) have intent and does not make promises.

I tried to avoid argument because it didn't matter, not because I thought my word choice was incorrect. It was an internal document describing intended software behavior, to help the poor soul who had to maintain that code.

"When 'shall' is used to describe a status, to describe future actions, or to seemingly impose an obligation on an inanimate object, it's being used incorrectly."[0]

[0]https://law.utexas.edu/faculty/wschiess/legalwriting/2005/05...

marklyon · on March 23, 2016

I have a similar issue.

Formerly critical tool (now, sole repository of necessary historical information) that runs only on an outdated stack and for "copy protection" has critical data in an undocumented, obfuscated DAT file. It also requires a parallel port dongle, which it checks for as part of every read operation. The vendor went bust a decade before I was ever hired.

I've automated a 'backup to zip file' each time the application terminates. It's saved me more than once - it's easy to clobber the data in the thing and the users have a tendency to screw it up when trying to navigate its cryptic keystroke-driven interface.

Trying to export all the legacy data into a new tool met with incredibly frustration the couple of times we tried. It all becomes irrelevant in late 2017 and there's been a new system in place since 2007, so this abomination needs to only live on for a little while longer.

saidajigumi · on March 23, 2016

When appropriate, this is why I adore full-disk, bootable system backups. Plug a backup drive in, spot-check the contents (e.g. by data priority), run a filesystem compare tool, boot from it, etc. Backup is easy, as is restore and testing. A booted backup drive can restore to a replaced system drive.

That's obviously not the right solution for many IT-centric backup needs, but when full-disk backup became cheap and easy it set a new standard in how I think about backup and restore process everywhere.

toxican · on March 23, 2016

This entire discussion has me wanting to backup (and test) everything that I can get my hands on.

ollybee · on March 23, 2016

World Backup Day — March 31st

langseth · on March 23, 2016

World Restore Day April 1st

yongjik · on March 23, 2016

Postmortem on date field mismatch following backup/restore procedure of the World: Octocember 99st

chris_wot · on March 23, 2016

What could possibly go wrong?

foxpc · on March 23, 2016

As long as it's not February 29th.

piffey · on March 23, 2016

Had the same training from the head of operations at my first gig. If you don't restore the backup each time it's pulled to verify it's valid as a test then it's a risk.

technion · on March 24, 2016

This is made more serious in practice by the continual recurrence of bugs in Backup Exec over the last several versions, which would often manifest as "backups ran fine, verify ran fine, restores claim to run fine. But all your restored files are 0 bytes long".

slavik81 · on March 23, 2016

How do you do test your restore without destroying the only trustworthy copy of your data?

ziziyO · on March 23, 2016

Do it in the Staging or User Acceptance Testing environment.

twic · on March 24, 2016

This is the right answer! You want a staging/acceptance/mirror environment that's the same as production, right? So make it with nightly restores of the production backups. You get a crisp, fresh staging environment, and regular validation of your backups too. Just remember to run full production monitoring against your staging environment too.

vacri · on March 24, 2016

A friend of mine has his staging databases restored every morning (and anonymised) from the previous day's production backup.

jlgaddis · on March 23, 2016

Restore it elsewhere.

jowiar · on March 23, 2016

Made that mistake once. I was interning at the IT Helpdesk at a Pharma Startup. Our CEO calls us one day, informing us that he deleted a slide deck, and was wondering if we could do anything about it. We dutifully attempt to restore from tape, to find the tapes blank and that our backup has done literally nothing for the past month. Thankfully, the data loss was nothing more severe, but the lesson stuck.

disordinary · on March 23, 2016

In my last job I was a member of an IT team of three, one guy was the CIO + he did sysadmin and IT support for an office of about 60. I did sysadmin and support for about half a dozen offices, 200 odd staff, in two countries with a geographical spread of several hundred kilometres. The third guy was support and sysadmin for our US office which had about 12 people (cushy job). I popped out there once for a week or so to help train the guy on our new systems. They had an old backup system that copied diffs to disk which he had to take home every day, but he didn't like doing that so they invested in a very expensive off site system. I was having a dig around when I was over there and it appeared to me that the diffs hadn't been happening. I confronted the guy and he said that he hadn't set it up yet and knew that would bite him in the ass eventually. They hadn't run any backups for six months, but he had been giving positive reports to both his manager and the rest of the IT team about the hard work he was putting in setting it up, and how it had been so reliably backing up, we celebrated when the first full copy of data finally finished, etc. Needless to say he lost his job.

We were very lucky that I caught it.

vacri · on March 23, 2016

We were using an external vendor for our production database (I've changed this now) and I was poking around their management console one morning when I notice something (I was about two months into this gig). The CTO walks into the office at that point and I ask him: "Hey, I can see that the staging database is being backed up, but the prod database doesn't have any backup files. Where are the backups for prod?". His response: "Press that manual backup button immediately, please." Turns out that production had been running for nearly three years without backup actually running...