Treating app servers as cattle, i.e. if there's a problem just shoot & replace it, is easy nowadays if you're running any kind of blue/green automated deployment best practices. But DBs remain problematic and pet-like in that you may find yourself nursing them back to health. Even if you're using a managed DB service, do you know exactly what to do and how long it will take to restore when there's corruption or data loss? Having managed RDS replication for example doesn't help a bit when it happily replicates your latest app version starting to delete a bunch of data in prod.
Some policies I've personally adopted, having worked with sensitive data at past jobs:
- If the dev team needs to investigate an issue in the prod data, do it on a staging DB instance that is restored from the latest backup. You gain several advantages: Confidence your backups work (otherwise you only have what's called a Schrödinger's-Backup in the biz), confidence you can quickly rebuild the basic server itself (try not to have pets, remember), and an incentive to the dev team to make restores go faster! Simply knowing how long it will take already puts you ahead of most teams unfortunately.
- Have you considered the data security of your backup artifacts as well? If your data is valuable, consider storing it with something like https://www.tarsnap.com (highly recommended!)
- In the case of a total data loss, is your data retention policy sufficient? If you have some standard setup of 30 days worth of daily backups, are you sure losing a days worth of data isn't going to be catastrophic for your business? Personally I deploy a great little tool called Tarsnapper (can you tell I like Tarsnap?) that implements an automatic 1H-1D-30D-360D backup rotation policy for me. This way I have hourly backups for the most valuable last 24 hours, 30 days of daily backups and monthly backups for a year to easily compare month-to-month data.
Shamless plug: If you're looking to draw some AWS diagrams while Gliffy is down, check out https://cloudcraft.co a free diagram tool I made. Backed up hourly with Tarsnap ;)
I've found tarsnap to be slow at restoring in the past. My recollection is a few hours for a ~1GB maildir. I was using it for my personal things but I would (as would anything) test restore times if I was using it for serious stuff.
The amount of de-duplication performed by Tarsnap, and the amount of files, which for a maildir I imagine is a lot of tiny files, probably negatively impacts it. Dealing with a single DB dump file the performance is fine so far at least. I can imagine one could also try to partition the data into multiple independent dumps that can run in parallel during the restore if speed became a concern.
I've been a customer of theirs for a long while and will note that their customer service is amazing. They helped implement what I needed and offer support anytime I need.
Now that S3 has matured and prices have continued to drop, though, I am going to be moving to trim costs. I actually kicked off backups to S3 earlier this month and am backing up to both S3 and rsync.net at the moment, with the plan of ending rsync once I've tested restores and made it through a billing cycle at Amazon.
Wow. To stress the amazing level of customer service, someone there just ran across my comment and reached out - noting that my pricing was set for an older structure, updating me to a far more competitive rate and offering a retroactive credit.
While Amazon has offered some great service, it's never been as good as that.
They really do stand by and provide a superior level of support and assistance if you need it on the technical side as well.
They sound well intentioned but don't work in some backwards Enterprise companies.
1) That'd be great. Except that management refuses to get into the 21st century; everything is virtualised but no you can't get a sandbox, that's a 3 month requisition that needs a business case and approvals all the way up the line - even though we have a unlimited licenses for OS and databases.
So no you can't have sandboxes that work that way.
Also we know servers can't be restored piecemeal like that. Why?
Well I don't know what wonderful world you're living in, though I would like to live there, but our management is 100% focused on REDUCING NUMBERS. What's our server count? 3000? They want that count reduced to 5.
I'm not joking. That's a meeting with senior management and a set KPI.
We actually haven't managed to reduce server count because they also keep authorising so many new servers for "special projects" of their own, but we have consolidated servers that run 30-40-50 different applications now...
Except that patching and rebooting them is a nightmare as you cannot get 30-40-50 product managers to agree on downtime to do so. You can't restore it piecemeal for testing or anything like that either. And... well we know it's not backed up... I mean the databases are but nothing else is (because Infrastructure agrees with us that nothing should be on a database server except the databases)... and that's not my problem...
2) I consider it. And then I consider the fucking joke that is the rest of the business, and that the second we try to introduce some kind of key into the situation, it's going to be lost, and then the data will be lost. Lost inaccessible data is a far more serious violation that insecure accessible data - that's a fact. One will get you fired immediately, the other will be understood.
It doesn't help companies often don't have easily accessible PKI; not in any way we can automate and use and trust and know and be trained on and rely on, in the database space. The way Enterprise I see works would put it behind a firewall and have you requesting a key using a filled out paper document and waiting a few weeks of authorisations to get it. Now how the fuck are you going to roll that into your automated backup strategy across a couple hundred servers and rotate keys every few quarters?
3) Hahaha. Okay for mom and pop stores, sure. But you have to realise that Enterprise carves out every fucking piece of the pie for a different person. This team looks after databases. This team looks after applications. This team looks after the underlying infrastructure. This team looks after storage. This team looks after DR. This team looks after LONG TERM BACKUPS.
And then the long term backups team does whatever the fuck they want, has zero accountability, and literally nobody in management cares or wants to touch it because either a contract is in place or "they like that manager" or "that manager is on the same level as me so I can't do anything", and the manager above is their friend who got them in ;-)
And then, sometimes, sometimes, it's not even their fault. They get some order from some miscellaneous manager at the very top to "start keeping every single backup". But they can't because disk space is finite. And suddenly the entire organisation starts being crippled as disks fill and your normal day to day backups start failing, and then your operational systems go offline! But still - you have to keep every single backup - and so they SECRETLY start deleting the older backups because there is literally no choice, you can't have the business running AND keep those old backups, and because it's a secret they can't tell ANYONE and so those backups are GONE.
(And no, we can't circumvent that process and do it ourselves, because we don't have a spare petabyte of storage, and we aren't the storage team, so we can't just buy it or get it allocated, and management would squash that as inefficient duplication of effort if we tried).
Man I'm really ranting tonight. You all have no idea how bad it is.
What we see here is (or should be) darwinism in action. Companies that become balkanized, politicized and resistant to change are infficient, prone to catastrophes and easily disrupted by (hi HN!) the startup crowd.
In the past I've had the misfortune to work for some lumbering corporates with all these pathologies and more. You tolerate the perpetual carcrash for the money but however good you are you can't change them & instead run yourself ragged trying to bring order to the chaos. Even if it can be fixed (& often I wonder if organisations can get too big to fix) it's the responsibility of the management and way above your pay-grade.
If you can diagnose all these problems you're clearly a sound engineer. You can do so much better than losing your hair in some self-destructive megacorp that disempowers you from doing good work. Life is too short and IT staff are in demand: they don't deserve you so get out while you can.
I wonder if that kind of backup retention (or any backups at all) is even legal. Under EU law (and even US law in specific situations), user data must be deleted upon request. Unless your live production systems can go in and delete things from your months-old backups (yikes!) this kind of scheme would seem to be a crime.
Good advice, but things get a lot more complicated with HIPAA-protected data. Alas we can't simply move our prod data to any place less secure than prod.
Some policies I've personally adopted, having worked with sensitive data at past jobs:
- If the dev team needs to investigate an issue in the prod data, do it on a staging DB instance that is restored from the latest backup. You gain several advantages: Confidence your backups work (otherwise you only have what's called a Schrödinger's-Backup in the biz), confidence you can quickly rebuild the basic server itself (try not to have pets, remember), and an incentive to the dev team to make restores go faster! Simply knowing how long it will take already puts you ahead of most teams unfortunately.
- Have you considered the data security of your backup artifacts as well? If your data is valuable, consider storing it with something like https://www.tarsnap.com (highly recommended!)
- In the case of a total data loss, is your data retention policy sufficient? If you have some standard setup of 30 days worth of daily backups, are you sure losing a days worth of data isn't going to be catastrophic for your business? Personally I deploy a great little tool called Tarsnapper (can you tell I like Tarsnap?) that implements an automatic 1H-1D-30D-360D backup rotation policy for me. This way I have hourly backups for the most valuable last 24 hours, 30 days of daily backups and monthly backups for a year to easily compare month-to-month data.
Shamless plug: If you're looking to draw some AWS diagrams while Gliffy is down, check out https://cloudcraft.co a free diagram tool I made. Backed up hourly with Tarsnap ;)