Short answer: promising 5 nines of uptime is not a thing for startups. Downtime is going to happen and you are going to be asleep, drunk, or otherwise not fit for doing any emergency ops. It's not the end of the world. Happens to the best of us.
So given that, just do the right things to prevent things going down and get to a reasonable level of comfort.
I recently shut down the infrastructure for my (failed) startup. Some parts of that had been up and running for close to four years. We had some incidents over the years of course but nothing that impacted our business.
Simple things you can do:
- CI & CD + deployment automation. This is an investment but having a reliable CI & CD pipeline means your deployments are automated and predictable. Easier if you do it from day 1.
- Have good tests. Sounds obvious but you can't do CD without good tests. Writing good tests is a good skill to have. Many startups just wing it here and if you don't get the funding to rewrite your software it may kill your startup.
- Have redundancy. I.e. two app servers instead of 1. Use availability zones. Have a sane DB that can survive a master outage.
- Have backups (verified ones) and a well tested procedure & plan for restoring those.
- Pick your favorite cloud provider and go for hosted solutions for infrastructure that you need rather than saving a few pennies hosting shit yourself on some cheap rack server. I.e. use Amazon RDS or equivalent and don't reinvent the wheels of configuring, deploying, monitoring, operating, and backing that up. Your time (even if you had some, which you don't) is worth more than the cost of several years of using that even if you only spend a few days on this. There's more to this stuff than apt-get install whatever and walking away.
- make conservative/boring choices for infrastructure. I.e. use postgresql instead of some relatively obscure nosql thingy. They both might work. Postgresql is a lot less likely to not work and when that happens it's probably because of something you did. If you take risks with some parts, make a point of not taking risks with other parts. I.e. balance the risks.
- When stuff goes wrong, learn from it and don't let it happen again.
- Manage expectations for your users and customers. Don't promise them anything you can't deliver. Like 5 nines. When shit goes wrong be honest and open about it.
- Have a battle plan for when the worst happens. What do you do if some hacker gets into your system or your data-center gets taken out by a comet or some other freak accident? Who do you call? What do you do? How would you find out? Hope for the best but definitely plan for the worst. When your servers are down, improvising is likely to cause more problems.
So given that, just do the right things to prevent things going down and get to a reasonable level of comfort.
I recently shut down the infrastructure for my (failed) startup. Some parts of that had been up and running for close to four years. We had some incidents over the years of course but nothing that impacted our business.
Simple things you can do: - CI & CD + deployment automation. This is an investment but having a reliable CI & CD pipeline means your deployments are automated and predictable. Easier if you do it from day 1. - Have good tests. Sounds obvious but you can't do CD without good tests. Writing good tests is a good skill to have. Many startups just wing it here and if you don't get the funding to rewrite your software it may kill your startup. - Have redundancy. I.e. two app servers instead of 1. Use availability zones. Have a sane DB that can survive a master outage. - Have backups (verified ones) and a well tested procedure & plan for restoring those. - Pick your favorite cloud provider and go for hosted solutions for infrastructure that you need rather than saving a few pennies hosting shit yourself on some cheap rack server. I.e. use Amazon RDS or equivalent and don't reinvent the wheels of configuring, deploying, monitoring, operating, and backing that up. Your time (even if you had some, which you don't) is worth more than the cost of several years of using that even if you only spend a few days on this. There's more to this stuff than apt-get install whatever and walking away. - make conservative/boring choices for infrastructure. I.e. use postgresql instead of some relatively obscure nosql thingy. They both might work. Postgresql is a lot less likely to not work and when that happens it's probably because of something you did. If you take risks with some parts, make a point of not taking risks with other parts. I.e. balance the risks. - When stuff goes wrong, learn from it and don't let it happen again. - Manage expectations for your users and customers. Don't promise them anything you can't deliver. Like 5 nines. When shit goes wrong be honest and open about it. - Have a battle plan for when the worst happens. What do you do if some hacker gets into your system or your data-center gets taken out by a comet or some other freak accident? Who do you call? What do you do? How would you find out? Hope for the best but definitely plan for the worst. When your servers are down, improvising is likely to cause more problems.