You know how to set up a rock-solid remote hands console to all your servers, I take it? Dial-up modem to a serial console server, serial cables to all the servers (or IPMI on a segregated network and management ports). Then you deal with varying hardware implementations, OSes, setting that up in all your racks in all your colos.
Compare that to AWS, where there are 6 different kinds of remote hands, that work on all hardware and OSes, with no need for expertise, no time taken. No planning, no purchases, no shipment time, no waiting for remote hands to set it up, no diagnosing failures, etc, etc, etc...
That's just one thing. There's a thousand more things, just for a plain old VM. And the cloud provides way more than VMs.
The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this), and you have to have hot backup/spares, because otherwise you'll find out your spares don't work. Getting new gear in can take weeks (it "shouldn't" take that long, but there's little things like pandemics and global shortages on chips and disks that you can't predict). Power and cooling can go out. There's so many things that can (and eventually will) go wrong.
Why expose your business to that much risk, and have to build that much expertise? To save a few bucks on a server?
It's really not like that at all. If it was, I expect after 25 years of growth FastMail would probably have noticed. Much of what you're describing assumes a poorly run company that isn't able to make good choices -- if you have such a mix of odd hardware os OSes then that's pretty bad sign.
Prioritise simplicity.
For remote hands, 2 kinds is sufficient: IP KVM, and an actual person walking over to your machine. Can't say I've had an AWS person talk to me on a cell phone whilst standing at my server to help me sort out an issue.
It's actually really fun, and saving 90% what can be your largest cost can actually be a fundamental driver of startup success. You can undercut the competition on price and offer stuff that's just not available otherwise.
Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.
> Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.
My experience of this is that people either fall into the camp of having done it under a set of non-ideal constraints (leading them to do it badly), or it's post-rationalising that they just don't want to.
Complex cloud infra can also fail for all kinds of reasons, and they are often harder to troubleshoot than a hardware failure. My experience with server grade hardware in a reliable colo with a good uplink is it's generally an extremely reliable combination.
And my experience is the opposite, on both counts. I guess it's moot because two anecdotes cancel each other out?
Cloud VMs fail from either the instance itself not coming back online, or an EBS failure, or some other az-wide or region-wide failure that affects networking or control plane. It's very rare, but I have seen it happen - twice, across more than a thousand AWS accounts in 10 years. But even when it does happen, you can just spin up a new instance, restoring from a snapshot or backup. It's ridiculously easier to recover than dealing with an on-prem hardware failure, and actually reliable, as there's always capacity [I guess barring GPU-heavy instances].
"Server grade hardware in a reliable colo with good uplink" literally failed on my company last week, went hard down, couldn't get it back up. Not only that server but the backup server too. 3 day outage for one of the company's biggest products. But I'm sure you'll claim my real world issue is somehow invalid. If we had just been "more perfect", used "better hardware", "a better colo", or had "better people", nothing bad would have happened.
There is lot of statistical and empirical data on this topic - MTBF estimates from vendors (typically 100k - 1m+ hours), Backblaze and Google drive failure data (~1-2% annual failure rate), IEEE and others. With N+1 redundancy (backup servers/RAID + spare drives) and proper design and change control processes, operational failures should be very rare.
With cloud hardware issues are just the start - yes you MUST "plan for failure", leveraging load balancers, auto scaling, cloudwatch, and dozens of other proprietary dials and knobs. However, you must also consider control plane, quotas, capacity, IAM, spend, and other non-hardware breaking points.
You're autoscaling isn't working - is the AZ out of capacity, did you hit a quota limit, run out of IPv4s, or was an AMI inadvertently removed? Your instance is unable to write to S3 - is the metadata service being flakey (for your IAM role), or is it due to an IAM role / S3 policy change? Your Lambda function is failing - did it hit a timeout, or exhaust the (512MB) temp storage? Need help diagnosing an issue - what is your paid support tier - submit a ticket and we'll get back to you sometime in the 24 hours.
> The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this)
Cloud vendors are not immune from hardware failure. What do you think their underlying infrastructure runs on, some magical contraption made from Lego bricks, Swiss chocolate, and positive vibes?
It's the same hardware, prone to the same failures. You've just outsourced worrying about it.
The hardware is prone to the same failures, but the customers rarely experience them, because they handle it for you. EBS means never worrying about disks. S3 means never worrying about objects. EC2 ASG means never worrying about failed machines/VMs. Multi-AZ means never worrying about an entire datacenter going down.
Yes, you pay someone else to worry about it. That's kinda the whole idea.
But, it comes at a cost. And that cost is significant. Like magnitudes significant.
At what point does it become cheaper to hire an infra engineer? Let's see.
In the US a good infra engineer might cost you $150K/yr all in. That's not taking into account freelancers/contractors who can do it for less.
That's ~$12K/mo.
That's a lot of compute on AWS...but that's not the end of the story. Ever try getting data OUT of AWS? Yeah, those egress costs are not chump change. But that's not even the end of it.
The more important question is, what's the ratio of hosting/cloud costs to overall revenue? If colo/owned DC will yield better financials over ~few quarters, you'd be bananas as a CTO to recommend the cloud.
The bigger cost is what will happen to your business when you're hard-down for a week because all your SQL servers are down, and you don't have spares, and it will take a week to ship new servers and get them racked. Even if you think you could do that very fast, there is no guarantee. I've seen Murphy's Law laugh in the face of assumptions and expectations too many times.
But let's not just make vague claims. Everybody keeps saying AWS is more expensive, right? So let's look at one random example: the cost of a server in AWS vs buying your own server in a colo.
AWS:
1x c6g.8xlarge (32-vCPU, 64GB RAM, us-east-2, Reserved Instance plan @ 3yrs)
Cost up front: $5,719
Cost over 3 years: $11,437 ($158.85/month + $5,719 upfront)
On-prem:
1x Supermicro 1U WIO A+ Server (AS -1115SV-WTNRT), 1x AMD EPYC™ 8324P Processor 32-Core 2.65GHz 128MB Cache (180W), 2x 32GB DDR5 5600MHz ECC RDIMM Server Memory, 2x 240GB 2.5" PM893 SATA 6Gb/s Solid State Drive (1 x DWPD), 3 Years Parts and Labor + 2 Years of Cross Shipment, MCP-290-00063-0N - Supermicro 1U Rail Kit (Included), 2 10GbE RJ45 Ports : $4,953.40
1x Colo shared rack 1U 2-PS @ 120VAC: $120/month (100Mbps only)
Cost up front: $4,953.40 (before shipping & tax)
Cost over 3 years: $9,273 (minimum)
So, yes, the AWS server is double the cost (not an order of magnitude) of a ServerMicro (& this varies depending on configuration). But with colocation fees, remote hands fees, faster internet speeds, taxes, shipping, and all the rest of the nickle-and-diming, the cost of a single server in a colo is almost the same as AWS. Switch to a full rack, buy the networking gear, remote hands gear, APCs, etc that you'll probably want, and it's way, way more expensive to colo. In this one example.
Obviously, it all depends on a huge number of factors. Which is why it's better not to just take the copious number of "we do on-prem and everything is easy and cheap" stories at face value. Instead one should do a TCO analysis based on business risk, computing requirements, and the non-monetary costs of running your own micro-datacenter.
> The bigger cost is what will happen to your business when you're hard-down for a week because all your SQL servers are down, and you don't have spares, and it will take a week to ship new servers and get them racked. Even if you think you could do that very fast, there is no guarantee. I've seen Murphy's Law laugh in the face of assumptions and expectations too many times.
Lets ignore the loaded, cherry picked situation of no redundancy, no spares, and no warranty service. Because this is all magically hard since cloud providers appeared even though many of us did this, and have done this for years....
There is nothing stopping an on-prem user from renting a replacement from a cloud provider while waiting for hardware to show up. That's a good logical use case for the cloud we can all agree upon.
Next, your cost comparison isn't very accurate. One is isolated dedicated hardware, the other is shared. Junk fees such as egress, IPs, charges for access metal instances, IOPS provisioning for a database, etc will infest the AWS side. The performance of SAN vs local SSD is night and day for a database.
Finally, I can acquire that level of performance hardware much cheaper if I wanted to, order of magnitude is plausible and depends more on where it's located, colo costs, etc.
These servers are kinda tiny, and ignore the cost of storage. From the article, $252,000/y for 1 PB is crazy, and that's just storing it. There's also the CapEx vs OpEx aspect.
Yeah, if you don't have levels of redundancy, then you're pretty screwed. We could theoretically lose 2/3 of our systems and have sufficient capacity, because our metric is 2N primary plus N secondary, and we can run with half the racks switched off in the primary, or with the secondary entirely switched off, or (in theory, there's still some kinks with failover) with just secondary.
Compare that to AWS, where there are 6 different kinds of remote hands, that work on all hardware and OSes, with no need for expertise, no time taken. No planning, no purchases, no shipment time, no waiting for remote hands to set it up, no diagnosing failures, etc, etc, etc...
That's just one thing. There's a thousand more things, just for a plain old VM. And the cloud provides way more than VMs.
The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this), and you have to have hot backup/spares, because otherwise you'll find out your spares don't work. Getting new gear in can take weeks (it "shouldn't" take that long, but there's little things like pandemics and global shortages on chips and disks that you can't predict). Power and cooling can go out. There's so many things that can (and eventually will) go wrong.
Why expose your business to that much risk, and have to build that much expertise? To save a few bucks on a server?