> Hardware can fail for all kinds of reasons Complex cloud infra can also fail f...

0xbadcafebee · 2024-12-23T08:29:03 1734942543

And my experience is the opposite, on both counts. I guess it's moot because two anecdotes cancel each other out?

Cloud VMs fail from either the instance itself not coming back online, or an EBS failure, or some other az-wide or region-wide failure that affects networking or control plane. It's very rare, but I have seen it happen - twice, across more than a thousand AWS accounts in 10 years. But even when it does happen, you can just spin up a new instance, restoring from a snapshot or backup. It's ridiculously easier to recover than dealing with an on-prem hardware failure, and actually reliable, as there's always capacity [I guess barring GPU-heavy instances].

"Server grade hardware in a reliable colo with good uplink" literally failed on my company last week, went hard down, couldn't get it back up. Not only that server but the backup server too. 3 day outage for one of the company's biggest products. But I'm sure you'll claim my real world issue is somehow invalid. If we had just been "more perfect", used "better hardware", "a better colo", or had "better people", nothing bad would have happened.

jread · 2024-12-23T18:21:58 1734978118

There is lot of statistical and empirical data on this topic - MTBF estimates from vendors (typically 100k - 1m+ hours), Backblaze and Google drive failure data (~1-2% annual failure rate), IEEE and others. With N+1 redundancy (backup servers/RAID + spare drives) and proper design and change control processes, operational failures should be very rare.

With cloud hardware issues are just the start - yes you MUST "plan for failure", leveraging load balancers, auto scaling, cloudwatch, and dozens of other proprietary dials and knobs. However, you must also consider control plane, quotas, capacity, IAM, spend, and other non-hardware breaking points.

You're autoscaling isn't working - is the AZ out of capacity, did you hit a quota limit, run out of IPv4s, or was an AMI inadvertently removed? Your instance is unable to write to S3 - is the metadata service being flakey (for your IAM role), or is it due to an IAM role / S3 policy change? Your Lambda function is failing - did it hit a timeout, or exhaust the (512MB) temp storage? Need help diagnosing an issue - what is your paid support tier - submit a ticket and we'll get back to you sometime in the 24 hours.