Hacker News new | past | comments | ask | show | jobs | submit login

"the entire site became unavailable since the load balancer took the servers out of rotation." I don't care about the regexp, this is bad SRE, you can't just take servers out of rotation without some compensation action.

Never mind that it looks like all web servers where taken out of rotation, even one server down could cause a cascading effect (more traffic directed to the healthy ones that end up dying, in a traffic-based failure). One action for example after n servers have gone down, (besides getting up other m servers) is to put (at least some) servers in a more basic mode (read only/static, some features disabled), not guaranteed but that could have prevented this and other type of down times.




I have the same concern - including with our own system. As far as I know, all LBs are designed with this faulty logic. That's the case with our in-house F5 BigIPs, and I believe also for AWS's ELBs.

Sure, it makes sense to cut malfunctioning servers out of the production pool when discovered. But at the point where you've dumped out every production server, you've just created the worst possible situation - you're absolutely guaranteed not to be able to serve any traffic. At the point you get to zero functioning servers, a better response would be to restore all servers to the pool and just hope. That would be bad, but less so.

What we've done on our internal systems is to set things up so that when the last server is dropped, it fails over to a passive pool that is really just the normal production pool with no health checks. But I'm not aware of any LB that supports this directly, we had to glue together some ugly configurations to get this to work.


I was thinking the same while reading that part, and I also find it very scary. Why would you even _let_ the load balancer take off _all_ of your servers?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: