"the entire site became unavailable since the load balancer took the servers out...

CWuestefeld · on July 22, 2016

I have the same concern - including with our own system. As far as I know, all LBs are designed with this faulty logic. That's the case with our in-house F5 BigIPs, and I believe also for AWS's ELBs.

Sure, it makes sense to cut malfunctioning servers out of the production pool when discovered. But at the point where you've dumped out every production server, you've just created the worst possible situation - you're absolutely guaranteed not to be able to serve any traffic. At the point you get to zero functioning servers, a better response would be to restore all servers to the pool and just hope. That would be bad, but less so.

What we've done on our internal systems is to set things up so that when the last server is dropped, it fails over to a passive pool that is really just the normal production pool with no health checks. But I'm not aware of any LB that supports this directly, we had to glue together some ugly configurations to get this to work.

fermuch · on July 21, 2016

I was thinking the same while reading that part, and I also find it very scary. Why would you even _let_ the load balancer take off _all_ of your servers?