I'd be VERY careful about including external dependancies in an HTTP health check which results in a web server being removed from service - it's usually an invitation for cascading failures.
1) If you do have a back-end failure, this setup can cloud the root cause during recovery because your downstream web servers are down as well.
2) Transitory back-end failures can cascade, and your health checks can make this worse as things flap around. You need to be VERY careful with your timers and retry logic, and tune them appropriately. eg, how does `/status` handle a single TCP reset from the DB? When does it decide to timeout if the DB doesn't respond? It's hard to get this right - network errors vs protocol errors vs timeouts are often handled differently at different layers of the controller, and the load balancer has its own retries and timeouts.
3) If this health check is being used for an API, recovery can be more difficult because of thundering herds. eg, 5/5 API servers go down because of a temporary DB failure. Queues and requests backup. 1/5 API servers is restored to service a bit earlier than the others, and immediately goes down again because it's not degrading gracefully under heavy load. Wash/rinse/repeat until you figure out how to load shed.
4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken? Ideally, the app will degrade gracefully if certain request can be served from cache, or don't require the failed backend.
5) The failure case where this type of thing might be useful (network partition affecting DB connectivity from a subset of your webservers) isn't very common in the environments I've been involved with.
A far more productive approach if you absolutely need this sort of thing is to implement circuit breakers[1] at the app level, and explicitly make certain tripped breakers a hard failure for the /status endpoint. This has the advantage of not depending solely on synthetic queries, and having very explicit restoration logic.
For load balancer health checks, I prefer ones that hit the controller and expect a 200, and nothing more. (The DB and overall health of the service cluster are monitored through more organic instrumentation.)
I was laughing to myself when I read the Stack Overflow post mortem because I have a healthy pile of healthcheck-too-expensive war stories, and their experience resonated.
You make a good point about health checks. Its easy to conflate single node health with overall service health. The LB check should really be about "can this node serve HTTP traffic", even if the non-health check requests can only send back 5xx responses because their dependencies are failing.
I've also had the opposite problem where the health check would answer 200 OK, but the real app itself had a corrupted internal state because of a poor design/threading bug. If the health check had been the home page the node would have been pulled from the LB. While a more in-depth health check would have helped here, I think its better to alert/monitor/kill any node with a higher than normal error rate and leave the health check simple.
You make a good point, but I don't see it as black and white.
> 4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken?
Yes, if that's the case you should not respond "unable to serve". However, you should probably fail setup check and not start serving traffic if you're a newly provisioned node.
But!
I wouldn't call this cascading failure more than error propagation. If a critical dependency is down making the node essential worthless, it should be removed. This can be caused by things like network partitions and whatnot - as you say by 5. You say "it's not likely" which is sadly a common statement, but when it does happen it can be nasty - and the bigger environment the often it happens.
Cascading failure usually (in my experience anyway) refers to having the actual failover cause more outages. In the case of failing /status I can only see this happening if the architecture of the system is broken anyway - or the load is genuinely too high for the failure that just occurred.
You say: "The DB and overall health of the service cluster are monitored through more organic instrumentation". What is acting on that? What happens when you do get that rare network partition and half of your web nodes cannot reach the database?
1) If you do have a back-end failure, this setup can cloud the root cause during recovery because your downstream web servers are down as well.
2) Transitory back-end failures can cascade, and your health checks can make this worse as things flap around. You need to be VERY careful with your timers and retry logic, and tune them appropriately. eg, how does `/status` handle a single TCP reset from the DB? When does it decide to timeout if the DB doesn't respond? It's hard to get this right - network errors vs protocol errors vs timeouts are often handled differently at different layers of the controller, and the load balancer has its own retries and timeouts.
3) If this health check is being used for an API, recovery can be more difficult because of thundering herds. eg, 5/5 API servers go down because of a temporary DB failure. Queues and requests backup. 1/5 API servers is restored to service a bit earlier than the others, and immediately goes down again because it's not degrading gracefully under heavy load. Wash/rinse/repeat until you figure out how to load shed.
4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken? Ideally, the app will degrade gracefully if certain request can be served from cache, or don't require the failed backend.
5) The failure case where this type of thing might be useful (network partition affecting DB connectivity from a subset of your webservers) isn't very common in the environments I've been involved with.
A far more productive approach if you absolutely need this sort of thing is to implement circuit breakers[1] at the app level, and explicitly make certain tripped breakers a hard failure for the /status endpoint. This has the advantage of not depending solely on synthetic queries, and having very explicit restoration logic.
For load balancer health checks, I prefer ones that hit the controller and expect a 200, and nothing more. (The DB and overall health of the service cluster are monitored through more organic instrumentation.)
I was laughing to myself when I read the Stack Overflow post mortem because I have a healthy pile of healthcheck-too-expensive war stories, and their experience resonated.
[1] http://martinfowler.com/bliki/CircuitBreaker.html