I'd be VERY careful about including external dependancies in an HTTP health chec...

sneal · on July 21, 2016

You make a good point about health checks. Its easy to conflate single node health with overall service health. The LB check should really be about "can this node serve HTTP traffic", even if the non-health check requests can only send back 5xx responses because their dependencies are failing.

I've also had the opposite problem where the health check would answer 200 OK, but the real app itself had a corrupted internal state because of a poor design/threading bug. If the health check had been the home page the node would have been pulled from the LB. While a more in-depth health check would have helped here, I think its better to alert/monitor/kill any node with a higher than normal error rate and leave the health check simple.

bluecmd · on July 21, 2016

You make a good point, but I don't see it as black and white.

> 4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken?

Yes, if that's the case you should not respond "unable to serve". However, you should probably fail setup check and not start serving traffic if you're a newly provisioned node.

But!

I wouldn't call this cascading failure more than error propagation. If a critical dependency is down making the node essential worthless, it should be removed. This can be caused by things like network partitions and whatnot - as you say by 5. You say "it's not likely" which is sadly a common statement, but when it does happen it can be nasty - and the bigger environment the often it happens.

Cascading failure usually (in my experience anyway) refers to having the actual failover cause more outages. In the case of failing /status I can only see this happening if the architecture of the system is broken anyway - or the load is genuinely too high for the failure that just occurred.

You say: "The DB and overall health of the service cluster are monitored through more organic instrumentation". What is acting on that? What happens when you do get that rare network partition and half of your web nodes cannot reach the database?

webtechgal · on July 21, 2016

> For load balancer health checks, I prefer ones that hit the controller and expect a 200, and nothing more

A very sound strategy, IMO. I subscribe to this KISS practice. If response/status != 200, issue an alert and/or call a deeper health check routine.