You make a good point, but I don't see it as black and white.
> 4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken?
Yes, if that's the case you should not respond "unable to serve". However, you should probably fail setup check and not start serving traffic if you're a newly provisioned node.
But!
I wouldn't call this cascading failure more than error propagation. If a critical dependency is down making the node essential worthless, it should be removed. This can be caused by things like network partitions and whatnot - as you say by 5. You say "it's not likely" which is sadly a common statement, but when it does happen it can be nasty - and the bigger environment the often it happens.
Cascading failure usually (in my experience anyway) refers to having the actual failover cause more outages. In the case of failing /status I can only see this happening if the architecture of the system is broken anyway - or the load is genuinely too high for the failure that just occurred.
You say: "The DB and overall health of the service cluster are monitored through more organic instrumentation". What is acting on that? What happens when you do get that rare network partition and half of your web nodes cannot reach the database?
> 4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken?
Yes, if that's the case you should not respond "unable to serve". However, you should probably fail setup check and not start serving traffic if you're a newly provisioned node.
But!
I wouldn't call this cascading failure more than error propagation. If a critical dependency is down making the node essential worthless, it should be removed. This can be caused by things like network partitions and whatnot - as you say by 5. You say "it's not likely" which is sadly a common statement, but when it does happen it can be nasty - and the bigger environment the often it happens.
Cascading failure usually (in my experience anyway) refers to having the actual failover cause more outages. In the case of failing /status I can only see this happening if the architecture of the system is broken anyway - or the load is genuinely too high for the failure that just occurred.
You say: "The DB and overall health of the service cluster are monitored through more organic instrumentation". What is acting on that? What happens when you do get that rare network partition and half of your web nodes cannot reach the database?