Not before each change but before any major new initiative or refactor. Having t...

ansgri · on Sept 20, 2016

Replying to the second paragraph, there often is a real value in maintaining strong upper bound for the latency, especially in distributed real-time systems (which are most of the real systems, anyway).

E.g. (99% sub-200ms and 1% _unbounded_) vs (80% sub-200ms and _always_ sub-500ms) means 1% of potentially unanticipated crashes (a hell to debug and explain to customers!) vs a highly reliable system and happy customers.

kasey_junk · on Sept 20, 2016

For sure, thats the definition of a real time system after all. But having conversations about what the long tails do to the "normal" path and what the costs (both in money and performance in the "normal" path) is quite simply something that you can't back into.

ansgri · on Sept 20, 2016

Maybe I didn't understand you correctly, but you can at least have a "return error on timeout" and process that with a predictable logic. Or maybe you do have an architecture when any individual tardy request absolutely cannot impact others. After all, I come from stream processing systems where there's only few "users" with constant streams of requests, and these users are interdependent (think control modules in a self-driving car).

kasey_junk · on Sept 20, 2016

What I'm suggesting is the decision on what you do in the case of long tail performance problems, is not something you can back into.

If you are going to have timeouts with logic, that has down stream implications. If you are going to have truly independent event loops, that is a fundamental architectural decisions.

None of those things match the "make it work, then make it fast". You literally have to design that into the system from jump street as it is part of the definition of "works".