I'm still confused. You mention resynchronizing when the "partition" "recovers"....

I'm still confused. You mention resynchronizing when the "partition" "recovers". First, can you clarify what a partition is? Second, can you define "recovery"? I'm not worried about performance degradation, i'm worried about nodes being marked down when they aren't down.

Please correct me if i'm wrong, but it sounds like this software only works reliably when you have two sets of nodes that suddenly can't communicate at all, and are eventually connected. Sometimes that does happen on a real network, but often the cause of failures is intermittent and undetermined for hours, days, or weeks. In this case, how would this program work? Would network nodes keep appearing and disappearing, triggering floods of handler scripts, loading boxes and keeping services unavailable?

Yes, tcp performance does degrade under packet loss. It also continues to operate (at well over 50% loss) and automatically tunes itself to regain performance once degradation ends. And it does not present false positives.

It maintains its own state (ordered delivery), checks its own integrity, stands up to Byzantine events (hacking), and is supported by any platform or application. Unfortunately, due to its highly-available nature, it will eventually report a failure to an application if one exists. But if latency is more of a priority than reliability, UDP-based protocols are more useful.

If you're designing a distributed, decentralized, peer-to-peer network, that's cool! But I personally wouldn't use one to support highly-available network services (which is three out of the five suggested use cases for Serf)