Classic blog from 2010. The idea didn't catch on however, and no one uses these terms. PNUTS is still completely closed source inside Yahoo -- this is because PNUTS depends on an internal queuing system.
As for YCSB, it's commonly references and is easily available on github. It hasn't changed much lately, and now with tools like jepsen that focus on correctness as well as performance, YCSB is no longer the preferred testing tool.
I agree, and I wish the concepts here were more widely adopted. Many people seem to equate low-consistency modes (offered by Dynamo/Cassandra etc) with CAP and guaranteeing availability in presence of network partitions. In practice, network partitions seems much less of a concern and the low-consistency modes are really needed to get reasonable performance/latency instead.
The concept listed in here is either (a) too complex or (b) not complete enough to provide a full rich mental model.
CAP was highly successful because it fits in with an existing meme - 3 things, you can only get 2 of them. It's simple, and easy to apply in a trivial fashion.
But CAP theorem doesn't have enough "meat" as an engineering analysis, and doesn't guide your system design. Yes there's a proof, but it doesn't tell you how to balance the 3 concepts and how to directly compare system designs.
The thing is that as a descriptive model of how an entire distributed database works, CAP just aint enough. That's why blog posts like https://aphyr.com/posts/313-strong-consistency-models and concepts of 'linearizability' are very useful.
Having implemented and used multiple consistent coordination systems: Raft+ in c5 (I helped write), Single Decree Paxos via DConE at WanDISCO and timeline consistent implementations via HBase, I have to say the academics miss that the devil is in the implementation details. We all focus on high level things like CAP (although FLP is what real hard core academics care about) when system details like how you can aggregate fsyncs, system pauses (Especially with GC) and how you integrate your coordination system into the larger system play a much larger role in overall system
latency. The coordination posture is a minor detail when compared to GC issues. Now I know what you are going to say, "He's not an academic". He ran a real DB company. I totally disagree. The DB community in general focuses on the wrong thing. That's true in Hadoop, it's true with cassandra, and I would bet that it's true at google as well.
A classic! If you had to take away one thing from this article, it would be that the CAP theorem (while useful) doesn't take into consideration latency, which might be more important than even more important than partition tolerance. Yes, partition tolerance is inevitable, but you have to deal with latency every single time!
The author of CAP himself said that the CAP theorem gained undue popularity. It is an interesting theorem only because it models what partition implies in terms of trade-off when they happen.
But partitions are rare. And when partitions happen, generally you can't access the partitioned area so a lot of problems disappear.
Availability isn't an on/off switch, there is a wide range of how "available" you can be and what you can do. For example you can allow reads but disallow rights.
Last but not least, the most important is what happens after the partition is over and what level of guarantees you offer regarding coherence.
Partitions are not rare. A partition simply means that one node is not available to another, which does not have to be caused by a network failure. For example, a stop the world GC is equivalent to a network failure - the node is unreacheable at this time. So is a deployment with downtime (e.g. a 10 second restart per-node).
Also, while you can't access the partitioned area, you might have nodes outside that area. The question is whether the non-partitioned nodes become useless to you or not.
10 seconds/day breaks 3 9's. The fact that this might be below the TCP timeout is irrelevant, that's almost certainly NOT the timeout that will be considered as unavailability. E.g., suppose you are delivering an advertising tag. Unavailable means > 200ms.
So these nodes can't access the partition and you don't have any conflict.
The TCP timeout is kind of irrelevant if your application defined timeout is lower. Cassandra's default timeout is 10s, but in practice most users expect their responses in less than a second.
Also partition in the system definition can be as little as 'single node failure'.
But I've seen plenty of systems that handle single node failure, but really get fucked when you have a network split where 50% of the nodes can talk to each other but not the other 50%. CAP theorem doesn't help you build a system that doesn't do very bad things when this happens.
The way I've heard the CAP theorem used practically at a high level is to frame the question as follows:
What happens in the case of a (logical) network partition? - an AP system continues taking requests and provides eventual consistency, while a CP system waits for the partition to go away, or says come back later.
You can be available, or you can be consistent. The higher guarantee of consistency you demand, lowers your availability.
Imagine a small database with a million copies on a million machines. when I write something, i'm going to have to wait a bit while all million machines acknowledge the write is complete. If i have a lower threshold, say 1, the write completes real fast but the million machines aren't consistent.
Real systems balance consistency and availability.
If I understand the author correctly: a CP system sacrifices Availability only during a Partition, whereas an AP system tends to sacrifice Consistency all the time.
As for YCSB, it's commonly references and is easily available on github. It hasn't changed much lately, and now with tools like jepsen that focus on correctness as well as performance, YCSB is no longer the preferred testing tool.