I think that you description of exactly once messaging is inaccurate > This is t...

sudhirj · on Dec 31, 2020

AWS makes the same claim with FIFO SQS, and maybe I’m getting it wrong, but these claims 1) have a lot of caveats and 2) only work inside the boundaries of the messaging system.

There’s a note in the next paragraph about how systems manage to say that if you pass in the same message ID / token for X minutes they won’t be duplicated, and my ensuring FIFO there’s a side effect of not giving out the next message until the current one is acknowledged.

This leads to a situation where there’s a guarantee of exactly once acknowledgement, but not necessarily exactly-once processing or delivery. Given that the semantics of at-most-once and at-least-once apply to processing and delivery, I personally don’t think the goalposts should move on exactly once.

Systems claiming exactly-once lull developers into not planning for multiple deliveries on the subscriber, or the need to do multiple publishes, both of which can still happen.

perfectspiral · on Dec 31, 2020

It's better to use a SQS standard queue and have the consuming system provide the exactly-once processing guarantee for various reasons. You will need to introduce something like Redis, if you are not already using it, but I still think it's net superior to using an SQS FIFO queue if you want exactly-once processing.

sudhirj · on Dec 31, 2020

Might not even need to use Redis. If the message has a proper idempotent ID a transactional database is more than enough. If the consumer is running MySQL/Postgres/DynamoDB etc nothing else is needed.

kondro · on Dec 31, 2020

Not quite, there are always a bunch of edge cases that inevitably make "exactly once" actually "almost always exactly once, but always at least once."

WJW · on Dec 31, 2020

Indeed. "exactly once" violates the CAP theorem, so if you actually make a system that can guarantee "exactly once" then you should apply for a Turing medal immediately.

rowland66 · on Dec 31, 2020

I think that you are misunderstanding the CAP theorem. The CAP theorem states that in the event of a network partition, a system can either be consistent or available, but not both. So a messaging system that provided exactly once message delivery would not provide availability during a network partition. However, there are many applications for which consistency is more important than availability, especially if the period of unavailability can be limited.

WJW · on Dec 31, 2020

Ah, but "especially if the period of unavailability can be limited" is exactly the type of edge case kasey_junk was talking about. Network partitions may persist for unbounded amounts of time as far as the CAP theorem is concerned, and an unspecified amount of packets may be dropped and/or delayed. It could be the case that every message you send gets dropped due to a persisting partition and in such a case none would arrive, thereby violating the "guarantee" of exactly-once delivery.

In practice I agree that these problems are quite rare since most network are reasonably stable. However, especially at scale it's not rare to see messages dropped or delivered more than once. I have no doubt IBM MQ can achieve exactly-once most of the time, but no distributed system can achieve exactly-once delivery all of the time.

rowland66 · on Dec 31, 2020

> It could be the case that every message you send gets dropped due to a persisting partition and in such a case none would arrive, thereby violating the "guarantee" of exactly-once delivery.

That is not correct. All interactions between the client and the broker are performed in transactional units. If the transaction in which messages are sent fails to commit, then the messages are not sent, and all work is rolled back. Once a message is successfully send (that is, sent and transaction committed), it will be delivered once and only once to the receiver.

Likewise on the receiving side, a message is delivered and the encompassing transaction is committed once and only once. A message may be delivered more than once if the encompassing transaction is later rolled back due to say network failure. But a message delivery in a transaction that does not commit is not a delivery.

The benefit here is that application programmers don't need to concern themselves with message duplicate checking and the risk that duplicate checking is done incorrectly leading to bugs that are very difficult to identify.

sagichmal · on Dec 31, 2020

A transaction which is partition-tolerant in the way you're describing requires stronger semantics than mere client acknowledgement, it requires all participants to engage in the consensus protocol. Unless your application joins the message broker's topology as an active member -- some systems do work this way, like Zookeeper -- it can still suffer message loss.

But even if it does join, that's still not sufficient, because these systems can become unavailable during partitions, and that is definitionally incompatible with "exactly once".

kasey_junk · on Dec 31, 2020

It’s been an age since I’ve worked with IBM MQ and there are dials upon dials when setting up MQ based systems but it doesn’t off exactly once in the face of broker failure in most of its HA configurations and it uses deduplication at the protocol level to prevent duplicates.

When people say “exactly once” is impossible they really mean in the face of failure at the queue level.

tyu2 · on Dec 31, 2020

> When people say “exactly once” is impossible they really mean in the face of failure at the queue level.

And what exactly is impossible with that? Just wait it out, i.e. like all the CP systems do (as per CAP).

kasey_junk · on Dec 31, 2020

The premise is that unavailability is the same as zero delivered messages, not one.

Note none of this is rigorously defined either in the article or with most message queues and the configuration of queues/brokers/clients means that there are all manner of edge cases around delivery guarantees in practice.

WJW · on Dec 31, 2020

"Wait it out" is only a valid strategy when message rates are low enough that you can buffer them all until the network partition goes away again.

As an example, imagine a system sending a million 1 KB messages per second. To survive a 1 minute network outage it would need 60 GB extra storage to park the messages. If the outage lasts longer than it has space available, dropping messages becomes inevitable.

sagichmal · on Dec 31, 2020

Even in the face of network partitions?

rowland66 · on Dec 31, 2020

See my answer to WJW above. Yes, even in the face of network partition, but with system unavailability in the event of a network partition.

WJW · on Dec 31, 2020

System unavailability means the messages get delivered zero times, which is not exactly once.

sagichmal · on Dec 31, 2020

So we're just playing semantic games at this point, using different definitions of terms. The definition of "exactly once" you're using isn't the formal definition.