The architectural error was to have messages acknowledged at end of long process...

bsaul · on April 12, 2023

"two messages with quick ack, one message at start, one message at end from worker."

how do you deal with worker getting killed in that scenario ? if you can't rely on the queue job state then you need another whole set of code somewhere to handle timeouts & retries, don't you ?

tuyiown · on April 12, 2023

For long async processes:

A worker, or at least, the consumer receiving the messages and spawning sub workers, when getting killed, should restart at some point, inspecting jobs ids unfinished, the one stored just before ack'ing reception, and either resume work or notify failure.

Answering is the worker sole responsibility, and no viable implementation of an app can reliably try to substitute to that.

As for timeouts, they are necessary when waiting for answers from services that are not under your control, eg, you cannot assume they will answer you in all cases. Under your control, you have to keep yourself in a situation where worker response is guaranteed, success or failure.

bsaul · on April 13, 2023

I feel like this would reimplement a lot of the machinery that's built-in the job queue (atomic state locking to take ownership of the job, with deadlocking prevention mechanism to make sure a killed process doesn't keep that job locked, and timeouts).

At this point, why not throw the whole job queue away and simply use that system ?

whakim · on April 12, 2023

I wish you wouldn't speak so authoritatively because much of this is just one way (and hardly the only way) to implement such a system. If you read the RabbitMQ docs (see https://www.rabbitmq.com/confirms.html) you'll see that ACK'ing the messages after processing is explicitly described as a way to handle consumer failures.

tuyiown · on April 12, 2023

Depends on your definition of processing. In the case I describe the processing is receiving and recording the job id sent by the producer. Ack should be sent asap so that all non-business issues on message transmissions should be handled by messaging logics.

Having the jobs processing itself being independent of the messaging solution seems like good way to go as the actual message implementation, might have to change. I really think that you don't want ossification of your messaging implementation detail on your business logics.

I did not pretend that it was the only solution, merely just something I know would be reliable, an implicit «one way to it».

whakim · on April 12, 2023

Again though, RabbitMQ and many AMQP consumers are explicitly designed so that you don't have to do this if you don't want to. If you have long idempotent tasks, it is both possible (and, in many cases, recommended!) to set QoS (prefetch) to 1 and ACK only after the long task has completed. This means you don't have to write any logic to deal with worker crashes/disconnects/etc. because this is a fully anticipated and supported workflow. There are certainly downsides to this approach, but not having to worry about tracking job ids is much simpler.

tuyiown · on April 13, 2023

It's not only simpler, it's just way more reliable for high volumes of messages with a large number of workers and short processing.

I really was focusing on low volume, long jobs, where RabbitMQ is overkill, and workers respawn does not need proper dequeuing.

One aspect to keep in mind is also the kind of use case, when the job is just a subtask, and the final result is stored, on completion it produces as message to be handled for continuation, you have direct id tracking without further additional logics, and a simple ack is not enough.

For eg, sending emails, the jobs does not create further message, and no shared mutable state occurs. The scenario with ack on end of process covers all cases.

dkersten · on April 14, 2023

That seems like a rather uncommon definition of processing. Certainly not what people mean when they generally talk about processing a message.

You’re right that if the business-processing is a long heavy-weight operation then that should be decoupled from the message queue and failure handled internally, in which case your definition does make sense as an intermediary hand-off step. Just in the general case, I think most people wouldn’t consider hand-off alone as processing.