> If you have large messages and use keepalives (and you'll need keepalives), yo...

fake-name · on May 23, 2020

> I'm confused by what you mean by that.

By large, I mean 10+ MByte.

> Completely agree. Having hacked on and patched the code inside Celery, it's really quite a bummer.

I don't understand what the point of celery is. Literally everything I do requires /some/ persistent state in the workers, and there's no way to do that with celery.

> Are you talking about publishing connections? Consuming connections? One used for both? What does "stuck" mean? I'd be interested in hearing more about this.

TCP connections. As in, a connection to the server from a consumer. High latency connections seem to exacerbate the issue.

I think the issue is the state machines server-side and client-side get out of sync, and things just stop until the keep-alives/heartbeat cause the connection to reset, but that's a bunch of time to wait with no messages.

I also ran into the issue that basically every python library had at least one or two locations where `read()` was called without a timeout, but that was at least easier to fix.

> Kinda pedantic, but exactly once delivery is possible in some very restricted situations (see Kafka's implementation of this guarantee: https://www.confluent.io/blog/exactly-once-semantics-are-pos...). Exactly once processing is what's tough-née-impossible. So yeah, idempotence is great.

Well, it isn't really a thing, so you at least shouldn't depend on it being a thing for your architecture if possible.

zbentley · on May 30, 2020

> By large, I mean 10+ MByte.

OK. Did Rabbit or your client libraries bug out when sending single giant messages? What does message fragmentation (by which I assume you mean splitting one logical message up over multiple AMQP messages? Or something else?) have to do with keepalives (and what do you mean by keepalives? Connection heartbeats? TCP keepalives?)?

> Literally everything I do requires /some/ persistent state in the workers, and there's no way to do that with celery.

Sure there is. In-memory caches persist between requests. And there's always sqlite and friends. Celery's more intended for the "RPC/fire-and-forget" case than stateful workloads, but it's not too painful to use those with it. And you get the benefits of its (reasonably) hardened connection/heartbeat management, which may help with some of your other issues.

Basically every time I've seen code that rolled its own bespoke consumer loop for RabbitMQ, it was wrong in some fundamental ways; the state machine on the consumer side did indeed get out of whack, and badly. Best to outsource the "keep the connection alive, establish subscription, detect failures" work to a higher-level library (like Celery) that provides a long-lived consumer so your code can just be occupied with data processing.