I've been trying to rationalize using either RabbitMQ or Kafka for something I'm building. High messages per second but with more complex routing topologies.
Rabbit seems to be the right path but I'm worried about scaling out as many sources seem to point as Kafka being more scalable (at least horizontally). I've been looking into Rabbit's Federation but it's still not clear if that will solve the problem down the road.
I've been running RabbitMQ on pretty small VMs for a long time. RabbitMQ doesn't need a lot of resources per message, even with very small VMs (512MB RAM, single CPU) I've seen it handle peaks of many thousands of messages a second without running into problems. Give it a bit of beefy hardware and it'll probably handle whatever load you were thinking, unless you're saturating 10gig links with messages or something.
RabbitMQ and Kafka are very different struggles when thinking of scaling and performance. Kafka is almost a database itself of messages which have routed through the system. In many configurations clients can come back and demand to replay the message stream from almost any point in time. This means you need to handle _a lot_ of disk and memory access. With RabbitMQ, messages are traditionally very ephemeral. Once a message has been ack'd, its gone. Poof. Not in memory. Not on disk. Nobody is going to come back asking for that message. This leads to a lot more efficiency in handling things per message, but at the cost of not being able to remember the messages that went through the system a few milliseconds ago.
CPU usage highly depends on the number of connected clients, not that much on message throughput. You can experiment with the excellent rabbitmq-perf-test tool to get some ballpark numbers.
I have a system that only pushes 5k messages per second but it needs 32 cores.
Yeah that sounds about right. Of course if you had 200 connections and 50 queues you'd more likely be seeing 100000 msg/s. The number of connections and queues has a big effect on total throughput.
As someone who has ran a number of messaging systems in production, this is what my current take is in general:
If you are moving to a more "event-sourced" architecture, usually two main concerns (beyond basic operational stuff of uptime, scale, etc) are routing and long-term retention.
RabbitMQ has the routing but not the retention. Kafka can have the retention and the routing, but it can be complex/expensive. Apache Pulsar really shines here as the API is pub/sub but it is underpinned by a log structure that gives you long-term retention (that doesn't need to be manually re-balanced) but it's flexibility does come with some operation complexity when compared to RabbitMQ.
If your needs is pretty much just moving large amounts of data, Kafka is definitely the most mature and has a big ecosystem, but long term-retention is difficult and there are some sharp edges around consumer groups.
If you really really don't need long-term retention and need complex topologies, RabbitMQ is your best bet and is fairly reasonable to operate even up to fairly high message rates (~10k msgs/sec shouldn't be too hard to achieve)
There are a TON more options these days though, older more java solutions like activeMQ and rocketMQ or more "minimal" implementations like NATs, not to mention the hosted services on cloud providers.
Personally, I am a big fan of Apache Pulsar for it's flexibility and some nice design choices, but I don't think there is any silver bullet in this space.
I'm guessing that the pain points surrounded having to set up a Zookeeper cluster in conjunction with Pulsar. I think Pulsar has the best model of the various queuing systems at the moment for the routing flexibility of RabbitMQ, the high-throughput of Kafka (topic/partitions), as well as the ability to seamlessly integrate with cold storage (S3/GCS) and to recall messages from cold storage without extra code (unlike Kafka), I just wish that ZK wasn't an additional dependency.
Adding to what the sibling comment say, be careful about buying into RabbitMQ's clustering; having run it for years, I found it to be extremely brittle.
We often lost entire queues because a small network blip caused RabbitMQ to think there was a network partition, and when the other nodes became visible, RabbitMQ has no reliable way to restore its state to what it was. It has a bunch of hacks to mitigate this, but they don't solve the core problem; the only way to run mirrored queues ("classic mirrored queues", as they're not called) reliably is to disable automatic recovery, and then you have to manually repair RabbitMQ every time this happens. If you care about integrity, you can use the new quorum queues instead, which use a Raft-based consensus system, but they lack a lot of the features of the "classic" queues. No message priorities, for example.
I've never used federation or Shovel, which are different features with other pros/cons.
If you're willing to lose the occasional message under very high load, NATS [3] is absolutely fantastic, and extremely fast and easy to cluster. Alternatively, NATS Streaming [4] and Liftbridge [5] are two message brokers built on top of NATS that implement reliable delivery. I've not used them, but heard good things.
> lost entire queues because a small network blip caused RabbitMQ to think there was a network partition, and when the other nodes became visible, RabbitMQ has no reliable way to restore its state to what it was
I can offer a similar anecdote: we started seeing rabbitmq reporting alleged cluster partitions in production after enabling TLS between rabbitmq nodes, where manual recovery was needed each time.
After a bit of investigation we noticed that cluster partition seemed to correlate with sending an unusually large message (think something dumb like 30 megs) through rabbitmq when TLS between rabbitmq nodes was enabled. What I believe was happening was Rabbitmq was so busy encrypting/decrypting large message that it delayed sending or receiving heartbeat & then the cluster falsely assumed there has been a network partition.
Mitigated that issue by rewriting system to not send 30 meg messages- there was only one message producer that sent messages anywhere near that large, and after a bit of thought realised it was not necessary to send any message at all in that case (sending large message was to hack around some other old system performance problem that had gotten fixed properly a year back, but the hack that generated a huge message was still in place)
Erlang/OTP-22 (released last year) introduced TLS distribution optimizations and message fragmentation which sound very related to the problem you saw:
The fragmentation in particular addresses the problem where a large message would block all other messages, including heartbeats, and cause nodes to look “down” when they’re not.
fantastic. thank you for sharing that -- my anecdote about this problem is slightly dated -- it would have been late 2017 early 2018 we were seeing the issue, which indeed predates OTP 22 release.
nowadays? it's actually quite simple to setup and works pretty well (source: i know two different companies that setup clustering recently and both had good experiences with no downtime).
I've used both. I was introduced to Rabbit at one job and at another, was "fed" Kafka during a selection process. At that time, I was definitely not opposed to Kafka because, hey new resume item. I ended up yearning for Rabbit for three reasons.
1) Much easier to implement and maintain for small to medium architectures. However, war stories I've heard is that it starts to become a hassle for large clustering architectures.
2) Because it's a traditional message broker, the input and output ends, which I was responsible for, were much simpler to write because I didn't have to worry about replays when it came back online. Rabbit knows which client it has already routed to and where messages went. Kafka is not that sophisticated in that regard. Kafka has been described as "dumb broker/smart clients" while Rabbit is "smart broker, dumb clients."
3) The scaling. Rabbit is very scalable. Once you get to the Uber/Paypal level (like, a couple of million writes per second), then Kafka becomes the obvious choice. Rabbit handles thousands or writes per second just fine. However, at that second company and like many others, they thought they'd have to suck up all the data, so of course, Kafka was the more scalable tool long term. Spoiler: We were never, ever close to PayPal-level transactions. If the size of the sun represents paypal/Uber transactions, we were basically Manhattan.
Kafka is one of those things where if you're new to it, especially if you're coming from Rabbit or similar, you might tend to assume the happy path - exactly once delivery. This is a bad mistake (whether that's possible and to what definition is not a debate I'd like to dive into now). What you should expect from Kafka is at least once delivery.
There will be times when you lose offsets or when you actually want to replay every message, so take an hour and figure out what that means to your app. It's usually only a few lines of code in your consumer that compares source timestamps, but it's by far the most beneficial thing you can do when working with Kafka in my experience.
It's also relatively easy to hit "tens of thousands" messages/second, especially in replay or bootstrapping scenarios, and that's when Kafka becomes useful to the non-FAANG companies.
I've seen quite a lot messages going through RabbitMQ. I wouldn't worry too much about scaling, because the possibilities depend very much on the architecture. With some tuning RabbitMQ can take you a long way. I would give clustering a go and see where the limits are before exploring more complicated architectures like federation.
With clustering, you can have more nodes and you can shard (distribute) your queues over the cluster. You don't need to mirror every queue on every node. But you are right, mirroring alone will add more load.
Rabbit's federation is a good way to bridge point-to-point connections between geographically distributed systems. I'm not sure that's a great scaling pattern for throughput though.
The clustering might look tempting but it hasn't been resilient for me in the face of janky networks. Split brains and data loss can result.
In the past I've scaled my rabbits for throughput by implementing my own routing/sharding layer.
If you're tempted to use the message persistence and you care about retaining messages, kafka is a bigger but much more capable hammer.
If you’re trying to “rationalize” a decision, that’s already a red flag. Also, Kafka and RabbitMQ are intended for different use cases. One is (the log component of) a streaming data processing system, the other is a message queue. Figure out which kind of system you need before deciding on a particular system. BTW, if you need to really scale, Apache Pulsar is designed to handle both scenarios.
Look into Pulsar, it can function as a message queue or pub/sub like Kafka.
By default it only retains non-acked messages, multiple subscription modes, can use non-persistent messaging, dead letter queue, scheduled delivery, can use Pulsar Functions to implement custom routing etc.
Scales like Kafka (probably better) and has cluster replication built in.
Rabbit MQ is a traditional message broker; you use it when you have lots of messages you don't particularly want/need to be stored persistently, and where you want/need to take advantage of the routing feature--that you put keyed messages into some topic/exchange and then subscribe to only part of the messages any given application is interested in.
Kafka creates the abstraction of a persistently stored, offset-indexed log of events. You read all events in a topic. Kafka can be used to distribute messages in the way AMQP is used, but is more likely to be the centerpiece of an architecture for your entire system where system state is pushed forward/transformed by deterministically processing the event logs.
If your main concern is scalability: Each queue in rabbit gets its own thread. So if you can spread your workload across multiple different queues you can scale without too many problems.
Then the odds of you hitting the scale where RabbitMQ v. Kafka is relevant are a million to one. There is a lot of overhead with Kafka compared to RabbitMQ.
Unless you already have Kafka infrastructure, setting up Kafka for a brand new project is crazy unless your only goal is learning how to set up Kafka.
Rabbit seems to be the right path but I'm worried about scaling out as many sources seem to point as Kafka being more scalable (at least horizontally). I've been looking into Rabbit's Federation but it's still not clear if that will solve the problem down the road.
Can anyone shine some light?