Apache Kafka, Samza, and the Unix Philosophy of Distributed Data

module0000 · on Aug 7, 2015

Kafka is represented as a very unix-friendly message tool, when it is no such thing. Kafka is (one of) the worst of the collection of message bus daemons. It imposes small(3-6mb) message size limits, or else performance goes to hell. Kafka also fails to "do one thing well", and instead tries to provide everything-and-the-kitchen-sink (clustering, compression, etc) built-in, but poorly.

For some contrast, take a language like python, which will allow you to create example programs to test out Kafka VS RabbitMQ VS zeromq. You will quickly find that one of the message bus daemons constantly gets in your way, fails to reliably deliver messages, and consumes a ton of system resources compared to the other two. Hint: It's kafka!

I really can't say enough bad things about kafka. Having used it and unfortunately been forced to implement it at a number of "big"(juniper, cisco, vmware) companies, it has been a horrible and disappointing experience for both the end-users and the developers, every single time.

tldr; don't use Kafka, use a real messaging bus.

schmidtc · on Aug 9, 2015

Can anyone provide a legitimate criticism of Kafka? "It sucks, doesn't use it" isn't very helpful or productive. The closest thing I can find is http://engineering.onlive.com/2013/12/12/didnt-use-kafka/ which is more critical of zookeeper then anything else.

I'm evaluating Kafka for a new project and it seems to be a perfect fit. I've contemplated building something from scratch in python, as my reliability and performance demands are pretty minimal. However, it seems that a lot of thought went into Kafka's design and it's feature set is perfect match for my problem. Specifically the unlimited buffering, log compaction and the ability to replay logs from arbitrary offsets.

If there are any viable alternatives to Kafka what are they? Bonus points if the JVM isn't involved.

toomanybits · on Aug 5, 2015

there's a live stream of this talk happening now https://www.youtube.com/watch?v=m-0cSdxiLLY#t=222

faragon · on Aug 5, 2015

The problem of "Unix Philosophy for everything" is forgetting that involves multiple buffer copy and parsing. So if you're able to reduce the space problem at first step(s), you can reduce the problem, being that method efficient enough so the buffer copy + parsing become not significant (cheap step(s) for finding needle(s) in a haystack, and then, apply the higher cost operations).

However, for operations not involving filtering, i.e. when processing every line/record, you can increase the throughput avoiding IPC memory copies and parsing on every step (the copy + parsing could take an important portion of processing time).

icefox · on Aug 6, 2015

I often play a game where I first write some shell script in some way that was quick and easy to do. I start it running and see if I can in another terminal figure out the faster "correct" method. I rarely ever win.

rch · on Aug 5, 2015

> you can’t just pipe one database into another, even if they have the same data model

> awk doesn’t know about the format of nginx logs

So the complexity of the implementation is a direct function of the dimension and heterogeneity of the data[-], but it still reduces to:

pgsql notify | client daemon | process | search engine

[-] high throughput is another matter

dkhenry · on Aug 5, 2015

Also you can pipe from one database into another, you just need to have a common format (SQL), or be able to convert to a common format. every database I know has a tool which will take text from stdin and can return text on stdout.

kragen · on Aug 6, 2015

It's true that database servers in the RDBMS ecosystem are not analogous to gzip or wc or even awk in the Unix ecosystem; they are analogous to Unix itself, and they live in an uneasy détente with it. They are the "ground of being" upon which various components can be composed, the role Kleppman proposes here for Kafka (and Hadoop, and all the other nonsense underpinning Kafka). You could argue that in fact database servers are terrible at composability even of the things you can put into them (scripting languages, data types, triggers), and I think that is true. More lower down about why that is.

The question of how to build an "operating system for the internet" is really interesting. How can we make it as easy to write a networked program that can be composed with other networked programs, in the way that you can compose tail and grep?

Kafka and Samza aren’t it; they're cluster-centric, not internet-centric. They assume a single administrator, Kafka's publish-and-subscribe model is inherently unreliable in the presence of spammers or other sources of system overload, and spammers are unavoidable if you can't kick them off the system. (If you're going to run Hadoop in "secure mode", then it uses Kerberos for authentication; I need say no more.)

With regard to system overload, this is exactly backwards:

    As long as the buffer
    fits within Kafka’s available disk space,
    the slow consumer can catch up later.
    This makes the system
    less sensitive to individual slow components,
    and more robust overall.

You can reliably feed terabytes of data through a Unix pipeline consisting of a fast process feeding that data to a slow process (e.g. rot13 | some slow Perl script). Without backpressure, you have to buffer the terabytes of intermediate data, which will fill up your disk, causing your system to fail; or you can drop data, causing your system to fail. Systems without backpressure cannot provide this kind of composability reliably. (Similarly, systems that require you to name your data outputs in a global mutable-object namespace, like Kafka topics, pose obstacles to composability. This, more than anything else, is what limits the composability of queries in traditional SQL databases.)

Don't get me wrong. I think pub-sub is great, and especially for loosely-coupled integration of different systems. I use pub-sub systems every day. I've even written a few. But don't make the mistake of thinking that they're the networked equivalent of Unix.

IPFS might be it. There's a whole panoply of other projects which, like IPFS, are trying to build a decentralized internet OS (MaidSafe, Ethereum, and so on) and probably sooner or later one of them will succeed.

One more minor quibble, which to my mind shows that the author wasn’t very interested in writing true things instead of false things:

     you can pipe
     the output of gunzip
     to wc
     without a second thought,
     even though the authors of those two tools
     probably never spoke to each other

David MacKenzie and Paul Rubin are in fact listed among the contributors to gunzip.

contingencies · on Aug 6, 2015

TLDR; author likens message oriented distributed architectures to the classic unix pipe, while lambasting us for forgetting (in RDBMS design, etc.) the composability lessons of the 1970s.

vegabook · on Aug 5, 2015

It would be even closer to the Unix philosophy if a competitive distributed graph processing framework such as these could became available on something other than the JVM / Java ecosystem. I personally am not using either for this exact reason and hope that Golang's goroutines will get their own distributed processing analogue, or that Mirage OS will take off.

I am amazed that these systems, Storm / Samza / Spark / even the new Flink are all Java / Scala based.

pixelmonkey · on Aug 5, 2015

One of the nice things about Storm, in particular, is that the creator of the project thought about multiple language support from the start. We use Storm via our open source project, streamparse[1], and this enables usage of normal Python programs running under a Storm topology. In our case, we achieve multi-process parallelism and fault tolerance using Storm's simple multi-lang protocol, which is abstracted by our library.

We also use Kafka natively from Python via the other module we released publicly, pykafka[2]. It provides support for nearly all of the Kafka protocol, and we even have an optimized C extension module in the works that may even be faster than the JVM consumer.

My view on this is that infrastructure has to be implemented in some language. I don't lament the fact that Postgres was written in C or that Cassandra was written in Java -- I just use these technologies as infrastructure. Both have very full-featured Python bindings. The only thing missing on the Storm/Kafka/streams side are the full-featured Python bindings. But my team at Parse.ly has been hard at work building those in open source. Help us!

[1]: https://github.com/Parsely/streamparse

[2]: https://github.com/Parsely/pykafka

infinite8s · on Aug 5, 2015

How do you find Storm's scalability and resource efficiency? Its interesting that Twitter themselves have moved away from Storm (and have even published a paper about it - http://blog.acolyer.org/2015/06/15/twitter-heron-stream-proc...)

pixelmonkey · on Aug 6, 2015

I find it scales well and uses resources well, within the limitations of its design. For example, Storm does not attempt to handle concerns like resource rebalancing, contention management, or auto-scaling. It has a simple REBALANCE command, but this amounts to killing all your workers and randomly redistributing them in your cluster. It isn't a smart rebalancer based on load or throughput, or anything.

My understanding is that Twitter's rewrite of Storm (Heron) was mainly to make it work with Mesos as a resource manager. This is probably wise since Mesos handles a lot of the concerns I described above. But Mesos didn't really exist when Storm was written.

They chose to re-implement 100% of Storm's API in doing so, which perhaps shows that the high level concept of the framework has staying power even though you might benefit from a more advanced resource manager at 1,000-node scale. I wish they had reimplemented it as a competing implementation and actually released it as open source. But no, they decided to keep it 100% proprietary. So it goes. I guess once companies become a certain size, they turn their back on open source if it's not directly in their interest any longer.

We run it with 10-15 Storm nodes with 32 cores each and find it to be immensely helpful in this context, keeping each node in the cluster lit up to ~60% CPU utilization and plowing through 10K events/second on a Kafka topic, despite the fact that we are using Python and there isn't a line of code using threads or process management. People are often shocked we pull this off, since, in theory, CPython's GIL means you can't even run on more than one core at a time. But we write simple Python programs that run on Storm and utilize hundreds of cores at once, across multiple machines. And we get the Erlang-style "let it crash" / "fail fast" process supervision for free.

I have a more cynical view of the Heron paper overall -- if you're curious about that, reach out to me directly (@amontalenti on Twitter).

infinite8s · on Aug 6, 2015

Thanks for the detailed reply. Do you run each python bolt as a separate process?

Pykafka looks nice! Any interest in getting it working with asyncio (or the 2.7 version, trollius)?

pixelmonkey · on Aug 6, 2015

Yes, each Python Bolt and Spout runs as a separate process. Though, by changing the paralellism hint, you change the number of Python processes spawned per component (e.g. :p 8 = 8 Python processes running the same bolt code).

This is because, under the hood, it all uses ShellBolt/ShellSpout in the Storm layer, which is all process-based. But I actually find this to be the "purer" way to run Storm, anyway. Why deal with threads when you don't need to.

As Joe Armstrong, the creator of Erlang, once said (paraphrasing): "Processes are isolated environments for code where state can't be shared except through explicit messaging; threads are isolated environments where state is directly -- and dangerously -- shared. Why would you want threads when you could have processes?" Ofc, I realize, there are times when threads' lightweightness matters, but it doesn't in our case, and processes are certainly simpler!

As for your question, yes, we are working on async support for pykafka. The async producer is being worked on in this issue: https://github.com/Parsely/pykafka/issues/124 -- feel free to contribute, or even simply +1 as a vote of confidence!

squarecog · on Aug 13, 2015

[disclaimer: I managed the data platform group, of which the storm/heron team was part, when heron was developed, and I'm still a Twitter employee; however, naturally, this is all my opinion and not an official position of my employer, etc etc].

Hi Andrew,

The Heron rewrite had more to do with what were at the time gross operational inefficiencies of Storm at very high scales, and problems diagnosing failures and bottlenecks. This is Storm 0.9 -- in the years since, I think Storm community in general and the Yahoo folks in particular have been working hard on addressing some of those issues, and a recent blog post from them indicated that some of the stuff we fixed in Heron is on their roadmap. Note that the Heron paper was published a year or so after the first Heron topology went into production inside Twitter. We had a very real problem that we needed to fix very quickly, and writing Heron was faster than making Storm work. Some of that was due to OSS challenges, some due to Storm's architecture fundamentals, some due to the specific people and backgrounds we had on the real-time compute team at that point.

The scales I am talking about are hundreds of nodes, not dozens, and an order of magnitude more messages per second. I am not surprised it works perfectly well for your use case (it worked fine while we were only putting tweets into it in 2013, as well -- that was about the size you are quoting, iirc).

Mesos not only existed when Storm was written, Storm ran inside Twitter on Mesos since before Storm was open-sourced. Mesos went into Apache incubator in 2011, while Storm did so in 2013. I'm not sure why you got the impression any of this was related to Mesos; it's true that we simplified a lot of operational complexity for Storm+Mesos by not writing our own Mesos scheduler, like Storm did, and just using Apache Aurora -- a decision that also meant we were able to use Aurora/Mesos clusters shared with other processes, which was nice; but that wasn't the prime motivation by a long shot.

The API was kept as a trade-off, to make migration of internal customers seamless. There are problems with the API, particularly around back pressure, but at the same time it was a straightforward API that got wide internal adoption, both in raw form and through Summingbird. So it made sense to evolve that part incrementally and deliver our internal customers the performance, observability, and reliability wins first, without having them rewrite a line of code, and tweak APIs over time to address the above-mentioned issues.

I'm biased of course, but I think our track record with open source contributions is still pretty good -- Scalding, Parquet, Aurora, Mesos, Finagle, Zipkin, and many more major projects with very wide adoption in the industry, which we still very actively contribute to and continue to evolve. At the same time, there's definitely a cost to open-sourcing projects, and those decisions get made on a case by case basis.

teacup50 · on Aug 5, 2015

Why do you find this surprising?

What other language and runtime platform provides:

1) Productive, high-level programming languages.

2) Portability to any OS.

3) High runtime performance for the server-side usage model.

4) Real shared-state SMP (necessary for memory and core efficiency).

5) Standalone, single-directory distribution of the applications and all library dependencies.

6) A broad, stable, mature base of well-documented, well-tested libraries.

The CLR has some advantages over the JVM, but it also shares most of the JVM's attributes.

dap · on Aug 6, 2015

I disagree that the JVM is the only platform on which you could (or would) build such a system. I know because we built a similar system at Joyent called Manta, which provides a distributed, Unix-based, map-reduce environment. It's built almost entirely in Node.js. We use nginx for serving raw data from the storage tier and PostgreSQL for storing and querying metadata (as well as compute job state), but job supervision and actual execution are implemented in Node.js.

dap · on Aug 6, 2015

Sorry for the self-reply, but to clarify: the _implementation_ is Node.js. Users can run their actual workloads using any program they want.

vegabook · on Aug 5, 2015

A mature, reliable, performant, but unadventurous platform which attracts corporates. I am not knocking it - I think it's obvious that the JVM is well suited, but isn't there room for at least one or two competitors on other platforms? Do any exist that you know of? I tend to be a bit more adventurous in my coding projects and given that embarking on projects requires one to be enthusiastic about one's tools, I cannot use this as Java leaves me cold. This sounds flippant but really I don't think it is. I love the idea of DAG workflows but I don't want to do it in Java, even with all its strengths.

squarecog · on Aug 5, 2015

Well, first, a lot of this stuff is actually written in Scala, which is like Java's younger and more fun cousin. Still runs on the JVM but has many fun language aspects I won't get into because there's already a billion posts online about that.

But second, not quite a "platform" but I've been enjoying Frank McSherry's blog posts in which he builds up a complete distributed dataflow engine in Rust: http://www.frankmcsherry.org/

dm3 · on Aug 5, 2015

The other benefit of the JVM is the multiple languages that run on it - did you try Clojure or JRuby already? Or do you mean Java as in JVM and you don't want that in your stack for some reason?

vegabook · on Aug 5, 2015

I am not a fan of any compatibility layer technology as I am a bit of a bare-metal purist, and I believe Linux does this already. The original post waxes lyrical about the Unix philosophy, but part of the Unix philosophy is supreme efficiency and this is violated by the JVM. I know I know - performance is comparable but then coding is a bit like art - you have to love what you are doing.

As for using other languages - I played with Python on Storm but I very quickly found that I had to basically use Java because the entire community and documentation was java-centric (last I checked - a few months ago).

I may be being obtuse here - in fact I know I am - but my point is that there is a large community out there which is not Java friendly (logically or not) and so I lament the fact that all the advancement in this new, very exciting, field, is JVM-based. My dislike for Java was sealed by the Bloomberg terminal API, some of which's basic functionality is 5 to 6 dots deep. Seriously I had 80+ character function calls. This.that.that.this.this.finally()!

Perhaps we'll just have to kick in, leave our biases behind and start running with the JVM.

FWIW my use case is streaming fixed income (bond pricing) data analysis. Python/Numpy is running out of steam fast for me and we're using quite a bit of C. I basically come from the scientific computing set. We're applying advanced statistical analysis to pricing (not HFT - we're operating in the 5 minute thru 1-week horizon - not 5ns).

pixelmonkey · on Aug 5, 2015

If you're interested in Python + Storm, we are making it happen via our open source project streamparse. See the project on Github[1] and also my PyCon US 2015 talk[2].

Your point about documentation is well-taken. We are trying to document streamparse + Storm usage from a Pythonista standpoint via our online documentation[3], e.g. here is our detailed Python API documentation[4].

[1]: https://github.com/Parsely/streamparse

[2]: https://www.youtube.com/watch?v=ja4Qj9-l6WQ

[3]: http://streamparse.readthedocs.org/en/stable/

[4]: http://streamparse.readthedocs.org/en/stable/api.html

teacup50 · on Aug 5, 2015

In one breath "I am a bit of a bare-metal purist" and in the next "As for using other languages - I played with Python on Storm".

So which one is it? :-)

srean · on Aug 6, 2015

I am not the OP but it is not necessarily a contradiction.

The underlying goal in Pythonic data-science is that one pushes as much of the computation into pre-packaged loops that have already been written in a close to the metal language (C, C++, Fortran). Well, that's as far as the intent goes, practice deviates from it by degrees.

This does not work quite as well in Java (or JVM) because JNI is just supremely god-awful. JVM semantics are overly strict, this over-specification kills optimization opportunities. Is heavy on memory, I would rather use the memory for loading more data than fill it with overhead. Finally there is this OOP culture that gets in the way.

That said, JVM is one of the most mature, and well engineered VMs out there (Java is another story), but not very well suited for number crunching, because there is more to number crunching than calling BLAS APIs.

vegabook · on Aug 6, 2015

last time I checked it was Java Scala or Python. None is bare metal. We use Python as glue. Any heavy lifting is done using Numpy ecosystem or C.

parasubvert · on Aug 6, 2015

Clojure is a fascinating language. Rich Hickey gives a lot of his rationale for why he picked the JVM for it [1], which boils down to VMs being the platforms of future, not OSes like Linux.

[1] http://clojure.org/rationale

infinite8s · on Aug 5, 2015

What do you find missing in the python/numpy space? Inability to handle analysis over streaming data?

vegabook · on Aug 5, 2015

Inability to scale easily to clusters, unless the problem is embarrassingly parallel (ours isn't always), and no equivalent to the conceptual beauty and efficiency of distributed directed acyclic graphs that Storm amongst others uses. We could hack Python up to do this, but if the problem became TRULY huge, hundreds or thousands of nodes, we'd probably end up spending more time building a processing infrastructure, and repaying technical debt, than doing the actual algorithms, while Spark / Storm etc already have massive scaling ability built in. Still love Python and especially the Numpy ecosystem though and we stream thousands of ticks per second through it no problem, with a bit of help from multiprocessing, MKL libs and Numbapro. But as we get towards tens of thousands we're hitting the buffers, and most importanly, problem complexity is rising so the DAG approach is looking very attractive.

infinite8s · on Aug 5, 2015

I'm looking to build something like Naiad on top of python, disco and numba. Mind if I pick your brain on the kind of tooling you'd like to see around big data python? My email is in my profile.

teacup50 · on Aug 5, 2015

That's why I highlighted the platform strengths -- what others platforms can reasonably compete?

That said, you can use these tools as complete applications without actually dipping your toes into Java (or even the JVM, other than running it). If you do want to write code directly against them, the choices (of which I'm sure you'll well aware) range from Clojure to Scala to Java to JRuby.

Scala is one ugly duck, but personally I find it to be a sufficient language for writing JVM-based code without losing my lunch.

threeseed · on Aug 6, 2015

Scala/Python using Spark Streaming.

a8da6b0c91d · on Aug 5, 2015

I'm sure I'm being stupid and underinformed, but it doesn't seem like these systems do anything you can't replicate easily enough with htcondor/gridengine and so-forth alongside cluster computing filesystems. It almost seems like Java people decided they wanted Java only in their stack, so they ignored all the existing unix cluster computing stuff.

Somebody's downvoting me but I doubt they know anything about streaming and short-job support in gridengine.

srean · on Aug 5, 2015

C++11 perhaps.

teacup50 · on Aug 5, 2015

Precluded by requirement #1 (which I say as someone who has written an awful lot of C++).

herewego · on Aug 6, 2015

That's subjective. I'm quite productive writing system software in C++11/14.