Thrift, Protocol Buffers and JSON comparison

andres · on March 1, 2009

I abandoned Protocol Buffers because the Python implementation was too slow. The problem is that Google hasn't written a C-extension yet because it wouldn't be compatible with AppEngine. It's a known problem that has gone undocumented.

http://groups.google.com/group/protobuf/browse_thread/thread...

Gonsalu · on March 1, 2009

The tests are using simplejson. Someone added cjson (impressive) results in the comparison, over at reddit: http://www.reddit.com/r/programming/comments/811gl/comparing...

inklesspen · on March 1, 2009

The tests apparently were not run with simplejson's c extension speedups compiled in. I did so: http://gist.github.com/72412

Simplejson was slightly faster on two out of three tests. Consistently so, when I re-ran the tests.

Test environment: py2.6 on Mac OS X, with simplejson 2.0.9 and python-cjson 1.0.5

Test script: http://gist.github.com/72413

Also, I changed the test script from using time.time() to time.clock(), which according to the python docs should be used for performance testing on unixes.

lacker · on March 1, 2009

For handling protocol buffers in Python, it is much faster to generate the C++ protocol buffer wrappers, and then swig them. It is bothersome to regenerate this every time you change the proto definition though.

mattj · on March 1, 2009

I think the schema is causing this. Lists of strings of nonfixed size aren't going to yield good results, as any serialization framework now has to perform work to find the delimeters of each string. In this case you could store IP addrs and all the other DNS fields as ints and you should see a massive speedup. This would probably be closer to the actual workload google or fb sees - why would they be serializing huge records of data that's already been encoded into a human-readable / string format?

justinsaccount · on March 3, 2009

Your comment brought up a good point, so I ran some more tests with ints.

http://bouncybouncy.net/ramblings/posts/json_vs_thrift_and_p...

Protocol buffers won hands down as far as space used...

The speed issues are still there, but I'm sure that over time things will improve. If the C extension for simplejson can speed up serialization by an order of magnitude, I have no doubt that similar improvements can be made to protocol buffers and thrift.

joshu · on March 2, 2009

The string could be prefixed with the length of the string.

mbreese · on March 2, 2009

I thought this is exactly how protocol buffers worked with non-fixed length fields. Doesn't it start the record with the length of the string? I'm not sure how thrift works, but probably the same way.

(Not speaking from experience, just from what I remember of the format when I read the specs).

mattj · on March 2, 2009

Yeah, but the whole point is you don't have a fixed length record. And in the example given you shouldn't be using strings at all - integers will suffice - that's the real problem with this benchmark.

You might as well be testing how quickly thrift / pb / json could serialize / deserialize pickled blobs. Your not giving thrift or pb the data it's designed to perform well with, so the fact that it fails isn't surprising.

I'm sure a tesla would suck compared to an ancient pickup at helping me pick up a couch from craigslist, but I wouldn't say that the pickup is a better car because of that.

mokeefe · on March 1, 2009

I suspect that he forgot to add the speedup option for pb:

option optimize_for = SPEED;

mitchellh · on March 1, 2009

Although I'm not a pb user, reading around its been said that this flag doesn't work for Python bindings to PB.

kragen · on March 2, 2009

It's hard for me not to read "pb" in a Python serialization context as anything other than Perspective Broker.

oconnor0 · on March 2, 2009

Any ideas as to why the YAML code is so slow?