ACK of network transfer is available ("require_ack_response" option). This option ends up choice of at-most-once semantics vs. at-least-once semantics. You need to choose and you can choose.
Fuentd provides "buffer_type file" to buffer records on disk. Shutting down won't loose data. If you need to choose memory buffer for performance reasons, fluentd enables "flush_at_shutdown" option by default.
You would also want to use <secondary> feature. This lets you to write a buffer chunk to another storage if the primary destination is not available "retry_limit" times.
Ah, thanks - require_ack_response sounds like what I was missing. Some blogs are from before this was added in 0.12 so I didn't know about it.
I am still interested in forwarder failure cases - I have replied to kiyoto's comment talking about the HA docs, which still talk about some other cases that can lose messages.
In this case:
* The process dies immediately after receiving the events, but before writing them into the buffer.
Is it possible to require acknowledgement that the log event has been written to the buffer? Is that separate to what require_ack_response does?
Disclaimer: I'm authoer of MessagePack for C++/Ruby and committer of one for Java.
As for strings, JSON has to allocate memory and copy to deserialize strings because strings are escaped.
MessagePack does't have to allocate/copy because the serialized format of strings is same as the format in memory. But it depends on the implementation whether actually it doesn't allocate/copy.
C++ and Ruby implementations try to suppress allocation and copying (zero-copy). But Java implementation doesn't support zero-copy feature so far (we have plan to do so. Here is "TODO" comment: https://github.com/msgpack/msgpack-java/blob/master/src/main...).
As for the other types, C++ implementation (and new Ruby implementation which is under development) has memory pool which optimizes those memory allocations petterns. But it's hard to implement such optimizations for Java because JVM (and Dalvik VM) doesn't allow to hook object allocation.
Interesting, thanks so much for the response.
I'll keep this in mind as I continue to develop my app. It's still between custom byte stream, json, thrift, etc.
But MsgPack looks interesting as well and, if anything, these blog posts have brought it into the light for me.
I looked at the java class and what might help is if you can set a buffer size and use that buffer to store the data in the buffer and expand it if necessary. But that seems like a lot of work.
But yeah, not sure if you can optimize based on usage patterns due to the constraint you said.
In any case, great stuff and thanks for the info.
We're using MessagePack in a Rails application to log user behaviors and analyze them.
Compared with other serialization libraries such as Protocol Buffers, Avro or BSON, one of the advantages of MessagePack is compatibility with JSON. (In spite of its name, BSON has special types which cause incompatibility with JSON)
It means we can exchange objects sent from browsers (in JSON format) between servers written in different languages without losing information.
I will not use MessagePack with browsers but it's still useful to use it with web applications.
If JSON compatibility is an issue, have you looked at UBJSON? http://ubjson.org/
May be a bit bigger than msgpack but is damn-near human readable even in its binary format and really easy to encode/decode. Also 1:1 compatibility with JSON.
Compatibility and simplicity were the core design tenantes. It may not be the right choice, just throwing it out there incase it helps.
I looked at it. It's design process is not completed.
One strong negative point is that it enforces big endian integer encoding.
Another one is that it doesn't use the value space of tags as efficiently as message pack. I would use the unused space to encode small string size in the tag since objects (associative arrays) have generally many short identifier strings as keys.
I sent these as comments and requests for change but didn't receive any response yet. I don't know how open its design process is.
Sounds cool. I would take a look. I think what this space (serializers) needs is objective/holistic evaluations of pros and cons of different approaches. (disclaimer: I am involved with MessagePack, although not a committer of any of its drivers).
MessagePack includes a concept named "type conversion" to support types which are not supported by its wire format.
With the concept, we can serialize/deserialize user-defined classes as well as strings with encodings.
So far, MessagePack for Java, C++ and D implement the concept.
Although the original blog post focuses on JavaScript and browsers, MessagePack itself doesn't mainly focus on them.
A major use case of MessagePack is to store serialized objects in memcached. A blog post written by Pinterest describes this use case (http://engineering.pinterest.com/posts/2012/memcache-games/).
They use MessagePack with Python which is faster than one with JavaScript. They could store more objects in a server without performance declination (e.g. gzip).
It's true that MessagePack is not always faster than JSON (e.g. within browsers), and it's not always smaller than other serialization methods (e.g. with gzip compression). So we should consider that which serialization methods should I use for "my" case.
There are also general tendency which is helpful to select MessagePack or JSON:
MessagePack is faster to serialize binary data such as thumbnail images.
MessagePack is better to reduce overheads to exchange small objects between servers.
JSON is better to use it with browsers.
> They could store more objects in a server without performance declination (e.g. gzip).
The performance declination argument is bullshit. Network's a million [0] times slower than gzip.
Truth be told, once you're on the network, you're already screwed w.r.t. most serialization. The only thing efficient compression/decompression is going to buy you is lower CPU (memcached servers run at like 2% CPU util, even under heavy load [1]).
Memcache at Facebook actually uses the ascii protocol, and the memcached implementation is a braindead strtok parser (some of our other stuff uses ragel -- you'll have a hell of a time out optimizing ragel with the right compiler flags -- I've tried and failed).
Just use whichever serialization format has the best API, because I can say with near certainty that it's not going to be a perf problem for you if you're touching disk, network, etc.
[0] Obviously a made up number, but it's way slower. Especially if you're unlucky and lose a packet or something.
[1] With the exception of weird kernel spin lock contention issues, which can happen if you're not sharding your UDP packets well and trying to reply from 8+ cores on 1 UDP socket. You probably aren't.
I +1 that. I have working experience with MessagePack and I can confirm it works for the following use cases:
* RPC communication between servers where binary data is exchanged and its structure is not always the same (ie. difficult to use something that requires an IDL).
* Serialization and storage of objects that will be sent over the network (note: you can batch MessagePack objects just by concatenating them).
* Communication between a server and a native mobile application. Native applications live in a binary world whereas Web applications live in a text-based world where JSON is better.
The human readability argument is poor: the JSON that is sent over a network is not usually human readable, so you would use a prettyfier to read it anyway. Moreover a MessagePack message is standalone / self-describing, ie. you don't need an IDL description to read it. So in both case, reading the message is just adding another block to a pipeline...
That test you linked to which claims that messagepack is 4x faster seems to rely on the serialized text staying in-process. The vrefbuffer is only zero-copy as long as you don't need to send it to any API which reads strings or char buffers (e.g. any RPC or network-oriented mechanism). Am I reading it right?
We're also using Fluentd as well as original JSON-based logging libraries.
Fluentd deals with JSON-based logs. JSON is good for human facing interface because it is human readable and GREPable.
On the other side, Fluentd handles logs in MessagePack format internally. Msgpack is a serialization format compatible with JSON and can be an efficient replacement of JSON.
I wrote plugin for Fluentd that send those structured logs to Librato Metrics (https://metrics.librato.com/) which provides charting and dashboard features.
With Fluentd, our logs became program-friendly as well as human-firnedly.
Spitting out sprintfs means parsing texts. JSON might be slow since it's text, but binary based formats should be faster especially the log includes many integers.
Fuentd provides "buffer_type file" to buffer records on disk. Shutting down won't loose data. If you need to choose memory buffer for performance reasons, fluentd enables "flush_at_shutdown" option by default.
You would also want to use <secondary> feature. This lets you to write a buffer chunk to another storage if the primary destination is not available "retry_limit" times.
Those concerns would be solved by the document: http://docs.fluentd.org/articles/out_forward#buffered-output...