Riemann, a distributed systems monitor (built in Clojure)

snowwindwaves · on Dec 24, 2012

i've been using mango http://mango.serotoninsoftware.com since 2006 to monitor control systems, mostly for small hydro power plants but also communications networks, smart homes, solar arrays, etc. Unfortunately development on the freely available source has stopped so I no longer get a better product every year.

Bigger projects i've used either citect or wonderware, both of which get the job done but show their age and are at times painful to use, although not as horrible as many of the other legacy control system HMI software out there.

Mostly data is collected by polling modbus slaves, although OPC would be another important protocol to support.

It seems like wonderare or citect is ready to be replaced by a distributed system that uses the web browser as the display client. systems monitoring software such as nagios, openNMS, cacti overlaps with the control system HMI software arena as both

1. display real time data, preferably with some context (eg gauges to indicate how close to maximum or minimum limit, alarm or shutdown thresholds the variable is) and sometimes overlayed on a diagram to assist in visualizing or understanding the process

2. "trend" (log and plot) historical data for analysis and reporting purposes. Better yet would be interactive plotting (zoom etc).

I've been wondering about graphite (which riemann can use) as part of the solution, and people seem to be producing great plots with d3.js. As an aside I've used kst and veusz for desktop interactive plotting with success.

In summary: if riemann supported the modbus protocol it could be useful for control systems.

aphyr · on Dec 24, 2012

1. The dashboard is in a rough spot right now--I haven't quite finished the transition to the next-gen dash--but it does do "realtime" visualization of events matching arbitrary queries with under 50-ms end-to-end latency. It'll push about a thousand events/sec, depending on size and rendering complexity. The websocket protocol is pretty straightforward, if you wanted to build one-off system diagrams with streaming updates.

2. Yeah, that would be great. The historical event store space is pretty terrible right now, and it's such a big problem that I doubt I could realistically tackle it. Librato Metrics and Hosted Graphite are both approaching this as a service, and there's openTSDB if you have Hadoop people in-staff. Riemann has out-of-the-box integration with librato and graphite, but I haven't set up an openTSDB cluster yet.

Modbus: that'd be cool. Implementing a Riemann server (i.e. a thing that accepts events from the wire) is pretty straightforward, though I'd need to understand the protocol. If you're interested in building it, I'm happy to discuss how.

aphyr · on Dec 24, 2012

Thanks for your interest everyone. I'll try to answer any questions here. Going through some rough health stuff at the moment so I won't be on IRC, but I do read the backlog and will respond when I get a chance. Cheers! :)

dgtized · on Dec 24, 2012

I'm having a little trouble following your examples. In particular some of the examples are wrapped in (streams) which I am interpreting as the data source to query, but then many of the examples are just a bare (where) with clause. Or is that the final target ie you wrap it in streams if you want to make a new stream? The system looks pretty slick, but I am having some problems with understanding some of the core concepts in the query DSL.

aphyr · on Dec 24, 2012

Ah, yeah I should standardize the docs a bit. It'll be clearer when you've looked at the stock config: (streams ...) just denotes the section of the config where streams live. Since most streams are composable, I sometimes omit the context.

It might be easier to think of streams as literal streams, rivulets, deltas, and tributaries, which events flow through, rather than a query language with well-defined clauses like sql.

d--b · on Dec 24, 2012

This looks pretty cool. Has anyone tried it and liked it?

ispolin · on Dec 24, 2012

I use it (and like it) to monitor a middleware we wrote. The developer, Kyle Kingsbury, has a pretty good talk about it here: http://vimeo.com/45807716

jondot · on Dec 24, 2012

Looks great, I'd love to read the code. Any idea if this is in the same space as Esper?

aphyr · on Dec 24, 2012

I'd like to add to this: Riemann is more opinionated than Esper. While you can use it as a general event processing system, it makes some assumptions about event structure which are geared towards application monitoring. For instance, Riemann events have a fixed schema with fields like 'host'. You can treat them as open maps and add arbitrary kv pairs in your streams, but that extra information won't necessarily be serialized to fixed-schema formats like the protocol buffers interface.

Riemann is also more general than Esper, in that you can define arbitrary operations on events. It sacrifices having an up-front query language in favor of composable functions with stateful side effects. If you want to write a stream that restarts an EC2 instance on failure, it'll be a composable first-class citizen and you can write it right in the config file. Same goes for a stream that pulls in, say, parallel colt to do some heavy statistical lifting. On the other hand, Riemann doesn't include the full range of Esper queries as builtin streams yet, and the ones that are there haven't been optimized to the same degree.

Hope this helps clarify things. :)

necrodome · on Dec 24, 2012

From the mouth of the project's owner:

From skimming the docs, it looks like Esper is much bigger, much cooler, and with a more abstract version of events. It implements a lot of the primitives I've been considering but haven't built yet. It looks more difficult to set up, and has a commercial offering for support and HA; neither of which are present in Riemann right now.[1]

I recently finished building a monitoring system using Esper and JRuby since the client asked specifically for that, but I wished I had used Riemann from the beginning.

[1] https://groups.google.com/forum/#!msg/riemann-users/GhVMYJow...

jondot · on Dec 24, 2012

Thanks

astine · on Dec 24, 2012

I built something like this once for a client, also in Clojure. I'm going to download and read the code to see how it compares.

Heliosmaster · on Dec 24, 2012

Am I the only troubled by the name? Riemann's interest were far from monitoring systems...

aphyr · on Dec 24, 2012

Riemann originated as a system for discrete calculus over metric streams; e.g. Riemann sums.

gjvc · on Dec 24, 2012

systems monitors are often in themselves too complex :-(

aphyr · on Dec 24, 2012

You're right to be concerned about complexity: simple things are easy to understand, easy to predict, and easy to change.

That said, I think you'll find many of the ideas in Riemann to be radically simple. The config file is just a Clojure program. Streams are just functions that take events. Events are just maps of keys to values. Everything is an event: there is no concept of a first-class host or service, no need to update the config when you add a host, and no poller loops.

Riemann tries to draw strong boundaries between the different layers of monitoring. It speaks a simple network protocol and interoperates with other systems for event collection, visualization, alerting, and storage, instead of building in those systems. In many ways Riemann is defined not by what it includes, but by what it leaves out.

That said, there's a lot of work required to make simple abstractions behave correctly, especially around IO and error handling. Wherever possible I try to draw clear internal boundaries to isolate this complexity, but it's still there. If you have specific complaints about code or interfaces which seem too complex to you, I'd be happy to try and explain or change them.

pjscott · on Dec 25, 2012

The tone around here is often negative, but I want to give you a huge compliment: you understand simplicity.

Reducing hairy things to simple abstractions can save weeks of work in a matter of minutes. It is the single most powerful programming technique I know of. And I'm going to seriously consider switching a bunch of stuff over to Riemann.

aphyr · on Dec 25, 2012

Thank you. :)

cinch · on Dec 25, 2012

hi, this project looks really neat! i'm evaluating icinga web + pnp4nagios at work, but this could be a viable alternative. cheers and glad you put this out there :-)

i see that you use Protocol Buffers. from google's page, it seems like they only work with C++, Java or Python. now this could be a problem for us. what if i want to pull events from a bash script, a delphi gui app, SNMP, Dell idrac interface, or any other event? do they have to interface over Protocol Buffers, or am i missing something here? would i have to write a glue layer?

and what about, if you have two seperate networks, and want one server to forward data for its entire lan to the other server, to process and graph them?

aphyr · on Dec 26, 2012

There are protobuf bindings for many languages, though I hear node.js was a bit of a pain. Check the clients page and see if the language you need is there; I can help you build one if not.

For pulling from other tools, I usually write a little daemon to poll and relay the data. See riemann-tools for a collection of existing tools to do just that; and you can require it as a library to write your own in just a few lines of ruby.

Forwarding between servers is built in; it's easy to aggregate events in hierachies for large-scale analysis.

cinch · on Dec 26, 2012

awesome, thanks for replying :)