Question as someone new to graph databases: Are there any open source graph data...

rail2rail · on March 10, 2016

We're using TitanDB. One of the main benefits for us is that AWS has provided backend integration with DynamoDB. This affords you practically infinite and painless scaling on a pay-as-you-go model. Love it.

https://aws.amazon.com/blogs/aws/new-store-and-process-graph...

kinow · on March 10, 2016

Depends on what kind of data and graph you are going to store/use. Neo4j is quite popular, cypher isn't very hard to learn, and it has lots of examples. Might be a good choice for a beginner.

https://en.wikipedia.org/wiki/Graph_database#List_of_graph_d...

valine · on March 10, 2016

> cypher isn't very hard to learn

Oh but I love a challenge. Are there reasons to choose cypher besides a gentle learning curve?

kinow · on March 10, 2016

Not really, if you learn Cypher you should be fine learning the basics of Gremlin, SPARQL, or other languages to operate on graphs in a few hours.

There was some post about enabling SPARQL in Neo4J, but when you install Neo4J it comes with cypher by default (not sure if it supports anything else).

I use Apache Jena + SPARQL, but had to use Neo4J to help in a master thesis. Took me a few hours of "How the heck can I do that same thing I'd do in SPARQL that way?", plus some reading of the tutorials.

Edit: some old post with example of Cypher, Gremlin and SPARQL: http://kinoshita.eti.br/2014/09/09/cypher-gremlin-and-sparql...

kevinschumacher · on March 10, 2016

You can definitely run gremlin queries against Neo4j by a couple of methods.

One example: https://github.com/thinkaurelius/neo4j-gremlin-plugin

Also can use the Tinkerpop3 or Blueprints APIs to access your graph with Gremlin.

jexp · on March 10, 2016

Cypher expressses the graph patterns that you're looking for in an ASCII-art-syntax, so you don't loose sight of the core of your question. On top of that you get filtering, projection, aggregation, pagination.

The most fun things are in-query dataflow which allows you to pass information from one query part to the next (projected, aggregated, ordered etc).

And the really cool collection and map functions, so you save a lot of roundtrips between client and server.

See: http://neo4j.com/developer/guide-sql-to-cypher/

LunaSea · on March 10, 2016

But if you want to scale to more than two instances you will have to pay a $50k license. $50.000,00 dollars. Which is far too expensive for startups.

rspeer · on March 10, 2016

Look, I only wrecked a semester of research by chasing the slippery promises of graph databases. Don't wreck your entire startup on them.

grandalf · on March 10, 2016

Explain

rspeer · on March 10, 2016

ConceptNet [1] started out as an academic project that I was responsible for for a while. Since then I've left to start a company but I still maintain ConceptNet and build lots of stuff on it.

[1] http://conceptnet5.media.mit.edu

Here's a list of databases, some of them graph databases, some of them barely databases, where I've tried to store and look up edges of ConceptNet:

  - SQLite
  - PostgreSQL
  - MongoDB
  - Some awful IBM quad-store
  - HypergraphDB
  - Tinkerpop
  - Neo4J
  - Solr
  - Riak
  - SQLite with APSW to speed up importing
  - Just a hand-rolled hashtable on disk

Here are the systems that have succeeded to any extent, in that I could do simple things with them and they didn't collapse:

  - PostgreSQL
  - SQLite with APSW to speed up importing
  - Just a hand-rolled hashtable on disk

The time when I tried Tinkerpop, HypergraphDB, and Neo4J because I had a graph and graph databases are supposed to be good at graphs was particularly terrible. Graph databases seem to only be good at dealing with graphs so small that anything can deal with them.

If this has changed, please point me at an open-source graph database that's not terrified of gigabytes. (No trying to sell me SaaS, please.)

espeed · on March 11, 2016

Graph DB technology has been advancing fast over the last few years, and more evolutions are coming down the pipe. For example, Titan and Blazegraph are distributed and can handle billions of edges, and Blazegraph can be GPU-accelerated, which "demonstrated a throughput of 32 Billion Traversed Edges Per Second (32 GTEPS), traversing a scale-free graph of 4.3 billion directed edges in 0.15 seconds" (https://www.blazegraph.com/product/gpu-accelerated/).

NB: TinkePop is not a graph DB -- it's a graph software stack / computing framework for graph DBs (OLTP) and graph analytic systems (OLAP). Since TinkerPop is integrated with almost all of the graph DBs and graph processing engines, its mailing lists are good place to discuss and get help with graph-related projects.

[1] http://tinkerpop.incubator.apache.org/

[2] TinkerPop / Gremlin Users Mailing List http://groups.google.com/group/gremlin-users

[3] TinkerPop Developer Mailing List http://mail-archives.apache.org/mod_mbox/incubator-tinkerpop...

rspeer · on March 11, 2016

I like how at least the BlazeGraph people are talking about billions of edges and not thousands, but I'm not sure that's something I could use. That seems to be a "pre-order" page, so it sounds neither open source nor existent. And I'm trying to figure out what their normal non-GPU non-distributed software is, but it seems to mostly be a pile of Javadocs.

Using distributed computing on mere gigabytes of data is silly.

I think TinkerPop was something else back in 2011, but apologies if I've used the wrong terminology.

amirouche · on March 10, 2016

Knowing the ConceptNet project a little, still I don't understand the workload/algorithms you run on the database. Otherwise said, «this doesn't work for us» doesn't help anybody.

It really depends on the kind of algorithm you run on the database.

Based on open source project, in read/write mode, no db can help you since you load everything into memory. As a noob NLP user, I rather use something like AjguDB https://github.com/amirouche/ajgudb

grandalf · on March 10, 2016

It's interesting. I suppose most graph db operations could be easily enough broken down into a series of steps that could be performed in a more scalable but slower way.

Did your hand-rolled hashtable have any characteristics that would make its performance characteristics difficult for a smarter optimizer (if such a thing existed in Neo4j)?

Can you psudocode an example slow query/operation and indicate how many edges/vertices were being considered at each step?

Sorry to ask these kinds of questions, I'm just really curious about the situation you described.

rspeer · on March 10, 2016

The failures of these databases were a lot more fundamental than I think you're looking for. And so far it hasn't been a trade-off where a non-graph DB has been more scalable but slower; instead, non-graph DBs have been more scalable and faster.

Here's what I have to be able to do in the database:

1. Import millions of edges from a flat file (time limit: 24 hours)

2. Query any node to return up to 100 edges connected to it (time limit: 100 milliseconds)

3. (nice to have) Find the maximal core of nodes that all have degree at least n to each other (time limit: a few hours)

4. Iterate all the edges between the nodes in a specified subset, such as the degree-3 core, which may still be millions of edges (time limit: a few hours)

#3 is optional, and the alternative is to export all the edges and compute it outside the database. But it's the only thing here that's actually a graph algorithm. However, every open-source graph database I've tried is orders of magnitude too slow at one of the other steps. They either fail at importing, fail at iterating, or fail to respond to trivial queries in a timely manner.

I forgot to mention one other non-graph-database system that met my requirements, which is Kyoto Cabinet. The main downside of it is the GPLv3 license.

akarambir · on March 10, 2016

Can you explain how you were using PostgreSQL as Graph Database?

espeed · on March 10, 2016

FYI: See this recent paper by Google Research and IBM Watson Research:

SQLGraph: An Efficient Relational-Based Property Graph Store http://research.google.com/pubs/archive/43287.pdf

Previous discussion: https://news.ycombinator.com/item?id=11101013

rspeer · on March 10, 2016

I mean, I was putting a graph in PostgreSQL, I don't know if that makes it a "graph database". Table of nodes, table of edges.

PaulRobinson · on March 10, 2016

I didn't pay $50k. Make them an offer. Seriously.

raccer · on March 10, 2016

Not entirely accurate. They have a startup program. Neo4j Enterprise is available for free for startups up to 20 employees.

Source: http://neo4j.com/startup-program/

kawera · on March 10, 2016

Cayley is a good option; we use it in production.

https://github.com/google/cayley

jalfresi · on March 10, 2016

Might I ask what sort of dataset size, servers etc you are using? I'm looking for an graph database and Cayley seems the best fit, though I'm not sure what sort of limits on the data there would be in the real world.

owen11 · on March 10, 2016

Cayley can store 130 million quads (2 nodes + connecting edge) on 20 GB harddrive. Join #cayley on freenode and https://groups.google.com/forum/#!forum/cayley-users and be part of our community!

jerven · on March 10, 2016

Virtuoso, does 2 billion in 50GB. Just so you know. And hard graphs like the complete UniProt data of 20 billion+ in 800GB.

kawera · on March 10, 2016

Cayley has been stable for us so far but I can't vault for it's scalability as our database is very small, less than 6M quads, so an 8GB machine is more than enough.

arupauqa · on March 10, 2016

What's the story on inserting data into Cayley? I think every single code example I have seen on it only shows traversing graphs with it.

Also, if you don't mind me asking, how does it not being a property graph affect modelling your data and queries? At a glance it seems that queries would get significantly more complex if you wish to take several properties of a vertex into account.

owen11 · on March 10, 2016

Cayley can be run in two ways: as HTTP service or as a Go library that you import from your Go app. To insert into Cayley when it's running as HTTP service you can do something like: `curl http://localhost:64210/api/v1/write -d '[{ "subject": "Krissy", "predicate": "loves", "object": "Justin Trudeau"}]'`.

To insert in the 'embedded' mode you can do: https://github.com/google/cayley/wiki/Cayley-Go-API-(as-a-Li....

Join #cayley on freenode and https://groups.google.com/forum/#!forum/cayley-users and get help from our community.

twic · on March 10, 2016

Or if you like the idea of using a graph database developed by a secretive organisation full of geniuses dedicated to collecting and organising the world's information, but would rather it was public-sector, there's Gaffer:

https://github.com/GovernmentCommunicationsHeadquarters/Gaff...

jedc · on March 10, 2016

Is Cayley mature enough for production? I thought it was still relatively new. Would love to know a little more about how you're using it.

kawera · on March 10, 2016

It's been stable for us since deployment eight mounts ago, although we haven't pushed it too hard - our data is relatively small and queries not too complex.

From what I've seen, our use has a lot in common with Seed-DB, only in a different economic sector/activity.

owen11 · on March 10, 2016

There is a similar question on Cayley's google group - https://groups.google.com/forum/#!topic/cayley-users/nirhCbq...

Also, I know it's being used by some companies. you can ask directly the people who uses it on IRC - #cayley (freenode).

emehrkay · on March 10, 2016

Yes. I love the Tinkerpop stack (http://tinkerpop.incubator.apache.org). I am currently writing a Python library as I develop an application around it called gizmo (https://github.com/emehrkay/gizmo).

SanderMak · on March 10, 2016

We use http://orientdb.com/orientdb/, seems decent so far.

timClicks · on March 10, 2016

Neo4j is a very good option.

iod · on March 10, 2016

ArangoDB is free open source multi model no-sql db that has decent¹ perfomance with graph support:

https://www.arangodb.com

¹ https://www.arangodb.com/2015/10/benchmark-postgresql-mongod...

rspeer · on March 10, 2016

"The performance will suffer if the dataset is much bigger than the memory."

That is a huge drawback when compared to relational databases.

A good follow-up question would be: which open-source graph databases can reasonably import and store graph data that's not small -- that is, more data than fits in than RAM? Without proprietary extensions?

neunhoef · on March 10, 2016

One of the developers of ArangoDB here.

Let me explain this quotation. When your graph data (including indices) do no longer fit into the RAM of a single server, you can either live with the higher latency of loading data from disk or you can use sharding, which will lead to communication and therefore slower traversals.

That does not mean that things stop working, but performance will be less good, you can no longer visit tens of millions of nodes per second in a traversal as in RAM on a single server.

If you actually only traverse a much smaller hot subgraph, I would probably go for the disk based single server approach.

If your graph has a natural known clustering, then an optimized sharding solution with fine tuned sharding keys us probably your best bet.

You can do all this with ArangoDB.

However, graph traversals vary greatly in many respects, and your mileage may vary accordingly, with any approach.

I would love to chat in more detail about your use case.

reactor · on March 10, 2016

Looks like they have persistent indexes in the roadmap (ver 3.0) https://www.arangodb.com/roadmap/ which might help.

jerven · on March 10, 2016

Blazegraph Virtuoso Jena tdb

Can easily load large to very large graphs.

LyndsySimon · on March 10, 2016

How much data are you dealing with? In my (admittedly limited) experience, it's usually cheaper to throw money at hardware than other options.

rspeer · on March 10, 2016

Throwing money on RAM because most graph databases haven't figured out how to use the disk effectively is not a good use of money. What's cheaper than buying all the RAM in the universe is figuring out a different system besides a graph database that does the job.

A good example would be the graph of Wikipedia links. About 100 million edges among 5 million nodes, last I checked. The nodes have large differences in degree.

The raw data for this is not the slightest bit large. We're only talking about gigabytes. But it would absolutely destroy Neo4J to even try to import it, to say nothing of running an interesting algorithm that justifies using a graph database on it, and Neo4J seems to be everyone's favorite open-source graph database for some reason.

whazor · on March 10, 2016

There are multiple systems out there, however I have my doubts. It is important that your data does not get corrupted, and that your transactions will not get lost. Furthermore, speedups are possible with certain indices. That is why I personally would want to see some more safety/speed analysis and comparisons between the different systems.

karussell · on March 10, 2016

Here is an (old) overview: https://docs.google.com/spreadsheets/d/1XGapLHpSd2Ta8019VlwY...

espeed · on March 10, 2016

Look at Blazegraph, an open-source GPU-accelerated distributed graph database.

See previous discussion: https://news.ycombinator.com/item?id=11197880

jerven · on March 10, 2016

In the rdf space there are a whole bunch. Graph as in sparkling gives you : Virtuoso Blazegraph Jena Rdf4j Ruby.rdf

There are more but these are opensource and I know them. And money more commercial ones.

d0ne · on March 10, 2016

http://stingergraph.com/ - From Georgia Tech

rusabd · on March 10, 2016

Virtuoso

marknadal · on March 10, 2016

(Full disclosure: I'm the author, we are VC backed) https://github.com/amark/gun is an Open Source graph database with Firebase like realtime synchronization.

phpnode · on March 10, 2016

We've had this discussion before but that product is not a graph database, it has no graph traversal features

marknadal · on March 10, 2016

Dijkstra's algorithm is wonderful, but by no means is a requirement for being a graph database. A graph database is that, a database composed of nodes that can interconnect into a graph. GUN supports this and allows for traversing the graph. We haven't implemented Dijkstra's algorithm, which is what your "discussion" refers to.

phpnode · on March 10, 2016

We've never talked about Dijkstra as far as I recall. Your product doesn't support graphs any more than, say, mongodb does because ultimately all you are doing is loading an object from JSON and then sending it to the consumer. This is in contrast to true graph databases whose main selling point is their ability to efficiently traverse, filter and aggregate large graphs to find the answer to some question, and then send the answer to the consumer.

Gun doesn't meet anyone's definition of "graph database" other than your own. If I load some JSON from a URL, and use lodash to pluck some data out of it, is it a graph database?

marknadal · on March 10, 2016

Dijkstra was your complaint about shortest path.

GUN can do efficient traversal and filtering, and this is going to be even better in our 0.5.x release with lexical cursor support.

By "anyone" do you mean Wikipedia's? https://en.m.wikipedia.org/wiki/Graph_database , because GUN does match its definition. Although we haven't implemented Dijkstra's.

I'm out in France right now and just boarded a plane to Slovenia, so I won't be able to reply again. Have a good one.

sklogic · on March 10, 2016

Are you inventing your own new, hipstor definition of a "graph database" here? I used graph databases a lot, and I never heard of any of the requirements you've listed.

phpnode · on March 10, 2016

Not at all but maybe I'm not being clear. My definition of "graph database" is "a database which can efficiently represent graphs and offers functionality for traversing them, in order to allow queries such as 'find the business relationships between user X and user Y based on who they've worked with' (ala linkedin)"

This is not what gun does. It alternately calls itself "the simplest database out there", "not a database" and "a distributed cache". It provides a mechanism for sharing a list of objects across multiple peers, but must transfer all of the data to each peer. It is conceptually similar to downloading a large chunk of JSON from a server and using lodash, ramda etc to query it, but no one would call that a graph database.

sklogic · on March 10, 2016

> a database which can efficiently represent graphs and offers functionality for traversing them

First part is matching the commonly accepted definition (the one that had been around for about 50 years). The second part is your own invention.

> This is not what gun does

I did not even have a chance to take a look at that product yet. So far I'm just puzzled by the graph database definition some people seem to be using in this thread.

groovy2shoes · on March 10, 2016

I only just discovered graph databases a few years ago, so I have no idea what the definition of "graph database" was in ye olde days, but the parent's definition is the only one I'm familiar with.

And it makes the most sense. After all, as programmers we're rarely concerned about the layout of data in memory, but rather the abstract data type (ADT) that we have to work with. An ADT is defined not by it's memory layout (i.e., a set of vertices and a set of edges do not a graph make (set, of course, also being an ADT) -- there are several possible ways to represent a graph in memory), but by the operations that are defined for the data type and their characteristics (i.e., an adjacency relation (possibly along with an incidence relation) does a graph make). Of course, specialized traversal operations are more of a convenience than a necessity (and they typically allow greater performance than implementing solely in terms of the adjacency relation), but the point stands.

Consider that a list may be represented in several ways: cons cells, classes, closures (just to name a few). But for clients of the list, none of that really matters. It only matters if a list has certain operations: cons, car, cdr (or some equivalent interface). As far as clients are concerned, any object which provides the list interface is a list; inversely, any object which does not provide that interface is not a list.

In a relational database, it hardly matters how data are stored in memory. What matters is that they provide an interface that allows relational algebra (or some close approximation) to be performed on the data. Likewise, I'd argue that how a graph database stores its data is inconsequential, and the only requirement is that it expose graph operations on that data.

amirouche · on March 10, 2016

> After all, as programmers we're rarely concerned about the layout of data in memory, but rather the abstract data type (ADT) that we have to work with. An ADT is defined not by it's memory layout

Otherwise, you can think of it as tradeoff, between expressiveness and computing speed (over an area of expertise) so really in between Memory Layout and ADT.

> In a relational database, it hardly matters how data are stored in memory.

Of course it does. It depends on where you want to optimise for speed.

> I'd argue that how a graph database stores its data is inconsequential, and the only requirement is that it expose graph operations on that data.

No. It makes different trade-offs for different purpose so it's not inconsequential.

GraphDB might be only a niche where only a few people have to use it. But still worth engineering because it help the ADT/expresiveness cause.

groovy2shoes · on March 10, 2016

I see now that I didn't make this clear in my comment, but I'll reproduce part of my reply to the sibling:

> I don't mean to imply that choice of data layout/representation doesn't matter at all, but that it doesn't matter for the purpose of deciding what constitutes a graph and distinguishing graphs from non-graph objects. Of course, as with any ADT, there are various trade-offs that need to be considered before deciding on a particular memory layout.

The beauty of ADTs is that once you've exposed your operations, you are free to change the memory layout without breaking clients -- or even to supply multiple structures with different layouts at the same time, each optimized for a different use case -- and in doing so, you never change the notion of what constitutes a graph.

sklogic · on March 10, 2016

A typical graph dbms exposes a node selection (often only allowing to start at a single entry node or selecting a set of nodes by tags) and arc selection / filtering. Given that they were mostly used for CADs such an interface makes a lot of sense.

None of such databases ever featured a query language capable of defining a Dijkstra algorithm.

And, no, for a typical use of a graph database, it matters most how cheap it is to follow a graph arc. Therefore, representation matters. Otherwise a graph interface on top of a relational storage would have been sufficient.

groovy2shoes · on March 10, 2016

> A typical graph dbms exposes a node selection (often only allowing to start at a single entry node or selecting a set of nodes by tags) and arc selection / filtering.

That doesn't contradict my view. In fact, I'd argue that it counts, due to the fact that graph operations are provided.

> None of such databases ever featured a query language capable of defining a Dijkstra algorithm.

Most modern graph databases do seem to feature some sort of query language. I won't argue that it's strictly necessary, as long as you have well-suited, well-defined operations on graphs. I can't speak to Dijkstra's algorithm -- that was a part of the thread that I overlooked previously.

> And, no, for a typical use of a graph database, it matters most how cheap it is to follow a graph arc. Therefore, representation matters. Otherwise a graph interface on top of a relational storage would have been sufficient.

I don't mean to imply that choice of data layout/representation doesn't matter at all, but that it doesn't matter for the purpose of deciding what constitutes a graph and distinguishing graphs from non-graph objects. Of course, as with any ADT, there are various trade-offs that need to be considered before deciding on a particular memory layout.

amirouche · on March 10, 2016

> None of such databases ever featured a query language capable of defining a Dijkstra algorithm.

Yes, because Dijkstra is not what is the most interesting stuff to write against a graph in every day use. Dijkstra is a primitive than you use but that you have to tweak to solve the particular problem. What is the interest of optimizing writing that particular algorithm?

You seem to follow the idea that there is a super-algorithm to define the way mind works instead I think that's it many small algorithms with similar purpose.

phpnode · on March 10, 2016

perhaps this is a terminology issue, I'll admit that I'm too young to have used pre-relational graph databases, but presumably they have a way of navigating / jumping between documents/vertices, (presumably based on pointers), otherwise what point would there be in having a graph?

I'd encourage you to look at the product and see whether it meets your definition.

sklogic · on March 10, 2016

Yes, of course you could always select a node (often you'd always have to start from a single root node), select arcs, filter the arcs by some criteria, etc.

But I've never seen a complex query language that would allow to express any complex traversal strategies (like Dijkstra algorithm), and from your wording I concluded that this was your requirement for something to be called a graph database.

phpnode · on March 10, 2016

Query language isn't a requirement but exposing the capability to do those kinds of selects / filtering operations is. It doesn't have to support Dijkstra's algorithm but it should be possible to implement it (efficiently).

sklogic · on March 10, 2016

Mind naming any historical (i.e., from the pre-relational era) graph DBMS that had graph traversal features?