Graph Databases 101

gtrubetskoy · on March 10, 2016

I have spent a lot of time figuring out how to deal with a large graph a couple of years ago. My conclusion - there will never be such a thing as a "graph database". There are many efforts in this area, someone here already mentioned SPARQL and RDF, you can google for "triple stores", etc. There are also large-scale graph processing tools on top of Hadoop such as Giraph or Graphx for Spark.

For the particular project we ended up using Redis and storing the graph as an adjacency list in a machine with 128GB of RAM.

The reason I don't think there ever will be a "graph database" is because there are so many different ways you can store a graph, so many things you might want to do with one. It's trivial to build a "graph database" in a few lines of any programming language - graph traversal is (hopefully) taught in any decent CS course.

Also - the latest versions of PostgreSQL have all the features to support graph storage. It's ironic how PostgreSQL is becoming a SQL database that is gradually taking over the "NoSQL" problem space.

amirouche · on March 10, 2016

In my point of view, the fact that you can add an expert index very easily to a graph database written in a modern language (say no C/C++) makes it even easier to customize an existing graph database to suit your direct need. In turn, storage and runtime can be tunned more easily. Making so easy to have the performance you need. But at the end of the day not dealing with algreba is the best.

yeukhon · on March 10, 2016

Years ago PostgreSQL already support recursive query, and in Oracle you have CONNECT BY. I have only used the recursive with once and it was just a quick demo, but my understanding is update is extremely expensive.

valhalla · on March 10, 2016

If anyone's curious about Network Science/Graph Theory in general here's a free online textbook used by a grad student friend of mine

http://barabasilab.neu.edu/networksciencebook/downlPDF.html

GFK_of_xmaspast · on March 11, 2016

Be careful about Barabasi: https://liorpachter.wordpress.com/2014/02/10/the-network-non...

(FWIW, I had previously read some Barabasi papers and had come away seriously unimpressed, see also https://news.ycombinator.com/item?id=9555547)

gilleain · on March 11, 2016

I misremembered Albert-László Barabási for Laszlo Babai - I was wondering why you were unimpressed!

Yes, scale-free networks (and so on, and so on), are oversold. Is his work really that bad though?

rfreytag · on March 10, 2016

Blasted down arrow is so close to the up-arrow I clicked down when I meant up. Someone please cancel out my mistake.

valine · on March 10, 2016

Question as someone new to graph databases: Are there any open source graph databases worth looking into?

rail2rail · on March 10, 2016

We're using TitanDB. One of the main benefits for us is that AWS has provided backend integration with DynamoDB. This affords you practically infinite and painless scaling on a pay-as-you-go model. Love it.

https://aws.amazon.com/blogs/aws/new-store-and-process-graph...

kinow · on March 10, 2016

Depends on what kind of data and graph you are going to store/use. Neo4j is quite popular, cypher isn't very hard to learn, and it has lots of examples. Might be a good choice for a beginner.

https://en.wikipedia.org/wiki/Graph_database#List_of_graph_d...

valine · on March 10, 2016

> cypher isn't very hard to learn

Oh but I love a challenge. Are there reasons to choose cypher besides a gentle learning curve?

kinow · on March 10, 2016

Not really, if you learn Cypher you should be fine learning the basics of Gremlin, SPARQL, or other languages to operate on graphs in a few hours.

There was some post about enabling SPARQL in Neo4J, but when you install Neo4J it comes with cypher by default (not sure if it supports anything else).

I use Apache Jena + SPARQL, but had to use Neo4J to help in a master thesis. Took me a few hours of "How the heck can I do that same thing I'd do in SPARQL that way?", plus some reading of the tutorials.

Edit: some old post with example of Cypher, Gremlin and SPARQL: http://kinoshita.eti.br/2014/09/09/cypher-gremlin-and-sparql...

kevinschumacher · on March 10, 2016

You can definitely run gremlin queries against Neo4j by a couple of methods.

One example: https://github.com/thinkaurelius/neo4j-gremlin-plugin

Also can use the Tinkerpop3 or Blueprints APIs to access your graph with Gremlin.

jexp · on March 10, 2016

Cypher expressses the graph patterns that you're looking for in an ASCII-art-syntax, so you don't loose sight of the core of your question. On top of that you get filtering, projection, aggregation, pagination.

The most fun things are in-query dataflow which allows you to pass information from one query part to the next (projected, aggregated, ordered etc).

And the really cool collection and map functions, so you save a lot of roundtrips between client and server.

See: http://neo4j.com/developer/guide-sql-to-cypher/

LunaSea · on March 10, 2016

But if you want to scale to more than two instances you will have to pay a $50k license. $50.000,00 dollars. Which is far too expensive for startups.

rspeer · on March 10, 2016

Look, I only wrecked a semester of research by chasing the slippery promises of graph databases. Don't wreck your entire startup on them.

grandalf · on March 10, 2016

Explain

rspeer · on March 10, 2016

ConceptNet [1] started out as an academic project that I was responsible for for a while. Since then I've left to start a company but I still maintain ConceptNet and build lots of stuff on it.

[1] http://conceptnet5.media.mit.edu

Here's a list of databases, some of them graph databases, some of them barely databases, where I've tried to store and look up edges of ConceptNet:

  - SQLite
  - PostgreSQL
  - MongoDB
  - Some awful IBM quad-store
  - HypergraphDB
  - Tinkerpop
  - Neo4J
  - Solr
  - Riak
  - SQLite with APSW to speed up importing
  - Just a hand-rolled hashtable on disk

Here are the systems that have succeeded to any extent, in that I could do simple things with them and they didn't collapse:

  - PostgreSQL
  - SQLite with APSW to speed up importing
  - Just a hand-rolled hashtable on disk

The time when I tried Tinkerpop, HypergraphDB, and Neo4J because I had a graph and graph databases are supposed to be good at graphs was particularly terrible. Graph databases seem to only be good at dealing with graphs so small that anything can deal with them.

If this has changed, please point me at an open-source graph database that's not terrified of gigabytes. (No trying to sell me SaaS, please.)

espeed · on March 11, 2016

Graph DB technology has been advancing fast over the last few years, and more evolutions are coming down the pipe. For example, Titan and Blazegraph are distributed and can handle billions of edges, and Blazegraph can be GPU-accelerated, which "demonstrated a throughput of 32 Billion Traversed Edges Per Second (32 GTEPS), traversing a scale-free graph of 4.3 billion directed edges in 0.15 seconds" (https://www.blazegraph.com/product/gpu-accelerated/).

NB: TinkePop is not a graph DB -- it's a graph software stack / computing framework for graph DBs (OLTP) and graph analytic systems (OLAP). Since TinkerPop is integrated with almost all of the graph DBs and graph processing engines, its mailing lists are good place to discuss and get help with graph-related projects.

[1] http://tinkerpop.incubator.apache.org/

[2] TinkerPop / Gremlin Users Mailing List http://groups.google.com/group/gremlin-users

[3] TinkerPop Developer Mailing List http://mail-archives.apache.org/mod_mbox/incubator-tinkerpop...

rspeer · on March 11, 2016

I like how at least the BlazeGraph people are talking about billions of edges and not thousands, but I'm not sure that's something I could use. That seems to be a "pre-order" page, so it sounds neither open source nor existent. And I'm trying to figure out what their normal non-GPU non-distributed software is, but it seems to mostly be a pile of Javadocs.

Using distributed computing on mere gigabytes of data is silly.

I think TinkerPop was something else back in 2011, but apologies if I've used the wrong terminology.

amirouche · on March 10, 2016

Knowing the ConceptNet project a little, still I don't understand the workload/algorithms you run on the database. Otherwise said, «this doesn't work for us» doesn't help anybody.

It really depends on the kind of algorithm you run on the database.

Based on open source project, in read/write mode, no db can help you since you load everything into memory. As a noob NLP user, I rather use something like AjguDB https://github.com/amirouche/ajgudb

grandalf · on March 10, 2016

It's interesting. I suppose most graph db operations could be easily enough broken down into a series of steps that could be performed in a more scalable but slower way.

Did your hand-rolled hashtable have any characteristics that would make its performance characteristics difficult for a smarter optimizer (if such a thing existed in Neo4j)?

Can you psudocode an example slow query/operation and indicate how many edges/vertices were being considered at each step?

Sorry to ask these kinds of questions, I'm just really curious about the situation you described.

rspeer · on March 10, 2016

The failures of these databases were a lot more fundamental than I think you're looking for. And so far it hasn't been a trade-off where a non-graph DB has been more scalable but slower; instead, non-graph DBs have been more scalable and faster.

Here's what I have to be able to do in the database:

1. Import millions of edges from a flat file (time limit: 24 hours)

2. Query any node to return up to 100 edges connected to it (time limit: 100 milliseconds)

3. (nice to have) Find the maximal core of nodes that all have degree at least n to each other (time limit: a few hours)

4. Iterate all the edges between the nodes in a specified subset, such as the degree-3 core, which may still be millions of edges (time limit: a few hours)

#3 is optional, and the alternative is to export all the edges and compute it outside the database. But it's the only thing here that's actually a graph algorithm. However, every open-source graph database I've tried is orders of magnitude too slow at one of the other steps. They either fail at importing, fail at iterating, or fail to respond to trivial queries in a timely manner.

I forgot to mention one other non-graph-database system that met my requirements, which is Kyoto Cabinet. The main downside of it is the GPLv3 license.

akarambir · on March 10, 2016

Can you explain how you were using PostgreSQL as Graph Database?

espeed · on March 10, 2016

FYI: See this recent paper by Google Research and IBM Watson Research:

SQLGraph: An Efficient Relational-Based Property Graph Store http://research.google.com/pubs/archive/43287.pdf

Previous discussion: https://news.ycombinator.com/item?id=11101013

rspeer · on March 10, 2016

I mean, I was putting a graph in PostgreSQL, I don't know if that makes it a "graph database". Table of nodes, table of edges.

PaulRobinson · on March 10, 2016

I didn't pay $50k. Make them an offer. Seriously.

raccer · on March 10, 2016

Not entirely accurate. They have a startup program. Neo4j Enterprise is available for free for startups up to 20 employees.

Source: http://neo4j.com/startup-program/

kawera · on March 10, 2016

Cayley is a good option; we use it in production.

https://github.com/google/cayley

jalfresi · on March 10, 2016

Might I ask what sort of dataset size, servers etc you are using? I'm looking for an graph database and Cayley seems the best fit, though I'm not sure what sort of limits on the data there would be in the real world.

owen11 · on March 10, 2016

Cayley can store 130 million quads (2 nodes + connecting edge) on 20 GB harddrive. Join #cayley on freenode and https://groups.google.com/forum/#!forum/cayley-users and be part of our community!

jerven · on March 10, 2016

Virtuoso, does 2 billion in 50GB. Just so you know. And hard graphs like the complete UniProt data of 20 billion+ in 800GB.

kawera · on March 10, 2016

Cayley has been stable for us so far but I can't vault for it's scalability as our database is very small, less than 6M quads, so an 8GB machine is more than enough.

arupauqa · on March 10, 2016

What's the story on inserting data into Cayley? I think every single code example I have seen on it only shows traversing graphs with it.

Also, if you don't mind me asking, how does it not being a property graph affect modelling your data and queries? At a glance it seems that queries would get significantly more complex if you wish to take several properties of a vertex into account.

owen11 · on March 10, 2016

Cayley can be run in two ways: as HTTP service or as a Go library that you import from your Go app. To insert into Cayley when it's running as HTTP service you can do something like: `curl http://localhost:64210/api/v1/write -d '[{ "subject": "Krissy", "predicate": "loves", "object": "Justin Trudeau"}]'`.

To insert in the 'embedded' mode you can do: https://github.com/google/cayley/wiki/Cayley-Go-API-(as-a-Li....

Join #cayley on freenode and https://groups.google.com/forum/#!forum/cayley-users and get help from our community.

twic · on March 10, 2016

Or if you like the idea of using a graph database developed by a secretive organisation full of geniuses dedicated to collecting and organising the world's information, but would rather it was public-sector, there's Gaffer:

https://github.com/GovernmentCommunicationsHeadquarters/Gaff...

jedc · on March 10, 2016

Is Cayley mature enough for production? I thought it was still relatively new. Would love to know a little more about how you're using it.

kawera · on March 10, 2016

It's been stable for us since deployment eight mounts ago, although we haven't pushed it too hard - our data is relatively small and queries not too complex.

From what I've seen, our use has a lot in common with Seed-DB, only in a different economic sector/activity.

owen11 · on March 10, 2016

There is a similar question on Cayley's google group - https://groups.google.com/forum/#!topic/cayley-users/nirhCbq...

Also, I know it's being used by some companies. you can ask directly the people who uses it on IRC - #cayley (freenode).

emehrkay · on March 10, 2016

Yes. I love the Tinkerpop stack (http://tinkerpop.incubator.apache.org). I am currently writing a Python library as I develop an application around it called gizmo (https://github.com/emehrkay/gizmo).

SanderMak · on March 10, 2016

We use http://orientdb.com/orientdb/, seems decent so far.

timClicks · on March 10, 2016

Neo4j is a very good option.

iod · on March 10, 2016

ArangoDB is free open source multi model no-sql db that has decent¹ perfomance with graph support:

https://www.arangodb.com

¹ https://www.arangodb.com/2015/10/benchmark-postgresql-mongod...

rspeer · on March 10, 2016

"The performance will suffer if the dataset is much bigger than the memory."

That is a huge drawback when compared to relational databases.

A good follow-up question would be: which open-source graph databases can reasonably import and store graph data that's not small -- that is, more data than fits in than RAM? Without proprietary extensions?

neunhoef · on March 10, 2016

One of the developers of ArangoDB here.

Let me explain this quotation. When your graph data (including indices) do no longer fit into the RAM of a single server, you can either live with the higher latency of loading data from disk or you can use sharding, which will lead to communication and therefore slower traversals.

That does not mean that things stop working, but performance will be less good, you can no longer visit tens of millions of nodes per second in a traversal as in RAM on a single server.

If you actually only traverse a much smaller hot subgraph, I would probably go for the disk based single server approach.

If your graph has a natural known clustering, then an optimized sharding solution with fine tuned sharding keys us probably your best bet.

You can do all this with ArangoDB.

However, graph traversals vary greatly in many respects, and your mileage may vary accordingly, with any approach.

I would love to chat in more detail about your use case.

reactor · on March 10, 2016

Looks like they have persistent indexes in the roadmap (ver 3.0) https://www.arangodb.com/roadmap/ which might help.

jerven · on March 10, 2016

Blazegraph Virtuoso Jena tdb

Can easily load large to very large graphs.

LyndsySimon · on March 10, 2016

How much data are you dealing with? In my (admittedly limited) experience, it's usually cheaper to throw money at hardware than other options.

rspeer · on March 10, 2016

Throwing money on RAM because most graph databases haven't figured out how to use the disk effectively is not a good use of money. What's cheaper than buying all the RAM in the universe is figuring out a different system besides a graph database that does the job.

A good example would be the graph of Wikipedia links. About 100 million edges among 5 million nodes, last I checked. The nodes have large differences in degree.

The raw data for this is not the slightest bit large. We're only talking about gigabytes. But it would absolutely destroy Neo4J to even try to import it, to say nothing of running an interesting algorithm that justifies using a graph database on it, and Neo4J seems to be everyone's favorite open-source graph database for some reason.

whazor · on March 10, 2016

There are multiple systems out there, however I have my doubts. It is important that your data does not get corrupted, and that your transactions will not get lost. Furthermore, speedups are possible with certain indices. That is why I personally would want to see some more safety/speed analysis and comparisons between the different systems.

karussell · on March 10, 2016

Here is an (old) overview: https://docs.google.com/spreadsheets/d/1XGapLHpSd2Ta8019VlwY...

espeed · on March 10, 2016

Look at Blazegraph, an open-source GPU-accelerated distributed graph database.

See previous discussion: https://news.ycombinator.com/item?id=11197880

jerven · on March 10, 2016

In the rdf space there are a whole bunch. Graph as in sparkling gives you : Virtuoso Blazegraph Jena Rdf4j Ruby.rdf

There are more but these are opensource and I know them. And money more commercial ones.

d0ne · on March 10, 2016

http://stingergraph.com/ - From Georgia Tech

rusabd · on March 10, 2016

Virtuoso

marknadal · on March 10, 2016

(Full disclosure: I'm the author, we are VC backed) https://github.com/amark/gun is an Open Source graph database with Firebase like realtime synchronization.

phpnode · on March 10, 2016

We've had this discussion before but that product is not a graph database, it has no graph traversal features

marknadal · on March 10, 2016

Dijkstra's algorithm is wonderful, but by no means is a requirement for being a graph database. A graph database is that, a database composed of nodes that can interconnect into a graph. GUN supports this and allows for traversing the graph. We haven't implemented Dijkstra's algorithm, which is what your "discussion" refers to.

phpnode · on March 10, 2016

We've never talked about Dijkstra as far as I recall. Your product doesn't support graphs any more than, say, mongodb does because ultimately all you are doing is loading an object from JSON and then sending it to the consumer. This is in contrast to true graph databases whose main selling point is their ability to efficiently traverse, filter and aggregate large graphs to find the answer to some question, and then send the answer to the consumer.

Gun doesn't meet anyone's definition of "graph database" other than your own. If I load some JSON from a URL, and use lodash to pluck some data out of it, is it a graph database?

marknadal · on March 10, 2016

Dijkstra was your complaint about shortest path.

GUN can do efficient traversal and filtering, and this is going to be even better in our 0.5.x release with lexical cursor support.

By "anyone" do you mean Wikipedia's? https://en.m.wikipedia.org/wiki/Graph_database , because GUN does match its definition. Although we haven't implemented Dijkstra's.

I'm out in France right now and just boarded a plane to Slovenia, so I won't be able to reply again. Have a good one.

sklogic · on March 10, 2016

Are you inventing your own new, hipstor definition of a "graph database" here? I used graph databases a lot, and I never heard of any of the requirements you've listed.

phpnode · on March 10, 2016

Not at all but maybe I'm not being clear. My definition of "graph database" is "a database which can efficiently represent graphs and offers functionality for traversing them, in order to allow queries such as 'find the business relationships between user X and user Y based on who they've worked with' (ala linkedin)"

This is not what gun does. It alternately calls itself "the simplest database out there", "not a database" and "a distributed cache". It provides a mechanism for sharing a list of objects across multiple peers, but must transfer all of the data to each peer. It is conceptually similar to downloading a large chunk of JSON from a server and using lodash, ramda etc to query it, but no one would call that a graph database.

sklogic · on March 10, 2016

> a database which can efficiently represent graphs and offers functionality for traversing them

First part is matching the commonly accepted definition (the one that had been around for about 50 years). The second part is your own invention.

> This is not what gun does

I did not even have a chance to take a look at that product yet. So far I'm just puzzled by the graph database definition some people seem to be using in this thread.

groovy2shoes · on March 10, 2016

I only just discovered graph databases a few years ago, so I have no idea what the definition of "graph database" was in ye olde days, but the parent's definition is the only one I'm familiar with.

And it makes the most sense. After all, as programmers we're rarely concerned about the layout of data in memory, but rather the abstract data type (ADT) that we have to work with. An ADT is defined not by it's memory layout (i.e., a set of vertices and a set of edges do not a graph make (set, of course, also being an ADT) -- there are several possible ways to represent a graph in memory), but by the operations that are defined for the data type and their characteristics (i.e., an adjacency relation (possibly along with an incidence relation) does a graph make). Of course, specialized traversal operations are more of a convenience than a necessity (and they typically allow greater performance than implementing solely in terms of the adjacency relation), but the point stands.

Consider that a list may be represented in several ways: cons cells, classes, closures (just to name a few). But for clients of the list, none of that really matters. It only matters if a list has certain operations: cons, car, cdr (or some equivalent interface). As far as clients are concerned, any object which provides the list interface is a list; inversely, any object which does not provide that interface is not a list.

In a relational database, it hardly matters how data are stored in memory. What matters is that they provide an interface that allows relational algebra (or some close approximation) to be performed on the data. Likewise, I'd argue that how a graph database stores its data is inconsequential, and the only requirement is that it expose graph operations on that data.

amirouche · on March 10, 2016

> After all, as programmers we're rarely concerned about the layout of data in memory, but rather the abstract data type (ADT) that we have to work with. An ADT is defined not by it's memory layout

Otherwise, you can think of it as tradeoff, between expressiveness and computing speed (over an area of expertise) so really in between Memory Layout and ADT.

> In a relational database, it hardly matters how data are stored in memory.

Of course it does. It depends on where you want to optimise for speed.

> I'd argue that how a graph database stores its data is inconsequential, and the only requirement is that it expose graph operations on that data.

No. It makes different trade-offs for different purpose so it's not inconsequential.

GraphDB might be only a niche where only a few people have to use it. But still worth engineering because it help the ADT/expresiveness cause.

groovy2shoes · on March 10, 2016

I see now that I didn't make this clear in my comment, but I'll reproduce part of my reply to the sibling:

> I don't mean to imply that choice of data layout/representation doesn't matter at all, but that it doesn't matter for the purpose of deciding what constitutes a graph and distinguishing graphs from non-graph objects. Of course, as with any ADT, there are various trade-offs that need to be considered before deciding on a particular memory layout.

The beauty of ADTs is that once you've exposed your operations, you are free to change the memory layout without breaking clients -- or even to supply multiple structures with different layouts at the same time, each optimized for a different use case -- and in doing so, you never change the notion of what constitutes a graph.

sklogic · on March 10, 2016

A typical graph dbms exposes a node selection (often only allowing to start at a single entry node or selecting a set of nodes by tags) and arc selection / filtering. Given that they were mostly used for CADs such an interface makes a lot of sense.

None of such databases ever featured a query language capable of defining a Dijkstra algorithm.

And, no, for a typical use of a graph database, it matters most how cheap it is to follow a graph arc. Therefore, representation matters. Otherwise a graph interface on top of a relational storage would have been sufficient.

groovy2shoes · on March 10, 2016

> A typical graph dbms exposes a node selection (often only allowing to start at a single entry node or selecting a set of nodes by tags) and arc selection / filtering.

That doesn't contradict my view. In fact, I'd argue that it counts, due to the fact that graph operations are provided.

> None of such databases ever featured a query language capable of defining a Dijkstra algorithm.

Most modern graph databases do seem to feature some sort of query language. I won't argue that it's strictly necessary, as long as you have well-suited, well-defined operations on graphs. I can't speak to Dijkstra's algorithm -- that was a part of the thread that I overlooked previously.

> And, no, for a typical use of a graph database, it matters most how cheap it is to follow a graph arc. Therefore, representation matters. Otherwise a graph interface on top of a relational storage would have been sufficient.

I don't mean to imply that choice of data layout/representation doesn't matter at all, but that it doesn't matter for the purpose of deciding what constitutes a graph and distinguishing graphs from non-graph objects. Of course, as with any ADT, there are various trade-offs that need to be considered before deciding on a particular memory layout.

amirouche · on March 10, 2016

> None of such databases ever featured a query language capable of defining a Dijkstra algorithm.

Yes, because Dijkstra is not what is the most interesting stuff to write against a graph in every day use. Dijkstra is a primitive than you use but that you have to tweak to solve the particular problem. What is the interest of optimizing writing that particular algorithm?

You seem to follow the idea that there is a super-algorithm to define the way mind works instead I think that's it many small algorithms with similar purpose.

phpnode · on March 10, 2016

perhaps this is a terminology issue, I'll admit that I'm too young to have used pre-relational graph databases, but presumably they have a way of navigating / jumping between documents/vertices, (presumably based on pointers), otherwise what point would there be in having a graph?

I'd encourage you to look at the product and see whether it meets your definition.

sklogic · on March 10, 2016

Yes, of course you could always select a node (often you'd always have to start from a single root node), select arcs, filter the arcs by some criteria, etc.

But I've never seen a complex query language that would allow to express any complex traversal strategies (like Dijkstra algorithm), and from your wording I concluded that this was your requirement for something to be called a graph database.

phpnode · on March 10, 2016

Query language isn't a requirement but exposing the capability to do those kinds of selects / filtering operations is. It doesn't have to support Dijkstra's algorithm but it should be possible to implement it (efficiently).

sklogic · on March 10, 2016

Mind naming any historical (i.e., from the pre-relational era) graph DBMS that had graph traversal features?

AdamN · on March 10, 2016

Everybody's focused on graph databases here but let's talk about Cray! One of the most forward-thinking computer technology companies ever to exist is starting to get out there again. If they got a few hundred million dollars from an outside investor, they could do friggin' incredible things. They already do incredible things but not out there in the way it so easily could be.

fulafel · on March 10, 2016

Cray is a brand name that has been passed around between half a dozen companies (including Sun and SGI) dotted by various kinds of product reboots and commercial failures. Cool stuff but supercomputing isn't the most financially sound business it seems. The current name holder is the company previously called Tera, originally famous for making an aggressively multithreaded HPC computer.

mitchty · on March 10, 2016

I'm not clear how Sun ever owned Cray care to explain?

The provenance was Cray Research -> SGI -> Tera/Cray according to those that have been around since the Cray Research days.

Source: err, I work here and asked a couple people a few cubes over. :)

The Sun deal was apparently more SGI wouldn't be caught dead with a supercomputer that ran on sparc so it got sold off to Sun.

fulafel · on March 10, 2016

My bad about Sun. Yep, Sun just bought the SPARC-based stuff & developed further under their own brand (E10k etc).

mitchty · on March 11, 2016

No worries, just thought i'd clarify.

pinewurst · on March 10, 2016

"an aggressively multithreaded HPC computer" - still sold as Urika-GD, marketed as a dedicated graph appliance. That's where Cray Graph Engine originated.

rrrrtttt · on March 10, 2016

Cray nowadays is purely a government contracting shop (think healthcare.gov in terms of technology chops).

kendallgclark · on March 10, 2016

Wow. That's just patently false. Good job bro!

amirouche · on March 10, 2016

I am huge fan a graph-y stuff. I did several iteration over a graph database written -- in Python -- using files, bsddb and right now wiredtiger. I also use Gremlin for querying. Have a look at the code https://github.com/amirouche/ajgudb.

Also, I made an hypergraphdb, atom-centered instead of hyperedge focused in Scheme https://github.com/amirouche/Culturia/blob/master/culturia/c....

Did you know that Gremlin, is only srfi-41 aka. stream API with a few graph centric helpers.

edit: it's srfi 41, http://srfi.schemers.org/srfi-41/srfi-41.html

SloopJon · on March 10, 2016

The author's next post describes RDF and SPARQL in the context of the Cray Graph Engine:

http://www.cray.com/blog/how-cray-graph-engine-manages-graph...

atombender · on March 10, 2016

I've seen people using graph databases as a general-purpose backing store for webapps/microservices. What are people's opinions about this?

My feeling is that graph databases are not suitable/ready for — for lack of a better term — the kind of document-like entity relationship graphs we typically use in webapps. Typical data models don't represent data as vertices and edges, but as entities with relationships ("foreign keys" in RDBMS nomenclature) embedded in the entities themselves.

This coincidentally applies to the relational model, in its most pure, formal, normal form, but the web development community has long established conventions of ORMing their way around this. The thing is, you shouldn't need an ORM with a graph database.

TimPrice · on March 10, 2016

1-Would it be more efficient to store objects that contain its relations if you only do (simple) read operations? (e.g. JSON database)

2-Instead, do graph DB engines try to break through bottlenecks for big data and analytics scenarios?

thesz · on March 10, 2016

It introduces false dichotomy "graph vs relational".

In fact, most (if not all) graph algorithms can be expressed using linear algebra (with specific addition and multiplication). And matrix multiplication is a select from two matrices, related with "where i=j" and aggregation over identical result coordinates.

The selection of multiplication and addition operations can account for different "data stored in links and nodes".

So there is no such dichotomy "graph vs relational".

sophacles · on March 10, 2016

Strictly speaking yeah. Practically speaking: Not really true.

Just because something can be done, doesn't mean it can be done easily or well. I've done a lot of work with relational databases, and I love them for a lot of data sets. But I also have done a lot of work with graph databases - and they make working with graph shaped data a pleasure. I could do a graph in SQL, it's even moderately straight-forward in postgres these days by using WITH RECURSIVE - but it's still not as simple as just loading orient or arango for those tasks.

It's the same reason I keep multiple knives in my kitchen. Sure I could do everything with an 8" chef's knife, but the paring knife and the boning knife just make some tasks easier.

tkinom · on March 10, 2016

Facebook has a very good paper descriptions on how they do graph on top of relational database. Google "facebook tao" for details.

I read that and implement my own version with SQL in < 500 lines of python code and found it just perfect for my own use cases. I can easily query any edges, notes in web speed ( < 10 ms) from databases with millions of nodes, edges, GBs of info.

I am curious what I might be missing with that approach as compare to a real graph database?

sophacles · on March 10, 2016

From my quick read of tao, it seems to be doing essentially what graph databases do but with the data storage layer being in sql rather than some other object store. And with the interface layer being not quite as feature complete. So the query layer in tao seems to lack a way to follow multiple edges without first returning to the application code, which graph dbs present as a native feature. The other thing that's lacking, which some graph dbs provide, is labeled edges - that is edges that contain data besides just an "edge type".

tkinom · on March 11, 2016

I implemented a table for Nodes, one for Edges. The Nodes table has an entry for JSON for that Nodes.

If I need more info for particular "Edge type", I just add new Node entry type "Edge_info" that link the Edge type to a JSON that content such info. I found that very flexible, but I have not used any real graph database.

polymeris · on March 11, 2016

Curious, since you seem to have experience with both graph and relational databases (and I have not really worked much with the former)... if I had a graph where I just want to compute the shortest weighted path between two nodes, which model would suit better and how much difference would it make in terms of performance?

grandalf · on March 10, 2016

True but highly irrelevant to why anyone might choose to use a graph database (or choose not to use one in favor of a relational database)...

sklogic · on March 10, 2016

Of course you can express anything on top of a relational model. But for graphs such a representation would have been awfully inefficient. For this reason, CADs never even tried to switch to a relational data storage once that fancy new relational databases appeared, most of the professional CADs are still using good old graph databases.

thesz · on March 10, 2016

I beg to disagree.

I am part of the team developing Russian CAD system [0]. It uses what one can consider a hypergraph db (relation includes many objects), but that DBMS system has queries on par with SQL. And they prove themselves very useful in development of CAD.

What you describe can be explained with development inertia. Most CADs are C/C++ and these languages are not very well suited for changes that go through all code base (change of storage engine and data model).

I also made experiments during a dev of graph analytics engine in one part of my experience. The relational model (actually, linear algebra model) has proven itself very competitive. It allows for easy distribution of data, the operations over distributed data are close to optimal, etc, etc.

[0] http://dd.ru/

sklogic · on March 10, 2016

I worked on a (very old) ship-building and factory-building CAD, written mostly in PL/I and Fortran. It was built around a graph database. Throughout its long and turbulent history, there were many attempts of moving to a relational storage, every time resulting in orders of magnitude drop in performance.

Keep in mind that in such a CAD designs are huge. Think of an aircraft carrier scale of "huge". And it was designed when memory was very limited. Therefore, pretty much all the CAD operations depended on the database access.

So, nobody really cared about the queries, they were insignificant. What people cared about was:

* Performance of following an arc

* Transactions

* Data consistency

* Compactness of representation (remember, disk space is also a limited thing when you're building aircraft carriers).

* Nice API (even in a very limited language)

thesz · on March 11, 2016

Had the state of art changed from the time of their last attempt to move to relational DB? Had the hardware changed during that time?

You tell us about some old project, written in hard to maintain languages, which had many failures to adapt to new tech. This is exactly what to expect.

I am talking about relatively modern language (C#) using good DB tech (lagging about seven, maybe five years from the state of art). Maybe, the story will be different in our case.

sklogic · on March 11, 2016

The last attempt was made in around 2008, AFAIR.

Fundamentally nothing changed in the relational storage. Follow-a-graph-edge operation is as expensive as it used to be (involves an index lookup, it cannot be cheap).

If you know a relational arrangement suitable for a cheap O(1) edge traversal - please share. But I am very skeptical.

And I cannot see how the host language is relevant at all. C#, Haskell, whatever - none can make data access operations cost less than what the data model predicts.

thesz · on March 12, 2016

Use covering indices for edges, it is vastly cheaper than regular indices. Choose write-optimized storage layer (fractal indices (circa ~2007, BTW!), LSM) for DB. Don't do singular accesses, use bulk accesses (and now we hit the wall of PL/1 and Fortran you mentioned). I can go on.

I cannot help but feel that what you describe is a classic example of technical inertia due to massive technical debt. It cannot prove that relational DBs are bad for CADs.

sklogic · on March 12, 2016

Is any of the indices you're describing O(1)? Very, very unlikely.

As for the bulk operations, they're in most cases totally useless. Rendering - maybe, but most of the other CAD operations require precise edge following.

And why even going into all the troubles with using this totally unsuitable relational representation when a proper graph dbms is so much easier? Relational religion is so funny, almost as funny as OOP.

thesz · on March 15, 2016

What exactly are these "most CAD operations"? I bet they follow many links from nodes (note the plural!) in most of them. Single node operations are seldom and even then there are many links to follow. This is true for scheme/PCB editor, I bet it is even more true for arch design.

You can look at these indices as O(1) operations. Btrees are just like that.

(if you think that memory access is O(1), you are wrong)

Graph DBs, more often than not, are ad hoc bug ridden slow poor implementation of one tenth of relatively complete implementation of relational DB.

sklogic · on March 15, 2016

> What exactly are these "most CAD operations"?

CSG, place and route (pipes, cables, etc.), design constraint checks, all that stuff.

> I bet they follow many links from nodes (note the plural!) in most of them.

Not that many, mostly single-digit numbers.

> This is true for scheme/PCB editor

Which is very, very different from an oil refinery or an aircraft carrier. Both in a scale and typical operations.

> You can look at these indices as O(1) operations. Btrees are just like that.

WAT?!? Not even close. O(log n) at best. And a multiplier there is huge.

> Graph DBs, more often than not, are ad hoc bug ridden slow poor implementation of one tenth of relatively complete implementation of relational DB.

What?

Graph DBs are orders of magnitude simpler than any relational pile of a mess. It's really hard to screw them up. Everything is trivial there, including transactions, logging, referential transparency and all that.

thesz · on March 18, 2016

Design constrain checks touch A LOT of stuff: usually things have several classes attached to them and check need to be performed on union or intersection of class-related data. Place and route, even for 2D PCB, also touches a lot of stuff - it needs to access constraints, at the very least. In our case, constraint system had to return data to PCB real-time router in under 100us, for several tens of millions of different classes of constraints. I cannot see how you can speak of singular accesses in this context. Or "single-digit as many" accesses.

Now I'll leave conversation. We clearly have different view on almost everything, including, but not limited to "huge multipliers".

cbsmith · on March 10, 2016

linear algebra != relational calculus

Your argument is effectively because Haskell can be implemented in C, there is a false dichotomy between the two languages.

jmartins · on March 14, 2016

Anybody know dgraph.io? it's a Scalable, Distributed, Low Latency, High Throughput Graph Database over terabytes of structured data. DGraph supports facebook GraphQL as query language, and responds in JSON and the storage engine is facebook rocksdb a very fast database. see more in https://github.com/dgraph-io/dgraph

Xyik · on March 11, 2016

One of the biggest challenges in databases is handling concurrency and sharding, wish this would have talked a bit more about how that changes between a graph database and a relational database.