Well, I'm grad to get text with InfoQ rather than a video.
Still,
"Relational databases are fine things, even for large data sets, up to the point where you have to join. And in every relational database use case that we’ve seen, there’s always a join — and in extreme cases, when an ORM has written and hidden particularly poor SQL, many indiscriminate joins."
It seems like the overall argument is for (what I see as) a step backward from the declarative model to a lower level imperative model. "You never know what memory your implicit declarations will allocate, better do everything in explicit c-like loops as your data expands."
It's almost like an argument for a return to the world of "hardware is expensive, people are cheap" and for all I know that's what's happening with really big data. But it seems a bit sad to present it as a step forward.
You must have missed something because that is the opposite of what it being said here. The graph model is more declarative than the so called relational model.
In the graph, this node here represents Bob, this one Alice, Alice is Bob's manager. (alice)-[:MANAGES]->(bob) The query costs you 1 traversal from one node to another, a minuscule cost regardless of size. O(1) your cost stays the same no matter how many employees you have.
In a relational database, you have an employees table with a FK_ID for Manager that is indexed and is a O(log(n)) operation. As your employees table gets bigger your cost increases.
It's a O(log(n)) operation, but we aren't talking log base 2(100,000), since it's usually a B+tree, not just a B-tree. In practice, you don't usually end up with a B+tree more than 4 levels deep.
So, we are usually talking 3, at most 4 buffer reads for the index, and possibly an extra read to an additional buffer or disk if you need non-indexed data, assuming the amount of managers is fairly smaller than employees. Follow the 5 minute rule (you should have enough memory for anything you might touch every 5 minutes) and in practice you get will probably get similar performance to a graph database.
You'd be surprised what you can do with recursive common table expressions in SQLite, Postgres, and Oracle too.
Actually your cost is O(n) for the graph in the worst case, and the mysql index is worst case to O(log(n)) (as you implicitly point out), so really what we are talking about here is speed in practice. I've found that in practice graph databases generally need more (in gigabytes / terabytes) fast random access (SSD / RAM), or very intelligent / lucky data storage strategies. Furthermore, adding data to a already large node can sometimes cause problems since the database will take steps to avoid fragmentation. This is also a problem in some kinds of manipulations in a traditional relational database, but generally you farm more information into its own table in though, which makes this problem less likely. Also, I find sharding harder on graph databases of sufficient complexity, although there are cases where it can be easier too.
But if we're talking actual performance operations, generally relational databases preform as well or better than graph databases for a traditional SaaS or social network app, which relies more heavily on aggregate querying, something that is easier to make more performant for the types of scales and expected timelines that most of these apps operate on.
As for your slides that you linked, joins are considered harmful for certain types of querying at certain scales, but you can usually work around these problems by either "doing the join in code", creating Entity–Attribute–Value models (EAV) for certain cases, or if you really want you can just use the MEMORY storage engine (and get that o(1) average case lookup time). Or at that point you can use multiple datastores if you need to.
Anyway, I don't really know what my point is. I've used graph DBs before, and they were alright, but I find the cost of giving up SQL to be pretty high, and I actually enjoy some of the constraints that exist on the DB level for something like MySQL or Postgres. Plus, when you're ready to scale there are a bunch of people that can quickly get you 2x or 3x speed improvements without knowing much about the structure of your data. If you ever want someone to analyze your data to figure out your growth rate, or what have you, it is easier. It is easier for marketing to get weekly Excel exports of the data.
Speaking from personal experience, so YMMV, the trickiest thing about moving from a relational or document DB mindset to a graph DB mindset is remembering that you can store information implicitly in the structure of the graph.
So, as a very simple example, you don't have a Comment node with attributes for the person who wrote the comment and the article the comment is associated with. You just have edges pointing back to those things. Nowhere in the comment, or even in the edges, is there anything that looks like an ID or foreign key.
Unlike a document DB, however, you don't have weirdness once you have something like co-authorship. Just point to both authors, no need to duplicate the data or set up some kind of pseudo foreign key. Once you get the hang of it, it's a really elegant way to store data.
Elegant to store, hell to really query. Continuing with your example of comments, it makes every query you have to do into a map reduce job, and simple fast things that used to be easy end up being a pain.
Certainly useful when you really need a graph, but I don't find that it is a cure-all.
I would generally agree that you should only go the graph dbms route when you actually need to. However I would attach the caveat that a lot more people need to than you might think.
Almost any ___domain where you want to do some kind of 'deep' traversal or non trivial pattern matching is going to benefit hugely from a native graph data model.
Not necessarily. If you have an index into the posts and you want all the comments for a given article you can find the post quickly and the comments are just one hop away from that. Beyond that it's just like SQL (order by, limit, etc.).
"The problem with a join is that you never know what intermediate set will be produced, meaning you never quite know the memory use or latency of a query with a join. Multiplying that out with several joins means you have enormous potential for queries to run slowly while consuming lots of (scarce) resources."
Anyone with even a meagre understanding of databases will put indexes on join columns. If the data model is complete, then the joined columns will be modelled as foreign keys (conceptually the same as a relationship in a graph DB) which force indexes. I think they are talking more about problems with ORMs, where the ORM might construct unexpected queries that don't hit indexes. This is an ORM problem, not a relational DB problem.
One of the major promises of relational DBs was that you could write code describing what you wanted, and the DBMS would figure out how to efficiently find it for you. This promise was derailed by the push to merge relational models with object-oriented models (i.e. ORMs), but it's not dead. What we need is a more powerful SQL, one that doesn't require boilerplate and can do things like recursion (making every database a graph database). We need a SQL that makes the application code seem like boilerplate. We need the equivalent of type inference for joins; let me say A join D and the DBMS infers I mean A->B->C->D and figures out the cardinality of the result. We need result sets that are graphs instead of one-list-fits-all. These are all things that can be modelled in an RDBMS without losing its expressiveness.
You said:
"What we need is a more powerful SQL, one that doesn't require boilerplate and can do things like recursion (making every database a graph database)."
What about SQL's common table expressions (CTEs)? Not powerful enough?
People do interesting things on top of graph databases. I find Structr ( http://structr.org/ ) a very interesting approach to a rich CMS/WCM (based on Neo4j).
Neo4j offers Master-slave replication for efficient scaling of reads. Horizontal scaling of graph databases often involved partitioning, which is a hard problem and an active area of research.
I would say this however:
- If your data and query workload is a natural fit for the graph model then the speedup you get offsets a huge amount of the advantages offered by horizontal write scalability in other DBMS.
- A single Neo4j instance can store and query a very great deal of data indeed (in personal testing I have imported low 100s of millions of nodes, and I am given to understand it can go much further still). For many use cases this is sufficient.
Well, this sounds nice. This amount of nodes is more than sufficient for my needs. The problem would probably be the reads. "give my everything that is tagged X, Y and Z", "give me everything that is tagged A, B, X and G" etc.
Obviously I don't know the specifics of the data you are going to be modelling but I would suggest thinking of many tags 'properties' as part of the topology of the graph.
For instance (warning contrived example ahead) if you wanted to say "Give me all people that live in Germany" then Germany would be a node (and Lives_In a relationship) rather than a property on each individual person node.
Graph databases are optimised for thinking about data in this way. So you might start your query at the node with the label Country and the name property Germany, then return all connected lives in relationships. This obviously considers far fewer nodes than if you loop through all nodes with the label Person.
Still, "Relational databases are fine things, even for large data sets, up to the point where you have to join. And in every relational database use case that we’ve seen, there’s always a join — and in extreme cases, when an ORM has written and hidden particularly poor SQL, many indiscriminate joins."
It seems like the overall argument is for (what I see as) a step backward from the declarative model to a lower level imperative model. "You never know what memory your implicit declarations will allocate, better do everything in explicit c-like loops as your data expands."
It's almost like an argument for a return to the world of "hardware is expensive, people are cheap" and for all I know that's what's happening with really big data. But it seems a bit sad to present it as a step forward.