ConceptNet [1] started out as an academic project that I was responsible for for a while. Since then I've left to start a company but I still maintain ConceptNet and build lots of stuff on it.
Here's a list of databases, some of them graph databases, some of them barely databases, where I've tried to store and look up edges of ConceptNet:
- SQLite
- PostgreSQL
- MongoDB
- Some awful IBM quad-store
- HypergraphDB
- Tinkerpop
- Neo4J
- Solr
- Riak
- SQLite with APSW to speed up importing
- Just a hand-rolled hashtable on disk
Here are the systems that have succeeded to any extent, in that I could do simple things with them and they didn't collapse:
- PostgreSQL
- SQLite with APSW to speed up importing
- Just a hand-rolled hashtable on disk
The time when I tried Tinkerpop, HypergraphDB, and Neo4J because I had a graph and graph databases are supposed to be good at graphs was particularly terrible. Graph databases seem to only be good at dealing with graphs so small that anything can deal with them.
If this has changed, please point me at an open-source graph database that's not terrified of gigabytes. (No trying to sell me SaaS, please.)
Graph DB technology has been advancing fast over the last few years, and more evolutions are coming down the pipe. For example, Titan and Blazegraph are distributed and can handle billions of edges, and Blazegraph can be GPU-accelerated, which "demonstrated a throughput of 32 Billion Traversed Edges Per Second (32 GTEPS), traversing a scale-free graph of 4.3 billion directed edges in 0.15 seconds" (https://www.blazegraph.com/product/gpu-accelerated/).
NB: TinkePop is not a graph DB -- it's a graph software stack / computing framework for graph DBs (OLTP) and graph analytic systems (OLAP). Since TinkerPop is integrated with almost all of the graph DBs and graph processing engines, its mailing lists are good place to discuss and get help with graph-related projects.
I like how at least the BlazeGraph people are talking about billions of edges and not thousands, but I'm not sure that's something I could use. That seems to be a "pre-order" page, so it sounds neither open source nor existent. And I'm trying to figure out what their normal non-GPU non-distributed software is, but it seems to mostly be a pile of Javadocs.
Using distributed computing on mere gigabytes of data is silly.
I think TinkerPop was something else back in 2011, but apologies if I've used the wrong terminology.
Knowing the ConceptNet project a little, still I don't understand the workload/algorithms you run on the database. Otherwise said, «this doesn't work for us» doesn't help anybody.
It really depends on the kind of algorithm you run on the database.
Based on open source project, in read/write mode, no db can help you since you load everything into memory. As a noob NLP user, I rather use something like AjguDB https://github.com/amirouche/ajgudb
It's interesting. I suppose most graph db operations could be easily enough broken down into a series of steps that could be performed in a more scalable but slower way.
Did your hand-rolled hashtable have any characteristics that would make its performance characteristics difficult for a smarter optimizer (if such a thing existed in Neo4j)?
Can you psudocode an example slow query/operation and indicate how many edges/vertices were being considered at each step?
Sorry to ask these kinds of questions, I'm just really curious about the situation you described.
The failures of these databases were a lot more fundamental than I think you're looking for. And so far it hasn't been a trade-off where a non-graph DB has been more scalable but slower; instead, non-graph DBs have been more scalable and faster.
Here's what I have to be able to do in the database:
1. Import millions of edges from a flat file (time limit: 24 hours)
2. Query any node to return up to 100 edges connected to it (time limit: 100 milliseconds)
3. (nice to have) Find the maximal core of nodes that all have degree at least n to each other (time limit: a few hours)
4. Iterate all the edges between the nodes in a specified subset, such as the degree-3 core, which may still be millions of edges (time limit: a few hours)
#3 is optional, and the alternative is to export all the edges and compute it outside the database. But it's the only thing here that's actually a graph algorithm. However, every open-source graph database I've tried is orders of magnitude too slow at one of the other steps. They either fail at importing, fail at iterating, or fail to respond to trivial queries in a timely manner.
I forgot to mention one other non-graph-database system that met my requirements, which is Kyoto Cabinet. The main downside of it is the GPLv3 license.
[1] http://conceptnet5.media.mit.edu
Here's a list of databases, some of them graph databases, some of them barely databases, where I've tried to store and look up edges of ConceptNet:
Here are the systems that have succeeded to any extent, in that I could do simple things with them and they didn't collapse: The time when I tried Tinkerpop, HypergraphDB, and Neo4J because I had a graph and graph databases are supposed to be good at graphs was particularly terrible. Graph databases seem to only be good at dealing with graphs so small that anything can deal with them.If this has changed, please point me at an open-source graph database that's not terrified of gigabytes. (No trying to sell me SaaS, please.)