I originally posted this question to Stack Exchange, and am copying it below here so I can get some insight from HN too :) The original (extended) post is here: http://dba.stackexchange.com/questions/162008/deciding-on-the-right-database-for-big-data
We've identified the following requirements of a database:
- Thousands of inserts/updates per second
- Has an solid aggregation strategy: Aggregating data in N different ways is really important to us, any hindrance in our ability to aggregate data is going to slow us down.
- Store + query related data (hierarchical/recursive JSON) -- ideally efficiently, but not required: getting the data out in a useful format is very important to us.
- Partition tolerant (easily clustered, automated replication)
- Availability (heavy reads + writes): same story as the partition tolerance. Availability requirements scale with the popularity of the video games we're storing data for.
With those requirements in mind, we've been mulling over a few different choices.
- Cassandra: Nails the partition tolerance and availability requirements, but it's very limited in querying capability. Also has the intriguing ScyllaDB, which could prove to provide us with more capacity if needed. With the addition of Spark to our infrastructure, Cassandra may be a good fit for our aggregation needs as well. (And we realize Riak + others are compatible with Spark as well.)
- MongoDB: Native support for JSON could be a big plus. Has a built in aggregation pipeline, but we're unsure of its capability. Not too enthused with the master-slave replication - again we're not overly concerned with consistency.
- Postgres: Maybe we don't need NoSQL yet? With the right partitioning strategy, an RDBMS could prove to work for our use case. Postgres has (from what I can tell) excellent support for JSON. Indexing on JSON properties could prove to be very useful for aggregation queries.
1. When you say "thousands of inserts/updates per second" do you mean thousands of both? Or just thousands of inserts and maybe a smaller number of updates? It makes a difference because updates require finding the record first, and some data stores (like Hadoop) are basically write-only.
2. Are these "aggregations" going to be run on whole database tables at once (like the top score of each game), or on small chunks of them (like an individual player's scores)? For small chunks, strategies like indexing that speed up queries are essential, but for big batch processes, the MapReduce algorithm is the way to go.
I'm going to give my two cents based on a couple of alternative scenarios. In the first scenario, you're capturing gamers' scores/achievements/whatever and it's really only inserts, no updates or deletes except to correct errors. In that case you might go with a basic star schema implemented in a relational database, or something like BigTable if you need it faster. Read the Kimball book (The Data Warehouse Toolkit) if you don't know the model. Basically it's optimized for aggregation -- sums/averages/counts sliced and diced and pivoted any way you want to see them. At the middle of the star schema is a big "fact" table that basically records all events/transactions in a write-once-then-read-only manner.
In the second scenario, you're storing gamers' saved games or something, and really need to do updates and deletes as well as inserts. Yet you want to run aggregations on data at the level of individual gamers and on individual games. In this case I think I'd use some NoSQL database with very fast access and redundancy, based on Amazon's Dynamo. Riak is based on that model. I can't remember if Cassandra is. You could have two "tables", one for players and one for games, and each data point would be inserted twice. So you could get all the data for a game with one quick hash lookup, or all the data for a player, and then run your aggregation at that level of detail. Really big system-wide analyses could run on the whole database using MapReduce (you mentioned Spark, that'd do it).