Ask HN: What database should we use for our “big data” problem?

joeclark77 · on Jan 24, 2017

I have a couple of questions for clarification:

1. When you say "thousands of inserts/updates per second" do you mean thousands of both? Or just thousands of inserts and maybe a smaller number of updates? It makes a difference because updates require finding the record first, and some data stores (like Hadoop) are basically write-only.

2. Are these "aggregations" going to be run on whole database tables at once (like the top score of each game), or on small chunks of them (like an individual player's scores)? For small chunks, strategies like indexing that speed up queries are essential, but for big batch processes, the MapReduce algorithm is the way to go.

I'm going to give my two cents based on a couple of alternative scenarios. In the first scenario, you're capturing gamers' scores/achievements/whatever and it's really only inserts, no updates or deletes except to correct errors. In that case you might go with a basic star schema implemented in a relational database, or something like BigTable if you need it faster. Read the Kimball book (The Data Warehouse Toolkit) if you don't know the model. Basically it's optimized for aggregation -- sums/averages/counts sliced and diced and pivoted any way you want to see them. At the middle of the star schema is a big "fact" table that basically records all events/transactions in a write-once-then-read-only manner.

In the second scenario, you're storing gamers' saved games or something, and really need to do updates and deletes as well as inserts. Yet you want to run aggregations on data at the level of individual gamers and on individual games. In this case I think I'd use some NoSQL database with very fast access and redundancy, based on Amazon's Dynamo. Riak is based on that model. I can't remember if Cassandra is. You could have two "tables", one for players and one for games, and each data point would be inserted twice. So you could get all the data for a game with one quick hash lookup, or all the data for a player, and then run your aggregation at that level of detail. Really big system-wide analyses could run on the whole database using MapReduce (you mentioned Spark, that'd do it).

iLoch · on Jan 24, 2017

1.) Over time, the number of new players playing a game decreases, to the point where most writes will be updates. That being said, in some cases we have access to totals via the APIs, so we just do a replace/upsert.

2.) We tend to do a bit of both. One of the main uses we have for aggregation right now is for global-level analytics (ie. how are certain weapons in the game being utilized over time)

I think the second approach sounds more closely aligned to what we're interested in doing.

The trouble with any of these solutions (and this is where I sort of neglected to make my description of the problem more complex) is that their fairly one dimensional. The reality is that one player may have multiple characters each with their own stats, and we may want to track the statistics of each weapon of each character for each player as well. We may want to track all of this per playlist as well. The relationships start to get very complex, as does the data.

joeclark77 · on Jan 25, 2017

Well, one of the reasons they call it Big Data is that it can get big. You can store the same data point multiple times - once in the "player" record, once in the "game" record, once in the "weapon" record, etc. Whatever makes it faster/easier to process.

You have to forget everything you learned about "normalization" if you've ever studied relational databases and SQL. If redundant copies or even multiple platforms and data models allow you to make faster queries, use them!