Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What database should we use for our “big data” problem?
3 points by iLoch on Jan 24, 2017 | hide | past | favorite | 3 comments
I originally posted this question to Stack Exchange, and am copying it below here so I can get some insight from HN too :) The original (extended) post is here: http://dba.stackexchange.com/questions/162008/deciding-on-the-right-database-for-big-data

We've identified the following requirements of a database:

- Thousands of inserts/updates per second

- Has an solid aggregation strategy: Aggregating data in N different ways is really important to us, any hindrance in our ability to aggregate data is going to slow us down.

- Store + query related data (hierarchical/recursive JSON) -- ideally efficiently, but not required: getting the data out in a useful format is very important to us.

- Partition tolerant (easily clustered, automated replication)

- Availability (heavy reads + writes): same story as the partition tolerance. Availability requirements scale with the popularity of the video games we're storing data for.

With those requirements in mind, we've been mulling over a few different choices.

- Cassandra: Nails the partition tolerance and availability requirements, but it's very limited in querying capability. Also has the intriguing ScyllaDB, which could prove to provide us with more capacity if needed. With the addition of Spark to our infrastructure, Cassandra may be a good fit for our aggregation needs as well. (And we realize Riak + others are compatible with Spark as well.)

- MongoDB: Native support for JSON could be a big plus. Has a built in aggregation pipeline, but we're unsure of its capability. Not too enthused with the master-slave replication - again we're not overly concerned with consistency.

- Postgres: Maybe we don't need NoSQL yet? With the right partitioning strategy, an RDBMS could prove to work for our use case. Postgres has (from what I can tell) excellent support for JSON. Indexing on JSON properties could prove to be very useful for aggregation queries.




I have a couple of questions for clarification:

1. When you say "thousands of inserts/updates per second" do you mean thousands of both? Or just thousands of inserts and maybe a smaller number of updates? It makes a difference because updates require finding the record first, and some data stores (like Hadoop) are basically write-only.

2. Are these "aggregations" going to be run on whole database tables at once (like the top score of each game), or on small chunks of them (like an individual player's scores)? For small chunks, strategies like indexing that speed up queries are essential, but for big batch processes, the MapReduce algorithm is the way to go.

I'm going to give my two cents based on a couple of alternative scenarios. In the first scenario, you're capturing gamers' scores/achievements/whatever and it's really only inserts, no updates or deletes except to correct errors. In that case you might go with a basic star schema implemented in a relational database, or something like BigTable if you need it faster. Read the Kimball book (The Data Warehouse Toolkit) if you don't know the model. Basically it's optimized for aggregation -- sums/averages/counts sliced and diced and pivoted any way you want to see them. At the middle of the star schema is a big "fact" table that basically records all events/transactions in a write-once-then-read-only manner.

In the second scenario, you're storing gamers' saved games or something, and really need to do updates and deletes as well as inserts. Yet you want to run aggregations on data at the level of individual gamers and on individual games. In this case I think I'd use some NoSQL database with very fast access and redundancy, based on Amazon's Dynamo. Riak is based on that model. I can't remember if Cassandra is. You could have two "tables", one for players and one for games, and each data point would be inserted twice. So you could get all the data for a game with one quick hash lookup, or all the data for a player, and then run your aggregation at that level of detail. Really big system-wide analyses could run on the whole database using MapReduce (you mentioned Spark, that'd do it).


1.) Over time, the number of new players playing a game decreases, to the point where most writes will be updates. That being said, in some cases we have access to totals via the APIs, so we just do a replace/upsert.

2.) We tend to do a bit of both. One of the main uses we have for aggregation right now is for global-level analytics (ie. how are certain weapons in the game being utilized over time)

I think the second approach sounds more closely aligned to what we're interested in doing.

The trouble with any of these solutions (and this is where I sort of neglected to make my description of the problem more complex) is that their fairly one dimensional. The reality is that one player may have multiple characters each with their own stats, and we may want to track the statistics of each weapon of each character for each player as well. We may want to track all of this per playlist as well. The relationships start to get very complex, as does the data.


Well, one of the reasons they call it Big Data is that it can get big. You can store the same data point multiple times - once in the "player" record, once in the "game" record, once in the "weapon" record, etc. Whatever makes it faster/easier to process.

You have to forget everything you learned about "normalization" if you've ever studied relational databases and SQL. If redundant copies or even multiple platforms and data models allow you to make faster queries, use them!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: