Ask HN: How would you rank 500M rows of data in 5 seconds?

geophile · on June 15, 2016

So to be clear:

    - You have 500M rows x 50 columns of stats for each row.
    - For any stat, you want the ability to display the top 10M.
    - In under one second.

Is that correct?

If so, I think the problem is not clearly described. What does it mean to display a result row? Are you displaying the stat value and a primary key? The entire row? What exactly does it mean to display 10M rows? Are they all presented on one web page? Or is it N at a time with prev/next links? Or is there random access within those 10M (e.g. find me in the ranking for stat #27). Do you overlap retrieval with the display process, (showing the first rows when available)? When someone clicks prev or next does it recompute the entire result?

iLoch · on June 15, 2016

Pretty close. 10M rows for each stat -- this number grows linearly with the number of users. There are two cases for showing stats. The first is a leaderboard: random access (because of how we allow users to scroll through the leaderboards) where you'd be fetching 100 consecutive records from ranking 0 of that stat to rank N where N is the total number of rows for that stat. The second case is where you want to get the rankings for all stats for a particular user.

As for how the data is returned - whatever is fastest but ideally the response is sent back as JSON so it should all be available by the time it is sent back.

As for caching, I think it would be ok to cache the data for a short period of time, though I don't know how efficient it would be to cache 10 million rows at once (per stat.)

geophile · on June 15, 2016

Scrolling is not random access. Scrolling means like start at the beginning, and keep going, stopping eventually. Rows are provided in a fixed order. Random access means jumping immediately to any row at any time.

All rankings of one user is an additional requirement, not mentioned in your first post.

I wouldn't necessarily rule out caching, but it isn't yet obvious that it is needed.

This is probably not the best way to discuss your requirements and possible solutions, going back and forth on HN. I consult if you'd like to talk further.

iLoch · on June 15, 2016

Yeah I suppose I haven't done a great job outlining all of the requirements. Is there a way we can get in touch? Not sure we're looking to hire a consultant yet but it would be good to have as an option if we decide to go that route.

geophile · on June 16, 2016

Sure, I'm jao at geophile dot com.

nanis · on June 15, 2016

The first thing I would try would be to put each variable in its own table. If that's not enough you can go all the way to each stat on its own server.

With What you have right now, you are going to do 50 sorts on the same data, jumbling it all up unless all the stats are highly correlated. The entity-value structure you propose would probably be worse.

But, you have the data, you can measure these things.