Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How would you rank 500M rows of data in 5 seconds?
2 points by iLoch on June 15, 2016 | hide | past | favorite | 6 comments
We have a database (SQL Server) which has a table filled with stats. The table currently has around 10 million entries in it. The table is organized "horizontally" so that one user has one row, and we have about 50 stats in the table.

We have a leaderboard for every single one of these stats, and we want to have a "live" leaderboard - that is, the ranks are fresh on each request.

Right now we use an indexed RANK() query which can rank the top 100 in like 15ms. Top 10K is 100ms. Top 1M (or, offset 1M and get the next 100) is around 20 seconds.

At the very least, we'd like to rank 1M within a couple seconds (10x improvement) but ideally we'd be able to serve any number (up to say.. 10 million) of ranks in under a second. This is probably unreasonable, but it's our goal.

Is there any advantage to using a "vertical" stable structure where each stat is a row? That's where the >100M figure comes from - if we convert these stats into rows it's (number of users) x (number of stats).

Is there any specialized way of accomplishing this? Is the database slowing us down? Maybe there's a better way to store all of this stuff than in a DB? I was thinking perhaps storing it in the DB and an in-memory store at the same time, and repopulating the in-memory store when the server starts. But that wouldn't be much use if the in-memory solution doesn't give us a massive performance boost.

Right now we use dedicated servers for our application and database needs. We'd like to switch to cloud architecture but we want to both accomplish what is mentioned above while also reducing costs (we pay quite a bit for our current stack.)




So to be clear:

    - You have 500M rows x 50 columns of stats for each row.
    - For any stat, you want the ability to display the top 10M.
    - In under one second.
Is that correct?

If so, I think the problem is not clearly described. What does it mean to display a result row? Are you displaying the stat value and a primary key? The entire row? What exactly does it mean to display 10M rows? Are they all presented on one web page? Or is it N at a time with prev/next links? Or is there random access within those 10M (e.g. find me in the ranking for stat #27). Do you overlap retrieval with the display process, (showing the first rows when available)? When someone clicks prev or next does it recompute the entire result?


Pretty close. 10M rows for each stat -- this number grows linearly with the number of users. There are two cases for showing stats. The first is a leaderboard: random access (because of how we allow users to scroll through the leaderboards) where you'd be fetching 100 consecutive records from ranking 0 of that stat to rank N where N is the total number of rows for that stat. The second case is where you want to get the rankings for all stats for a particular user.

As for how the data is returned - whatever is fastest but ideally the response is sent back as JSON so it should all be available by the time it is sent back.

As for caching, I think it would be ok to cache the data for a short period of time, though I don't know how efficient it would be to cache 10 million rows at once (per stat.)


Scrolling is not random access. Scrolling means like start at the beginning, and keep going, stopping eventually. Rows are provided in a fixed order. Random access means jumping immediately to any row at any time.

All rankings of one user is an additional requirement, not mentioned in your first post.

I wouldn't necessarily rule out caching, but it isn't yet obvious that it is needed.

This is probably not the best way to discuss your requirements and possible solutions, going back and forth on HN. I consult if you'd like to talk further.


Yeah I suppose I haven't done a great job outlining all of the requirements. Is there a way we can get in touch? Not sure we're looking to hire a consultant yet but it would be good to have as an option if we decide to go that route.


Sure, I'm jao at geophile dot com.


The first thing I would try would be to put each variable in its own table. If that's not enough you can go all the way to each stat on its own server.

With What you have right now, you are going to do 50 sorts on the same data, jumbling it all up unless all the stats are highly correlated. The entity-value structure you propose would probably be worse.

But, you have the data, you can measure these things.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: