Ask YC: what's the best way to model a threaded comment system like YC?

kcl · on March 16, 2008

It's best to keep things in-memory if you can.

Since you mention MySQL I assume that's what you're using to persist data. Storing a tree in a flat database is kind of a challenge. This page (http://www.sitepoint.com/article/hierarchical-data-database) suggests two ways: adjacency lists, and modified preorder tree traversal. MPOTT is a cute idea, but that's about it. It will make your database schema more complex, make your algorithms more complex, and require preexisting entries to be changed when new ones arrive. It's not faster than doing things in memory, and it's just asking for trouble.

Try using adjacency lists and reconstructing the tree in memory each time you serve it. This will let you get away with a single query for both reading and writing. The dev.mysql.com link you mention warns against using the adjacency list model, but that's because it's an area where MySQL itself is inadequate. It's fine when you fix things in another language.

Here's a minimal schema:

topic, id, parent, payload

ID is unique and non-zero. Parent is the ID of the parent node (0 if top-level). Topic is shared by all comments in a given topic. When you select, select by topic and then recreate your tree by going through and reestablishing "children" lists for each of your parents. You can then use traditional tree algorithms.

This method is easily modified to support caching.

sah · on March 16, 2008

I've done threaded comments with adjacency lists like this in MySQL in the past, and it worked well. Note also that it's compatible with scores moving siblings around relative to one another -- just calculate sibling position based on both score and date. Your schema might look something like this:

id, date, topic_id, parent_id, score, payload

pg · on March 16, 2008

http://ycombinator.com/arc/arc2.tar

mmcgrana · on March 16, 2008

You might consider how much of the structure and ordering you actually need to maintain in the database, and how much you could compute in your app servers after pulling the raw data from the database.

For example, say that your comments table stored the user id, the post id, the parent comment id (where the parent is the one the user clicked "reply" on), the vote count, and a timestamp. Then to render a page for all of the comments for a particular post, you could just pull all of the comments keyed by that post id and then order/arrange them however you want in your application code, such as by timestamp or number of votes. This doesn't require any tricky db writes and generally removes load from the db (which is desirable since its the hardest part of the app to scale).

The harder part is dealing with a case like the user comments page on neww.yc, where you show all of the user's comments along with of the descendants of those comments. Aside from the traditional approaches to storing hierarchies in relational dbs like the ones you linked to, I know of two other possibilities. One is to store the path of nodes to the root as lexicographically-comparable strings. The other is to create a separate table that stores ancestry relationships, in which for each node you have one row for each of its ancestors in the tree up to the root. Again, both of these approaches would allow you to pull out exactly the comments that you needed, but you would have to do the arranging in the application code.

RyanGWU82 · on March 17, 2008

I worked at a company where we used a decimal number to represent a tree for a threaded forum. It meant that we could use an ORDER BY clause and quickly get all the descendants sorted, like this:

7.401 7.40101 7.40102 7.4010201 7.40103 7.4010301

The downside was that it wasn't nearly this straightforward -- numbers seemed to be assigned haphazardly. And the scheme wasn't documented well, so it was nearly impossible for new programmers to understand.

I would not recommend this approach again. If you need high-performance searches for descendants, a separate table relating ancestors to descendants will be much more straightforward in the long-run.

llimllib · on March 16, 2008

Thoughts: 1) it's not relational data; is a database necessarily the right place for you to store it?

2) If you do decide to use the database, you can increase the write speed of the nested sets model by leaving large gaps between the left and right values.

That is, if the single child of the root node has left and right values of 0 and 1000, and you insert another child, you give it left and right of 1000 and 2000, and update the parent.

Then you'll only have to renumber if you get a wicked long thread.

tyler · on March 16, 2008

I'm not sure I agree that we're not talking about relational data. I mean... each record is -related- to others in a parent-child sort of way. Maybe I'm misunderstanding what you mean by relational?

llimllib · on March 16, 2008

Ooh, bad memories, my boss made the same argument at me last week.

Anyway! The relational model uses tables to model all data. Though you can squeeze a comment tree into a relational data mode any number of ways, it is most clearly expressed by a hierarchical model as in XML, S-exprs or a tree data type in your favorite language.

So I'm not saying it can't be expressed as relational data, what I mean is that that representation is not its most natural or useful form.

At work I use a query like:

    select s1.id, s2.id, s3.id, s4.id ...
    FROM subject s1
    LEFT JOIN subject s2 WHERE s2.id=s1.parent
    LEFT JOIN subject s3 WHERE s3.id=s2.parent
    LEFT JOIN subject s4 WHERE s4.id=s3.parent
    ....
    WHERE s1.id=?

to get all parents of a given node. It's ugly, inefficient, and fragile. You can use the nested set model to simplify this particular query, but inserting rows becomes extremely complex and painful; you really just trade one problem for another.

So, while you can model a tree as relational data, it's not a good match.

noblethrasher · on March 17, 2008

I think what the parent meant is that such a tree is fundamentally relational (not merely that it can be modeled as such). Since a relation is just a mapping between tuples and a set of logical values (TRUE and FALSE in most cases) then clearly there is a well-defined relation among the notion of 'parent' and 'child'. Thus the problem is the de-facto standard query language. It seems to me that we could use new special query language (or an extension to SQL) that could handle creating trees from pure relations in a performant and lexically elegant way. Or maybe it’s already been tried and I missed it. (edit: looks like a lot of smart people have already figured it out)

jbert · on March 17, 2008

Agreed. An interesting alternative approach to a single 'parent' pointer in the record is the use of an ancestor table.

http://www.evolt.org/working_with_hierarchical_data_in_sql_u...

This allows all children (or parents) to be fetched in a single query. A suitable "order by" clause also groups them together for easier processing in one pass.

The cost is the same as any denormalisation - additional updates and integrity issues - so it isn't a no-brainer but it's probably a win in most such situations.

bootload · on March 17, 2008

"... what's the best way to model a threaded comment system like YC? ..."

A classic read is jwz's article on threaded messages ~ http://www.jwz.org/doc/threading.html jwz was a key hacker on mozilla ~ http://en.wikipedia.org/wiki/Jamie_Zawinski

clueless · on March 16, 2008

http://news.ycombinator.com/item?id=33888

clueless · on March 16, 2008

Also here is MY ineffecient way of doing it:

function article($link){

print_story($link->id);

print_comment(0,$link->id);

}

function print_comment($parent_id, $link_id){

$comments = $db->get_col("SELECT comment_id FROM table_comments WHERE comment_link_id=$link_id and comment_parent = $parent_id ORDER BY comment_votes DESC");

if ($comments) {

		$comment = new Comment;
		foreach($comments as $comment) {		
				$comment->print();
			}
				print_comment($comment_id,$link_id);
		}
								
	}

}

TABLE `table_comments` (

  `comment_id`, `comment_parent`, `comment_link_id`, `comment_votes`

)

If you store and manipulate all the data in memory, I don't see why this would be that bad?

domnit · on March 17, 2008

You can fetch all of the comments by comment_link_id in a single query, then work with the result to recursively print the tree.

tlrobinson · on March 16, 2008

One query per comment is terrible... I'm glad you realize it's inefficient :)

brett · on March 16, 2008

I went with more or less what nostrademons described. So far it's working well.

lampy · on March 16, 2008

Here's a way I've previously "validated" some of my designs in this area.

Look at Slashcode (http://www.slashcode.com/), the code that powers Slashdot. They have quite a few threading features and they obviously figured out performance issues.

a-priori · on March 16, 2008

The naïve approach is to give comments a parent (the reverse relationship gives you its children), and a boolean specifying whether it's a "front page" post. When selecting, order by votes. When rendering, do a depth-first search.

Is there any reason why that won't work?

pibefision · on March 16, 2008

It's a little more complicated.

When you store the tree on the DBMS, you have some algorithtms to do it well. For Example, the Nested Set (check the link up).

In that algorithm, the nodes are stored in a hierarchy that does not have any information about the votes.

So, you have a perfect tree, but with the first vote, you must update all the structure to reflect the new order of the child.

Nested set is great for speed reading, but not for writing often.

Sorry if I'm not very clear.

I was testing rails plugins : acts_as_nested_set and acts_as_better_nested and all have the same problem.

a-priori · on March 16, 2008

By its nature this site does not generally have deep nesting. Adjacency lists are inefficient with deep, narrow nesting. I think that performance would be acceptable with that approach.

However, nested sets are more elegant. I believe the best approach would be to do the reordering in the application. Retrieve the comments in arbitrary order (probably chronological), and sort the siblings by votes just before you render.

neelesh · on March 16, 2008

I used a slightly different approach of combining acts_as_tree (for faster updates/inserts) and added a dotted_id for faster reads. The dotted id is a string,as in 0001.0002.0007. Index this column and its fairly fast. The obvious limitation this approach creates is the depth of your comment tree. approx =dotted id col size/#of digits in max pk -1. Not an issue for many. When Im updating the hierarchy , I update root of the hierarchy being moved and then a single UPDATE to update the dotted ids in that subtree. If you are using MySQL, then the max length of a column that can be indexed is also limited. Don't remember the exact length.

tim2 · on March 16, 2008

Make a very simple database model with at least these columns: comment_id int, created timestamp, parent int, score int,...

When you need to view a page of comments that has not been loaded, load it into cache and then use your cache from there on out. When updates come in, write them directly to the cache, re-sort the cache as necessary, and then have the database updates run in the background.

For the cache tree model I use something like:

{'root_comment_id':123,'children_sorted_by_points':[]}

Cache expiration left as an exercise for you.

This can very easily handle something the size of hacker news.

edit: And this easily handles "many reads and many writes and an ever-changing order".

optimal · on March 16, 2008

More thoughts please on a system with both many reads and many writes and an ever-changing order (i.e., sorted by dynamic rankings instead of "time posted," a constant)!

pibefision · on March 16, 2008

When you add the "dynamic ranking" to the rdbms model, it's turns very complicated.

Would be a good idea to model the three on XML and store complete trees on the dbms? (thinking loud!) :)

tlrobinson · on March 16, 2008

Adjacency list in the database, build the tree in memory, cache the result in memcached if you need to.

xirium · on March 16, 2008

Have a message table and an ancestor table. The ancestor table allows you to find all ancestors of a message or all decendents of a message. This allows you to quickly retrieve a branch of the discussion tree.

shabda · on March 16, 2008

MPTT is generally the fastest way to get relational to tree data. If you are udsing Django code.google.com/p/django-mptt/

pibefision · on March 16, 2008

One more idea :

Is here the solution? http://arclanguage.org/item?id=3426