Aguri: Coolest Data Structure You've Never Heard Of

codeslinger · on Jan 22, 2008

The point of the data structure in this case was not the radix trie alone but rather the combination of the use of tries in concert with the aggregation characteristics of IP traffic. This type of innovative combination is common in the network world due to the necessity of providing ever more introspection and functionality over ever higher throughputs and ever lower latencies. For some serious treatment of the subject, check out the book "Network Algorithmics" by George Varghese.

tptacek · on Jan 22, 2008

This is an awesome, awesome, awesome book. Aguri isn't in it; aguri doesn't even rate for this book. That's how awesome this book is.

anupamkapoor · on Jan 22, 2008

network-algorithmics is just soooo cool. really, what you must be reading after stevens.

Xichekolas · on Jan 22, 2008

He says binary trees are better default search table than hash table... binary tree search is O(lg(n)) while hash table is O(1)... so hash table is better for search, binary trees win in other ways (like ordered traversal if you need that), but no way you could say one is always better than the other.

Also confused how he proposes to build his Radix Trie in less than linear time... just seems like he ditches his original premise and launches into routing. There are selection algorithms that will find things in an unsorted array in O(lg(n))... but pretty sure building a tree of any sort is not part of it.

Radix Tries are sweet fun... so is just good old Radix Sort, but he starts with something that doesn't really make sense for the topic (the unsorted array he wants to pull three out of), then makes some weird claims, then jumps into routing with Radix Tries... not sure what is insightful here.

tptacek · on Jan 22, 2008

Hash tables are NOT O(1). You just failed the job interview.

You're like the 4th person I've seen that read (some) of this article and pulled this tree vs. hash thing out of it. It's a throwaway point. Binary trees provide comparable performance to hash tables, but also provide efficient access to ranges of keys. Binary trees support more operations efficiently than hash tables.

The article isn't about some war I've declared on hash tables.

gojomo · on Jan 22, 2008

"Hash tables are NOT O(1)."

For average lookup time, which is a reasonable context to assume in the preceding comment, yes they are.

Xichekolas · on Jan 22, 2008

"tree vs. hash thing ... It's a throwaway point"

If you had read more than (some) of my comment you might have gathered that was my entire point. The first page of your post was a throwaway point... nothing to do with your actual point, which seemed to be Radix Tries. It would've been nice to dispense with the distraction was all I was saying.

I know that given a lot of collisions that hash tables degenerate into linear insertion and lookup... and I did point out that binary trees are better for several things, but I still maintain that you can't just outright claim one is better than the other for every situation.

Oh, and as for failing the job interview... not really sure why that was necessary. You don't make a stronger point by being a dick about it.

tptacek · on Jan 22, 2008

The hash table O(1)/O(n) question is a classic job interview gotcha. I didn't mean to be a dick.

Xichekolas · on Jan 22, 2008

... and I shoulda been nicer in my original comment. Applying radix tries to routing was pretty interesting. Sorry to pick nits.

marketer · on Jan 22, 2008

What do you mean hash tables are not O(1)?

With a good hash function (collision resistant), the amortized cost of lookup is O(1), even if you need to resize the table.

bayareaguy · on Jan 22, 2008

Hash tables are NOT O(1). You just failed the job interview.

tptacek, what do you think they are?

tptacek · on Jan 22, 2008

The bound on lookup from a hash table is O(n).

bayareaguy · on Jan 23, 2008

As is the worst case in a binary tree.

tptacek · on Jan 23, 2008

If you use a simple binary tree, which nobody does, you're right. If you use a red-black tree, like everybody does, you're wrong; worst case search time for a balanced binary tree is O(log n).

Worst case search for a radix trie is always O(k); radix tries don't require balancing.

Where did you go to school?

bayareaguy · on Jan 23, 2008

Well, nobody I know uses hash tables without periodically "rebalancing" them either.

Where I went to school, we were taught not to compare the worst case behavior of one structure to the average of another. We were also taught not to confuse binary trees with red-black trees.

ckuehne · on Jan 23, 2008

I guess your school sucked. Red-black trees ARE binary search trees. Also: the worst case behavior of a hash table is in fact worse than that of a balanced (red-black tree, b-tree, avl-tree) search tree.

bayareaguy · on Jan 23, 2008

Red-black trees ARE binary search trees.

Agreed. However the original poster said binary tree, a specific structure whose worse case behavior is O(n) just like a hash table.

They didn't say binary search tree (which as you observe may refer to one of several possible structures).

tptacek · on Jan 23, 2008

Your school taught you to "balance" hash tables?

bayareaguy · on Jan 23, 2008

The quotes were there for a reason.

aston · on Jan 22, 2008

There aren't any algorithms to find things in an unsorted array in less than worst case linear time. I think you're confusing search (for some value) with select (a value with a given index in the sorted array).

eru · on Jan 23, 2008

If you want to do several searches on the same data - that is initially unsorted - you can get better than O(n) per query. But like you said - not for the first one.

Xichekolas · on Jan 22, 2008

Yeah, yeah I was... was just thinking about that over breakfast.

Tichy · on Jan 22, 2008

For hash tables you need to compute the hash function and execute an equals operation. Computing the hash function might take longer than traversing a tree. For example, if you key is "sdolkfhasodfgkokjhdfgfnaryksfdksfdgkjgf" the hash function will probably take every single letter into account (>20 operations, didn't count), whereas a binary tree might only need to look at the first letter at most nodes. So at least for relatively few entries, trees might perform better than hashtables, O(1) nonwithstanding.

mojuba · on Jan 22, 2008

Radix tries are nice. Use them when

(1) you need fast lookup and fast inserts

(2) you need to preserve the lexicograhpic order of your elements - something that a hash table won't give you

(3) you are afraid of DoS attacks or maybe just "bad" data - hash tables can be attacked easily if the implementation is known

(4) you don't care about memory usage that much.

I compared a simple 8-bit radix tree with some standard hash table implementation - the former took roughly ten times more memory. I then changed my radix to be based on 4 bits (each char is just split into 2 parts) and the memory usage improved twice. Now I'm wondering if radix tries have more room for improvement.

So this is something to consider.

And thanks for the article - never heard of Aguri, it's beautiful.

Upd. sorted lists basically give you same advantages except for fast inserts

tptacek · on Jan 22, 2008

Compressed binary radix tries (also known as PATRICIA trees) give you excellent memory usage characteristics and pretty much optimal search time characteristics: both wins come from only storing the edges corresponding to significant bit differences.

Since hash tables by necessity have to allocate enough storage to keep table load down and minimize collisions, you could find a PATRICIA tree has even better memory footprint than whatever hash table you use.

Of course, all data structures share the DoS problem. There's an input to a PATRICIA tree that pessimizes lookup and memory usage: the one that creates a full 32 bits of fanout for every key. In all cases, the attack degrades performance to linear search.

Of all the "mainstream" data structures, balanced binary trees are probably the hardest to attack this way.

mojuba · on Jan 22, 2008

both wins come from only storing the edges corresponding to significant bit differences.

Of course I used compressed suffixes, but it didn't help much as compared with hash tables.

you could find a PATRICIA tree has even better memory footprint than whatever hash table you use

I don't think you can demonstrate that Patricia has a better memory footprint than a good hashtable. In fact best radix trees are much worse.

Of course, all data structures share the DoS problem.

You can attack radix trees in terms of memory usage, while hash tables are prone to serious performance attacks.

In all cases, the attack degrades performance to linear search.

Not true. I don't think you understand these algorithms and O(n) complexity very well.

The question how a hash table degrades when attacked highly depends on how exactly conflict resolution is implemented. Among other methods is, for example, expansion and rehashing, and you can't tell for sure if it degrades to linear search or not.

And radix never degrades to linear search - never.

Of all the "mainstream" data structures, balanced binary trees are probably the hardest to attack this way.

They are prone to a very specific attack when you make the tree to re-balance often.

tptacek · on Jan 22, 2008

You're ignoring table load factor in your memory comparison, you can attack any dictionary "in terms of memory", I conceded the O(n)/O(k) lookup thing before you wrote this, and why do you think "balancing" a red-black tree is expensive?

mojuba · on Jan 22, 2008

You're ignoring table load factor in your memory comparison

I believe I don't ignore anything, I'm measuring precise memory usage by two identical programs doing same thing but based on different algorithms. I also believe I got the most out of radix in my tests. The 4-bit version of radix wasn't trivial, but it was reasonably fast and the mem usage was twice as better than the 8-bit one, as I said before.

But even speculatively, on paper if you wish, radix consumes more memory than some good hash table implementation.

Upd. forgot to mention, I did some more optimizations as well, like compressing the pointer tables in tree nodes.

tptacek · on Jan 22, 2008

I'm sorry, you're talking about a totally different algorithm than I am either here or in the article you're responding to. There's no such thing as a "4 bit PATRICIA" trie. Obviously, if you want to burn memory, you can increase the arity of your tree nodes to "speed up" (heh) fetches.

tptacek · on Jan 22, 2008

I am, of course, an idiot; you can't make a PATRICIA lookup worse than O(k).

garret · on Jan 22, 2008

D. J. Bernstein uses these too.

http://cr.yp.to/critbit.html

RajendranB · on Jan 24, 2008

The time space trade off between hashtables and binary trees are not touched upon in all the above writings. Hashtables will require continuous memory locations and hence from stack and hence won't work for large sets of data and binary search trees or their derivatives will allocate memory from heap and take more time (logn) for get and put than the constant time(O(1)) for hashtables.

bayareaguy · on Jan 24, 2008

It sounds like you're making a lot of assumptions about the storage hierarchy and addressing scheme.

hisham · on Jan 22, 2008

Wow reading this post was like replaying one of my interviews with Google...

dfranke · on Jan 22, 2008

Knuth has a pretty good treatment of tries in volume 3.

jwp · on Jan 22, 2008

Suffix array.

andrewf · on Jan 22, 2008

A couple of years back, googling for "suffix array" got you advertisements for jobs@google. I'm not sure whether Google HR ended the program, or just migrated to more obscure search terms :)

apathy · on Jan 23, 2008

Burrows-Wheeler Transform maybe?

One of the coolest compression algorithms evar...

eru · on Jan 23, 2008

It's no compression algorithm in its own right. Just a first step.

An interesting first step.