However, for autocomplete you often want a weighted Trie because you have extra information you want to weight nodes by. An example with contacts is that you often want recent and frequent contacts.
My company has an open source trie implementation here for a client to do weighted contact auto complete: https://github.com/shortwave/trie
I first learned about Tries when I implemented a spell checker, almost 20 years ago now, with basic suggestions. It’s amazing how easy it is to get an efficient, 80% spell checker and recommendation engine implemented from scratch once you dig into it. Tries make for an efficient enough in memory lookup structure (space, storage, and compute), and are an obvious part of a suggestion engine as well. Coupled with a basic soundex and some word mangling + edit distance checking and you have a very functional system. Solving the last 20% takes an order of magnitude more work :)
I've learned recently that a trie can also be succinctly encoded, like encoded as a string that contains almost only the node values and ~nothing else. This way for some languages you can encode all the valid words of that language in less space that it would be needed for the Hunspell dictionary used to generate them. Plus you get instant start-up as the dictionary had already been processed at build time basically.
There is another aspect of tries that make them really useful: you can do fast, fuzzy word searches on a trie.
In Typesense[1], I've implemented fuzzy search based on levenshtein damerau distance and it's incredibly fast. I've found this approach to be a much better (faster + more flexible) alternative to Peter Norvig's brute-force based spell-checker that is quite a popular post [2].
Does the trie help in memoizing the edit distance between multiple words? Or do you pay the O(n^2) cost per word? I'd love to hear details of how this works!
With brute-force you have no idea about what letters could follow a given letter, but with a Trie you already know the possible combinations so you avoid needless permutations.
Tries are also used to efficiently store names of functions in macOS / iOS binaries (Mach-O format) and work great for that purpose because a lot of functions have the same prefixes [0].
Indexes in mongodb use a similar concept (they call that prefix compression [1]), and that allows them to store indexes more efficiently.
We use them at Pyroscope and we wrote a blog post about our storage design [2] There are even some animations :)
Trie is probably my favourite ds. I really enjoy this python implementation I learned while studying leetcode because its so succint. Really useful for interviews.
from collections import defaultdict
END = object()
def make_trie():
return defaultdict(make_trie)
def insert(trie, word):
for c in word:
trie = trie[c]
trie[END] = True
It's such a simple and elegant as data structure. If you only care about word lookup you can implement it in just 35 lines of modern JavaScript:
function trieBuilder(word_list) {
const root = {};
for (const word of words_list) {
let node = root;
for (const char of word) {
let nextNode = node[char];
if (!nextNode) node[char] = nextNode = {};
node = nextNode;
}
node._ = 1; // mark the nodes that are endings of real words
}
function findChildren(node, prefix, list, maxLength) {
if (node._ === 1) list.push(prefix);
for (const char in node) {
findChildren(node[char], prefix + char, list, maxLength);
if (list.length >= maxLength) return list;
}
return list;
}
function findSuffixes(prefix, maxLength) {
prefix = prefix.toLowerCase();
let node = root;
for (const char of prefix) {
let nextNode = node[char];
if (!nextNode) return [""];
node = nextNode;
}
let words = findChildren(node, prefix, [], maxLength);
return words;
}
return {root, findSuffixes};
}
There is an option to get all suffixes without traversing subtree, but it comes with extra O(N) memory where N is combined length of all stored words - depending on case might be acceptable since memory for storing words itself is O(N) anyway.
https://stackoverflow.com/a/29966616/2104560 (update 1 and update 3)
Thanks! I just realized it doesn't work with strings containing underscores, but simply using __ instead of _ (so double underscores) as the key to end-of-word markings fixes that.
And thanks for the link, that is an interesting optimization!
EDIT: one fun non-practical application (histogramming the letters in a word is simpler and faster) is an anagram finder using prime numbers:
That's a nice idea. I guess it's better to stay in primitive type range (to avoid long arithmetics), so we can "compress" up to 15 items into a single prime_product making it less than 2^64 - for English words should be just fine, I don't expect many words with 16 and more letters.
UPD: Sorry, "up to 15" is a wrong phrasing. I checked once how "prime factorial" fits into primitive, and first 15 primes can fit into long. So it's possible to handle even more symbols if it's smth like "aaaaaaa"64 times because it would be just 2^64
Note that to save memory, instead of using every character in the word, you may want to cap the depth of your trie to the first N chars. The words with shared prefixes are kept in an array at the leafs. During lookups, you’d scan the leaf array to find matches. You’d be trading space (lots of nested objects) vs speed (negligible for scanning small arrays).
Fredkin's contribution were to reversible computation and since Quantum computers are a kind of reversible computer, many ideas and gate notions carried over.
I guess you can ask, can someone be a significant contributor to a field they don't believe in? Didn't Einstein contribute to Quantum Mechanics, while simultaneously harboring grave misgivings? (God doesn't play dice etc)
I wrote a trie based offensive language filter for a chat system used by the NHL's original streaming games broadcasts, way back in '99. It was written in C, and used the PerfectHash program to produce the hash table used by the trie to contain a dictionary of every known curse word, slang variant, and common misspellings in about a dozen languages. About 5 years ago I was looking at some open source chat systems, and I saw my old offensive language filter's entire build system in that open source project, pretty much just as I left it so many years ago. Someone else's name was on it, but in the C comments there's my name in the source files.
I used to hate tries, mostly bc I refused to take the time to understand them, but it is a pretty cool concept with a narrow but very practical use case. It does what it does really well, nothing more.
However, the idea of implementing in JS gives me the impression that the author is advocating implementing autocomplete on the client (browser) [1]. I would encourage developers not to do so. Depending on the data set to be rendered, that could be a big perf hit or not feasible in certain scenarios, but if the data set is not dynamic or has a small memory footprint (ie, states in the US), then it should be fine to have that data sent to the client to be processed on.
My general rule of thumb - limit business logic to the server and rendering logic to the browser.
You can implement server-side autocomplete - send the partial search string as a request to the server, which responds with the autocomplete list of strings to be rendered in the UI, debouncing [2] to limit the number of requests to the server (instead of requesting on every user change to the partial search string). Of course, this would have its trade offs (API reqs introduce their own perf concerns), but you want to go this route if your search data set is dynamic and sufficiently large.
[1] Not sure if that was the author's intention bc I cannot access the author's blog on my work machine.
I wrote a little Trie in Javascript for practice a while back. It flattens the data into a single array for better locality. I think it ends up being pretty quick for lookups (but not insertions).
If I understand correct, you're wondering how you might match "a bro do" with "a brown dog" or "a brown door", etc. Similar to what Google does. I've done this before. It's not a single data structure, but a process.
First, you normalize your strings to index (convert to lower case, throw out stop words such as "a", "the", "is", etc.) then you generate prefixes (or n-grams, I suppose). Say you want to do musicians and you have "Bob Dylan" as an entry. That would become "bob dylan" which would generate the prefixes "bo", "bob", "bob d", etc. You include spaces but once you hit a space, you start generating new prefixes. So you would also start doing "dy", "dyl", "dyla", and "dylan". The max length of the prefixes depends on how much memory you're willing to use.
Lets say Bob Dylan is mapped to some musician table in your DB and he has id 1000. I've done this with Redis, but you can use a trie as well. You'd store every prefix in the trie and the data on the node would be an array of musician ids. Now the magic: if you search "bo dyl" you would look up the prefix "bo", grab all the ids and then look up the prefix "dyl" and grab the ids (you would also try the full "bo dyl" first, but let's just assume nothing is there). Once you have those two arrays, you would take the intersection of them. Which, in our case, would leave an array containing id = 1000 along with any others that matched both.
That's, in general, how a simple autocomplete works. But you're better off using Redis or Elasticsearch if your data set is large. And if it's not, well... a simple hashmap is probably competitive and easier than tries.
There's a searching concept known as edit (levenshtein) distance that is close but not exactly what you're talking about. To rephrase your question: is there a datastructure that can pre-compute the levenshtein distance for a set or results?
Yes, and it's crazy complicated. See Levenshtein Automata[1]. In reality, I would let the professionals handle this by either using Open/ElasticSearch or Apache Lucene.
An in-between option that is relatively simple and a rather fast (though not as fast as the automata) optimization to levenstein distance and based on Tries is: http://stevehanov.ca/blog/?id=114.
I'd flip that and decide that there is no algorithm that exists and is documented that you can't implement yourself, for amusement and educational value or, given time and the will, more than that. Just making that decision has value for you even if you don't actually do it terribly often.
Same as you can design a cpu in HDL and flash it to an FPGA. (The "Nand to Tetris" course seems popular) Same as you can write an OS. Same as you can write a compiler and/or interpreter.
Believing these things (and they're actually true!) gets us all away from learned helplessness. There's enough of that when it comes to actual silicone...
Nobody understands it all. _You_ understand as much of it as you choose, in the direction that takes you, as deep as you want to go. It's just work, a lot of it, but no more than that.
A quick hack for this is to create variations and include them in your main trie (with links to the canonical value/object/etc.). E.g. you might create a variation w/o vowels.
There are heuristics you can apply here rather than implementing full fuzziness or levenshtein comparators etc. Ultimately it depends on your use-case and what common variations users might try to use.
Not exactly what you want but a second trie with the key on the starting letters and the value the complete word might work for you. Your ABD example would work but something like
ABrD would not.
I remember having a class "competition" my freshman year of college to implement the fastest autocomplete system. It had to be Java, so I used JNI to call into a well-optimized set of C routines for constructing and querying a trie. I ended up winning by a lot - no one else even implemented a trie. I spent a lot of time making sure that I would have the fastest trie implementation, but that wasn't even necessary.
I first encountered Trie in code storing Unix network group mappings. Since IP addresses in a ___domain mostly vary at the end, a lot of IP addresses share the same first part. This greatly saved space and lookup time.
What's kind of funny was I was reading something that talked about tries as if they were an exotic data structure and said something along the lines of "most developers will never implement a trie" and I had just finished implementing a trie that day.
Yes, that's also the data-structure you want to use when working on a URL router for instance, in a http framework.
Although, unless you have 10,000 of different http routes, the performance gains might not be that significant all things considered.
I learned about tries by leetcoding. This is exactly why I liked leetcoding. Because tries are awesome, and I was already imagining what one could build with it.
I reckon there's a kind of broken rationale where people think, "ok well this value will mutate over time so I can't possibly use a const", even though a new block scope is initialized on each iteration. The `for(let i in blah)` idiom is basically cargo-culted now without thought.
It’s funny that you bring up cargo-culting because I find the unflinchingly rigid rule of “const for everything unless the code requires a let/var” to be cargo-cultish. I think it’s proponents vastly overestimate the value that you get out of it. In FE dev mutability is typically an issue at the app state level, and that’s not something you fix with const. It also doesn’t stop objects from being modified, and you shouldn’t make assumptions about what the underlying JS engine does with it as that can change.
Take the code in this article as an example. What danger can there possibly be with using a let when iterating over the set of characters in a word? The inside of the for loop is two lines long. The extra information that using const provides to whoever maintains this code is negligible. Using a const doesn’t hurt, but that level of nitpicking it doesn’t add much value whether in this thread or in code reviews.
However, for autocomplete you often want a weighted Trie because you have extra information you want to weight nodes by. An example with contacts is that you often want recent and frequent contacts.
My company has an open source trie implementation here for a client to do weighted contact auto complete: https://github.com/shortwave/trie