Trie in JavaScript: The data structure behind autocomplete

rockwotj · on Nov 4, 2021

Tries are fun structures!

However, for autocomplete you often want a weighted Trie because you have extra information you want to weight nodes by. An example with contacts is that you often want recent and frequent contacts.

My company has an open source trie implementation here for a client to do weighted contact auto complete: https://github.com/shortwave/trie

bitexploder · on Nov 4, 2021

I first learned about Tries when I implemented a spell checker, almost 20 years ago now, with basic suggestions. It’s amazing how easy it is to get an efficient, 80% spell checker and recommendation engine implemented from scratch once you dig into it. Tries make for an efficient enough in memory lookup structure (space, storage, and compute), and are an obvious part of a suggestion engine as well. Coupled with a basic soundex and some word mangling + edit distance checking and you have a very functional system. Solving the last 20% takes an order of magnitude more work :)

fabiospampinato · on Nov 4, 2021

I've learned recently that a trie can also be succinctly encoded, like encoded as a string that contains almost only the node values and ~nothing else. This way for some languages you can encode all the valid words of that language in less space that it would be needed for the Hunspell dictionary used to generate them. Plus you get instant start-up as the dictionary had already been processed at build time basically.

thatswrong0 · on Nov 4, 2021

Oh sick we might have to switch to this for our autocomplete / autosuggestion functionality - precisely because it's lacking weighting :)

rockwotj · on Nov 4, 2021

Whoot! Hope you find it useful :)

karterk · on Nov 4, 2021

There is another aspect of tries that make them really useful: you can do fast, fuzzy word searches on a trie.

In Typesense[1], I've implemented fuzzy search based on levenshtein damerau distance and it's incredibly fast. I've found this approach to be a much better (faster + more flexible) alternative to Peter Norvig's brute-force based spell-checker that is quite a popular post [2].

[1]: https://github.com/typesense/typesense [2]: https://norvig.com/spell-correct.html

bollu · on Nov 4, 2021

Does the trie help in memoizing the edit distance between multiple words? Or do you pay the O(n^2) cost per word? I'd love to hear details of how this works!

karterk · on Nov 4, 2021

With brute-force you have no idea about what letters could follow a given letter, but with a Trie you already know the possible combinations so you avoid needless permutations.

ignoramous · on Nov 4, 2021

here's the fuzzy search impl on an adaptive-radix-trie in typesense: https://github.com/typesense/typesense/blob/96bc8a078/src/ar...

petethepig · on Nov 4, 2021

Tries are also used to efficiently store names of functions in macOS / iOS binaries (Mach-O format) and work great for that purpose because a lot of functions have the same prefixes [0].

Indexes in mongodb use a similar concept (they call that prefix compression [1]), and that allows them to store indexes more efficiently.

We use them at Pyroscope and we wrote a blog post about our storage design [2] There are even some animations :)

[0] https://www.m4b.io/reverse/engineering/mach/binaries/2015/03...

[1] https://docs.mongodb.com/manual/reference/glossary/#std-term...

[2] https://github.com/pyroscope-io/pyroscope/blob/main/docs/sto...

aleksiy123 · on Nov 4, 2021

Trie is probably my favourite ds. I really enjoy this python implementation I learned while studying leetcode because its so succint. Really useful for interviews.

    from collections import defaultdict
    
    END = object()

    def make_trie():
       return defaultdict(make_trie)
    
    def insert(trie, word):
        for c in word:
            trie = trie[c]
        trie[END] = True

dmurray · on Nov 4, 2021

I can't figure out how this works. Shouldn't insert return a reference to the new trie?

juki · on Nov 4, 2021

No, it mutates the input trie. If you call `insert(trie, "foo")` with an empty trie, it will be modified to look like

  {'f': {'o': {'o': {END: True}}}}

I don't think this implementation is efficient enough to ever be worth using in a real program, though.

masklinn · on Nov 4, 2021

It probably works well enough for relatively simple problems, but the overhead of the dicts is definitely going to kill you memory-wise.

vanderZwan · on Nov 4, 2021

It's such a simple and elegant as data structure. If you only care about word lookup you can implement it in just 35 lines of modern JavaScript:

    function trieBuilder(word_list) {
      const root = {};
      for (const word of words_list) {
        let node = root;
        for (const char of word) {
          let nextNode = node[char];
          if (!nextNode) node[char] = nextNode = {};
          node = nextNode;
        }
        node._ = 1; // mark the nodes that are endings of real words
      }

      function findChildren(node, prefix, list, maxLength) {
        if (node._ === 1) list.push(prefix);
        for (const char in node) {
          findChildren(node[char], prefix + char, list, maxLength);
          if (list.length >= maxLength) return list;
        }
        return list;
      }

      function findSuffixes(prefix, maxLength) {
        prefix = prefix.toLowerCase();
        let node = root;
        for (const char of prefix) {
          let nextNode = node[char];
          if (!nextNode) return [""];
          node = nextNode;
        }
        let words = findChildren(node, prefix, [], maxLength);
        return words;
      }

      return {root, findSuffixes};
    }

demo: https://observablehq.com/@jobleonard/autocomplete

baur · on Nov 4, 2021

Nice implementation!

There is an option to get all suffixes without traversing subtree, but it comes with extra O(N) memory where N is combined length of all stored words - depending on case might be acceptable since memory for storing words itself is O(N) anyway. https://stackoverflow.com/a/29966616/2104560 (update 1 and update 3)

vanderZwan · on Nov 4, 2021

Thanks! I just realized it doesn't work with strings containing underscores, but simply using __ instead of _ (so double underscores) as the key to end-of-word markings fixes that.

And thanks for the link, that is an interesting optimization!

EDIT: one fun non-practical application (histogramming the letters in a word is simpler and faster) is an anagram finder using prime numbers:

https://observablehq.com/@jobleonard/finding-anagrams-using-...

baur · on Nov 4, 2021

That's a nice idea. I guess it's better to stay in primitive type range (to avoid long arithmetics), so we can "compress" up to 15 items into a single prime_product making it less than 2^64 - for English words should be just fine, I don't expect many words with 16 and more letters.

UPD: Sorry, "up to 15" is a wrong phrasing. I checked once how "prime factorial" fits into primitive, and first 15 primes can fit into long. So it's possible to handle even more symbols if it's smth like "aaaaaaa"64 times because it would be just 2^64

psadri · on Nov 4, 2021

Note that to save memory, instead of using every character in the word, you may want to cap the depth of your trie to the first N chars. The words with shared prefixes are kept in an array at the leafs. During lookups, you’d scan the leaf array to find matches. You’d be trading space (lots of nested objects) vs speed (negligible for scanning small arrays).

ufo · on Nov 4, 2021

Another way to save space is to use a directed acyclic graph. Specially if there are common suffixes that show up all the time.

danbmil99 · on Nov 4, 2021

Fun fact: the trie data structure was invented by Ed Fredkin, a pioneer in Quantum computing as well as cellular automata

Cybiote · on Nov 4, 2021

This is the best kind of correct but it's an interesting phrasing since Fredkin is a kind of Quantum supremacy skeptic (hotkey-find Classicatopia in: https://windowsontheory.org/2017/10/30/the-different-forms-o...).

Fredkin's contribution were to reversible computation and since Quantum computers are a kind of reversible computer, many ideas and gate notions carried over.

danbmil99 · on Nov 4, 2021

I agree, that is a more accurate description of Ed's work & opinion re Q-computing

danbmil99 · on Nov 4, 2021

I guess you can ask, can someone be a significant contributor to a field they don't believe in? Didn't Einstein contribute to Quantum Mechanics, while simultaneously harboring grave misgivings? (God doesn't play dice etc)

bsenftner · on Nov 4, 2021

I wrote a trie based offensive language filter for a chat system used by the NHL's original streaming games broadcasts, way back in '99. It was written in C, and used the PerfectHash program to produce the hash table used by the trie to contain a dictionary of every known curse word, slang variant, and common misspellings in about a dozen languages. About 5 years ago I was looking at some open source chat systems, and I saw my old offensive language filter's entire build system in that open source project, pretty much just as I left it so many years ago. Someone else's name was on it, but in the C comments there's my name in the source files.

game_the0ry · on Nov 4, 2021

I used to hate tries, mostly bc I refused to take the time to understand them, but it is a pretty cool concept with a narrow but very practical use case. It does what it does really well, nothing more.

However, the idea of implementing in JS gives me the impression that the author is advocating implementing autocomplete on the client (browser) [1]. I would encourage developers not to do so. Depending on the data set to be rendered, that could be a big perf hit or not feasible in certain scenarios, but if the data set is not dynamic or has a small memory footprint (ie, states in the US), then it should be fine to have that data sent to the client to be processed on.

My general rule of thumb - limit business logic to the server and rendering logic to the browser.

You can implement server-side autocomplete - send the partial search string as a request to the server, which responds with the autocomplete list of strings to be rendered in the UI, debouncing [2] to limit the number of requests to the server (instead of requesting on every user change to the partial search string). Of course, this would have its trade offs (API reqs introduce their own perf concerns), but you want to go this route if your search data set is dynamic and sufficiently large.

[1] Not sure if that was the author's intention bc I cannot access the author's blog on my work machine.

[2] https://lodash.com/docs/4.17.15#debounce

kitd · on Nov 4, 2021

Agreed. Tools like htmx [1] are good for this. Combined with the "list" attr on <input>s makes for very simple autocomplete implementations.

[1] https://htmx.org/examples/active-search/

5tefan · on Nov 4, 2021

I implemented them several times in Excel. Beats most lookups in terms of fun and speed.

vanderZwan · on Nov 4, 2021

I'm sorry, did you just say you implemented prefix trees in Excel?

As in a spreadsheet-based prefix tree?

Can I see it?

baur · on Nov 4, 2021

Trie also can be used as a hashset/map, roughly it's like hash(obj).toString is a "word" for a trie and in the leaf we store the object.

It's https://en.wikipedia.org/wiki/Hash_array_mapped_trie which used in Scala's immutableMap https://dotty.epfl.ch/api/scala/collection/immutable/HashMap...

mgraczyk · on Nov 4, 2021

I wrote a little Trie in Javascript for practice a while back. It flattens the data into a single array for better locality. I think it ends up being pretty quick for lookups (but not insertions).

https://github.com/mgraczyk/fast-trie-js/blob/master/index.j...

I used it for this little demo:

https://assets.opentoken.com/demos/search/index.html

benrbray · on Nov 4, 2021

Tries are also perfect for solving games of Boggle! I have a demo here: https://benrbray.com/projects/unboggler

euler2100 · on Nov 4, 2021

Hey, that's a pretty cool demo, thanks for sharing!

trinovantes · on Nov 4, 2021

The basic trie only matches exact consecutive characters. Does anyone know the data structure/trie variant for matching partial sequential characters?

    e.g. ABD matches A Brown Dog
                     ^ ^     ^

deckard1 · on Nov 4, 2021

If I understand correct, you're wondering how you might match "a bro do" with "a brown dog" or "a brown door", etc. Similar to what Google does. I've done this before. It's not a single data structure, but a process.

First, you normalize your strings to index (convert to lower case, throw out stop words such as "a", "the", "is", etc.) then you generate prefixes (or n-grams, I suppose). Say you want to do musicians and you have "Bob Dylan" as an entry. That would become "bob dylan" which would generate the prefixes "bo", "bob", "bob d", etc. You include spaces but once you hit a space, you start generating new prefixes. So you would also start doing "dy", "dyl", "dyla", and "dylan". The max length of the prefixes depends on how much memory you're willing to use.

Lets say Bob Dylan is mapped to some musician table in your DB and he has id 1000. I've done this with Redis, but you can use a trie as well. You'd store every prefix in the trie and the data on the node would be an array of musician ids. Now the magic: if you search "bo dyl" you would look up the prefix "bo", grab all the ids and then look up the prefix "dyl" and grab the ids (you would also try the full "bo dyl" first, but let's just assume nothing is there). Once you have those two arrays, you would take the intersection of them. Which, in our case, would leave an array containing id = 1000 along with any others that matched both.

That's, in general, how a simple autocomplete works. But you're better off using Redis or Elasticsearch if your data set is large. And if it's not, well... a simple hashmap is probably competitive and easier than tries.

aelzeiny · on Nov 4, 2021

There's a searching concept known as edit (levenshtein) distance that is close but not exactly what you're talking about. To rephrase your question: is there a datastructure that can pre-compute the levenshtein distance for a set or results?

Yes, and it's crazy complicated. See Levenshtein Automata[1]. In reality, I would let the professionals handle this by either using Open/ElasticSearch or Apache Lucene.

[1]http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...

Cybiote · on Nov 4, 2021

An in-between option that is relatively simple and a rather fast (though not as fast as the automata) optimization to levenstein distance and based on Tries is: http://stevehanov.ca/blog/?id=114.

harry8 · on Nov 4, 2021

>I would let the professionals handle this...

If this site doesn't contain "the professionals" then the phrase has zero meaning.

aelzeiny · on Nov 4, 2021

Fair enough. "Crazy complicated" and "the professionals" are both intended as subjective terms. To ME this stuff might as well be magic.

harry8 · on Nov 4, 2021

I'd flip that and decide that there is no algorithm that exists and is documented that you can't implement yourself, for amusement and educational value or, given time and the will, more than that. Just making that decision has value for you even if you don't actually do it terribly often.

Same as you can design a cpu in HDL and flash it to an FPGA. (The "Nand to Tetris" course seems popular) Same as you can write an OS. Same as you can write a compiler and/or interpreter.

Believing these things (and they're actually true!) gets us all away from learned helplessness. There's enough of that when it comes to actual silicone...

Nobody understands it all. _You_ understand as much of it as you choose, in the direction that takes you, as deep as you want to go. It's just work, a lot of it, but no more than that.

js_chap · on Nov 4, 2021

You probably want fuzzy search: https://en.m.wikipedia.org/wiki/Approximate_string_matching

padolsey · on Nov 4, 2021

A quick hack for this is to create variations and include them in your main trie (with links to the canonical value/object/etc.). E.g. you might create a variation w/o vowels.

There are heuristics you can apply here rather than implementing full fuzziness or levenshtein comparators etc. Ultimately it depends on your use-case and what common variations users might try to use.

TYMorningCoffee · on Nov 4, 2021

Not exactly what you want but a second trie with the key on the starting letters and the value the complete word might work for you. Your ABD example would work but something like ABrD would not.

Rd6n6 · on Nov 4, 2021

Or for common misspellings on mobile?

rschachte · on Nov 4, 2021

I made a video implementing one of these from scratch in Java if anyone is interested:

https://youtu.be/Ye2_C_Nw_Ow

nomaxx117 · on Nov 5, 2021

I remember having a class "competition" my freshman year of college to implement the fastest autocomplete system. It had to be Java, so I used JNI to call into a well-optimized set of C routines for constructing and querying a trie. I ended up winning by a lot - no one else even implemented a trie. I spent a lot of time making sure that I would have the fastest trie implementation, but that wasn't even necessary.

0xdky · on Nov 13, 2021

I first encountered Trie in code storing Unix network group mappings. Since IP addresses in a ___domain mostly vary at the end, a lot of IP addresses share the same first part. This greatly saved space and lookup time.

plondon514 · on Nov 7, 2021

If you'd prefer to learn this material interactively we've adapted it to a codeamigo lesson: https://codeamigo.dev/lessons/141

dhosek · on Nov 4, 2021

What's kind of funny was I was reading something that talked about tries as if they were an exotic data structure and said something along the lines of "most developers will never implement a trie" and I had just finished implementing a trie that day.

throw_m239339 · on Nov 4, 2021

Yes, that's also the data-structure you want to use when working on a URL router for instance, in a http framework. Although, unless you have 10,000 of different http routes, the performance gains might not be that significant all things considered.

mettamage · on Nov 4, 2021

I learned about tries by leetcoding. This is exactly why I liked leetcoding. Because tries are awesome, and I was already imagining what one could build with it.

jeffeld · on Nov 4, 2021

for (let character...

Please use const. Everything should be const until there is a proven need otherwise.

padolsey · on Nov 4, 2021

I reckon there's a kind of broken rationale where people think, "ok well this value will mutate over time so I can't possibly use a const", even though a new block scope is initialized on each iteration. The `for(let i in blah)` idiom is basically cargo-culted now without thought.

enumjorge · on Nov 4, 2021

It’s funny that you bring up cargo-culting because I find the unflinchingly rigid rule of “const for everything unless the code requires a let/var” to be cargo-cultish. I think it’s proponents vastly overestimate the value that you get out of it. In FE dev mutability is typically an issue at the app state level, and that’s not something you fix with const. It also doesn’t stop objects from being modified, and you shouldn’t make assumptions about what the underlying JS engine does with it as that can change.

Take the code in this article as an example. What danger can there possibly be with using a let when iterating over the set of characters in a word? The inside of the for loop is two lines long. The extra information that using const provides to whoever maintains this code is negligible. Using a const doesn’t hurt, but that level of nitpicking it doesn’t add much value whether in this thread or in code reviews.

jeffeld · on Nov 4, 2021

It's not about adding value, it's about picking up bad habits from example code.

trynharder · on Nov 4, 2021

Great post, thanks for sharing.