Revised JavaScript Dictionary Search

mckoss · on March 23, 2011

This was a fun (and distracting) challenge. After a lot of iterations, I' think I've squeezed about as much compression and performance as I can. You can play with the Trie/DAWG generator here:

http://lookups.pageforest.com

Source and format documentation here:

http://github.com/mckoss/lookups

jacobolus · on March 23, 2011

I’m curious whether regular expression matching could be used instead of these tricky plain string structures or string+object structures; storing a trie in a regular expression (using grouping and or) should be both small in terms of data over the wire and fast for lookups. The only question is how fast compiling the javascript would be at the start; if that took too long, it could be broken into parts and each part only compiled when needed and then memoized.

John, do you have a link to the corpus you’re using, so that we can use it for tests? Or should people just send in code for you to test?

Edit: nevermind, it's here: https://github.com/jeresig/trie-js/raw/master/dict/string.tx...

pbrumm · on March 23, 2011

this has been a great set of blog posts.

since your covering all the bases. you should look at the bloom filters.

something like this could work and be created on the server.

http://dren.ch/js-bloom-filter/

(not the newest post has a compatible perl and ruby impl)

Ilya Grigorik has a couple of relevant posts

http://www.igvita.com/2008/12/27/scalable-datasets-bloom-fil...

and

a ruby + redis backend.

https://github.com/igrigorik/bloomfilter

here are some of the memory footprints that Ilya got

    * 1.0% error rate for 1M items, 10 bits/item: 2.5 mb
    * 1.0% error rate for 150M items, 10 bits per item: 358.52 mb
    * 0.1% error rate for 150M items, 15 bits per item: 537.33 mb

jeresig · on March 23, 2011

This was discussed a bit in the Hacker News thread after last blog post but I continue to feel that bloom filters aren't particularly useful here. Since any solution I use would require no errors it would mean that I'd have to place the bloom filter in front of another solution (likely the succinct trie structure that I described).

Let's assume that the bloom filter is as fast as a hash lookup (unlikely, but anyway) - at most it'll be saving me about 5-46ms of lookup time, with some non-trivial additional memory usage and data transfer. I'm 100% OK with having 5-46ms of lookup time if it means that I can have such an incredibly low memory profile. If lookups took somewhere around 200ms then yeah, I'd definitely look at layering a bloom filter on top of this. As it stands though using a bloom filter will only add additional code and memory usage for a negligible amount of benefit.

pbrumm · on March 23, 2011

guaranteeing zero errors is a problem. but if the loaded bloom filter size is small you could expand to get a very low false positive.

the first link would do the lookup in javascript so it would run in the client.

and the storage is just a string so wouldn't have the javascript parsing on load.

rorrr · on March 23, 2011

Why do you require 0% error?

jeresig · on March 23, 2011

I would prefer that people spell words in my game that are actually words and not abuses of the system. I expect users to be challenging one another so I'd imagine that they wouldn't take too kindly towards one person getting an unnecessary (and unexpected) advantage.

rorrr · on March 23, 2011

If the error rate is 0.1% it would take a user 1000 tries (on average) to enter a word that's not in the dictionary. That's a lot of attempts.

ot · on March 23, 2011

> However the new Succinct Trie ends up being disturbingly slow (taking about 5.5ms to look up a word in Node.js on a fast machine)

Ouch! A succinct trie I implemented in C++ took 2-3 microseconds for lookups. I guess Javascript is not the best language for low-level data structures engineering.