This was a fun (and distracting) challenge. After a lot of iterations, I' think I've squeezed about as much compression and performance as I can. You can play with the Trie/DAWG generator here:
I’m curious whether regular expression matching could be used instead of these tricky plain string structures or string+object structures; storing a trie in a regular expression (using grouping and or) should be both small in terms of data over the wire and fast for lookups. The only question is how fast compiling the javascript would be at the start; if that took too long, it could be broken into parts and each part only compiled when needed and then memoized.
John, do you have a link to the corpus you’re using, so that we can use it for tests? Or should people just send in code for you to test?
This was discussed a bit in the Hacker News thread after last blog post but I continue to feel that bloom filters aren't particularly useful here. Since any solution I use would require no errors it would mean that I'd have to place the bloom filter in front of another solution (likely the succinct trie structure that I described).
Let's assume that the bloom filter is as fast as a hash lookup (unlikely, but anyway) - at most it'll be saving me about 5-46ms of lookup time, with some non-trivial additional memory usage and data transfer. I'm 100% OK with having 5-46ms of lookup time if it means that I can have such an incredibly low memory profile. If lookups took somewhere around 200ms then yeah, I'd definitely look at layering a bloom filter on top of this. As it stands though using a bloom filter will only add additional code and memory usage for a negligible amount of benefit.
I would prefer that people spell words in my game that are actually words and not abuses of the system. I expect users to be challenging one another so I'd imagine that they wouldn't take too kindly towards one person getting an unnecessary (and unexpected) advantage.
> However the new Succinct Trie ends up being disturbingly slow (taking about 5.5ms to look up a word in Node.js on a fast machine)
Ouch! A succinct trie I implemented in C++ took 2-3 microseconds for lookups. I guess Javascript is not the best language for low-level data structures engineering.
http://lookups.pageforest.com
Source and format documentation here:
http://github.com/mckoss/lookups