Quote is unrelated to what you are saying! *use MapReduce to expand the regex ov...

inerte · on July 7, 2009

Imagine the following code:

money = 5;

Could be tokenized and mapped to a few regex keywords, with one or another as key-value in a datastore. Since Google is big on MapReduce, which could be seen as computation on sets:

[a-z] = (DOCID_14566, DOCID_15999, DOCID_888)

\d{5} = (DOCID_15, DOCID_15999, DOCID_552)

Then when someone searchs for a "number with five digits", the value for the \d{5} key is retrived, and it has the documents where the search appears, and this set/list is combined with the other set/list that match other search keywords, and you have your results.

On a normal web search, the document is split into words and you can make these words into keys, so know where they appear, right? (the key's values).

Well, if you think that the user will type a regex expression "\d{5}", then you can map it to documents too.

It's like pre-processing the possible regex that would match a document token, and searching these regexes instead of words like a normal web search.

It's "just" an extra step, splitting words (tokens) from a document, and assigning possible regexes for these tokens.

Not saying Google does it :p But I guess that's what the OP meant...

Edited: Before anyone notices how inefficient my example is, I wouldn't implement every possible \d as a key, or [a-b], [a-c], [a-d] etc as another key, this is just an example of how sets of precomputed regexs (from any depth) can be (ab)used.

tumult · on July 8, 2009

This makes sense and it could work :) I imagine you would have to spend time tuning the tokens you map out. I think this would break down pretty quickly with moderate or complicated regexes, but those might not be the majority of searches.

Is there any trickery you could do with the inner nodes on the b+trees? I spent a few minutes thinking about tokens mapped out and intermediate reductions but didn't come up with anything.

johngunderman · on July 7, 2009

Hey, its all postulation :)