Well, there's always suffix arrays, so two 64-bit pointers, (plus the original char) per char?
Assuming their claim of 116,000,000 matches for /./, and 5k mean document size, 70TB, uncompressed.
Papers like "Compressed Suffix Arrays and Suffix Trees
with Applications to Text Indexing and String Matching" [1] lead me to believe that the state of the art could be as low as 4 bits/char, in which case the answer would be a much more tractable 270GB.
As far as partitioning goes, now I'm getting really out of my league. At a first pass, I imagine that you could keep the root and neighbors on a primary server, and as big of subtrees as you can fit in RAM on a bunch of slaves. The first couple chars get processed in the master, and then however many subtrees are alive get delegated in parallel to the secondary servers. The secondary servers return the answers to the primary, which returns to the client.
I feel like I'm answering hypothetical questions at a job interview :p
The math doesn't work. 116,000,000 matches x 5k is 580 GB. It fits on one big disk, and with 5800 machines holding 100 Mb of plain text each you could search all of it naively in real time. (Grep searches 100 Mb in 50 mSec.)
5800 machines is practical for a real product. Hardware probably costs Google $40/month/CPU, so $230k / month, about equal to 10 engineers. So if a more clever solution needs a whole team to maintain it then it's probably better to just use grep.
10 engineers to program one data structure? I'd be greatly surprised if 1 PhD and 2 engineers couldn't easily do the additional work beyond distributed grep in less than six months. Maybe one engineer to maintain it? That saves ~$200k/mo--more if you go over 20 queries/second it takes to max out the 5800cpu limit.
Assuming their claim of 116,000,000 matches for /./, and 5k mean document size, 70TB, uncompressed.
Papers like "Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching" [1] lead me to believe that the state of the art could be as low as 4 bits/char, in which case the answer would be a much more tractable 270GB.
As far as partitioning goes, now I'm getting really out of my league. At a first pass, I imagine that you could keep the root and neighbors on a primary server, and as big of subtrees as you can fit in RAM on a bunch of slaves. The first couple chars get processed in the master, and then however many subtrees are alive get delegated in parallel to the secondary servers. The secondary servers return the answers to the primary, which returns to the client.
I feel like I'm answering hypothetical questions at a job interview :p
[1] http://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf