Well, there's always suffix arrays, so two 64-bit pointers, (plus the original c...

tlb · on July 8, 2009

The math doesn't work. 116,000,000 matches x 5k is 580 GB. It fits on one big disk, and with 5800 machines holding 100 Mb of plain text each you could search all of it naively in real time. (Grep searches 100 Mb in 50 mSec.)

5800 machines is practical for a real product. Hardware probably costs Google $40/month/CPU, so $230k / month, about equal to 10 engineers. So if a more clever solution needs a whole team to maintain it then it's probably better to just use grep.

fizx · on July 8, 2009

10 engineers to program one data structure? I'd be greatly surprised if 1 PhD and 2 engineers couldn't easily do the additional work beyond distributed grep in less than six months. Maybe one engineer to maintain it? That saves ~$200k/mo--more if you go over 20 queries/second it takes to max out the 5800cpu limit.