That's approximately 200TB of data. I suggest sharding the data across multiple machines because a single SAN doing BDB lookups of a single file of that size is still going to be a tad slow. Most large databases nowadays are divided by sharding, but since it's key/value data, a BDB is probably one of the top solutions. If you modify a lot of the same data frequently, memcache can speed things up (also a key/value in-memory solution), but the size of the problem in total bytes may be more of a problem than the actual looking up and indexing of the data.
but the size of the problem in total bytes may be more of a problem than the actual looking up and indexing of the data.
One cool property of such a large data-set (~200TB) is that you are pretty sure to see a lot of the same data repeated.
It'd be neat to try reduce the overall data footprint by assigning some sort of signature to repeats. I.e. if you are storing the sequences (let's imagine for a sec that these are non-trivial sizes) ABC, ABCD, ABCDE and ABCDEF , replace ABC with say #1 etc. and perhaps save a whole lot of space.
Yeah. but it's text, so i'm thinking about S3, interesting storage optimizations (alot of it would be tag data, for example,so there are plenty of fkeys).
i'd shard it based on some kind of distributed scheme... though not sure what that might look like. Good tips though. :)