That's approximately 200TB of data. I suggest sharding the data across multiple ...

jaydub · on May 8, 2009

but the size of the problem in total bytes may be more of a problem than the actual looking up and indexing of the data.

One cool property of such a large data-set (~200TB) is that you are pretty sure to see a lot of the same data repeated.

It'd be neat to try reduce the overall data footprint by assigning some sort of signature to repeats. I.e. if you are storing the sequences (let's imagine for a sec that these are non-trivial sizes) ABC, ABCD, ABCDE and ABCDEF , replace ABC with say #1 etc. and perhaps save a whole lot of space.

planck · on May 8, 2009

This is also known as the ZIP compression algorithm.

webignition · on May 8, 2009

But then you might end up not being able to see the forest of data for all the Huffman trees.

gaius · on May 8, 2009

It's called "deduplication" in databaseland, it's common strategy for tape backups.

gaius · on May 8, 2009

Most large databases nowadays are divided by sharding

Err, this isn't even close to true. MySQL isn't the only database in the world, you know.

sgk284 · on May 8, 2009

Sharding has nothing to do with MySQL, and everything to do with scaling.

didroe · on May 8, 2009

I think the point was that the big DBs have better ways to deal with scaling than sharding.

imajes · on May 8, 2009

Yeah. but it's text, so i'm thinking about S3, interesting storage optimizations (alot of it would be tag data, for example,so there are plenty of fkeys).

i'd shard it based on some kind of distributed scheme... though not sure what that might look like. Good tips though. :)

scumola · on May 8, 2009

Here's a good comparison of key/value database alternatives: http://www.metabrew.com/article/anti-rdbms-a-list-of-distrib...