Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Amazon strongly implies that your data is stored triple redundantly. You'll have to buy more than one disk to get triple redundancy. Also, you'll need to "scrub" the data regularly to detect bit rot (100 bytes per TB are expected to go bad every year). And probably store the files with redundant coding to protect against this bit rot.

Plug: I am working on a startup (submitted to this YC round!) to solve this problem using user^Wcustomer-owned hardware for precisely the sort of reasons you describe. I am looking for co-founders. If anybody wants to talk, email is in my profile.

Plug #2: I wrote and submitted this article about Glacier yesterday but it sank fast: http://psranga.github.com/articles/possible-architecture-of-... Email me if you want to talk about going up against an 800 lb gorilla. :)



One external online provider really only counts as "one copy", ever. This is primarily because you cannot audit the ongoing storage architecture and processes of any given provider. You're looking for SPOFs, not how many disks may hold data replicas. One software error (or site/account hack) can wipe out all of your data. Or an entirely out-of-band error occurs: the provider goes belly-up.

Cloud storage is awesome in many ways. Yet it doesn't replace your backup strategy, it merely complements it.


We regularly run an ad campaign on reddit discussing that very notion:

http://www.reddit.com/comments/hg9oa/your_platform_is_on_aws...

... that a single provider is really just a single "copy".

It is also the reason that we build 's3cmd' into our environment and so many customers use it:

ssh user@rsync.net s3cmd put abc.txt s3://account/abc.txt


Very good points. I actually agree with you on most points. Which why I started my startup.


Since you're obviously experienced in this area, can you point me to a tool or article that describes this 'scrubbing'?

Is this something that people should be using on their old files/backups?


'scrubbing' is just a fancy way of saying that the files are read from media, checksums recomputed, and compared against stored checksums. If checksums are different and data was stored redundantly, then recovery is carried out, and correct data is written back to media.

I'm too lazy and don't really do it with my multiple DVD, CDs and HDD backup directories. But ideally, I should be doing it. My startup will make this sort of thing easy and automatic.


Search for zfec. Allows you to split a file into N chunks, only M of which are required to reconstitute the original. Protects against N-M independent errors/corruptions.


> (100 bytes per TB are expected to go bad every year)

That's unsettling. Source?


Sorry, I think I should retract that statement which I seem to have recalled mistakenly. The error rate seems to be quite a bit lower than that, so I will post an article here after I research it thoroughly.


Sorry, it's 10 bytes, not 100.

1 TB = 2^40 bytes

Amazon claims 99.999999999% durability. https://aws.amazon.com/glacier/faqs/

2^40 * (1 - 99.999999999/100) = 10.99 bytes


Isn't it more bytes in practice? That's 88 bits, but they could be spread over different bytes, right?

In other words: shouldn't you calculate the loss over the total number of bits?


Sorry, my comment above wasn't well thought out. Amazon's durability guarantee is on a per-object basis, not on a per-byte basis. I will post an article here after I research it thoroughly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: