How long does an average hard drive last? You'd have to spend that 700k every that many years (plus the extra bits you mentioned). Quite an operation actually
I actually find that fairly tame. For a point of comparison, Wikipedia gets ~$150M in revenue a year, an "asset rise" (I presume this is what non-profits call profit?) of ~$15M a year, and is sitting on about a quarter billion in the bank.
Not that they want to, but I think Wikipedia could fund this using their current donations if they wanted. Hell, I almost wonder if one of the big storage providers would do it for free if they could do it in their staging environment so they get real traffic. It would be less good than real backups, but extra copies are still extra copies even if they're unreliable.
A good portion of the text on Wikipedia relies on Wayback Machine links to remain verifiable. If they lose that, I guess the editors might have to comb every page for information which would need to be either resourced or deleted.
You're right, I guess it is tame and achievable so far as organisations go. I was imagining trying to get some friends together to have a decent percentage of the IA backed up, but that seems out of reach based on this napkin math. Not that that is necessarily demotivating, but it's going to depend on a lot of people intuitively seeing the value and keeping up their share
Yeah, as a sort of pet project I don’t think backing up the whole thing is possible.
You might be able to back up a significant portion of the unique data in IA if you limited it to text files. I think they probably have the highest information to file size ratio.
It’s also probably the most likely to already be back up, though. Interesting issue; you might also get somewhere by cutting the 50TB up into 10GB torrents (or 100GB or whatever, something reasonable for a consumer hard drive) and maybe adding a script that checks the torrent swarm stats to recommend a torrent to download.
Something where I run it, tell it I want to let it use 600GB, and it hands me torrent files for the least seeded 600GB. Maybe a super basic web UI so people can see how well backed up it is?
Unsure if people would sign on or not; I probably would. I’ve got 10 or so TB of NFS I’m not using I could chuck at it. I would guess there are other data hoarders out there who would do the same, but only if it were somewhat easy. I’m probably not going to volunteer to do an hour of rtorrent cleanup a week to make sure I’m backing up the right things.
This is a great question, and a state of the art kind of thing.
HDDs are sold with a lifetime drive read/write amount and power cycle warranty, along with usually some environmental operating envelope. read/write relates to the quality/space of the platter, power cycle is usually the actuator & read/write head being reseated/wearing out. Environment is the same as all other devices in a DC.
Most folks replace drives when they die (reads/writes stall or return garbage), or when the warranty runs out. Some will pay for a warranty exception, and some will just use the drive outside of warranty. Depending on how you use the drive, what environment it's in, etc changes how much you can push things.
I'd say anywhere from 4-8 years, depending on how it's used. In many cases it can be cheaper to have a worse environment for your fleet (thus using less power on hvac) and replace devices more frequently.
I tried for 6 weeks. Eventually, it just stops functioning. The same program and arguments spits out "segmentation fault" 33% of the time I run it, with the other 67% working perfectly. The only way I could explain it was that it was in a function outside the main, because when I put the exact same code in the main, compiled and ran, it worked.
I have no other explanation. At some point, having too many nested loops and variables causes segmentation faults, whereas less complex code functioned without error. I needed to have certain things performed, and it only functioned in the main.
Why would you try to do this in C of all languages? It's one of the worst choices, especially for a self-learner and a beginner like you. Consider: choosing another language could, on its own, 100% eliminate any possibility of getting a segfault! With just that, you'd be spared from having to produce an abomination of many thousands of loc inside a single function, which is never (unless you're Donald Knuth) a good programing practice.
Python is slower but easier, and less likely to segfault out of blue! You don't even have to have a main() loop. If you just have an idea worth demoing quick, I'd recommend switching to Python 3.
There's also the fact that hard drive capacities keep increasing and increasing significantly faster that the power required, and sooner or later for very long term storage it'd become cheaper to migrate all your data from those 5 year old 4TB drives to more modern 16TB ones. That's assuming you want hot access to the data and don't plan on spinning them down as soon as you've written to them, like you'd do for a cold backup of the whole IA.
I remember for a long time (I'm talking 20-ish years back here), every hard drive I bought had double or more the capacity of every drive I'd ever bought previously combined. My first ever 40MB (yes, megabyte) drive got upgraded to an 80MB one, that got updated to a 250MB one, then a 750MB, and then a whopping 2GB drive (how would I _ever_ fill that up???) - and so on. That's slowed down some, but I'm currently starting to think about upgrading my 8TB drives (Raid1 pair) with 20TB drives when the prices start to drop a bit more.
Drives do 140-220MB/s depending on the LBA distance of the readhead, and that's not really changing. 160MB/s is very common.
So your 8TB drives, assuming 1MiB writes with a 20ms latency and 160MB/s, you can rewrite the drive ~155 times/year. At 20T this drops to ~62 times/yr.
Do people really replace their drives when the warranty runs out? Hard drive manufacturers won't provide data recovery on drives that fail under warranty[1]. It makes more economical sense to just run a drive until it dies. You'll end up paying the price for a new drive either way, but less often if you ignore the warranty expiring.
1: I discovered this myself when a Seagate drive containing some important data failed under warranty. If you're foolish enough to send them a failed drive with data you need recovered (like I was), all they'll do is throw it in the bin and send you a replacement drive.
If this is a backup, you don't need it to be powered up and available 24x7.
So the question becomes more like "how long does an average hard drive last while powered down and still reliably be able to power back up and be read?".
I'm fairly sure that is a lot longer than the single digit years that'd be the probably answer to your question.
I wonder if there are useful guidelines for long term storage of powered down hard drives? My gut feel is the major failure modes would be electrolytic capacitor failure, bearings sticking as the lubrication ages, and obseleting of the interfaces. I wonder how hard it'd be to find hardware that'd read my Mac SCSI hard drives from 25 years ago?
> How long does an average hard drive last? You'd have to spend that 700k every that many years (plus the extra bits you mentioned). Quite an operation actually
You'd have to spend a lot more, because with that many drives, you need redundancy now.
True, that would be an up front cost. At the same time, the IA is still live. This initial expense can be softened by building up redundancy over some years rather than trying to do everything at once
> True, that would be an up front cost. At the same time, the IA is still live. This initial expense can be softened by building up redundancy over some years rather than trying to do everything at once
I think with that many drives, you'd be losing them constantly, and I suppose you wouldn't know which ones until later (assuming you're doing an offline backup, if you aren't you have to factor in power costs).
IA stores lots of redundant stuff in 5 file formats and none of them are particularly well-compressed, I think. There are (big) savings to be had, but maybe figuring that out (software dev and compute time) isn't worth it?
Electricity, bandwidth, and generally running a business is not free. Also for these pay-as-you-go setups you'd need a considerable amount of free space available on demand.
That said, it's not an especially cheap option. Hetzner has storage boxes for EUR 2.5/TB/mo (in fixed 5 and 10TB boxes though)
1: https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/