I'm just curious, do all the people using these kinds of paid services just have really tiny hard drives? To back up 1TB of data via this system would cost around $250 a month in storage and transfer charges, and that's assuming very good compression is happening. What am I missing? Or maybe I'm just not rich enough for $3,000 / year?
Or maybe you're just doing a subset of your data, in which case I don't see how snapshotting is such a big win. Nice to have, but not important in the least if you are just doing a small subset of your data. True, it saves a few pennies on transfer charges, but you could save that much by doing the upload to S3 yourself.
My other concern with this is that if the tarsnap server ever goes away, customers risk losing their data, since the server maintains the mapping of S3 objects to blobs. That's worrying. Assurances are not mechanisms.
do all the people using these kinds of paid services just have really tiny hard drives
Some tarsnap users back up all of their data; others just back up a subset. I'm not sure why you say that snapshotting isn't important -- in addition to the savings in bandwidth and storage, it makes tarsnap much faster and more convenient.
if the tarsnap server ever goes away, customers risk losing their data, since the server maintains the mapping of S3 objects to blobs
If the tarsnap server dies, I can launch a replacement EC2 instance and regenerate the metadata (as described in the article). If you're concerned about the possibility of me going away... well, if I get hit by a bus the tarsnap server will keep running on its own for at least the immediate future, but I have to concede that at the moment tarsnap won't survive indefinitely without me.
But I do my best to make sure that I won't get hit by a bus. :-)
> My other concern with this is that if the tarsnap server ever goes away, customers risk losing their data, since the server maintains the mapping of S3 objects to blobs. That's worrying. Assurances are not mechanisms.
Agreed. I use duplicity instead, which is a Free program that is similar to tarsnap. It backs up to S3, but uses many fewer PUTs (since archives are several megabytes).
Anyway, a full backup of my homedir (minus music) costs me about $1.40 a month to store (with incremental backups every night). A small price to pay knowing that if my laptop blows up, I can be right back where I started in just a few hours. (Or if I delete a file accidentally, it is back in seconds.)
I also won't store my backups in anything but my own S3 buckets (they are encrypted, privacy is not the issue). Is duplicity stable in your opinion? I am usually one to use alpha/beta software etc. but this is a long term need. I am a big rdiff-backup fan, this looks like a good alternative to my current strategy.
My current strategy is a little lame but works quite well: rsync daily to my home server and about once or twice a week an EC2 instance is fired up with elastic block store attachment and the home server does rdiff-backup to it.
Tarsnap is really wonderful. I've been using it for about a month now and it's really simple. From your perspective, you're simply creating tar archives. No fuss, no muss.
On the tarsnap side, it makes sure not to duplicate storage or bandwidth for duplicate parts. Anytime you want to get a specific backup back, you just reference it by name. You can list the available archives. It's all encrypted. Pricing is based on what you actually use (rather than being rounded) which makes it ideal for small things as you can pay fractions of a cent.
Great post! I don't understand the paragraph about the cost of the PUTs and GETs though. The saving you get by batching writes seems marginal. Can you give some numbers?
The average block size the tarsnap server sees is about 30 kB (the tarsnap client tries to produce blocks of 64 kB on average, but then it compresses them individually before sending them to the server). This means that for every GB uploaded, there are about 33 thousand blocks.
S3 PUTs cost $0.01 per thousand PUTs, so writing each of the blocks as an individual S3 object would cost $0.33 / GB for PUTs (plus the normal $0.10 / GB for bandwidth).
There are several advantages and disadvantages to different block sizes; but most significantly, larger blocks would make tarsnap less efficient at identifying duplicate data in the (very common) case where part of a file is modified. In the end it came down to weighing all the factors and picking a value which worked well.
I aggregate blobs together (up to 4 MB at once) and store them to S3 as a single S3 object. The tarsnap server keeps track of which part of the object corresponds to which blob.
Very interesting. Do you have an estimate of how tarsnap compares to rsync in terms of bandwidth in the typical case that only a few files have been modified?
Tarsnap is more efficient than rsync in that case, because rsync has a significant index overhead (sending a list of files, and sending a list of blocks for each file) while the tarsnap client works locally to identify new data and only uploads the new bits.
You mention that it's more expensive than JungleDisk
No I don't. Tarsnap isn't more expensive than JungleDisk overall -- yes, the bandwidth and storage costs more, but tarsnap doesn't have per-request costs, a fixed monthly service charge, or an up-front cost for the software. For some people, tarsnap will be more expensive, certainly; but for many others tarsnap will be cheaper.
you don't say why it's better
Not in this blog post, no -- this post was about how tarsnap uses Amazon Web Services. :-)
Or maybe you're just doing a subset of your data, in which case I don't see how snapshotting is such a big win. Nice to have, but not important in the least if you are just doing a small subset of your data. True, it saves a few pennies on transfer charges, but you could save that much by doing the upload to S3 yourself.
My other concern with this is that if the tarsnap server ever goes away, customers risk losing their data, since the server maintains the mapping of S3 objects to blobs. That's worrying. Assurances are not mechanisms.