How to Automatically Backup a Linux VPS to a Separate Cloud Storage Service

joshstrange · on June 10, 2019

Forgive me if I'm missing something but this appears to just backup files so it would be fine for source code (should be in version control and safe already) and static assets (like user uploads) but doesn't appear to address things like DB backups which I feel like is the number 1 thing lost if you lose access to your host (followed by user uploads). The problem with DB backups is you can't just backup the data directory (like /var/lib/mysql) unless you've shutdown the DB or you can do a dump (mysqldump) but backing that up hourly is not a good solution IMHO. I guess you could have a replica that you shut down at the top of the hour, backup the data directory, then start back up but all if this is to say this post is not a silver bullet to "Automatically backup a Linux VPS".

This is NOT a knock against the author, I just wanted to point out that "backups" are much more complicated than "copy files elsewhere". For DB's I'd probably consider running a replica on 1 or more other clouds. IDK the logistics of replication over the internet but I know for work we do replication from our datacenter down to our local servers and that's over a relatively slow connection so I assume it's possible to do it from cloud-to-cloud.

jakejarvis · on June 10, 2019

Absolutely. Maybe I should have noted that this is more of a guide to make your existing backup procedures more redundant, which implies that you already have local "backups" being made of whatever you want to redundantly store in S3 or B2 or anywhere externally.

In that case, it does become as simple as just copying files elsewhere. (For example, using the Restic steps in my post to backup a folder of hourly database dumps, like you mentioned.) Replicating databases (and other methods made specifically for DBs) is certainly a much, much better route for mission-critical and/or enterprise data.

Covering every permutation of different types of data to backup would have made a long post much longer, but I'll add writing a part two to my to-do list covering rudimentary database backups since that has been brought up here a few times.

Thanks for the feedback! :)

joshstrange · on June 10, 2019

Awesome! I really hope I didn't come across as attacking you/your post, I found it really useful. I just wanted to remind people that the DB wouldn't really be covered by this (except in the case you mentioned where you are dumping the data).

I am definitively looking at this through the lens of where I work where a mysqldump (or equivalent) could take days to complete in full (DB is nearing 2TB in size now). For a number of projects a mysqldump might only take seconds or minutes and would be a perfect candidate for this backup scheme.

jakejarvis · on June 10, 2019

Not at all! I'm really glad you mentioned it, since I wrote this in the mindset of small to medium VPSes used for personal projects and I'll make that more clear in the intro. Backups (like everything else, unfortunately) definitely get exponentially more difficult the more successful you become.

SkyLinx · on June 11, 2019

The tool I mentioned, xtrabackup, is orders of magnitude faster than mysqldump for both backups an restores. Check it out

apitman · on June 10, 2019

FWIW if your DB supports dumping to stdout restic can import that: https://restic.readthedocs.io/en/latest/040_backup.html#read...

tomcam · on June 10, 2019

Damn, son! Good catch! I have been trying to figure out how to deal with just this problem. Thank you.

tracker1 · on June 10, 2019

For most DBs the simplest path is a db dump, assuming you don't have more data than can be backed up in a reasonable time frame. From there, there's a number of file-systems and other integrations for S3, sftp or similar. You can simply copy out from there. The backup utility in question could be used to target your dump file(s) directory.

Replica's aren't actually a backup, though it's probably a good idea to have them, and in the case of too much data to reasonably backup, as close as you are going to come.

Also, depending on data structures, for example if your data models fit into something like MongoDB or not, it's easy enough to trigger a dump for each record. I've setup systems where primary records are json+gz files in S3 or Azure blobs. In practice this has been as part of the process that will update ElasticSearch (or Mongo) from an RDBMS authority.

It was pretty easy to do <base>/collection/{ID}.json.gz .. from there, worst case, I'd have to create a re-population script from the hard files, and might lose a little less important data, but would always have a hard recovery path. YMMV of course.

I've got 3 side projects that I've been muddling on getting started and some of the things that have happened to the original author frankly scare me. I'm also working on legal structure (LLCs etc) before any real launch. There are so many edge cases to consider, and sometimes it's hard just getting started and accepting that you will make mistakes along the way.

fabian2k · on June 10, 2019

A replica isn't necessarily a backup, if the problem isn't a dead server but a missing WHERE clause in your DELETE query, a replica probably won't save you.

For Postgres you can use something like WAL-E to continuously backup to S3 or other cloud storage providers. The underlying mechanism is explained in the Postgres documentation as "Continuous Archiving and Point-in-Time Recovery (PITR)"(https://www.postgresql.org/docs/11/continuous-archiving.html). Using this method you lose a minimal amount of live data when your main server goes down. And if you need to restore, you can also restore at an arbitrary point of time, so e.g. just before you accidentally deleted everything in a table or something like that.

joshstrange · on June 10, 2019

100% agree, I was talking specifically about protecting against your hosting provider shutting down your account. And a replica coupled with hourly (or whenever) shutdowns, backup data directory, and restart will protect you against a missing WHERE clause on a DELETE.

SkyLinx · on June 11, 2019

For MySQL you can use Xtrabackup to take online - non blocking - incremental backups to the filesystem before backing up everything off site with Restic or other.

turrini · on June 10, 2019

Vultr AND Linode.

1) Upload a custom ISO with ZFS (https://github.com/beren12/zfs-iso/)

2) Create a new VPS without OS and boot to your uploaded ISO.

3) Create a ZFS root pool and bootstrap your Debian or another distribution.

4) Enable all cool features: compression, encryption, etc.

5) rsync your zfs snapshots from Vultr to Linode and vice-versa.

This is how I do. You can even use them as templates for newly VPS.

And for backups, BackBlaze B2 and WASABI with a zfs-snapshot-upload script.

SkyLinx · on June 11, 2019

Isn't a backup of the whole os via snapshot overkill? I can bring up one or more completely configured new servers in a few minutes with Ansible (plus another few with Rancher for Kubernetes). I don't see the point in backing up anything other than the actual data.

fabian2k · on June 10, 2019

I've used rclone for a very similar purpose. Restic, which is used in this post looks very interesting as well.

It's not the topic of the post, but database backups deserve a special mention. You can't just naively copy the database folder this way in most cases, you have to make sure to backup a consistent snapshot of the database. This is still not hard to do at smaller scales, when you can just add an exported dump of the database to your regular backup. But it is a point that needs some attention if you host the database yourself.

tluyben2 · on June 10, 2019

I have many servers with different (versions) of Linux distros on them and I found Duplicity | Restic very annoying to install. Vague (for me as non-Python expert) error messages and options randomly not working as a result. Rclone was absolutely painless to install everywhere.

KAMSPioneer · on June 10, 2019

Were you getting Python errors from Restic? Not terribly familiar with Duplicity, but Restic is written in Go (Github is https://github.com/restic/restic).

tluyben2 · on June 10, 2019

I tried 4 different similar packages; maybe I remember that detail wrong, but I could not get Restic working on older machines for some reason. Rclone was so simple and it just worked, so I did not investigate further. Is Restic much better?

witten · on June 10, 2019

You might be thinking of Borg Backup, which is written in Python: https://borgbackup.readthedocs.io/

apitman · on June 10, 2019

Yeah for me the hardest part about installing restic was remembering how to unzip a bz2 file. I don't know what more you can ask for than a statically linked binary...

tluyben2 · on June 10, 2019

Ok, cannot have been Restic then; thanks, I am going to try that one now. And find out what the other Python one was; I was following an SO recommendation.

apitman · on June 10, 2019

I use restic (for dumping an encrypted deduped backup to a usb drive) and rclone (for pushing to backblaze B2). Both fantastic tools.

ngcc_hk · on June 10, 2019

Very good point. (For source code sync with git is a must.)

And do rehearsal as well. Backup may not work.

fabian2k · on June 10, 2019

The tricky thing with the naive copy as a database backup is that it actually could work if you test it while the database isn't writing at the moment. For example when you only tested this outside production on a test server without load.

But yes, you do have to test and verify that your backup works. It might be configured entirely wrong, the cron job might not be running for some reason, you set it up with encryption years ago but lost the passphrase. There are plently of ways this can potentially fail.

marceloneil · on June 10, 2019

restic actually supports rclone!

krn · on June 10, 2019

Are there any reasons to prefer Restic over BorgBackup[1]?

A conclusion from one comparison (2017)[2]:

"Restic’s memory requirements makes it unsuitable for backing up a small VPS with limited RAM, and the slow backup verification process makes it impractical on larger servers. But if you are backing up desktop or laptop computers then this may not matter so much, and using Restic means that you don’t have to setup your own storage server."

Is this still true?

[1] https://www.borgbackup.org/

[2] https://stickleback.dk/borg-or-restic/

raimue · on June 10, 2019

For remote backups, BorgBackup always needs to run a server process (usually over SSH). Restic works with a "dumb" storage that only provides get/put/list/delete operations. Therefore restic is way easier to set up with built in support for S3, B2, GCS, and similar services that only offer an API but not shell access.

witten · on June 10, 2019

That's true, although there are now a handful of BorgBackup remote storage vendors (rsync.net, BorgBase, etc.) that you can pay to run the server-side hosting for you. Probably not nearly as cheap as, say, S3.. but it does get closer to "just point your client here and hit go". And they offer additional sauce on top that you'd have to roll yourself with S3.. Backup activity monitoring, etc.

m3nu · on June 11, 2019

Thanks for the mention. BorgBase.com author here.

We're not as cheap as S3 Deep Glacier, but cheaper than standard storage and the same price as B2 and Wasabi, if you get the large plan. So not that much difference to "dumber" storage.

Storage is either RAID6 or Ceph.

Improvotter · on June 10, 2019

If you can use something like samba or any other way of attaching a remote folder, Borg will work without SSH access. So you can also use Borg if you for example mount a Google Drive folder and use that as your repository. Correct me if I'm wrong.

bloopernova · on June 10, 2019

Borgbackup is a fantastic piece of software. I've used it to backup so many different things over SSH, and it's always worked perfectly.

I'm still convinced that its dedupe is magical. I don't know if there's a backup app that is more frugal with disk space, but Borgbackup has served me well in a non-growing 1.5TB backup area for 3+ years now.

m3nu · on June 11, 2019

They split larger files into segments and only back up new segments. This avoids a) uploading files it has seen before and b) re-uploading large files if only part of it has changed.

Neil44 · on June 10, 2019

I use Duplicity in a similar way to back my Linode stuff up to Backblaze. It does versioning really well and it's been very reliable. I'd still have to configure up a new server somewhere etc but at least I have the data. http://duplicity.nongnu.org/

raimue · on June 10, 2019

I used duplicity in the past, but the main problem with its incremental backups is that in order to be able to prune the backup history, you need to do full backups regularly to start a new backup chain. That means transferring a full copy of the data.

I switched to restic now, which allows to take incremental backups, but can also remove any snapshot to prune the history. Although it does not support compression, due to its deduplication and removing the need to store multiple full backups, the restic repository takes less space now than duplicity before.

OJFord · on June 10, 2019

That's a really good point I hadn't considered at all; I'm glad you mentioned it! I was looking at a benchmark (that i linked in another comment) that makes duplicity look slow, but so much more economical on storage space - i.e. cheaper.

But as you point out, if you don't need a long history, incremental eventually gets more expensive. Unless you could squash older than X, I suppose, but presumably that's so expensive to run (encryption & compression) that it's not supported.

petre · on June 11, 2019

We still use Duplicity because of the ability to rsync the files to other hosts. A repo is sort of weird as it's not a date tagged single file or volune directory.

narag · on June 10, 2019

Couldn't you backup the backup instead? More space, but transfer in a local network would be faster.

SkyLinx · on June 11, 2019

I have never liked duplicity, too slow and requires full backups regularly. And did I mention that it is slow?

PStamatiou · on June 10, 2019

Related - I've been thinking about how to best backup my S3 buckets (some with 50k+ files) off of Amazon. Sure I can setup another bucket with that cross region duplication feature, and I have versioning.. but would really prefer a backup off of Amazon (ie not sending manually created zips in a lightsail/ec2 or something to glacier) in case it ever gets hacked or I accidentally nuke the buckets or something like that.

Currently just doing a combination of s3cmd for a local archive (takes forever to download and then it doesnt seem like incremental syncs are any faster), as well as having Google Console clone my bucket there (but I'm not sure if it's versioned, or as easy as downloading the whole archive).

Never used duplicity -- would it be fast for something like this? Guessing I should just cron it on a remote server instead of running off a local machine frequently.

padelt · on June 10, 2019

Have you had a look at rclone? Pretty sure you can copy or even sync files from one remote storage to another. E.g. copy from S3 to B2. https://rclone.org/commands/rclone_sync/

apitman · on June 10, 2019

+1 for rclone here. It can indeed copy between remote backends. Just keep in mind that that data all has to flow through the rclone process. You could probably get much better performance by running rclone itself on an ec2 instance. Just keep an eye on your throughput usage.

PStamatiou · on June 10, 2019

Thanks, I haven't. Will take a look

tickthokk · on June 10, 2019

Thanks for sharing! While the victims were being scorned by the internet for not having proper backups, nobody was sharing how to achieve that.

dymk · on June 10, 2019

Really? A blog post about how to use a glorified `rsync` was needed to instruct people building services for Fortune 500 companies to back their user data up?

apitman · on June 10, 2019

To be fair there are a dizzying array[0] of OSS backup solutions, and it's very much not apparently what features are most important. A simple post like this that outlines a single good enough solution with a modern tool is valuable IMO.

EDIT: Oh and restic has much more functionality than rsync, including deduplication and encryption. rclone is more of a "glorified rsync", but even then its array of backends makes it truly glorious.

[0] https://wiki.archlinux.org/index.php/Synchronization_and_bac...

z3t4 · on June 10, 2019

Don't forget about practicing restoration (catastrophe scenarios). So that you will know how long time it will take to restore, and if something is missing. Last time I did it I did not remember the password for the encryption key. Sure I had it written down on a piece of paper, but the scenario was that the building had burnt down.

SkyLinx · on June 11, 2019

Good point on testing the backups.

smnrchrds · on June 10, 2019

In a dockerized single-VPS environment, where should cronjobs live? Should they be part of the main Docker container that had the app code, or a separate container that only has all cronjobs, or simply on the host?

jakejarvis · on June 10, 2019

Good question. I have the same setup on one server hosting GitLab, Pi-Hole, Plex, etc., and I have Restic (and its cronjob) installed on the host and only backup the files that I mount to each Docker container, which are all stored in /srv/docker.

In theory, you need to be ready to literally delete every container at any time and pull them from scratch and be 100% fine, since all of your actual data should be stored on the host and mounted as Docker volumes [0]. It's a good Doomsday test if you're looking for one. ;)

[0] https://docs.docker.com/storage/

tracker1 · on June 10, 2019

As mentioned in another post... could manage the backups from another server (not in the network) with a cron job that grabs a snapshot from the docker server's shared volume directory and forwards it to it's final destination. Could be done on a really small instance, and this way your backup information and account details aren't on your production server itself.

OJFord · on June 10, 2019

How are you orchestrating them?

With kubernetes (and no more specifics than you've mentioned) you should use a Job.

With docker-compose I think I'd be tempted to have a different service that isn't long-running, and a cron job on the host that runs it.

With swarm, unless it supports something like k8s Jobs, (I don't know if it does or not, only used it once briefly and in anger) I'd probably have a 'cronjob' service which was responsible for launching the short-lived services per compose suggestion above.

witten · on June 10, 2019

I don't know about "should", but one way to do it is to put both the backup script and the cron job to run it into a single, separate, backup-only container. Then tell that container (via volume mounts, etc.) what volumes to backup from other containers. Example container (non-Restic) that does this: https://hub.docker.com/r/b3vis/borgmatic/

kijin · on June 10, 2019

Meh, just another backup solution that requires AWS keys, ssh keys, etc. to be kept on the same server where your data is. What if that server is compromised? The attacker now has all the keys he needs to delete or modify your backups, too.

For maximum peace of mind, always pull backups from a separate server that is not exposed to the world. Don't let your primary server push arbitrary data to the backup store.

This rule is trickier to follow when your backup store can't run scripts, which is why so many tools designed to work with S3 tell you to keep the keys exposed. But if you really want to, you can use an intermediate host to pull backups before pushing them again to S3.

longwave · on June 10, 2019

Borg has an append-only mode [1] that prevents clients from overwriting data.

[1] https://borgbackup.readthedocs.io/en/stable/usage/notes.html...

slig · on June 10, 2019

Can't you set up keys that are only allowed to do GET/PUT?

I know that tarsnap [1] can work with `list`, `write` and `delete` keys.

[1]: http://www.tarsnap.com/man-tarsnap-keymgmt.1.html

akerl_ · on June 10, 2019

As the other commenter noted, this is why you’d give the backup keys the ability to upload data but not delete data.

Combine that w/ S3 Object Versioning and you’ve got a pretty solid approach.

Bender · on June 10, 2019

My own personal preference is to simply make VM's on each VPS that has some storage space, then enable chroot sftp and rsnapshot. Then on the client side, I used LFTP (sftp mirror sub-system) which is compatible with chroot sftp and behaves like rsync.

Each VPS backs up to the other. RSnapshot makes daily diffs that use hardlinks to avoid taking up space. This also mitigates tampering, as only root have access to the snapshots.

Demo site using anon login for testing: [1]

[1] - https://tinyvpn.org/sftp/#lftp

cure · on June 10, 2019

+1 for restic. I use it, and it's awesome.

rsync · on June 10, 2019

(I hope) You'll be happy to learn that restic works perfectly with rsync.net:

https://www.rsync.net/products/restic.html

One of the modes of restic is SFTP target and as we run stock, standard OpenSSH, it works perfectly.

EDIT: A sibling comment to yours mentioned 'rclone' and I am happy to informally announce that over the past few months we have rolled out the 'rclone' binary to all of our production fileservers (it requires a server-side binary exe to be in place) and it is being used by rsync.net customers to broker file transfers cloud to cloud to cloud (as rclone is apt to be used for). 'rclone serve' and 'rclone mount' are disallowed for (I think) obvious reasons, but otherwise everything works ...

nickcw · on June 10, 2019

Nice one - I'd love to hear more about this (rclone author!).

rsync · on June 10, 2019

Please email info@rsync.net so we may chat a bit ... I'm really excited to have this functionality in place.

heinrichhartman · on June 10, 2019

Does anyone here have experience with backing up ZFS pools in cloud storage like S3, B2, ...?

I have a bunch of snapshots (https://github.com/jakelee8/zfs-auto-snapshot) that I want to backup along with the active tree. But don't want to keep extra copies of the data.

- Do these services offer snapshotting? ...that can be automated?

- Is there zfs integration, e.g `zpool send | b2 receive`?

conception · on June 10, 2019

https://www.rsync.net/ is the only one I know of.

rys · on June 10, 2019

rsync.net will natively accept ZFS sends

Blackstone4 · on June 10, 2019

One idea I had was to create a service with preconfigured images setup for personal use with VPN, email server and file sync/backup. It could be sold to privacy conscious individuals and could compete with ProtonMail.

The technical side could be hidden from less technical users and it sold as isolated servers so the data would be protected.

I don’t have the skills or the time to work on this so happy for others to use the idea

monkeydust · on June 10, 2019

I am looking for something that can backup Dropbox, Google Drive and Amazon Cloud to a 3rd party service. What do people recommend?

tracker1 · on June 10, 2019

In addition to rclone... you could run a VPS that has the various services as fuse mounts and otherwise sync between them in CRON ... however, probably best to only actually use one of them in practical terms.

It depends on what you want to do, but can probably be accomplished in a <= $5 VPS.

ac29 · on June 10, 2019

rclone

monkeydust · on June 11, 2019

Thanks but appears Amazon stopped issuing api keys for Amazon Drive so stuck, at least for a fully automated solution.

a2tech · on June 10, 2019

Does anyone have a recommendation for a backup client that handles millions of tiny files? I'm using rsnapshot right now, which works but backing up to an NFS share is incredibly slow (most of the time is spent in iterating over the filesystem to get a list of changed files, then running the hardlink process from the previous snapshot).

rsync · on June 10, 2019

You're going to have to walk all those inodes no matter what you do. rsync is as good as anything at that task.

A better way would be to unmount and send the filesystem with 'dd' or something like that, or, to use 'zfs send' but I have a suspicion that neither of those options are available to you ...

I will say that splitting the rsync job (rsnapshot runs rsync underneath) into multiple, smaller jobs, could save you some time if you're running into any resource limits while you walk that big set of inodes... so if you're lucky and you have 4 or 5 or 8 top level dirs that are all roughly the same size, you could do a handful of smaller jobs, one after the other, instead of one huge one ...

module0000 · on June 10, 2019

>> A better way would be to unmount and send the filesystem with 'dd' or something like that

To add to that.. to avoid having to unmount your filesystem; use LVM. Then you can call `sync`, snapshot your main volume, and `dd` the clean snapshot. Once you're done, remove the snapshot. This strategy avoids downtime while backing up your volume.

dsign · on June 10, 2019

Over the years, I've got burn so many times with Linux network filesystems (including SMB/CIFFS, which is not all Linux but still) that I would start by recommending not to touch a network filesystem in the same host where anything important happens.

The issue is with how network failures (which always eventually happen) interact with the "uninterruptible" Linux process state. Hell breaks loose, and the failure is anything but obvious.

kijin · on June 10, 2019

Run backups to and from the hosts where the files actually reside. Running rsync over ssh is probably many times faster than running it over NFS.

realusername · on June 10, 2019

I use duplicity for that personally, I backup my whole home with it (and there's so much small files like this) it works very well.

rebelpixel · on June 10, 2019

I'm currently using restic and used duplicity before for a few years, sending backups to backblaze b2.

What I don't like about duplicity is how it spews weird error messages that I gathered were related to Python versions. It was easier to just start over with restic and haven't had any problems since.

pnutjam · on June 10, 2019

Good directions, everyone should do this.

themodelplumber · on June 10, 2019

I actually like using CPanel's built-in backup settings on servers where I have CPanel installed. Amazingly simple to set up, really intuitive, and supports a variety of services. I have used Amazon and SFTP backups so far and they both work really well.

electriclove · on June 10, 2019

Or use a paid service that handles files and databases for ~$30/year like https://www.dropmysite.com/

(I'm just a customer that has been generally pleased over the past many years)

SkyLinx · on June 11, 2019

I'm surprised that no one has mentioned Duplicacy yet. It's another very solid, reliable and fast alternative. At the moment I use Restic on servers but use Duplicacy on the desktop. It can also be used on servers of course.

OJFord · on June 10, 2019

Restic looks neat. I've been looking at using duplicity [0] for similar purposes recently, which does a similar job.

Just found a good comparison/benchmark of the two at [1] - tl;dr seems to be that restic is fast, and duplicity is small.

[0] - http://duplicity.nongnu.org

[1] - https://github.com/gilbertchen/benchmarking

SkyLinx · on June 11, 2019

Duplicity is not particularly good imo. Like I said in another comment it us much slower than other options and requires full backups regularly, which is a problem with lots of data.

ausjke · on June 10, 2019

trying to backup linode image using dd, which works but not easy, I hope VPS vendors can provide a way for customer to migrate when the time comes.

zerkten · on June 10, 2019

Doesn't Linode have its own backup which lets you restore onto another Linode VPS? It'd be nice to use provider-agnostic tools, but this seems like the most pragmatic option. I'm guessing other VPS providers offer something similar.

ausjke · on June 10, 2019

it can't deal with the case when linode locks up your account, similar to what DO did to the original post, you can't put all eggs in the same basket to be safe?

tracker1 · on June 10, 2019

I don't consider something backed up unless there are at least 3 copies of something in at least 2 locations. Also, replicated dbs are not backups, but may be the best available option when you have too much data to reliably backup en masse.

I've done replication on write scenarios. Generally when I've setup mongo or elasticsearch for searching/read performance, I'll also push to a fixed JSON+GZ on S3 or similar. It tends to work pretty well as a fallback for larger data scenarios as a fallback. Had to use it once, and was so glad to have it.

apitman · on June 10, 2019

TLDR for comments: use rclone and/or restic