Forgive me if I'm missing something but this appears to just backup files so it would be fine for source code (should be in version control and safe already) and static assets (like user uploads) but doesn't appear to address things like DB backups which I feel like is the number 1 thing lost if you lose access to your host (followed by user uploads). The problem with DB backups is you can't just backup the data directory (like /var/lib/mysql) unless you've shutdown the DB or you can do a dump (mysqldump) but backing that up hourly is not a good solution IMHO. I guess you could have a replica that you shut down at the top of the hour, backup the data directory, then start back up but all if this is to say this post is not a silver bullet to "Automatically backup a Linux VPS".
This is NOT a knock against the author, I just wanted to point out that "backups" are much more complicated than "copy files elsewhere". For DB's I'd probably consider running a replica on 1 or more other clouds. IDK the logistics of replication over the internet but I know for work we do replication from our datacenter down to our local servers and that's over a relatively slow connection so I assume it's possible to do it from cloud-to-cloud.
Absolutely. Maybe I should have noted that this is more of a guide to make your existing backup procedures more redundant, which implies that you already have local "backups" being made of whatever you want to redundantly store in S3 or B2 or anywhere externally.
In that case, it does become as simple as just copying files elsewhere. (For example, using the Restic steps in my post to backup a folder of hourly database dumps, like you mentioned.) Replicating databases (and other methods made specifically for DBs) is certainly a much, much better route for mission-critical and/or enterprise data.
Covering every permutation of different types of data to backup would have made a long post much longer, but I'll add writing a part two to my to-do list covering rudimentary database backups since that has been brought up here a few times.
Awesome! I really hope I didn't come across as attacking you/your post, I found it really useful. I just wanted to remind people that the DB wouldn't really be covered by this (except in the case you mentioned where you are dumping the data).
I am definitively looking at this through the lens of where I work where a mysqldump (or equivalent) could take days to complete in full (DB is nearing 2TB in size now). For a number of projects a mysqldump might only take seconds or minutes and would be a perfect candidate for this backup scheme.
Not at all! I'm really glad you mentioned it, since I wrote this in the mindset of small to medium VPSes used for personal projects and I'll make that more clear in the intro. Backups (like everything else, unfortunately) definitely get exponentially more difficult the more successful you become.
For most DBs the simplest path is a db dump, assuming you don't have more data than can be backed up in a reasonable time frame. From there, there's a number of file-systems and other integrations for S3, sftp or similar. You can simply copy out from there. The backup utility in question could be used to target your dump file(s) directory.
Replica's aren't actually a backup, though it's probably a good idea to have them, and in the case of too much data to reasonably backup, as close as you are going to come.
Also, depending on data structures, for example if your data models fit into something like MongoDB or not, it's easy enough to trigger a dump for each record. I've setup systems where primary records are json+gz files in S3 or Azure blobs. In practice this has been as part of the process that will update ElasticSearch (or Mongo) from an RDBMS authority.
It was pretty easy to do <base>/collection/{ID}.json.gz .. from there, worst case, I'd have to create a re-population script from the hard files, and might lose a little less important data, but would always have a hard recovery path. YMMV of course.
I've got 3 side projects that I've been muddling on getting started and some of the things that have happened to the original author frankly scare me. I'm also working on legal structure (LLCs etc) before any real launch. There are so many edge cases to consider, and sometimes it's hard just getting started and accepting that you will make mistakes along the way.
A replica isn't necessarily a backup, if the problem isn't a dead server but a missing WHERE clause in your DELETE query, a replica probably won't save you.
For Postgres you can use something like WAL-E to continuously backup to S3 or other cloud storage providers. The underlying mechanism is explained in the Postgres documentation as "Continuous Archiving and Point-in-Time Recovery (PITR)"(https://www.postgresql.org/docs/11/continuous-archiving.html). Using this method you lose a minimal amount of live data when your main server goes down. And if you need to restore, you can also restore at an arbitrary point of time, so e.g. just before you accidentally deleted everything in a table or something like that.
100% agree, I was talking specifically about protecting against your hosting provider shutting down your account. And a replica coupled with hourly (or whenever) shutdowns, backup data directory, and restart will protect you against a missing WHERE clause on a DELETE.
For MySQL you can use Xtrabackup to take online - non blocking - incremental backups to the filesystem before backing up everything off site with Restic or other.
Isn't a backup of the whole os via snapshot overkill? I can bring up one or more completely configured new servers in a few minutes with Ansible (plus another few with Rancher for Kubernetes). I don't see the point in backing up anything other than the actual data.
I've used rclone for a very similar purpose. Restic, which is used in this post looks very interesting as well.
It's not the topic of the post, but database backups deserve a special mention. You can't just naively copy the database folder this way in most cases, you have to make sure to backup a consistent snapshot of the database. This is still not hard to do at smaller scales, when you can just add an exported dump of the database to your regular backup. But it is a point that needs some attention if you host the database yourself.
I have many servers with different (versions) of Linux distros on them and I found Duplicity | Restic very annoying to install. Vague (for me as non-Python expert) error messages and options randomly not working as a result. Rclone was absolutely painless to install everywhere.
Were you getting Python errors from Restic? Not terribly familiar with Duplicity, but Restic is written in Go (Github is https://github.com/restic/restic).
I tried 4 different similar packages; maybe I remember that detail wrong, but I could not get Restic working on older machines for some reason. Rclone was so simple and it just worked, so I did not investigate further. Is Restic much better?
Yeah for me the hardest part about installing restic was remembering how to unzip a bz2 file. I don't know what more you can ask for than a statically linked binary...
Ok, cannot have been Restic then; thanks, I am going to try that one now. And find out what the other Python one was; I was following an SO recommendation.
The tricky thing with the naive copy as a database backup is that it actually could work if you test it while the database isn't writing at the moment. For example when you only tested this outside production on a test server without load.
But yes, you do have to test and verify that your backup works. It might be configured entirely wrong, the cron job might not be running for some reason, you set it up with encryption years ago but lost the passphrase. There are plently of ways this can potentially fail.
Are there any reasons to prefer Restic over BorgBackup[1]?
A conclusion from one comparison (2017)[2]:
"Restic’s memory requirements makes it unsuitable for backing up a small VPS with limited RAM, and the slow backup verification process makes it impractical on larger servers. But if you are backing up desktop or laptop computers then this may not matter so much, and using Restic means that you don’t have to setup your own storage server."
For remote backups, BorgBackup always needs to run a server process (usually over SSH). Restic works with a "dumb" storage that only provides get/put/list/delete operations. Therefore restic is way easier to set up with built in support for S3, B2, GCS, and similar services that only offer an API but not shell access.
That's true, although there are now a handful of BorgBackup remote storage vendors (rsync.net, BorgBase, etc.) that you can pay to run the server-side hosting for you. Probably not nearly as cheap as, say, S3.. but it does get closer to "just point your client here and hit go". And they offer additional sauce on top that you'd have to roll yourself with S3.. Backup activity monitoring, etc.
We're not as cheap as S3 Deep Glacier, but cheaper than standard storage and the same price as B2 and Wasabi, if you get the large plan. So not that much difference to "dumber" storage.
If you can use something like samba or any other way of attaching a remote folder, Borg will work without SSH access. So you can also use Borg if you for example mount a Google Drive folder and use that as your repository. Correct me if I'm wrong.
Borgbackup is a fantastic piece of software. I've used it to backup so many different things over SSH, and it's always worked perfectly.
I'm still convinced that its dedupe is magical. I don't know if there's a backup app that is more frugal with disk space, but Borgbackup has served me well in a non-growing 1.5TB backup area for 3+ years now.
They split larger files into segments and only back up new segments. This avoids a) uploading files it has seen before and b) re-uploading large files if only part of it has changed.
I use Duplicity in a similar way to back my Linode stuff up to Backblaze. It does versioning really well and it's been very reliable.
I'd still have to configure up a new server somewhere etc but at least I have the data.
http://duplicity.nongnu.org/
I used duplicity in the past, but the main problem with its incremental backups is that in order to be able to prune the backup history, you need to do full backups regularly to start a new backup chain. That means transferring a full copy of the data.
I switched to restic now, which allows to take incremental backups, but can also remove any snapshot to prune the history. Although it does not support compression, due to its deduplication and removing the need to store multiple full backups, the restic repository takes less space now than duplicity before.
That's a really good point I hadn't considered at all; I'm glad you mentioned it! I was looking at a benchmark (that i linked in another comment) that makes duplicity look slow, but so much more economical on storage space - i.e. cheaper.
But as you point out, if you don't need a long history, incremental eventually gets more expensive. Unless you could squash older than X, I suppose, but presumably that's so expensive to run (encryption & compression) that it's not supported.
We still use Duplicity because of the ability to rsync the files to other hosts. A repo is sort of weird as it's not a date tagged single file or volune directory.
Related - I've been thinking about how to best backup my S3 buckets (some with 50k+ files) off of Amazon. Sure I can setup another bucket with that cross region duplication feature, and I have versioning.. but would really prefer a backup off of Amazon (ie not sending manually created zips in a lightsail/ec2 or something to glacier) in case it ever gets hacked or I accidentally nuke the buckets or something like that.
Currently just doing a combination of s3cmd for a local archive (takes forever to download and then it doesnt seem like incremental syncs are any faster), as well as having Google Console clone my bucket there (but I'm not sure if it's versioned, or as easy as downloading the whole archive).
Never used duplicity -- would it be fast for something like this? Guessing I should just cron it on a remote server instead of running off a local machine frequently.
Have you had a look at rclone? Pretty sure you can copy or even sync files from one remote storage to another. E.g. copy from S3 to B2.
https://rclone.org/commands/rclone_sync/
+1 for rclone here. It can indeed copy between remote backends. Just keep in mind that that data all has to flow through the rclone process. You could probably get much better performance by running rclone itself on an ec2 instance. Just keep an eye on your throughput usage.
Really? A blog post about how to use a glorified `rsync` was needed to instruct people building services for Fortune 500 companies to back their user data up?
To be fair there are a dizzying array[0] of OSS backup solutions, and it's very much not apparently what features are most important. A simple post like this that outlines a single good enough solution with a modern tool is valuable IMO.
EDIT: Oh and restic has much more functionality than rsync, including deduplication and encryption. rclone is more of a "glorified rsync", but even then its array of backends makes it truly glorious.
Don't forget about practicing restoration (catastrophe scenarios). So that you will know how long time it will take to restore, and if something is missing. Last time I did it I did not remember the password for the encryption key. Sure I had it written down on a piece of paper, but the scenario was that the building had burnt down.
In a dockerized single-VPS environment, where should cronjobs live? Should they be part of the main Docker container that had the app code, or a separate container that only has all cronjobs, or simply on the host?
Good question. I have the same setup on one server hosting GitLab, Pi-Hole, Plex, etc., and I have Restic (and its cronjob) installed on the host and only backup the files that I mount to each Docker container, which are all stored in /srv/docker.
In theory, you need to be ready to literally delete every container at any time and pull them from scratch and be 100% fine, since all of your actual data should be stored on the host and mounted as Docker volumes [0]. It's a good Doomsday test if you're looking for one. ;)
As mentioned in another post... could manage the backups from another server (not in the network) with a cron job that grabs a snapshot from the docker server's shared volume directory and forwards it to it's final destination. Could be done on a really small instance, and this way your backup information and account details aren't on your production server itself.
With kubernetes (and no more specifics than you've mentioned) you should use a Job.
With docker-compose I think I'd be tempted to have a different service that isn't long-running, and a cron job on the host that runs it.
With swarm, unless it supports something like k8s Jobs, (I don't know if it does or not, only used it once briefly and in anger) I'd probably have a 'cronjob' service which was responsible for launching the short-lived services per compose suggestion above.
I don't know about "should", but one way to do it is to put both the backup script and the cron job to run it into a single, separate, backup-only container. Then tell that container (via volume mounts, etc.) what volumes to backup from other containers. Example container (non-Restic) that does this: https://hub.docker.com/r/b3vis/borgmatic/
Meh, just another backup solution that requires AWS keys, ssh keys, etc. to be kept on the same server where your data is. What if that server is compromised? The attacker now has all the keys he needs to delete or modify your backups, too.
For maximum peace of mind, always pull backups from a separate server that is not exposed to the world. Don't let your primary server push arbitrary data to the backup store.
This rule is trickier to follow when your backup store can't run scripts, which is why so many tools designed to work with S3 tell you to keep the keys exposed. But if you really want to, you can use an intermediate host to pull backups before pushing them again to S3.
My own personal preference is to simply make VM's on each VPS that has some storage space, then enable chroot sftp and rsnapshot. Then on the client side, I used LFTP (sftp mirror sub-system) which is compatible with chroot sftp and behaves like rsync.
Each VPS backs up to the other. RSnapshot makes daily diffs that use hardlinks to avoid taking up space. This also mitigates tampering, as only root have access to the snapshots.
One of the modes of restic is SFTP target and as we run stock, standard OpenSSH, it works perfectly.
EDIT: A sibling comment to yours mentioned 'rclone' and I am happy to informally announce that over the past few months we have rolled out the 'rclone' binary to all of our production fileservers (it requires a server-side binary exe to be in place) and it is being used by rsync.net customers to broker file transfers cloud to cloud to cloud (as rclone is apt to be used for). 'rclone serve' and 'rclone mount' are disallowed for (I think) obvious reasons, but otherwise everything works ...
One idea I had was to create a service with preconfigured images setup for personal use with VPN, email server and file sync/backup. It could be sold to privacy conscious individuals and could compete with ProtonMail.
The technical side could be hidden from less technical users and it sold as isolated servers so the data would be protected.
I don’t have the skills or the time to work on this so happy for others to use the idea
In addition to rclone... you could run a VPS that has the various services as fuse mounts and otherwise sync between them in CRON ... however, probably best to only actually use one of them in practical terms.
It depends on what you want to do, but can probably be accomplished in a <= $5 VPS.
Does anyone have a recommendation for a backup client that handles millions of tiny files? I'm using rsnapshot right now, which works but backing up to an NFS share is incredibly slow (most of the time is spent in iterating over the filesystem to get a list of changed files, then running the hardlink process from the previous snapshot).
You're going to have to walk all those inodes no matter what you do. rsync is as good as anything at that task.
A better way would be to unmount and send the filesystem with 'dd' or something like that, or, to use 'zfs send' but I have a suspicion that neither of those options are available to you ...
I will say that splitting the rsync job (rsnapshot runs rsync underneath) into multiple, smaller jobs, could save you some time if you're running into any resource limits while you walk that big set of inodes... so if you're lucky and you have 4 or 5 or 8 top level dirs that are all roughly the same size, you could do a handful of smaller jobs, one after the other, instead of one huge one ...
>> A better way would be to unmount and send the filesystem with 'dd' or something like that
To add to that.. to avoid having to unmount your filesystem; use LVM. Then you can call `sync`, snapshot your main volume, and `dd` the clean snapshot. Once you're done, remove the snapshot. This strategy avoids downtime while backing up your volume.
Over the years, I've got burn so many times with Linux network filesystems (including SMB/CIFFS, which is not all Linux but still) that I would start by recommending not to touch a network filesystem in the same host where anything important happens.
The issue is with how network failures (which always eventually happen) interact with the "uninterruptible" Linux process state. Hell breaks loose, and the failure is anything but obvious.
I'm currently using restic and used duplicity before for a few years, sending backups to backblaze b2.
What I don't like about duplicity is how it spews weird error messages that I gathered were related to Python versions. It was easier to just start over with restic and haven't had any problems since.
I actually like using CPanel's built-in backup settings on servers where I have CPanel installed. Amazingly simple to set up, really intuitive, and supports a variety of services. I have used Amazon and SFTP backups so far and they both work really well.
I'm surprised that no one has mentioned Duplicacy yet. It's another very solid, reliable and fast alternative. At the moment I use Restic on servers but use Duplicacy on the desktop. It can also be used on servers of course.
Duplicity is not particularly good imo. Like I said in another comment it us much slower than other options and requires full backups regularly, which is a problem with lots of data.
Doesn't Linode have its own backup which lets you restore onto another Linode VPS? It'd be nice to use provider-agnostic tools, but this seems like the most pragmatic option. I'm guessing other VPS providers offer something similar.
it can't deal with the case when linode locks up your account, similar to what DO did to the original post, you can't put all eggs in the same basket to be safe?
I don't consider something backed up unless there are at least 3 copies of something in at least 2 locations. Also, replicated dbs are not backups, but may be the best available option when you have too much data to reliably backup en masse.
I've done replication on write scenarios. Generally when I've setup mongo or elasticsearch for searching/read performance, I'll also push to a fixed JSON+GZ on S3 or similar. It tends to work pretty well as a fallback for larger data scenarios as a fallback. Had to use it once, and was so glad to have it.
This is NOT a knock against the author, I just wanted to point out that "backups" are much more complicated than "copy files elsewhere". For DB's I'd probably consider running a replica on 1 or more other clouds. IDK the logistics of replication over the internet but I know for work we do replication from our datacenter down to our local servers and that's over a relatively slow connection so I assume it's possible to do it from cloud-to-cloud.