I'm currently managing a Postgres cluster with a petabyte of data running on ZFS on Linux on AWS. Most of the issues we've come across are around us not knowing ZFS.
The first main issue was the arc_shrink_shift default being poor for machines with a large ARC. Our machines have Arc at several 100GB, so the default arc_shrink_shift was flushing several GBs to disk at a time. This was causing our machines to become unresponsive for several seconds at a time pretty frequently.
The other main issue we encountered was when we tried to delete lots of data at a time. We aren't sure why, but when we tried to delete a lot of data (~200GB from each machine which each contain several TB of data), our databases become unresponsive for an hour.
Other then these issues, ZFS has worked incredibly well. The builtin compression has saved us lots of $$$. It's just the unknown unknowns that have been getting us.
Agreed, ZFS has its caveats, but feature-wise and stability-wise ZFS is to -- a large degree -- what BTRFS should have been.
The licensing is incredibly unfortunate, though. (I don't care about the reasoning for the license, it's just bad that it isn't GPL-compatible so that it could be compatible with the most prolific kernel in the world.)
Anyway, back to BTRFS-vs-ZFS. It seems abundantly clear that a filesystem is (no longer) a thing where you can just "throw an early idea out there" and hope that others will pick up the slack and fix all the bugs. There's just too much design (not code) that goes into these things that it's not just about code any more.
My (small) bet right now as to the "next gen" FS on Linux is on bcachefs[1, 2]. It sounds much sounder from a design perspective than BFS, plus it's built on the already-proven bcache, etc. etc. (Read the page for details.)
According to Canonical, it _is_ GPL compatible. Either way, that shouldn't get in the way of the best file system in existence being used with the kernel of last resort.
Canonical ships ZoL binaries as of April 2016. They claim doing so doesn't violate the GPL since they are shipping it as a module rather than built into the kernel.
No, they're supplied as kernel modules, packaged separately from the kernel. Before Ubuntu 15.10 you could still install it as a DKMS module (such that it compiled on the system it's being installed on). Now they just ship the pre-built .ko's, saving the user compilation time. There are still userland tools to interact with it zpool, zfs etc.
>We aren't sure why, but when we tried to delete a lot of data (~200GB from each machine which each contain several TB of data), our databases become unresponsive for an hour.
There used to be an issue where users hitting their quota couldn't delete files since for some reason deleting a file meant creating a file somehow. The trick was to find some reasonably large file and `echo 1 > large_file` which truncates the file and frees up enough space that you can begin removing files. Maybe this kind of trick could help you guys.
That said, it's inadvisable to run a database on a btree file system like ZFS or btrfs if you're keeping an eye on the write performance.
cf Postgres 9.0 High Performance by Gregory Smith (https://www.amazon.com/PostgreSQL-High-Performance-Gregory-S...)
> That said, it's inadvisable to run a database on a btree file system like ZFS or btrfs if you're keeping an eye on the write performance.
Our writes are actually heavily CPU bound because of how we architected the system[0]. We recently made some changes that dramatically improved our write throughput, so AFAICT, we aren't going to need to focus much on write performance in the near future.
Could you elaborate more on your setup? What's in the ZFS pool that supports the performance of running a DB as well as a PB of data without breaking the bank?
It's not a single machine. We have a cluster of machines, each of which have several TB of data. The only parameter I clearly remember changing is recordsize=8k, since postgres works with 8k pages.
The first main issue was the arc_shrink_shift default being poor for machines with a large ARC. Our machines have Arc at several 100GB, so the default arc_shrink_shift was flushing several GBs to disk at a time. This was causing our machines to become unresponsive for several seconds at a time pretty frequently.
The other main issue we encountered was when we tried to delete lots of data at a time. We aren't sure why, but when we tried to delete a lot of data (~200GB from each machine which each contain several TB of data), our databases become unresponsive for an hour.
Other then these issues, ZFS has worked incredibly well. The builtin compression has saved us lots of $$$. It's just the unknown unknowns that have been getting us.