I'm currently managing a Postgres cluster with a petabyte of data running on ZFS...

lomnakkus · on Dec 29, 2016

Agreed, ZFS has its caveats, but feature-wise and stability-wise ZFS is to -- a large degree -- what BTRFS should have been.

The licensing is incredibly unfortunate, though. (I don't care about the reasoning for the license, it's just bad that it isn't GPL-compatible so that it could be compatible with the most prolific kernel in the world.)

Anyway, back to BTRFS-vs-ZFS. It seems abundantly clear that a filesystem is (no longer) a thing where you can just "throw an early idea out there" and hope that others will pick up the slack and fix all the bugs. There's just too much design (not code) that goes into these things that it's not just about code any more.

My (small) bet right now as to the "next gen" FS on Linux is on bcachefs[1, 2]. It sounds much sounder from a design perspective than BFS, plus it's built on the already-proven bcache, etc. etc. (Read the page for details.)

[1] https://www.patreon.com/bcachefs [1] https://bcache.evilpiepirate.org/Bcachefs/

jen20 · on Dec 29, 2016

According to Canonical, it _is_ GPL compatible. Either way, that shouldn't get in the way of the best file system in existence being used with the kernel of last resort.

georgyo · on Dec 30, 2016

The CDDL is incompatible with the GPL license. The GPL however is not incompatible with the CDDL.

This means Linux copyright owners could sue ZoL binary distributers, but Oracle could not.

However no one is shipping ZoL binaries, only the source code. The code itself is 100% conflict free.

aidenn0 · on Dec 30, 2016

Canonical ships ZoL binaries as of April 2016. They claim doing so doesn't violate the GPL since they are shipping it as a module rather than built into the kernel.

mistat · on Dec 30, 2016

So you're saying it's only going to be available in user space and never in the kernel? Why don't oracle just relicense it?

_joel · on Dec 30, 2016

No, they're supplied as kernel modules, packaged separately from the kernel. Before Ubuntu 15.10 you could still install it as a DKMS module (such that it compiled on the system it's being installed on). Now they just ship the pre-built .ko's, saving the user compilation time. There are still userland tools to interact with it zpool, zfs etc.

fnord123 · on Dec 30, 2016

>We aren't sure why, but when we tried to delete a lot of data (~200GB from each machine which each contain several TB of data), our databases become unresponsive for an hour.

There used to be an issue where users hitting their quota couldn't delete files since for some reason deleting a file meant creating a file somehow. The trick was to find some reasonably large file and `echo 1 > large_file` which truncates the file and frees up enough space that you can begin removing files. Maybe this kind of trick could help you guys.

That said, it's inadvisable to run a database on a btree file system like ZFS or btrfs if you're keeping an eye on the write performance. cf Postgres 9.0 High Performance by Gregory Smith (https://www.amazon.com/PostgreSQL-High-Performance-Gregory-S...)

and

https://blog.pgaddict.com/posts/postgresql-performance-on-ex...

malisper · on Dec 30, 2016

> That said, it's inadvisable to run a database on a btree file system like ZFS or btrfs if you're keeping an eye on the write performance.

Our writes are actually heavily CPU bound because of how we architected the system[0]. We recently made some changes that dramatically improved our write throughput, so AFAICT, we aren't going to need to focus much on write performance in the near future.

[0] http://blog.heapanalytics.com/running-10-million-postgresql-...

ewams · on Jan 4, 2017

Recommend to anyone wanting to run ZFS in production to read: https://www.tiltedwindmillpress.com/?product=fmzfs

https://www.tiltedwindmillpress.com/?product=fmaz

bogomipz · on Dec 30, 2016

Could you elaborate more on your setup? What's in the ZFS pool that supports the performance of running a DB as well as a PB of data without breaking the bank?

malisper · on Dec 31, 2016

It's not a single machine. We have a cluster of machines, each of which have several TB of data. The only parameter I clearly remember changing is recordsize=8k, since postgres works with 8k pages.