Apart from the NTP tangent, this sounds like a Linux XFS / ServeRAID M5210 firmw...

mikagrml · on April 10, 2021

(Author here) Yes, it was an XFS/controller issue, but Ceph reported the failure. :) (IMO it wasn't really a good decision from Ceph to use 100MB XFS partitions as a kind of database, but nowadays ceph-disk (which uses those XFS partitions) is gone, and instead ceph-volume uses a different approach via LVM.)

Regarding configuration management/firmware version: yes - especially, as you'd need to also rebuild disks in the dev/test environment with the identical configuration (firmware, disks,...), to ensure it's actually identical. And even if we neglect load/capacity/usage issues (problems might show up only under specific work loads), there are also further "invisible" layers/components like cables, NICs, switches,… and their firmware versions which are also relevant. Not exactly trivial. :)

ibotty · on April 13, 2021

> IMO it wasn't really a good decision from Ceph to use 100MB XFS partitions as a kind of database

It has been shown, that you are right. But not because of bugs like the one you encountered. The problem could as well have happened with a regular xfs fs holding a maildir.