Apart from the NTP tangent, this sounds like a Linux XFS / ServeRAID M5210 firmware issue. Your XFS filesystems created using the incorrect block/io sizes reported by the RAID controller would have been unmountable on the newer Linux kernel regardless of Ceph.
Lesson learned: your configuration management also needs to control for firmware versions such that the same issue would have shown up in a dev/test environment before turning into a prod nightmare :/
(Author here) Yes, it was an XFS/controller issue, but Ceph reported the failure. :) (IMO it wasn't really a good decision from Ceph to use 100MB XFS partitions as a kind of database, but nowadays ceph-disk (which uses those XFS partitions) is gone, and instead ceph-volume uses a different approach via LVM.)
Regarding configuration management/firmware version: yes - especially, as you'd need to also rebuild disks in the dev/test environment with the identical configuration (firmware, disks,...), to ensure it's actually identical. And even if we neglect load/capacity/usage issues (problems might show up only under specific work loads), there are also further "invisible" layers/components like cables, NICs, switches,… and their firmware versions which are also relevant. Not exactly trivial. :)
> IMO it wasn't really a good decision from Ceph to use 100MB XFS partitions as a kind of database
It has been shown, that you are right. But not because of bugs like the one you encountered. The problem could as well have happened with a regular xfs fs holding a maildir.
Lesson learned: your configuration management also needs to control for firmware versions such that the same issue would have shown up in a dev/test environment before turning into a prod nightmare :/