Same here, still remember that time our Heketi DB partially corrupted and we had to fix it up by exporting it to a massive json file, fix it up by looking at the Gluster state and importing it again. I can't quite remember the details but I think it had to do with Gluster snapshots being out of sync with the state in the DB.
It's a system within Facebook called Warm Storage. There have been some public presentations about it, so I'm comfortable mentioning the name, but unfortunately I can't provide many other details about architecture or scale. I'll just warn people that the public information on it is way out of date. Most of it seems to be from 2014, and what it describes is really a separate system from what we have now.
I can recommend atop[1] for that. It runs every ten minutes by default and writes lots of information to /var/log/atop/atop_YYYYMMDD. With that you can examine what happend before the crash, just open a file with atop -r /var/log/atop/atop_YYYYMMDD.