> I know ZFS can be run without ECC and some consumer solutions do. However, it seems ZFS should be run with ECC. I've already experienced observable bitrot with older images and video files, I'd rather not let it progress.
From my understanding, the only risk to your data from non-ECC is a bit flip in RAM, pre-checksum calculation. In that unlikely scenario, you commit bad data to disk as good data(valid checksum). Bitrot isn't a factor, at all.
This means that ECC RAM and ZFS are completely orthogonal concerns.
If your data is important enough to warrant ECC RAM, you should get ECC RAM whether you use ZFS or not.
If you want to use ZFS (for its volume management, compression, mirroring, healthchecks, whathaveyou), you should do so whether or not you have ECC RAM.
That is bitrot: you save correct data and it’s not retrievable. The fact that it happens in RAM rather than on the storage media, controller, or I/O channel just makes it a different category.
It is also far, far more likely that an uncorrected bit flip happens outside the relatively small portion of time the kernel spends in filesystem code. This is not a ZFS-specific problem by any means.
> From my understanding, the only risk to your data from non-ECC is a bit flip in RAM, pre-checksum calculation. In that unlikely scenario, you commit bad data to disk as good data(valid checksum).
Wouldn't an option to do it twice in different memory regions be nice? I'm pretty sure in many use cases scarifying performance for greater reliability wouldn't be an issue. Given how many cores we have available nowadays it could potentially even not have that much impact on performance.
Also are there any software solutions (like a kernel patch) which would do "software ECC"? I imagine in this case performance hit would be quite devastating but it still could be acceptable trade-off for NAS-like systems where you want to have lots of RAM for dedup and cache but it's not a busy system.
There is still a race condition: if you read data from disk into a buffer, make a copy of the buffer, then do 2 checksums, the bit flip can still occur before the 2nd copy is created.
From my understanding, the only risk to your data from non-ECC is a bit flip in RAM, pre-checksum calculation. In that unlikely scenario, you commit bad data to disk as good data(valid checksum). Bitrot isn't a factor, at all.