OK, so I have a few comments about your experiments: > I am surprised that mount...

xk3 · on Jan 23, 2023

> 26% of data is still fully readable -- But also note that about 78% of files were still completely intact.

Do you mean partially intact? I did not count that data. Ah I think I understand what you're saying. 78% of files are fully readable (by number of files) but most of those are small files and those are stored within btrfs metadata ("inlined" extents)

I computed 26% using the quantity of data (ie. sum() of file size) which I feel is a more accurate representation of what is readable. Sure, btrfs `single` mode will have many partially readable files--and if that is an ideal failure state then I would recommend it.

I tried the same experiment with raid0 btrfs config and only the inlined extents were fully readable--less than 1MB recovered from 80GB of data.

> you would expect to lose about 25% of files, while the remaining 75% would be intact

That's what I expected from btrfs for that last 9 months and people online were saying that `single` mode is the same data guarantees as `raid0` mode--which is kind of true but also it is kind of not as we can see. It's true but not likely that in a highly fragmented filesystem the spread of data in `single` could be a similar shape to `raid0` and in that case you could only easily recover the same amount of data (almost none).

What happens in practice is that btrfs will allocate 1gb blocks one drive at a time but, in a multi-disk setup, it writes file extents to multiple disks at a time. So at the file level there are no guarantees about one file being on one disk. This is why I was only able to read 20~30% of data rather than the 75% you and I both naively expected from btrfs single mode. It's important to note that this 20~30% is not guaranteed--it depends how file extents are saved across multiple disks and that is probabilistic not deterministic.

> the remaining 75% would be intact (especially if the files are small enough)

If all the files are inlined extents (default limit is 2048 bytes per file), and you were using raid1c4 metadata profile, then theoretically you could have 100% intact even after losing 3 of 4 disks (regardless of what the data profile is set to since that would not be used to save the file data)--but you would be using 80GB of allocated as "metadata" space in btrfs. (I have not tested that scenario but I think it is likely to be true). So all of the file redundancy is provided by the raid1c3 metadata configuration which I used for <2kb files in my test but the larger files like the max() 2GB one were recovered due to the chance that the file extents were only saved on one or more of the other three disks.

> Although I can see that on your latter experiment, you've defragged the btrfs filesystem beforehand

Yes, I think btrfs defrag does not do much different from when it writes the files initially, but it is still a useful utility in situations where files were overwritten many times. As I understand it there are many reasons that btrfs will decide to write a file to multiple extents and there seems to be no option to have it write one file to one disk as much as possible

> all of your files could be allocated on the first disk only

maybe a good example would be if I had filled up a disk then added a new one. btrfs really tries to allocate data fairly but adding a new disk is a situation where it would definitely be skewed toward one disk. I was actually thinking of recreating my filesystem and just copying over data one disk at a time so that the file extents would be written more consolidated to each disk--but still there would be no mechanism to prevent cross-disk extent writing...

> But on the other hand, if you'd lose one of the other disks, you might not lose any data whatsoever

yep

> even if free space was not fragmented

That would be ideal but I think it is pretty unlikely in practice unless the 1gb blocks which btrfs allocates per disk are used immediately and no files are appended to or changed then there is lots of free space within each 1gb block for btrfs to find

> Unfortunately, ... I can't give you more insight than this

Your comments were helpful and interesting. Hopefully I could share some of my findings as well. I still like btrfs but it certainly acts like a mad chef who is trying to boil 6,827 pots of water to cook spaghetti in this situation.

multi-disk and `single` profile is a bit weird. I'm planning on switching my array to individual btrfs `single` profile disks with `dup` metadata. I will also try MergerFS to group them into one disk but if MergerFS feels sketchy I'll just interface directly with many disks and balance files between them manually