OK, so I have a few comments about your experiments:
> I am surprised that mounting worked without error but I guess the device is still active via losetup.
Exactly. `rm` doesn't actually delete the file contents while the file is still open, it just unlinks it from the filesystem tree. So your loopback-mounted disk is still there and all its contents are still available through /dev/loopX.
> I'm assuming this would be similar to an actual disk failure though, if the device weren't there maybe btrfs will complain and ask to be mounted with the `-o degraded` flag.
If the /dev/loopX device wasn't there then it would be similar to a complete disk failure, yes.
> In this test about 26% of data is still fully readable
It's true that only 26% of data is still fully readable if you account only for files that are fully intact. But also note that about 78% of files were still completely intact.
This is not clear from your comment, but I'm assuming that you are using 4 devices for the btrfs pool as well?
In this scenario, with such a disk configuration and subsequent disk failure you would expect to lose about 25% of files, while the remaining 75% would be intact (especially if the files are small enough)...
But actually, in reality things can be quite better or quite worse, depending on a few factors.
For example:
1. If the free space was fragmented. In such a case, a significant percentage of files might actually be allocated on more than one disk, so you'd lose more files than expected if a single disk fails. Although I can see that on your latter experiment, you've defragged the btrfs filesystem beforehand, so perhaps this is not the main issue.
2. Depending on how btrfs allocates data, if the files are not completely filling all of the disks then they can be heavily skewed towards a subset of the disks.
For example, imagine that each of your disks are 1 TB-sized and your files total less than 1 TB.
In this case, all of your files could be allocated on the first disk only, so losing this disk could lead to losing 100% of your data.
Or for example, if your files are less than 2 TB, they might all be allocated on the first 2 disks only, so losing one of these disks would lead to losing a lot more files than you'd expect if files were evenly distributed across all disks.
But on the other hand, if you'd lose one of the other disks, you might not lose any data whatsoever.
3. Depending on how large files are and how much free space there is on each disk, btrfs might be forced to (or might choose to) span a file across more than 1 disk even on the 'single' profile, even if free space was not fragmented.
4. But of course, more generally, how many files you would lose basically depends on how btrfs allocates disk space across the disks for each file.
These disk space allocation algorithms can be quite more complex than you'd expect from a naive allocator, mostly due to performance reasons.
Unfortunately, I know exactly nothing about how btrfs allocates data, so I can't give you more insight than this, sorry!
> 26% of data is still fully readable -- But also note that about 78% of files were still completely intact.
Do you mean partially intact? I did not count that data. Ah I think I understand what you're saying. 78% of files are fully readable (by number of files) but most of those are small files and those are stored within btrfs metadata ("inlined" extents)
I computed 26% using the quantity of data (ie. sum() of file size) which I feel is a more accurate representation of what is readable. Sure, btrfs `single` mode will have many partially readable files--and if that is an ideal failure state then I would recommend it.
I tried the same experiment with raid0 btrfs config and only the inlined extents were fully readable--less than 1MB recovered from 80GB of data.
> you would expect to lose about 25% of files, while the remaining 75% would be intact
That's what I expected from btrfs for that last 9 months and people online were saying that `single` mode is the same data guarantees as `raid0` mode--which is kind of true but also it is kind of not as we can see. It's true but not likely that in a highly fragmented filesystem the spread of data in `single` could be a similar shape to `raid0` and in that case you could only easily recover the same amount of data (almost none).
What happens in practice is that btrfs will allocate 1gb blocks one drive at a time but, in a multi-disk setup, it writes file extents to multiple disks at a time. So at the file level there are no guarantees about one file being on one disk. This is why I was only able to read 20~30% of data rather than the 75% you and I both naively expected from btrfs single mode. It's important to note that this 20~30% is not guaranteed--it depends how file extents are saved across multiple disks and that is probabilistic not deterministic.
> the remaining 75% would be intact (especially if the files are small enough)
If all the files are inlined extents (default limit is 2048 bytes per file), and you were using raid1c4 metadata profile, then theoretically you could have 100% intact even after losing 3 of 4 disks (regardless of what the data profile is set to since that would not be used to save the file data)--but you would be using 80GB of allocated as "metadata" space in btrfs. (I have not tested that scenario but I think it is likely to be true). So all of the file redundancy is provided by the raid1c3 metadata configuration which I used for <2kb files in my test but the larger files like the max() 2GB one were recovered due to the chance that the file extents were only saved on one or more of the other three disks.
> Although I can see that on your latter experiment, you've defragged the btrfs filesystem beforehand
Yes, I think btrfs defrag does not do much different from when it writes the files initially, but it is still a useful utility in situations where files were overwritten many times. As I understand it there are many reasons that btrfs will decide to write a file to multiple extents and there seems to be no option to have it write one file to one disk as much as possible
> all of your files could be allocated on the first disk only
maybe a good example would be if I had filled up a disk then added a new one. btrfs really tries to allocate data fairly but adding a new disk is a situation where it would definitely be skewed toward one disk. I was actually thinking of recreating my filesystem and just copying over data one disk at a time so that the file extents would be written more consolidated to each disk--but still there would be no mechanism to prevent cross-disk extent writing...
> But on the other hand, if you'd lose one of the other disks, you might not lose any data whatsoever
yep
> even if free space was not fragmented
That would be ideal but I think it is pretty unlikely in practice unless the 1gb blocks which btrfs allocates per disk are used immediately and no files are appended to or changed then there is lots of free space within each 1gb block for btrfs to find
> Unfortunately, ... I can't give you more insight than this
Your comments were helpful and interesting. Hopefully I could share some of my findings as well. I still like btrfs but it certainly acts like a mad chef who is trying to boil 6,827 pots of water to cook spaghetti in this situation.
multi-disk and `single` profile is a bit weird. I'm planning on switching my array to individual btrfs `single` profile disks with `dup` metadata. I will also try MergerFS to group them into one disk but if MergerFS feels sketchy I'll just interface directly with many disks and balance files between them manually
> I am surprised that mounting worked without error but I guess the device is still active via losetup.
Exactly. `rm` doesn't actually delete the file contents while the file is still open, it just unlinks it from the filesystem tree. So your loopback-mounted disk is still there and all its contents are still available through /dev/loopX.
> I'm assuming this would be similar to an actual disk failure though, if the device weren't there maybe btrfs will complain and ask to be mounted with the `-o degraded` flag.
If the /dev/loopX device wasn't there then it would be similar to a complete disk failure, yes.
> In this test about 26% of data is still fully readable
It's true that only 26% of data is still fully readable if you account only for files that are fully intact. But also note that about 78% of files were still completely intact.
This is not clear from your comment, but I'm assuming that you are using 4 devices for the btrfs pool as well?
In this scenario, with such a disk configuration and subsequent disk failure you would expect to lose about 25% of files, while the remaining 75% would be intact (especially if the files are small enough)...
But actually, in reality things can be quite better or quite worse, depending on a few factors.
For example:
1. If the free space was fragmented. In such a case, a significant percentage of files might actually be allocated on more than one disk, so you'd lose more files than expected if a single disk fails. Although I can see that on your latter experiment, you've defragged the btrfs filesystem beforehand, so perhaps this is not the main issue.
2. Depending on how btrfs allocates data, if the files are not completely filling all of the disks then they can be heavily skewed towards a subset of the disks.
For example, imagine that each of your disks are 1 TB-sized and your files total less than 1 TB.
In this case, all of your files could be allocated on the first disk only, so losing this disk could lead to losing 100% of your data.
Or for example, if your files are less than 2 TB, they might all be allocated on the first 2 disks only, so losing one of these disks would lead to losing a lot more files than you'd expect if files were evenly distributed across all disks.
But on the other hand, if you'd lose one of the other disks, you might not lose any data whatsoever.
3. Depending on how large files are and how much free space there is on each disk, btrfs might be forced to (or might choose to) span a file across more than 1 disk even on the 'single' profile, even if free space was not fragmented.
4. But of course, more generally, how many files you would lose basically depends on how btrfs allocates disk space across the disks for each file.
These disk space allocation algorithms can be quite more complex than you'd expect from a naive allocator, mostly due to performance reasons.
Unfortunately, I know exactly nothing about how btrfs allocates data, so I can't give you more insight than this, sorry!