Yeah, there is definitely some merit to more efficient hashing. Trees with a lot of duplicates require a lot of hashing, but hashing the entire file would be required regardless of whether partial hashes or done or not.
I have one data set where `dedup` was 40% faster than `dupe-krill` and another where `dupe-drill` was 45% faster than `dedup`.
`dupe-krill` uses blake3, which last I checked, was not hardware accelerated on M series processors. What's interesting is that because of hardware acceleration, `dedup` is mostly CPU-idle, waiting on the hash calculation, while `dupe-krill` is maxing out 3 cores.
Hashing the whole file after that is wasteful. You need to read (and hash) only as much as needed to demonstrate uniqueness of the file in the set.
The tree concept can be extended to every byte in the file:
https://github.com/kornelski/dupe-krill?tab=readme-ov-file#n...