Hacker Newsnew | past | comments | ask | show | jobs | submit | peterwaller-arm's commentslogin

Author here. This is correct, we set out to do binary diffing but we soon discovered that if you put similar enough object files together in a stream, and then compress the stream, zstandard does a fantastic job at compressing and decompressing quickly with a high compression ratio. The existing binary diffing tools can produce small patches, but they are relatively expensive both to compute the delta and to apply the patches.


Author here, I'd like to see such a comparison too actually, but I'm not in the position to do the work at the moment. We did some preliminary experiments at the beginning, but a lot changed over the course of the project and I don't know how well elfshaker fares ultimately against all the options out there. Some basic tests against git found that git is quite a bit slower (10s vs 100ms) during 'git add' and git checkout. Maybe that can be fixed with some tuning or finding appropriate options.


It would be interesting to compare to gitoxide tweaked to use zstd compression for packs.


Nice, I bet dwarfs would do well at our use case too. Thanks for sharing.


Author here. Compressed data is unlikely to work well in general, unless it never changes.


Author here. No architecture specific processing currently. Most of the magic happens in zstandard (hat tip to this amazing project).

Please see our new applicability section which explains the result in a bit more detail:

https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83...

In manyclangs (which uses elfshaker for storage) we arrange that the object code has stable addresses when you do insertions/deletions, which means you don't need such a filter. But today I learned about such filters, so thanks for sharing your question!


Thanks, great project!

In this comment, you say "20% compression is pretty good". AFAIK, usually "X% compression" means the measure of the reduction in size, not the measure of the remaining. Thus, 0.01% compression sounds almost useless, very different from the 10,000x written next to it.


(Disclosure: I work for Arm, opinions are my own)

Author here. elfshaker itself does not have a dependency on any architecture to our knowledge. We support the architectures we have use of. Contributions to add missing support are welcome.

manyclangs provides binary pack files for aarch64 because that's what we have immediate use of. If elfshaker and manyclangs proves useful to people, I would love to see resource invested to make it more widely useful.

You can still run the manyclangs binaries on other architectures using qemu [0], with some performance cost, which may be tolerable depending on your use case.

[0] https://github.com/elfshaker/manyclangs/tree/main/docker-qem...


Author here. The executables shipped in manyclangs are release builds! The catch is that manyclangs stores object files pre-link. Executables are materialized by relinking after they are extracted with elfshaker.

The stored object files are compiled with -ffunction-sections and -fdata-sections, which ensures that insertions/deletions to the object file only have a local effect (they don't cause relative addresses to change across the whole binary).

As you observe, anything which causes significant non-local changes in the data you store is going to have a negative effect when it comes to compression ratio. This is why we don't store the original executables directly.


Thank you for the explanation, so the pre-link storage is one of the magical ingredients, maybe mention this as well in the README?

Is this the reason why manyclang (using llvms cmake based build system) can be provided easily, but it would be more difficult for gcc? Or is the object -> binary dependency automatically deduced?


> maybe mention this as well in the README?

We've tweaked the readme, I hope it's clearer.

It would be great to provide this for gcc too. The project is new and we've just started out. I know less about gcc's build system and how hard it will be to apply these techniques there. It seems as though it should be possible though and I'd love to see it happen.

To infer the object->executable dependencies we currently read the compilation database and produce a stand-alone link.sh shell script, which gets packaged into each manyclangs snapshot.


Ah, the compilation database is where more magic originates from :)


Yes, this is less great than I would like! :( :)


Thanks. I had a use case in mind where LTO is enabled. Unfortunately the LTO step is quite expensive so relinking does not seem like a viable option. If I find some time I'll give it a try though.


ThinLTO can be pretty quick if you have enough cores, it might work. Not sure how well the LTO objects compress against each other when you have small changes to them. It might work reasonably.

manyclangs is optimized to provide you with a binary quickly. The binary is not necessarily itself optimized to be fast, because it's expected that a developer might want to access any version of it for the purposes of testing whether some input manifests a bug or has a particular codegen output. In that scenario, it's likely that the developer is able to reduce the size of the input such that the speed of the compiler itself is not terribly significant in the overall runtime. Therefore, I don't see LTO for manyclangs as such a significant win. But it is still hoped that the overall end-to-end runtime is good, and the binaries are optimized, just not with LTO.


Author here. I've used bup, and elfshaker was partially inspired by it! It's great. However, during initial experiments on this project I found bup to be slow, taking quite a long time to snapshot and extract. I think this could in principle be fixed in bup one day, perhaps.


I also use bup for a long time, but found that for very large server backups I'm hitting performance problems (both in time and memory usage).

I'm currently evaluating `bupstash` (also written in Rust) as a replacment. It's faster and uses a lot less memory, but is younger and thus lacks some features.

Here is somebody's benchmark of bupstas (unfortunately not including `bup`): https://acha.ninja/blog/encrypted_backup_shootout/

The `bupstash` author is super responsive on Gitter/Matrix, it may make sense to join there to discuss approaches/findings together.

I would really like to eventually have deduplication-as-a-library, to make it easier to put into programs like nix, or also other programs, e.g. for versioned "Save" functionality in software like Blender or Meshlab that work with huge files and for which diff-based incremental saving is more difficult/fragile to implement than deduplcating snapshot based saving.


I used `bupstash` and evaluated it for a while. I am looking to do 5+ offsite backups of a small personal directory to services that offer 5GB of cloud space for free.

`bupstash` lacked good compression. I settled with `borg` because I could use `zstd` compression with it. Currently at 60 snapshots of the directory and the `borg` repo directory is at ~1.52GB out of 5GB quota. The source directory is ~12.19GB uncompressed. Very happy with `borg` + `zstd` and how they handle my scenario.

I liked `bupstash` a lot, and the author is responsive and friendly. But I won't be giving it another try until it implements much more aggressive compression compared to what it can do now. It's a shame, I really wanted to use it.

I do recognize that for many other scenarios `bupstash` is very solid though.


Borg has been working great for me with zstd.


Thank you for having such a good description on the project! Sometimes the links from HN lead to a page that takes a few minutes of puzzling to figure out what is going on but not yours.


Is elfshaker any good for backuping non-text data?


Author here, I agree with xdfgh1112, please take care before using brand new software to store your backups!


Yes, any time that i use something new or different (or both) for something as essential as backups, i take great and deliberate care...and test, test, test...well before standardizing on it. ;-)


Author here, this software is young, please don't use it for backups!

But also, in general, it might not work well for your use case, and our use case is niche. Please give it a try before making assumptions about any suitability for use.


In this age of rampant puffery, it's so... soothing to see somebody be positive and frank about the limits of their creation. Thanks for this and all your comments here!


<3


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: