By random chance I ended up in the git internals doc^1 today, also lovely refere...

js2 · on Oct 7, 2023

Unpack the files (git-unpack). Maybe it was one large file that someone added, then deleted in a later commit. You'd have to rewrite history to get rid of it entirely. Alternately it might be a bunch of medium sized files that were added and removed. It may take a little while to track down, but I'd start by unpacking.

This stack-overflow looks like it contains a reasonable description about how to rewrite history to remove objects:

https://stackoverflow.com/questions/11050265/remove-large-pa...

It might be easier to declare repo bankruptcy. Seed a new repo from the existing repo's source files. Have the commit message point to the old repo. Stop using the old repo. Yes, you lose history and folks trying to perform repo archeology will have to jump to the old repo.

But rewriting history to remove large files can be equally as awful since references to git commit IDs tend to end up in places you don't expect and when you rewrite history, you change the commit IDs.

Good luck.

charles_f · on Oct 7, 2023

Thanks! Yeah I plan to get to the bottom of it. I will probably propose to just keep a branch with full history somewhere (we need to keep history for auditability) and reset the main branch from a recent state.

masklinn · on Oct 7, 2023

Have you already tried a "gc --aggressive"? It's not exactly fast or cheap, but some repositories are very badly packed and only a full reset will fix them.

An other useful high-level option is git-sizer (https://github.com/github/git-sizer) which tries to expose a few useful trouble spots, there's not much that can be done if the repository is just big (long history of wide working copies with lots of changes), but sometimes it's just that there are a bunch of large binary assets.

This may be more likely if the repository was converted from a centralised VCS where storing large assets or files is less of an issue, likewise the bad compression. Though obviously removing such large assets from the core repository still requires rewriting the entire thing.

js2 · on Oct 7, 2023

That won't shrink the repo. Any reference will keep all the objects alive and they all get packed together. If you only care about reducing clone size see this post:

https://github.blog/2020-12-21-get-up-to-speed-with-partial-...

To be clear, I was not suggesting deleting the old repo. Keep it for historical purposes, whether you rewrite or start fresh.

acemarke · on Oct 7, 2023

If it helps, I wrote a very long and detailed blog post several years ago about the techniques I used to rewrite my team's Git repo history (including stripping out junk files, _and_ actually rewriting source file contents via formatting and codemods for _old_ commits):

https://blog.isquaredsoftware.com/2018/11/git-js-history-rew...

I specifically was looking for techniques that would let me quickly iterate over ~15000 commits.

granted, the repo size I was working with was only a few GB, but hopefully there's some pieces there you can find useful.

tharos47 · on Oct 7, 2023

I can recommend git-filter-repo instead, it's relatively recent and there is a lot of outdated info on the internet about cleaning git repos. The --analyse flag will generate a report about files in your repo even if they were deleted. I used it to cleanup a number of repo and it helped in detecting large files commited by mistake 10 years ago. The history rewrite removed the files and we didn't need to create a new repo (old history still works fine).

js2 · on Oct 7, 2023

This looks like a great tool. I'm not sure if I haven't come across it before or I'd forgotten about it.

In my experience you'll have references to the commits in a repo from outside of the repo: links from Slack, Jira, other repos, etc to specific commit IDs. When you rewrite history, all of the commit IDs change. That's why I recommend archiving the original repo so as not to break any such references. Create the new repo, either rewritten or seeded from the old, in a new location.

It would be neat if git supported a "rewite map" to allow it to redirect from one revision to another, sort of like how `git blame` can be configured to ignore revisions.

MarkSweep · on Oct 7, 2023

RE large pack files: you can remove unused objects with these commands:

git repack -AFd

git prune --expire now

Also related, the initial git clone from a TFS server (as of 2015) can include every object ever pushed to the server, even if it is on no current branch. So the above commands might save significant space locally. I’m not if newer versions of TFS and DevOps improved this behavior.

schacon · on Oct 8, 2023

> It reads like a good friend is trying to explain you something.

As the author of this document, I wanted to let you know that it made me happy to read this. Thank you for the kind words. :)

kaycebasques · on Oct 7, 2023

> I wish all doc was written with such explicit care to be understood. It reads like a good friend is trying to explain you something.

Thanks for this insight. As a technical writer this is a helpful phrase for providing guidelines on how to write docs.

extraduder_ire · on Oct 9, 2023

> I wish all doc was written with such explicit care to be understood. It reads like a good friend is trying to explain you something.

I think this is due to git's early history and the reputation it had for being incomprehensible and difficult to use. Lots and lots of work has been done by many people to make it more developer/user friendly. It really helped that its feature-set made all of this work appealing. e.g. learning git-blame and git-bisect made me want to use git for all of my projects, even if it takes time to explain how to use it.

glandium · on Oct 7, 2023

Check the git verify-pack subcommand, particularly the -s and -v flags.

Forge36 · on Oct 7, 2023

I'm doing analysis with git-filter-repo --analyze

I've found 1gb files in our repository (thankfully a work in progress so we're able to remove it before it goes to main).

It lists everything by size.

conceptme · on Oct 7, 2023

You could try bfg https://rtyley.github.io/bfg-repo-cleaner/

beezlewax · on Oct 7, 2023

Do you have large image files, videos or other file formats that aren't plain text only that might cause git to store weird diffs/duplicates when you change them?