Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What is in that .git directory? (meain.io)
269 points by Ivoah on Oct 7, 2023 | hide | past | favorite | 41 comments


By random chance I ended up in the git internals doc^1 today, also lovely refered to as plumbing and porcelain. It's a fantastic read, very well explained. I wish all doc was written with such explicit care to be understood. It reads like a good friend is trying to explain you something.

What got me into that was a 51Gb ".pack" file that I wanted to understand. If you wonder about that, they're pack files, and what that "delta compression" message when you commit is about^2. The 51Gb file though I don't have an explanation for as of yet, I'm guessing something terrible happened before I joined, and people didn't find the courage to forego the history just yet. But at least I got an entertaining read out of it.

^1: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Po...

^2: https://git-scm.com/book/en/v2/Git-Internals-Packfiles


Unpack the files (git-unpack). Maybe it was one large file that someone added, then deleted in a later commit. You'd have to rewrite history to get rid of it entirely. Alternately it might be a bunch of medium sized files that were added and removed. It may take a little while to track down, but I'd start by unpacking.

This stack-overflow looks like it contains a reasonable description about how to rewrite history to remove objects:

https://stackoverflow.com/questions/11050265/remove-large-pa...

It might be easier to declare repo bankruptcy. Seed a new repo from the existing repo's source files. Have the commit message point to the old repo. Stop using the old repo. Yes, you lose history and folks trying to perform repo archeology will have to jump to the old repo.

But rewriting history to remove large files can be equally as awful since references to git commit IDs tend to end up in places you don't expect and when you rewrite history, you change the commit IDs.

Good luck.


Thanks! Yeah I plan to get to the bottom of it. I will probably propose to just keep a branch with full history somewhere (we need to keep history for auditability) and reset the main branch from a recent state.


Have you already tried a "gc --aggressive"? It's not exactly fast or cheap, but some repositories are very badly packed and only a full reset will fix them.

An other useful high-level option is git-sizer (https://github.com/github/git-sizer) which tries to expose a few useful trouble spots, there's not much that can be done if the repository is just big (long history of wide working copies with lots of changes), but sometimes it's just that there are a bunch of large binary assets.

This may be more likely if the repository was converted from a centralised VCS where storing large assets or files is less of an issue, likewise the bad compression. Though obviously removing such large assets from the core repository still requires rewriting the entire thing.


That won't shrink the repo. Any reference will keep all the objects alive and they all get packed together. If you only care about reducing clone size see this post:

https://github.blog/2020-12-21-get-up-to-speed-with-partial-...

To be clear, I was not suggesting deleting the old repo. Keep it for historical purposes, whether you rewrite or start fresh.


If it helps, I wrote a very long and detailed blog post several years ago about the techniques I used to rewrite my team's Git repo history (including stripping out junk files, _and_ actually rewriting source file contents via formatting and codemods for _old_ commits):

https://blog.isquaredsoftware.com/2018/11/git-js-history-rew...

I specifically was looking for techniques that would let me quickly iterate over ~15000 commits.

granted, the repo size I was working with was only a few GB, but hopefully there's some pieces there you can find useful.


I can recommend git-filter-repo instead, it's relatively recent and there is a lot of outdated info on the internet about cleaning git repos. The --analyse flag will generate a report about files in your repo even if they were deleted. I used it to cleanup a number of repo and it helped in detecting large files commited by mistake 10 years ago. The history rewrite removed the files and we didn't need to create a new repo (old history still works fine).


This looks like a great tool. I'm not sure if I haven't come across it before or I'd forgotten about it.

In my experience you'll have references to the commits in a repo from outside of the repo: links from Slack, Jira, other repos, etc to specific commit IDs. When you rewrite history, all of the commit IDs change. That's why I recommend archiving the original repo so as not to break any such references. Create the new repo, either rewritten or seeded from the old, in a new location.

It would be neat if git supported a "rewite map" to allow it to redirect from one revision to another, sort of like how `git blame` can be configured to ignore revisions.


RE large pack files: you can remove unused objects with these commands:

git repack -AFd

git prune --expire now

Also related, the initial git clone from a TFS server (as of 2015) can include every object ever pushed to the server, even if it is on no current branch. So the above commands might save significant space locally. I’m not if newer versions of TFS and DevOps improved this behavior.


> It reads like a good friend is trying to explain you something.

As the author of this document, I wanted to let you know that it made me happy to read this. Thank you for the kind words. :)


> I wish all doc was written with such explicit care to be understood. It reads like a good friend is trying to explain you something.

Thanks for this insight. As a technical writer this is a helpful phrase for providing guidelines on how to write docs.


> I wish all doc was written with such explicit care to be understood. It reads like a good friend is trying to explain you something.

I think this is due to git's early history and the reputation it had for being incomprehensible and difficult to use. Lots and lots of work has been done by many people to make it more developer/user friendly. It really helped that its feature-set made all of this work appealing. e.g. learning git-blame and git-bisect made me want to use git for all of my projects, even if it takes time to explain how to use it.


Check the git verify-pack subcommand, particularly the -s and -v flags.


I'm doing analysis with git-filter-repo --analyze

I've found 1gb files in our repository (thankfully a work in progress so we're able to remove it before it goes to main).

It lists everything by size.



Do you have large image files, videos or other file formats that aren't plain text only that might cause git to store weird diffs/duplicates when you change them?


If you'd like a more in-depth treatment of the topic, let me suggest chapter 10 of the git book:

https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Po...

> But what gets sent to the other git repo? It is everything that is in objects and under refs.

Not everything under refs. Just the refs that you push. What gets pushed depends on how you configure git, what arguments you provide to `git push` and how the refspecs are configured for the remote under `.git/config`:

https://git-scm.com/book/en/v2/Git-Internals-The-Refspec

e.g., I regularly use `git push origin +HEAD:develop` to force push the checked out branch to a destination branch named `develop`.

A couple additional points not mentioned:

There are also tag objects. You create these with `git tag -a`. These are also called annotated tags. They carry their own message and point to a commit. Without `-a` you create a so-called lightweight tag which is just an entry under `refs/tags` pointing directly to a commit (as opposed to pointing to a tag object).

https://git-scm.com/docs/git-tag

All those loose objects get packed up into pack files periodically to save space and improve git's speed. You can manually run `git gc` but git will do so for you automatically every so many commits. You'll find the pack files under `.git/objects/pack`:

https://git-scm.com/book/en/v2/Git-Internals-Packfiles


> Not everything under refs. Just the refs that you push.

Ahh, thanks. I overlooked that detail. I've fixed it now. :D


Nice post, thanks for sharing! I found that another way to learn about Git internals is following a very step by step re-implementation of Git. It really was a very cool and efficient way for me to understand what's in the .git repository.

See for example the ugit [1] "build Git from scratch in Python" series for that.

[1] https://www.leshenko.net/p/ugit/


It's fairly easy to grab info from .git for your own purposes. For example, the program that generates my PS1 peeks there (without wasting precious cycles on shelling out to the git command) to find the current branch we're on:

https://github.com/rollcat/etc/blob/b2fd739/cmd/prompter/mai...


What's with the random bit flips in pieces that look like they would have been copied from the shell (i.e. likely not typos)?

objects/4c -> objects/5c

2023-07-02 -> 2024-07-02


That is a typo, lemme go fix that :D


just a random comment:

the

    .git/info/exclude
file acts as a personal, private .gitignore you don't have to commit


Nice! I didnt know about this. Feel like its in an odd place. This is the first ive ever heard of this, so must be that not many use it (or admit to using it)


A more suggestive name like “private-ignore” would help.


While I always hate the cluttering of "top" directories, id think something more like `.gitignore.local` at the top level would be much better than where its hidden


Yeah absolutely. When it is buried inside the repository itself most people won't find out about it, even with a good name.


The way I learn git internals through experimenting is, executing a git command, watch the file changes happen in .git directory, it's pretty fun. I actually wrote a simple cli util to watch the changes: https://github.com/wong2/meowatch


Great post. Git becomes much less mysterious once knowing how it works internally.


I really don't like the "f" in that font. Very jarring.


Yup. As far as modern letterform conventions go, it's just plain wrong.

It's incredibly distracting and I can't imagine why anyone would ever choose to use it for code.

If you're doing some kind of cool alternative graphic design poster, then by all means go nuts! That's precisely where it's fun to play with different forms and be as "wrong" as you want.

But for something like code where legibility is the primary concern, it's a very unfortunate choice.

Our brain recognizes words not just by individual letters but by the shape of the entire word, and inserting a descender where we're not accustomed to one, breaks our word-level recognition. It's not a neutral, aesthetic choice -- it literally makes it objectively harder to read, in a modern context.


Haha, I have heard that from a lot of people. I actually really like that `f` for some reason.


There is a trend to all these hidden dot folders and files from apps. VS code is another example. Personally I do not like this. Couldn't there be another way for this config files?


If you don't like .git directories you can create a bare repo. That puts all of git's internal stuff at top level and makes it visible. But then you have to set up your working directory somewhere else.

https://stackoverflow.com/questions/7632454/how-do-you-use-g...


On the same topic, I usually refer back to this fantastic talk on how to add and commit a file without using git add or git commit: https://www.youtube.com/watch?v=mdvlu_R8EWE


[flagged]


sqlite? god please no

sqlite is great when everything is working as designed, but breaks completely on any badness.

The file-based git approach is on the other hand is incredibly resilent - and "resilent" is exactly what I want from my version control system.

For example, I sync all my computers, including .git dirs, with unison. And sadly, I am not a perfect human being, so I often generate conflicts (like make different commits on same branch in same git checkout and thentry to sync this using file-based sync tool). And git survives such abuse and just works. It also survives partially deleted files, bad transfers.. sometimes you need to dig a bit, but you can recover it.


Did you get that backwards?

sqlite is a proper database that actually tests its resilience.

Just because you can't sync a sqlite file, that doesn't mean it isn't resilient it just means you need to back it up via pushing to another repo or using the backup command. Syncing by just copying files over while a disk is still being used is fragile in general.


No, they didn't get it backwards.

Testing is great for resilience, but "content files generally only get added, not modified or deleted" is even better.

Copying files around may be fragile but people want to do it and get lots of value out of it.


have you seen sqlite official documentation on corruption resistance? https://www.sqlite.org/howtocorrupt.html

supported failure modes, tested and handled:

"application crash, or an operating-system crash, or even a power failure" - so basically proper atomic renames. git does this well.

unsupported failure modes:

"Backup or restore while a transaction is active" - when you backup your machine, do you really treat each sqlite specially? I know I don't.

"Deleting a hot journal" - or, you know, downloading database file and forgetting to grab journal at the same time

"Multiple links to the same file" - did you ever hardlink or bind-mounted a database file? prepare for corruption...

-----

Don't get me wrong, it takes some skill to implement proper safe file handling, and a random person off the street would be better off with sqlite.

But git specifically took the effort and designed the system so that the database is resilent in all sort of crazy conditions, and even if not, it's easy to recover. Switching git to sqlite would be all downside, no upside.


>when you backup your machine, do you really treat each sqlite specially?

Yes, but only for servers where the database is being used. On my desktop if the database isn't being used it is safe to copy so I don't worry about it. Backing up a git repo while git is writing to the repo isn't safe either.

>or, you know, downloading database file and forgetting to grab journal at the same time

You should not be downloading an actively used sqlite database anyways. If you backup the sqlite database before downloading it there won't be a journal file.

>did you ever hardlink or bind-mounted a database file? prepare for corruption...

You just have to link both the database and WAL file. This is somewhat challenging since the WAL file will be deleted by default if all processes close the database. It's better to link or mount the directory that contains the database file. If you link only some of the files from .git, then git won't work properly either.


Does that break things like cherry-pick ? I’m intrigued.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: