It's not just GUI usage that leads to bad understanding. The documentation is in...

edejong · on Oct 27, 2016

First line in the description of the 'git-commit (1)' man-page:

Stores the current contents of the index in a new commit along with a log message from the user describing the changes.

Since most people are searching the man-pages for a certain task (in this case, the task is to record the changes), the title makes sense. As a user, you are kindly requested to read at least the description part of the man-page.

antocv · on Oct 27, 2016

Well, you are wrong the man page is correct.

It is not a snapshot of a directory, if you want that use btrfs snapshot support, or use tar and gzip or duplicity.

A commit records changes to a repository, which can be changes to only a file and not a whole directory of files, and the changes stored is only the delta of the previous change, and not "snapshot" in the sense of storing 2 different versions of your files in full - thats a waste of space, its stores the deltas, the patch - and other meta-data such as author and dates.

godd2 · on Oct 27, 2016

A commit is not deltas. It contains a hash of a tree object, which in turn contains a list of hashes of blob and tree objects. The tree hashes are of course of tree objects which are your subdirectories, and the blob hashes are the hashes of your full files. When you make a change to a file, and then add it, git saves the new file as a full copy, and gives it its own sha1 hash.

If you go into `.git/objects` and find the file whose name is the hash of your most recent commit, you can decompress it (zlib inflate) and the first few characters of the file will be something like "commit 485\0tree 6f3eeb2952a...". This tell us that this object is a commit object of size 485 bytes, and then after the null character is the commit itself. If you then take that tree hash and do the same thing, you'll see a list of blob hashes next to filenames, and tree hashes next to directory names (in a tree object, git stores the hashes as the raw bits, instead of ascii encoded, so if you want to follow the hash list, you'll have to convert the hashes to their ascii equivalent to find the appropriate object in the object store).

You are correct that git uses deltas, but it doesn't use them for commits, it uses them when it recompresses your objects into a packfile (which happens when there are too many loose objects or when you pull and push).

Every commit can reconstruct the state of your project at the time the commit was made. Each commit can do this without the help of any other commit.

antocv · on Oct 27, 2016

Wow, that is informative, did not know. Thanks to you, and others!

godd2 · on Oct 27, 2016

No problem! :)

Once I understood this, things like merging and shallow clones made a lot more sense.

Although it makes rebasing more confusing. What's happening there is that git is creating patches on the fly, and then applying them to the new base, and creating new commits. But the resulting commits are snapshots of what the files would have been had you applied the patches yourself.

Of course, there can still be "merge conflicts" since both branches might make changes to the same place in the same file. But since everything is a snapshot, if you have no pending changes in your working directory, hopping around the commit history is a safe action, so long as you have a branch pointing to where you left off.

adrianN · on Oct 27, 2016

Consider reading the article.

godd2 · on Oct 27, 2016

I don't think it's fair to assume they didn't read the article. It's difficult to know when an explanation is succinct vs analogous, especially if you have a working model of knowledge. "A is just B" can mean more than one thing.

derefr · on Oct 27, 2016

Nah, semantically, the parent commentor is closer to the truth.

There are SCMs that literally store commits as patches or diffs, and then, when you want to check out anything other than HEAD, they have to run history backward by applying those patches, in series, to move through time.

A git commit, on the other hand, is more like a handle to a pure-functional tree data structure that happens to share some of its pointers with earlier versions of the same tree. Each commit is still the whole tree—not a diff; any marginal commit just happens to not take up much space, because a lot of the objects within it are objects that were already entered into the pool in previous commits.

You can easily see that this is true by cloning a large repo, using git's `git checkout --orphan` command to create a new entirely-disconnected branch, and then committing the whole repo to it. If git was diff-based, the size of the .git folder would balloon when you did this and your computer would chug computing the commit. But instead, this operation is approximately free, because your new commit will create a tree object that shares all its child objects with existing objects already in the pool, even though the "diff" of this commit is against a blank-slate state.

(Mind you, git will chug if you ask it to `git show` this orphan commit; it will have to convert, right then, the (cheap) tree representation into a huge diff for you to look at. But that huge diff is just for your convenience; it has nothing to do with git's data model.)

rajivm · on Oct 27, 2016

Actually, you are incorrect.

A git commit is not a "patch" or "delta" as it may seem from the CLI/UI. A git commit consists of a reference to the previous commit and a reference to a git "tree" object.

A git "tree" object is essentially a directory listing of sub-trees (folders) and blobs (files). Each blob and sub-tree is addressed by a hash of the respective object. So in a sense, each commit by virtue of the tree it references represents the entire state of the repository at that commit. You do not need to accumulate the diffs of each commit to find the repo. state at that commit.

The way git saves space is two fold: 1) Since blobs are hashed (content-addressable), if the same file exists in two trees or in two locations in the repository it will not take up additional space. 2) Git occasionally re-packs the objects into packfiles that are compressed and leverage deltas to reduce storage space. This is the closest you get to storing deltas, but this is at the blob level and not the commit level, and is more an aspect of storage than how the Git data model actually works.

edejong · on Oct 27, 2016

A commit in the repo is simpler than that. It is:

- A reference to the hash of a tree object

- Some meta-data (one or more parent commits, author, date, message).

The commit command (git commit) basically takes the tree object which you created in the index, your HEAD pointer (the commit you were at right now) and some user entered information to create a commit object.

So, notably, the commit does not store the changes! It stores a reference to the root of the index tree object (which contains references to the root of other tree objects or blobs). There are no diffs recorded anywhere, only calculated when needed.

randomsearch · on Oct 27, 2016

Given how the above posters all seem to have a different opinion about the internals of git, the point of the article seems to be somewhat negated.