Gits Guts: Part 1

Raju Gandhi
  • August 2014
  • Git

This article originally appeared in NFJS, The Magazine's August, 2014 issue.

Since it’s inception Git has fast become one of the most popular distributed version control systems in use. Despite its pervasive use, Git often still comes across as arcane — with obtuse commands, many of which seem to do similar things. In this article series we will attempt to unravel the mysteries of Git by taking a deep dive into the internals of Git. We will explore the core data-structure Git uses to store our repository’s history and then look at a few commands to see how they mutate and manipulate this data-structure. This will enable us to get a better understanding of the workings of Git, and allow us to better leverage Git for our daily use.

The .git directory

As you know, Git is a distributed version control system.

Git stores all of the repository's history inside the .git directory which is _usually_ found at the root level of the Git repository.

We will start our exploration of Git by first taking a peek inside the .git directory.

Before we begin, let us initialize a new Git repository by using Git's init command.

Be sure to navigate to a scratch directory prior to running the following command:

Initialize a new Git repository
$ git init gitsGuts
 # Initialized empty Git repository in /Users/looselytyped/Documents/articles/gitsGuts/.git/
 $ cd gitsGuts
 $ (master) ls -al

 ....
ls -al
total 0
drwxr-xr-x   3 looselytyped  staff  102 Jul  6 14:57 .
drwxr-xr-x  14 looselytyped  staff  476 Jul  6 14:57 ..
drwxr-xr-x   9 looselytyped  staff  306 Jul  6 14:57 .git
 ....

Now that we have our repository set up let us take a quick look at the .git directory's structure.

We can use the Unix tree command to see the structure of the .git directory:

.git directory structure
$ (master) tree .git

....
.git
├── HEAD <1>
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── prepare-commit-msg.sample
│   └── update.sample
├── info
│   └── exclude
├── objects <2>
│   ├── info
│   └── pack
└── refs <3>
    ├── heads
    └── tags
....
<1> Symbolic Reference
<2> Object datastore
<3> References

Some of the files and directories found within the .git directory serve to help configure and customize the Git repository.

To help us out, I have highlighted a few files and directories that will be of particular interest for us in this article series.

If this seems to be unfamiliar territory, worry not -- we will be more than acquaintanced before we are finished here.

Now that we have a Git repository, let us get a high level overview of the core constructs that make up Git's datastore.

The Git datastore

The Git datastore is made up of four different kinds of objects:

Git's objects
  • Blobs
  • Trees
  • Commits
  • Tags

For the purposes of our discussion it will suffice to look only at blobs, trees and commits.

Before we begin to look at these individually let us talk about these objects from a 20,000 feet view.

Git objects

As one with an object-oriented background, I remember how my ears perked up when I heard of "Git objects" -- I was already thinking of what their API might look like.

But Git objects are nothing like the objects you may be used to in OO-land.

Rather, when you think of Git objects just think of them as "opaque" (that is "not plain text") records that are stored on the file system (in this case that would be the .git directory, or specifically, inside the .git/objects directory).

Each of these objects is compressed prior to being persisted on disk, and Git uses a SHA-1 hash not only to uniquely identify each object, but also decide where the object is stored.

I realize that this all seems a little abstract, so let us deep-dive into each object individually and perhaps some of this will come into perspective.

We will start with blobs first.

Blobs

Blobs in Git store the contents of files.

Say it with me - blobs in Git store the contents of files.

To put it another way, no meta-data about the file is stored in a blob -- no names, paths, types of files (regular, executable, symlink) -- none of that is stored in a blob.

When Git creates a blob it uses the contents of a file to produce a SHA-1 hash.

It then uses this hash to both fingerprint the blob as well as determine where to store the blob.

Let us see some of this in action.

We will start by creating some content, and we will attempt to see the hash that Git will use to represent that content within the datastore.

Calculating the hash
$ (master) echo 'Hello Git!' | git hash-object --stdin
# 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e

We will use one of Git's in-built commands, hash-object[1] to figure out what hash Git will generate to represent "Hello Git!" -- which turns out to be 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e[2].

Of course this is not usually how we use Git.

So let us write a file with the same content and git-add the file so that Git adds it to its datastore.

We will then use the Unix tree command to inspect the .git/objects directory.

git-add a file to Git
$ (master) echo 'Hello Git!' > README.md
 $ (master) git add README.md
 $ (master) tree .git/objects/

....
.git/objects/
├── 10
│   └── 6287c47fd25ad9a0874670a0d5c6eacf1bfe4e
├── info
└── pack

3 directories, 1 file
....

Recall that the hash that Git created to represent "Hello Git!" was 106287c47F25ad9a0874670a0d5c6eacf1bfe4e.

After we git-add README.md to add add the file to Git's index, we see that Git has created a hierarchy containing one folder and one file under .git/objects.

The name of the directory just happens to be the first two characters of the hash that represents the content, and the name of the file happens to be the remaining 38 characters.

6287c47fd25ad9a0874670a0d5c6eacf1bfe4e happens to be the blob that Git created to store the contents of README.md.

The blob, as I mentioned earlier, is a compressed file that contains the contents of README.md [3].

Let us use Git to find out a little more about this hash.

Decrypting a Git hash
$ (master) git cat-file -t 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e
# blob
$ (master) git cat-file -p 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e
# Hello Git!
$ (master) git cat-file -p 106287
# Hello Git!

We use the git-cat-file command to ask ask Git the type of hash (using the -t flag) that 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e represents and Git reports it as a blob.

No surprise there.

We can use the same command to ask Git to pretty-print the contents that the hash represents (this time using the -p flag) and again, no surprise.

Most Git commands that accept hashes as arguments can be supplied with the first 6 to 7 characters of the hash (since that is usually sufficient for Git to know which hash you mean).

One final note -- if you have ever heard anyone call Git a content-addressable storage then perhaps you see why -- Git uses the contents of a file to determine where it is to be stored.

Feel free to repeat this experiment with another piece of content. Use `git-hash-object` to see what hash Git will generate for it, then see if you can predict where Git will store the blob.

Then simply create a new file with the exact same content, and git-add it to the index.

Inspect .git/objects directory to see if your guess was correct.

To summarize, blobs represent contents of files.

They are identified by SHA-1 hashes that are generated using the contents of the files themselves, and Git uses this hash to determine where to store the blob.

They contain no metadata about the file itself -- so where does this information get stored?

The answer lies in the tree objects.

Let us look at those next.

Trees

Blobs represent the contents of files, trees represent the directory structure of those files. A tree has pointers to all of the blobs that make up that tree, and perhaps to other trees if there happen to be subdirectories.

Before we dig deeper let us add a bit more structure to our Git repository.

$ (master) mkdir src <1>
 $ (master) touch src/Main.java <2>
 $ (master) echo '// This is my source code' > src/Main.java <3>
 $ (master) git add src/Main.java <4>
 $ (master) $ tree <5>

....
.
├── README.md
└── src
    └── Main.java
....
<1> Add a src sub-directory
<2> Add a file to the sub-directory
<3> Put some contents in the newly created file
<4> git-add the file to the repository
<5> Inspect working directory structure

Quick!

How many blobs exist within our Git datastore?

If you guessed two then that is absolutely correct.

Well done :)

Here is another (albeit trickier) question -- how many directories exist within our working directory?

The correct answer to that question is two!

We have the src directory, *and* we have the working directory itself (represented by . in the tree output).

We will now ask Git to write the directory structure to the datastore so we can see what tree objects look like.

$ (master) git write-tree <1>
 # b81f10b16a08debe2624bdc0233a4c2fe2032616
 $ (master) tree .git/objects/ <2>

....
 .git/objects/
├── 10
│   └── 6287c47fd25ad9a0874670a0d5c6eacf1bfe4e
├── 75
│   └── 460e5f3dd6fa1688922a2b6737dc1143d9bb3f
├── b8
│   └── 1f10b16a08debe2624bdc0233a4c2fe2032616
├── df
│   └── 5044438d88195ccf896bdad3eef8940b31e7de
├── info
└── pack

6 directories, 4 files
....
<1> Add the tree to the datastore
<2> Inspect the .git/objects directory

We use yet another command (git-write-tree) from Git's repertoire of commands that causes Git to write the current directory structure to the datastore.

Git replies back with yet another hash (this time b81f10b16a08debe2624bdc0233a4c2fe2032616) -- this hash represents the root of the current working directory.

Just like blobs Git will store the tree under the .git/objects directory -- it takes the first two characters of the hash to create a folder (if it does not exist already) and then creates a file with the remaining 38 characters.

We know that there are two directories in our working directory (the root, and src) and we have two files.

We confirm this by inspecting the .git/objects directory.

We know that the 6287c47fd25ad9a0874670a0d5c6eacf1bfe4e contains the contents of README.md (in compressed format) and 1f10b16a08debe2624bdc0233a4c2fe2032616 represents the root directory. The obvious question is how does Git represent a directory structure? Let us find out.

$ (master) git cat-file -t b81f10b16a08debe2624bdc0233a4c2fe2032616 <1>
 # tree
 $ (master) git cat-file -p b81f10b16a08debe2624bdc0233a4c2fe2032616 <2>
 # 100644 blob 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e  README.md
 # 040000 tree 75460e5f3dd6fa1688922a2b6737dc1143d9bb3f  src

<1> Ask for type of hash df5044438d88195ccf896bdad3eef8940b31e7de represents
<2> Pretty-print (-p) it
. We once again use git-cat-file to ask for the type of hash that b81f10b16a08debe2624bdc0233a4c2fe2032616 represents and Git tells us it is a tree object. . Pretty printing the same hash reveals something that looks a lot like a directory listing!

Looking over the contents of the pretty print we see a few items that should be familiar.

We know that 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e is a blob representing README.md.

We also see an entry for a tree with the name src with a hash of 75460e5f3dd6fa1688922a2b6737dc1143d9bb3f.

Let us inspect that before we proceed to see what _actually_ happened when Git wrote the tree.

$ (master) git cat-file -p 75460e5f3dd6fa1688922a2b6737dc1143d9bb3f <1>
 # 100644 blob df5044438d88195ccf896bdad3eef8940b31e7de Main.java
 $ (master) git cat-file -p df5044438d88195ccf896bdad3eef8940b31e7de <2>
 # // This is my source code

<1> Pretty print 75460e5f3dd6fa1688922a2b6737dc1143d9bb3f
<2> Pretty-print df5044438d88195ccf896bdad3eef8940b31e7de
. Pretty printing 75460e5f3dd6fa1688922a2b6737dc1143d9bb3f reveals a string much like we saw for b81f10b16a08debe2624bdc0233a4c2fe2032616 except this one has only one entry in it. . Pretty printing the blob contained within the src directory reveals that it represents the contents of Main.java.
How does this work?

When we asked Git to write the tree to the datastore it started recursively inspecting the working directory from the root.

It realized that that there was a sub-directory (src) under the root directory and first calculated the hash for that directory.

It did so by creating a string that looked like 100644 blob df5044438d88195ccf896bdad3eef8940b31e7de Main.java and then using the SHA-1 algorithm to generate a hash from that string.

It then stuffed that very string (after compressing it) in a file called 460e5f3dd6fa1688922a2b6737dc1143d9bb3f under the 75 directory under .git/objects.

The following listing highlights the constituent parts of the string that represent a tree (or a directory) within Git.

Representing a tree
  • 100644 represents a regular non-executable file (Git uses several other codes such as 100755 to represent executable files, and 040000 to represent sub-directories a.k.a sub-trees)
  • The type: blob, tree, etc.
  • The hash of the current entry
  • The name of the entry

Perhaps now you see where the file (or blob) metadata is stored -- it is in the tree!

Furthermore, Git uses the hash of the blobs (and sub-trees) within a tree to calculate the hash of the tree itself!

Now that Git knows the hash of the src directory it traverses up to the parent directory (or the root directory in our case) and writes out another string that lists all the blobs and trees within that directory.

It uses that string to calculate the hash of the root directory and just like before, stuffs that string in a file called 1f10b16a08debe2624bdc0233a4c2fe2032616 under the b8 directory under .git/objects.

Let us restate what we learned here.

Trees in Git store the metadata (the type, hashes, and names) about the blobs that are contained within it.

The hash of the tree is calculated using a string that looks very much like a directory listing.

If a tree contains a sub-directory, then the the hash of the sub-tree is first calculated and used to calculate the hash of the parent directory.

Phew!

Almost there.

Let us look at commits next.

Commits

Commits are the level of abstraction that we as developers using Git are most familiar with.

The help page of git-commit-tree (via git help commit-tree) tells us:

> While a tree represents a particular directory state of a working directory, a commit represents that state in "time," and explains how to get there.

In other words a commit is a snapshot of the working directory at the time the commit was made.

Just so we are on the same page, let us check our Git status:

Git status
$ (master) git status
 # On branch master

....
Initial commit

Changes to be committed:
  (use "git rm --cached ..." to unstage)

  new file:   README.md
  new file:   src/Main.java
....

Excellent! We have two files staged, and ready to participate in the next commit.

Shall we commit?

First commit
$ (master) git commit -m "Initial commit"

....
[master (root-commit) 917408c] Initial commit <1>
 2 files changed, 2 insertions(+)
 create mode 100644 README.md
 create mode 100644 src/Main.java
....
<1> Git reports the hash of the commit
NOTE: If you are playing along you *will* get a different hash even if you have the same commit message as mine.

On a successful commit, Git reports the hash (albeit only the first seven characters) of the newly created commit.

Fear not -- this is the truncated form of the hash and in most Git operations that require a hash, only the first six or seven characters need be supplied.

If you are curious to know the full hash you can use yet another Git command git-rev-parse like so:

Using git-rev-parse
$ (master) git rev-parse 917408c
# 917408c8318bb3dc86c3a6d1095e27b97d14f637

Pop quiz time!

Based on what we have learned so far, where do you think Git will store the commit?

If you answer is a sub-directory within .git/objects directory with the name 91 and a file called 7408c8318bb3dc86c3a6d1095e27b97d14f637, then you are absolutely correct!

Go ahead -- take a look inside .git/objects and see for yourself.

Of course the next question to answer is: "What does the file 7408c8318bb3dc86c3a6d1095e27b97d14f637 contain?"

Let us ask Git.

We will once again request the services of our helpful friend git-cat-file to examine the commit.

$ (master) git cat-file -t 917408c8318bb3dc86c3a6d1095e27b97d14f637 <1>
commit
$ (master) git cat-file -p 917408c8318bb3dc86c3a6d1095e27b97d14f637 <2>

....
tree b81f10b16a08debe2624bdc0233a4c2fe2032616
author Raju Gandhi  1405795376 -0400
committer Raju Gandhi  1405795376 -0400

Initial commit
....
<1> Ask for the type
<2> Pretty print it
  1. The type of object that 917408c8318bb3dc86c3a6d1095e27b97d14f637 represents is a commit. Again, no surprise there.
  2. Pretty printing it reveals a few details about the commit. We see the hash of the tree that we created earlier using git-write-tree. We also see some author and committer information. This is followed by a blank line followed by the commit message we supplied when we created the commit.

Any guesses as to how the hash of the Git was calculated?

Let us take a step in Git's shoes and see what happens when we make a commit.

Keep in mind that the first thing we do is to add all the files (via git-add) to the index that we want to commit.

This we know will trigger Git to calculate the blobs to represent each of the files.

On the commit (via git-commit), Git will internally write the tree to the datastore and then write the commit.

In order to calculate the hash of the commit Git will take the hash of the tree (as is reported by git-write-tree), the author information (as provided by Gits configuration), the committer information (which in our case happens to be the same as the author information, since we are both making the changes and committing them to Git), the current timestamp, and finally the commit message.

It then proceeds to write out a string that looks like so:

Git Commit
tree 
author   
committer   

Commit message

It proceeds by hashing this string to create the hash of the commit.

Finally, it compresses this string and writes it to a file whose path is dictated by the hash it created.

Just like the hash of a tree is a function of all the blobs and trees beneath it, the hash of a commit is a function of the tree that was written when the commit was created.

I mentioned earlier that if you were playing along you *will* see a different hash than mine.

How was I so sure?

This is because the hash of the commit is a function of a lot more than just the tree hash!

And hopefully, email addresses are unique! :)

The Git DAG

We now know how a Git commit is created.

We know that the hash of a Git commit is representative of the tree it points to, which in turn is representative of all the blobs and sub-trees it contains.

But there is one more component to a Git commit.

Before we proceed we should note that the commit we made was the first commit in our newly created repository.

Let us make a minor change and make another commit to record that change. We will then interrogate the hash of the commit to see what it looks like.

$ (master) echo "Making another commit" >> README.md <1>
 $ (master) git add README.md <2>
 $ (master) git commit -m "Second commit" <3>
 # [master e4e4b13] Second commit <4>
 # 1 file changed, 1 insertion(+)
 $ (master) git cat-file -p e4e4b13 <5>

....
tree e257f1322a6d1eff27c146860e5bf3db286eceef
parent 917408c8318bb3dc86c3a6d1095e27b97d14f637
author Raju Gandhi  1405799643 -0400
committer Raju Gandhi  1405799643 -0400

Second commit
....
<1> Make a change to README.md
<2> Add README.md to the staging area
<3> Make a commit
<4> Git reports back the hash of the newly created commit
<5> Examine the commit

Compare the output of git cat-file for e257f1322a6d1eff27c146860e5bf3db286eceef against the one we saw previously for 917408c8318bb3dc86c3a6d1095e27b97d14f637.

We see that e257f1322a6d1eff27c146860e5bf3db286eceef has one more entry in it for parent.

Furthermore, the hash against the parent is the hash of our first commit.

In essence, a Git commit not only points to the tree that represents the working directory, it _also_ points to the hash of the commit that was made just before it.

If a commit does not have a parent, Git knows it to be the initial commit in a repository.

To better visualize this I have created an illustration that might help cement this idea:

Git's Directed Acyclic Graph

Git's Directed Acyclic Graph

The red circles in Figure 1 represent commits in our repository, the triangles represent trees and rectangles represent blobs, and time flows up -- the child commits appear above their predecessors (just like you see them in Git's logs).

Our first commit consisted of the README.md file at the root, and the Main.java inside the src directory.

Our second commit only updated the README.md file.

Here is where things get interesting -- recall that a commit is a snapshot of the working directory at the time the commit was made.

Git knows of Main.java at the time of the second commit, but also realizes that the file was not modified.

So it simply reuses the blob it created the first time around.

But it does record the state of the working directory in every commit.

You can see in Figure 1 that the commits form a DAG, or directed acyclic graph.

The graph is directed and acyclic since children point (direct) towards their parents but never the other way around (acyclic).

Therefore, each commit is not only a function of the state of the working tree (along with other information) but also of the commits that came before it.

We know Git hashes are going to be unique -- so if two different repositories have the same files with the same names and the same content in the same directory structure (which leads to the same tree hash) the commits *will* be unique merely as a function of the authors/committers being different.

Conclusion

Git's power comes from simplicity.

Understanding how commits are created and how they participate in foundational to the understanding of Git.

In this article we saw how Git stores the history of our repository within a Directed Acyclic Graph of commits, and how the git-commit command adds to this graph.

In part II of this article series we will take a look at a few more commands such as git-branch, git-checkout, and git-merge to see how they manipulate this graph.

Understanding how a command alters the DAG, and being able to visualize both the current and the final state of the graph as a function of executing such a command will lift the veil of obscurity that seemingly surrounds Git, and is the key to mastery.

Till we meet again, keep "add-ing" to your experience with Git and stay "commit-ted" to learning more. :)

Footnotes
  • #1 : We will see several commands that you might not be familiar with in this article series. You will probably never have to use these in your day to day usage of Git, but they will help us the understand the underpinnings of Git
  • #2 : I recommend that you copy-paste the command above into your terminal if you are playing along -- otherwise beware that you *must* match the case and white-spacing exactly to get the same hash
  • #2 : Git stores some additional information in the blob, but for the purposes of this discussion you can assume it's the contents of a file zipped up
Share