On the Git version control system

1A curriculum for team members(6w~1m)
2Mental model(199w~1m)
3More information(64w~1m)
4Things to write?(53w~1m)
5Git hash collisions(120w~1m)
6Related tools(11w~1m)
7Speeding up Git(205w~2m)
8Git annoys perfectionists(137w~1m)
9Distributed does not mean no coordination(38w~1m)

1A curriculum for team members

2Mental model

A commit is a snapshot of the working tree.

A reference names a commit so that you can write master instead of da39a3e.

If you are a visual person, you can think about how git commands change the picture shown by gitk (a tool for visualizing Git repositories).

In gitk, a blue circle is a commit, a green box is a reference, and the bold green box is the head.

The head points to the commit that will be the parent of the next commit.

When you git init, Git creates a .git directory.

When you git status, it prints On branch master. It means that the head points to the same commit pointed by the reference named master.

When you git commit, you make a new commit (blue circle), and move the current branch (bold green box) to that new commit.

When you git reset Target, you move the head (bold green box) to Target.

If you are not yet comfortable with Git, back up your data by copying the .git directory. It can get corrupted. Things will go wrong. You may accidentally do something and don't know how to recover. Computers don't understand what you mean. They do what you say, not what you mean.

3More information

A Git object:

is identified by a SHA-1 hash;
is either a blob, tree, or commit;
is stored as a file somewhere in the .git folder.

A commit has zero or more parents. It also refers to a tree.

A tree is a list of references. Every reference points to either a tree or a blob.

For more information, read the Pro Git book or the manpages (man git).

4Things to write?

Git fundamentals:
- Git store things in the .git directory.
- Why merge conflicts? How to resolve them? How to use meld? How to do a three-way merge?
- Avoid changing spaces. Avoid using your IDE to reformat files that are already commited.
Workaround for bad user experience
- Disable git-gui GC warning:
  - https://stackoverflow.com/questions/1106529/how-to-skip-loose-object-popup-when-running-git-gui
Understanding Git Filter-branch and the Git Storage Model

5Git hash collisions

Git hash collision may occur albeit extremely unlikely. Git assumes that if two objects have the same hash, then they are the same object. This is false; the converse is true: if two objects are the same, then they have the same hash. When hash collision occur, Git may silently lose data. Git is an example of software that is incorrect but works for the use cases it was designed for (source code versioning). Git is not meant to be used as an arbitrary database.

Other softwares are incorrect as well. We routinely make software that assumes that there will never be more than 2⁶⁴ rows in a database table.

Is it even possible to write correct software at all?

git-gui, for making commits
gitk, for showing history
meld, for three-way diff/merge

7Speeding up Git

7.1The problem(134w~1m)
7.2Plans(41w~1m)
7.3Non-plans(30w~1m)

7.1The problem

I have a repository with 100,000 files and 1,000,000 objects, but most of them are not mine, and I will never use most of them. I don't even think I have more than 1,000 files in that repository. The problem: Git interactive rebase is too slow in that repository.

<2018-12-05> I solved the problem by extracting my work into its own disjoint subtree, and pushing to a different branch of the same repository.

Hypothesis: How git rebase works.

I guess git rebase --onto TARGET BASE MOVE works like this:

git checkout --orphan TARGET --
git cherry-pick <all commits from BASE to MOVE, excluding BASE, including MOVE>
git checkout -B MOVE

Cherry-pick is also slow. I guess that speeding up cherry-pick will also speed up rebase.

Checkout is also slow. I guess that speeding up checkout will also speed up cherry-pick.

It seems that commit and write-tree are slow.

Git - Environment Variables
- GIT_TRACE_PERFORMANCE=true has no effect. Which git version is it for?

7.2Plans

Plan: Make a rebase that uses only trees and not indexes
- If a tree changes, all its ancestors have to be rewritten.
Plan: Just use subtrees and keep the repository small
- I think this is the least-effort solution that solves (works around) the problem.

7.3Non-plans

atlassian.com: How to manage big Git repositories
- Try git sparse checkout? It seems that sparse-checkout and rebase doesn't mix.
Not recommended: git gc --aggressive (doesn't do what we think it would do).

8Git annoys perfectionists

Is my commit perfect yet? Who the hell cares. But my stupid mistakes will be recorded permanently and published for everyone to see? Who the hell cares.

What do I use Git for? For backup and distribution. Not for history. I don't care about my Git history. In general, I don't care about the past, and I care about the present and the future. I have never needed to look into my Git history to diagnose a programming error. I have never used git-bisect. I try to do what I understand and understand what I do, in order to avoid introducing errors. I only demand that Git preserve the things I think are important.

But Git is a version control system, not a backup solution. It may lose data although the probability is astronomically small (practically improbable).¹

9Distributed does not mean no coordination

Git compares lines of text. Git does not compare the meaning of code fragments.

Git merge introduces defect when the source commits are diff-compatible but not meaning-compatible. Example: adding name-clashing functions at different places.

https://stackoverflow.com/questions/10434326/hash-collision-in-git ↩