23 KiB
23 KiB
# Git Internals (Chapter 8)
## Why this chapter exists / positioning in the book
- Can be read early (curiosity) or late (after learning porcelain)
- Understanding internals helps explain *why* Git behaves as it does
- Tradeoff: powerful insight vs. potential complexity for beginners
- Core premise
- Git = **content-addressable filesystem** + **VCS user interface** layered on top
- Historical note
- Early Git (mostly pre-1.5) UI emphasized filesystem concepts → felt complex
- Modern Git UI refined; early “complex Git” stereotype lingers
- Chapter flow
- Content-addressable storage layer (objects) first
- Then transports (protocols)
- Then maintenance + recovery tasks
## Plumbing and Porcelain
- Porcelain commands (high-level UX)
- Examples: `checkout`, `branch`, `remote`, …
- Most of the book focuses on these
- Plumbing commands (low-level toolkit)
- Designed to be chained (UNIX-style) or used from scripts/tools
- Used here to expose internals and demonstrate implementation
- Often not meant for humans to type frequently
## The `.git` directory (what Git stores/manipulates)
- Created by `git init`
- Backups/clones
- Copying `.git/` elsewhere gives *nearly everything* needed
- Fresh repo typical contents
- `config`
- Project-specific configuration
- `description`
- Used by GitWeb only
- `HEAD`
- Points to current branch (or object in detached HEAD)
- `hooks/`
- Client/server hook scripts (covered elsewhere)
- `info/`
- Global excludes (patterns you don’t want in `.gitignore`)
- `objects/`
- Object database (content store)
- `refs/`
- Pointers into commits (branches, tags, remotes, …)
- `index` (not shown initially)
- Staging area data (created when needed)
- “Core” pieces emphasized here
- `objects/` — all stored content
- `refs/` — names/pointers into commit graph
- `HEAD` — what’s checked out
- `index` — staging area snapshot used to build trees/commits
## Git Objects (content-addressable store)
### Concept: a key–value database
- Insert arbitrary data → receive a unique key → retrieve later
- Key is a checksum (SHA-1 in these examples) of:
- a header + the content (details later)
### Creating a blob object with `git hash-object`
- What it does
- hashes content
- optionally writes object into `.git/objects/`
- returns the object id (40 hex chars = SHA-1)
- Key options
- `-w` — write object to object database
- `--stdin` — read content from stdin (otherwise expects a filename)
- Object storage layout on disk (loose objects)
- Path: `.git/objects/<first2>/<remaining38>`
- Directory name = first 2 chars of SHA-1
- Filename = remaining 38 chars
- Inspecting an object
- `git cat-file -p <sha>` — pretty-print content (auto-detect type)
- `git cat-file -t <sha>` — print object type
- Blob objects
- store *only content* (no filename)
- example: versions of `test.txt` stored as different blobs
### Retrieving content
- You can “recreate” a file from a blob by redirecting `cat-file` output
- `git cat-file -p <sha> > test.txt`
- Limitations of blobs alone
- Must remember SHA-1 per version
- No filenames or directory structure
## Tree Objects (filenames + directories + grouping)
### What a tree is
- Stores a directory listing-like structure
- Entries contain
- mode
- type (`blob` or `tree`)
- SHA-1 of target object
- filename
- Conceptual model (simplified UNIX-like)
- tree ↔ directory entries
- blob ↔ file contents
### Inspecting trees
- `git cat-file -p master^{tree}`
- shows top-level tree for the last commit on `master`
- example entries include blobs (files) and trees (subdirectories)
- Subtrees
- a directory entry points to another tree object
- Shell quoting pitfalls for `master^{tree}`
- Windows CMD: `^` is escape → use `master^^{tree}`
- PowerShell: quote braces → `git cat-file -p 'master^{tree}'`
- ZSH: `^` globbing → quote expression → `git cat-file -p "master^{tree}"`
### Building trees manually (via the index)
- Normal Git behavior
- Creates trees from the staging area (index)
- Plumbing commands used
- `git update-index`
- manipulate index entries
- `--add` required if path not in index yet
- `--cacheinfo` used when content isn’t in working tree (already in DB)
- requires: `<mode> <sha> <path>`
- valid file modes for blobs
- `100644` normal file
- `100755` executable
- `120000` symlink
- `git write-tree`
- writes current index to a tree object
- `git read-tree`
- reads a tree into index
- `--prefix=<dir>/` stages it as a subtree
### Example sequence (three trees)
- Tree 1: `test.txt` v1
- stage blob via `update-index --add --cacheinfo 100644 <sha_v1> test.txt`
- `write-tree` → tree1 (contains `test.txt` → blob v1)
- Tree 2: `test.txt` v2 + `new.txt`
- update index to point `test.txt` to blob v2
- add `new.txt`
- `write-tree` → tree2 (two file entries)
- Tree 3: include Tree 1 under `bak/`
- `read-tree --prefix=bak <tree1>`
- `write-tree` → tree3
- tree3 contains
- `bak/` → tree1
- `new.txt` → blob
- `test.txt` → blob v2
## Commit Objects (snapshots + history + metadata)
### Why commits exist
- Trees represent snapshots but:
- SHA-1s are not memorable
- need who/when/why metadata
- need parent links to form history
### Creating commits with `git commit-tree`
- Inputs
- a tree SHA-1 (snapshot)
- optional parent commit SHA-1(s)
- message from stdin
- Commit object fields
- `tree <tree_sha>`
- `parent <parent_sha>` (none for first commit)
- `author ...` (from `user.name`, `user.email`, timestamp)
- `committer ...` (same source)
- blank line
- commit message
- Note about hashes in book
- commit hashes differ due to timestamps/author data; use your own
### Example history
- Commit 1 points to tree1 (no parent)
- Commit 2 points to tree2, parent = commit1
- Commit 3 points to tree3, parent = commit2
- View history
- `git log --stat <commit3_sha>`
- Key takeaway
- Porcelain `git add`/`git commit` do essentially:
- write blobs for changed content
- update index
- write tree(s)
- write commit referencing tree + parent
## Object Storage (how objects are actually stored)
### Common storage recipe
- Each object stored as:
- header + content
- Header format
- `<type> <size>\0`
- type: `blob`, `tree`, `commit`, `tag`
- size: bytes in content
- null byte terminator
- Object id
- SHA-1 of (header + content)
- Compression
- zlib-compressed before writing to disk
### Ruby walk-through (blob example)
- Build content string
- Build header (`"blob #{bytesize}\0"`)
- Concatenate and hash with SHA-1
- matches `git hash-object` (use `echo -n` to avoid newline)
- Compress with zlib
- Write to `.git/objects/<sha[0,2]>/<sha[2,38]>`
- Validate with `git cat-file -p <sha>`
## Git References (refs) — naming commits/objects
### What refs are
- Human-friendly names → files containing SHA-1s
- Stored under `.git/refs/`
- `refs/heads/` — branches
- `refs/tags/` — tags
- (later) `refs/remotes/` — remote-tracking refs
### Creating/updating refs
- Direct edit possible but discouraged
- `echo <sha> > .git/refs/heads/master`
- Safer: `git update-ref`
- `git update-ref refs/heads/master <sha>`
- Branch meaning
- A branch is a ref that points to the tip commit of a line of work
- Example: create branch at older commit
- `git update-ref refs/heads/test <sha_of_commit2>`
- `git log test` shows only commits reachable from that ref
## `HEAD` — what you have checked out
### Symbolic reference (usual case)
- `.git/HEAD` commonly contains
- `ref: refs/heads/<branch>`
- On checkout, Git updates `HEAD` to point at chosen branch ref
- Commit parent determination
- `git commit` uses commit pointed to by ref that `HEAD` references
### Detached HEAD (special case)
- Sometimes `HEAD` contains a raw SHA-1
- Happens when checking out
- a tag
- a commit
- a remote-tracking branch
### Managing HEAD safely
- `git symbolic-ref HEAD` — read where HEAD points
- `git symbolic-ref HEAD refs/heads/test` — set symbolic HEAD
- Constraint
- cannot point outside `refs/` namespace
## Tags (lightweight vs annotated)
### Tag object
- Fourth object type: `tag`
- Similar to commit object (tagger/date/message/pointer)
- Usually points to a commit, but can tag any object (blob/tree/commit)
### Lightweight tags
- Just a ref under `refs/tags/` pointing directly to an object
- `git update-ref refs/tags/v1.0 <commit_sha>`
- Never moves (unlike branch tips)
### Annotated tags
- Create a tag object and a ref that points to it
- `git tag -a v1.1 <commit_sha> -m '...'`
- `.git/refs/tags/v1.1` contains SHA-1 of the *tag object*
- Tag object content includes
- `object <target_sha>`
- `type <target_type>`
- `tag <name>`
- `tagger ...`
- message
- Examples mentioned
- Tagging a maintainer’s GPG key stored as a blob
- Kernel repo has an early tag pointing at an initial tree
## Remotes (remote-tracking references)
### What they are
- Refs under `refs/remotes/<remote>/...`
- Store last known state of remote branches after communicating
### Example
- After `git remote add origin ...` and `git push origin master`
- `.git/refs/remotes/origin/master` stores last known remote SHA-1
### Key characteristics
- Read-only from user standpoint
- You can checkout one, but Git won’t set `HEAD` as symbolic ref to it
- They act as bookmarks managed by Git for remote state
## Packfiles (space-efficient object storage)
### Loose objects vs packed objects
- Loose object: one zlib file per object
- Packfile:
- single `.pack` containing many objects
- `.idx` index mapping SHA-1 → offsets
### When packing happens
- Automatically when:
- many loose objects
- many packfiles
- Manually via `git gc`
- Often during push to a server
### Demonstration scenario (why deltas matter)
- Add large file (`repo.rb`, ~22K) and commit
- file stored as blob
- Modify it slightly and commit again
- creates a whole new blob
- two near-identical large blobs now exist
### `git gc` effects
- Creates pack + index
- Removes many loose objects (reachable ones)
- Leaves dangling/unreachable blobs loose (not in pack)
### Inspecting what’s packed
- `git verify-pack -v <pack>.idx`
- shows objects, sizes, offsets, delta bases
- Delta storage behavior shown
- newer version often stored in full
- older version stored as delta against newer
- optimized for fast access to most recent version
- Repacking
- can happen automatically
- can be triggered any time via `git gc`
## Refspec (ref mapping rules for fetch/push)
### Where it appears
- `.git/config` remote section created by `git remote add`
- `fetch = +refs/heads/*:refs/remotes/origin/*`
### Syntax
- `(+)?<src>:<dst>`
- optional `+` forces update even if not fast-forward
- `<src>`: refs on remote
- `<dst>`: local tracking refs
### Default fetch behavior
- Fetch all remote branches (`refs/heads/*`)
- Track locally as `refs/remotes/origin/*`
- Equivalent references
- `origin/master`
- `remotes/origin/master`
- `refs/remotes/origin/master`
### Custom fetch examples
- Fetch only master always
- `fetch = +refs/heads/master:refs/remotes/origin/master`
- One-time fetch to a different local name
- `git fetch origin master:refs/remotes/origin/mymaster`
- Multiple refspecs
- CLI or multiple `fetch =` lines in config
- Fast-forward enforcement and overrides
- non-FF rejected unless `+` used
- Partial globs (Git ≥ 2.6.0)
- `qa*` patterns for multiple branches
- Namespaces/directories for teams
- e.g., `refs/heads/qa/*` → `refs/remotes/origin/qa/*`
## Pushing refspecs & deleting remote refs
### Pushing into a namespace
- Push local `master` to remote `qa/master`
- `git push origin master:refs/heads/qa/master`
- Configure default push mapping
- `push = refs/heads/master:refs/heads/qa/master`
### Deleting remote references
- Old refspec deletion form
- `git push origin :topic`
- Newer explicit flag (Git ≥ 1.7.0)
- `git push origin --delete topic`
### Note/limitation
- Refspecs can’t fetch from one repo and push to another (as a single refspec trick)
## Transfer Protocols (moving data between repositories)
### Two major approaches
- Dumb protocol
- simple, HTTP read-only, no Git server-side logic
- inefficient, hard to secure/private; rarely used now
- Smart protocol
- Git-aware server process
- negotiates what data is needed
- supports pushes
### Dumb protocol (HTTP) — conceptual clone walkthrough
- `git clone http://server/<repo>.git`
- Fetch refs list (requires server-generated metadata)
- `GET info/refs`
- generated by `update-server-info` (often via post-receive hook)
- Fetch HEAD to determine default branch
- `GET HEAD` → `ref: refs/heads/master`
- Walk objects starting from advertised commit SHA
- `GET objects/<sha_prefix>/<sha_rest>` for loose objects
- parse commit → learn `tree` + `parent`
- If tree object not found as loose (404)
- check alternates
- `GET objects/info/http-alternates`
- check available packfiles
- `GET objects/info/packs`
- `GET objects/pack/pack-....idx`
- `GET objects/pack/pack-....pack`
- Once required objects are fetched
- checkout working tree for branch pointed to by downloaded `HEAD`
### Smart protocol — overview
- Upload (push): `send-pack` (client) ↔ `receive-pack` (server)
- Download (fetch/clone): `fetch-pack` (client) ↔ `upload-pack` (server)
#### Uploading data (push)
- SSH transport
- client runs remote command (conceptually)
- `ssh ... "git-receive-pack '<repo>.git'"`
- server advertises
- current refs + SHA-1s
- capabilities appended on the first line after a NUL separator
- pkt-line framing
- each chunk begins with 4 hex chars = length (including those 4 chars)
- `0000` indicates end
- client sends per-ref updates
- `<old_sha> <new_sha> <refname>`
- all zeros on left = create ref
- all zeros on right = delete ref
- client sends a packfile of objects server lacks
- server replies success/failure
- e.g., `unpack ok`
- HTTP(S) transport
- discovery
- `GET .../info/refs?service=git-receive-pack`
- push
- `POST .../git-receive-pack` with update commands + packfile
- note: HTTP may wrap in chunked transfer encoding
#### Downloading data (fetch/clone)
- SSH transport
- client runs remote command
- `ssh ... "git-upload-pack '<repo>.git'"`
- server advertises
- refs and capabilities
- `symref=HEAD:refs/heads/master` so client knows default branch
- negotiation
- client sends `want <sha>`
- client sends `have <sha>`
- client sends `done` to request packfile generation
- server returns packfile (optionally multiplexing progress via side-band)
- HTTP(S) transport
- discovery
- `GET .../info/refs?service=git-upload-pack`
- negotiation/data request
- `POST .../git-upload-pack` with want/have data
- response includes packfile
### Protocols summary note
- Only the high-level handshake is covered
- Many capabilities/features (e.g., `multi_ack`, `side-band`) exist beyond this chapter’s scope
## Maintenance and Data Recovery
### Maintenance (`gc`, packing, pruning)
- Auto maintenance
- Git may run `auto gc` occasionally
- Usually no-op unless thresholds exceeded
- What `git gc` does
- packs loose objects into packfiles
- consolidates packfiles
- removes unreachable objects older than a few months
- Trigger thresholds (approx)
- ~7000 loose objects
- >50 packfiles
- Config knobs
- `gc.auto`
- `gc.autopacklimit`
- Manual auto-gc run
- `git gc --auto` (often does nothing)
### Packing refs into `packed-refs`
- Before gc: refs stored as many small files
- `.git/refs/heads/*`, `.git/refs/tags/*`, …
- After gc: moved for efficiency into `.git/packed-refs`
- format lines: `<sha> <refname>`
- annotated tags include a “peeled” line starting with `^`
- indicates the commit the tag ultimately points to
- Updating a ref after packing
- Git writes a new loose ref file under `.git/refs/...`
- doesn’t edit `packed-refs`
- Lookup behavior
- Git checks loose refs first, then `packed-refs` fallback
### Data Recovery (finding lost commits)
#### Common loss causes
- force-delete a branch containing work you later want
- `git reset --hard` moving a branch tip back, abandoning newer commits
#### Reflog-based recovery
- Reflog records where `HEAD` pointed whenever it changes
- commits, branch switches, resets
- also updated by `git update-ref` (reason to prefer it over manual ref edits)
- Useful commands
- `git reflog` — concise HEAD history
- `git log -g` — reflog shown as a log
- Recovery technique
- find lost commit SHA-1 in reflog
- create a ref/branch pointing to it
- `git branch recover-branch <sha>`
#### Recovery without reflog
- If reflog is missing (e.g., `.git/logs/` removed)
- Use integrity checker
- `git fsck --full`
- shows dangling/unreachable objects
- `dangling commit <sha>`
- Recover similarly
- create a new branch ref pointing to the dangling commit
### Removing objects (purging big files from history)
#### Problem statement
- Git clones fetch full history
- A huge file added once remains in history forever if reachable
- even if deleted next commit
- Especially painful in imported repos (SVN/Perforce)
#### Strong warning
- Destructive: rewrites commit history (new commit IDs)
- Must coordinate contributors (rebase onto rewritten history)
#### Workflow to locate and remove large objects
- Confirm repo size after packing
- `git gc`
- `git count-objects -v` (check `size-pack`)
- Find largest packed objects
- `git verify-pack -v <pack>.idx | sort -k 3 -n | tail -3`
- third field in output is object size
- Map blob SHA to filename
- `git rev-list --objects --all | grep <blob_sha_prefix>`
- Identify commits that touched the path
- `git log --oneline --branches -- <file>`
- Rewrite history to remove the file from every tree
- `git filter-branch --index-filter 'git rm --ignore-unmatch --cached <file>' -- <bad_commit>^..`
- `--index-filter` is fast (no full checkout per commit)
- `git rm --cached` removes from index/tree, not just working dir
- Remove pointers to old history
- `rm -Rf .git/refs/original`
- `rm -Rf .git/logs/`
- Repack/clean
- `git gc`
- optionally remove remaining loose objects
- `git prune --expire now`
## Environment Variables (controlling Git behavior)
> Chapter note: not exhaustive; highlights the most useful
### Global behavior
- `GIT_EXEC_PATH`
- where Git finds sub-programs (e.g., `git-commit`, `git-diff`)
- inspect via `git --exec-path`
- `HOME`
- where Git finds global config
- can be overridden for portable Git setups
- `PREFIX`
- system-wide config path: `$PREFIX/etc/gitconfig`
- `GIT_CONFIG_NOSYSTEM`
- disable system-wide config
- Output paging/editing
- `GIT_PAGER` (fallback `PAGER`)
- `GIT_EDITOR` (fallback `EDITOR`)
### Repository locations
- `GIT_DIR`
- where `.git` directory is
- if unset, Git walks up directory tree searching
- `GIT_CEILING_DIRECTORIES`
- stops upward search early (useful for slow filesystems)
- `GIT_WORK_TREE`
- working tree root for non-bare repos
- `GIT_INDEX_FILE`
- alternate index path
- Object database
- `GIT_OBJECT_DIRECTORY` — override `.git/objects`
- `GIT_ALTERNATE_OBJECT_DIRECTORIES`
- colon-separated additional object stores (share objects across repos)
### Pathspecs (path matching rules)
- Pathspecs used in `.gitignore` and CLI patterns (e.g., `git add *.c`)
- Wildcard behavior toggles
- `GIT_GLOB_PATHSPECS=1` — wildcards enabled (default)
- `GIT_NOGLOB_PATHSPECS=1` — wildcards literal (e.g., `*.c` matches file named `*.c`)
- Per-argument overrides
- prefix with `:(glob)` or `:(literal)`
- `GIT_LITERAL_PATHSPECS`
- disables wildcard matching and override prefixes
- `GIT_ICASE_PATHSPECS`
- case-insensitive pathspec matching
### Committing (author/committer identity)
- Used primarily by `git-commit-tree` (then falls back to config)
- Author fields
- `GIT_AUTHOR_NAME`
- `GIT_AUTHOR_EMAIL`
- `GIT_AUTHOR_DATE`
- Committer fields
- `GIT_COMMITTER_NAME`
- `GIT_COMMITTER_EMAIL`
- `GIT_COMMITTER_DATE`
- `EMAIL`
- fallback email if `user.email` is unset
### Networking (HTTP behavior)
- `GIT_CURL_VERBOSE`
- emit libcurl debug messages
- `GIT_SSL_NO_VERIFY`
- skip SSL cert verification (self-signed/setup scenarios)
- Low-speed abort settings
- `GIT_HTTP_LOW_SPEED_LIMIT`
- `GIT_HTTP_LOW_SPEED_TIME`
- override `http.lowSpeedLimit` / `http.lowSpeedTime`
- `GIT_HTTP_USER_AGENT`
- override user-agent string
### Diffing and merging
- `GIT_DIFF_OPTS`
- only supports unified context count: `-u<n>` / `--unified=<n>`
- `GIT_EXTERNAL_DIFF`
- program invoked instead of built-in diff
- Batch diff metadata for external diff tool
- `GIT_DIFF_PATH_COUNTER`
- `GIT_DIFF_PATH_TOTAL`
- `GIT_MERGE_VERBOSITY` (recursive merge)
- 0: only errors
- 1: conflicts only
- 2: + file changes (default)
- 3: + skipped unchanged
- 4: + all processed paths
- 5+: deep debug
### Debugging/tracing (observability)
- Output destinations
- `"true"`, `"1"`, `"2"` → stderr
- absolute path `/...` → write to file
- `GIT_TRACE`
- general tracing (alias expansion, sub-program exec)
- `GIT_TRACE_PACK_ACCESS`
- pack access tracing: packfile + offset
- `GIT_TRACE_PACKET`
- packet-level tracing for network operations
- `GIT_TRACE_PERFORMANCE`
- timing for each internal step/subcommand
- `GIT_TRACE_SETUP`
- shows discovered repo paths (`git_dir`, `worktree`, `cwd`, `prefix`, ...)
### Miscellaneous
- `GIT_SSH`
- program used instead of `ssh`
- invoked as: `$GIT_SSH [user@]host [-p <port>] <command>`
- wrapper script often needed for extra args; `~/.ssh/config` may be easier
- `GIT_ASKPASS`
- program to prompt for credentials (returns answer on stdout)
- `GIT_NAMESPACE`
- namespaced refs (like `--namespace`), often server-side
- `GIT_FLUSH`
- stdout buffering
- `1` flush frequently; `0` buffer
- `GIT_REFLOG_ACTION`
- custom text written to reflog entries (action descriptor)
## Summary (what you should now understand)
- Git internals = object database + refs + a UI on top
- Main object types
- blob (content), tree (directories), commit (history + metadata), tag (named pointer + metadata)
- Refs and `HEAD` provide human-friendly naming and current-state tracking
- Packfiles optimize storage through compression and deltas
- Refspecs control fetch/push mappings and enable namespaced workflows
- Transfer protocols
- dumb: simple HTTP reads (rare)
- smart: negotiated pack exchange (common) for fetch/push
- Maintenance/recovery tools
- `gc`, `packed-refs`, `reflog`, `fsck`, `filter-branch`, `prune`
- Environment variables provide control, portability, and deep debugging capabilities