```markmap # Git Internals (Chapter 8) ## Why this chapter exists / positioning in the book - Can be read early (curiosity) or late (after learning porcelain) - Understanding internals helps explain *why* Git behaves as it does - Tradeoff: powerful insight vs. potential complexity for beginners - Core premise - Git = **content-addressable filesystem** + **VCS user interface** layered on top - Historical note - Early Git (mostly pre-1.5) UI emphasized filesystem concepts → felt complex - Modern Git UI refined; early “complex Git” stereotype lingers - Chapter flow - Content-addressable storage layer (objects) first - Then transports (protocols) - Then maintenance + recovery tasks ## Plumbing and Porcelain - Porcelain commands (high-level UX) - Examples: `checkout`, `branch`, `remote`, … - Most of the book focuses on these - Plumbing commands (low-level toolkit) - Designed to be chained (UNIX-style) or used from scripts/tools - Used here to expose internals and demonstrate implementation - Often not meant for humans to type frequently ## The `.git` directory (what Git stores/manipulates) - Created by `git init` - Backups/clones - Copying `.git/` elsewhere gives *nearly everything* needed - Fresh repo typical contents - `config` - Project-specific configuration - `description` - Used by GitWeb only - `HEAD` - Points to current branch (or object in detached HEAD) - `hooks/` - Client/server hook scripts (covered elsewhere) - `info/` - Global excludes (patterns you don’t want in `.gitignore`) - `objects/` - Object database (content store) - `refs/` - Pointers into commits (branches, tags, remotes, …) - `index` (not shown initially) - Staging area data (created when needed) - “Core” pieces emphasized here - `objects/` — all stored content - `refs/` — names/pointers into commit graph - `HEAD` — what’s checked out - `index` — staging area snapshot used to build trees/commits ## Git Objects (content-addressable store) ### Concept: a key–value database - Insert arbitrary data → receive a unique key → retrieve later - Key is a checksum (SHA-1 in these examples) of: - a header + the content (details later) ### Creating a blob object with `git hash-object` - What it does - hashes content - optionally writes object into `.git/objects/` - returns the object id (40 hex chars = SHA-1) - Key options - `-w` — write object to object database - `--stdin` — read content from stdin (otherwise expects a filename) - Object storage layout on disk (loose objects) - Path: `.git/objects//` - Directory name = first 2 chars of SHA-1 - Filename = remaining 38 chars - Inspecting an object - `git cat-file -p ` — pretty-print content (auto-detect type) - `git cat-file -t ` — print object type - Blob objects - store *only content* (no filename) - example: versions of `test.txt` stored as different blobs ### Retrieving content - You can “recreate” a file from a blob by redirecting `cat-file` output - `git cat-file -p > test.txt` - Limitations of blobs alone - Must remember SHA-1 per version - No filenames or directory structure ## Tree Objects (filenames + directories + grouping) ### What a tree is - Stores a directory listing-like structure - Entries contain - mode - type (`blob` or `tree`) - SHA-1 of target object - filename - Conceptual model (simplified UNIX-like) - tree ↔ directory entries - blob ↔ file contents ### Inspecting trees - `git cat-file -p master^{tree}` - shows top-level tree for the last commit on `master` - example entries include blobs (files) and trees (subdirectories) - Subtrees - a directory entry points to another tree object - Shell quoting pitfalls for `master^{tree}` - Windows CMD: `^` is escape → use `master^^{tree}` - PowerShell: quote braces → `git cat-file -p 'master^{tree}'` - ZSH: `^` globbing → quote expression → `git cat-file -p "master^{tree}"` ### Building trees manually (via the index) - Normal Git behavior - Creates trees from the staging area (index) - Plumbing commands used - `git update-index` - manipulate index entries - `--add` required if path not in index yet - `--cacheinfo` used when content isn’t in working tree (already in DB) - requires: ` ` - valid file modes for blobs - `100644` normal file - `100755` executable - `120000` symlink - `git write-tree` - writes current index to a tree object - `git read-tree` - reads a tree into index - `--prefix=/` stages it as a subtree ### Example sequence (three trees) - Tree 1: `test.txt` v1 - stage blob via `update-index --add --cacheinfo 100644 test.txt` - `write-tree` → tree1 (contains `test.txt` → blob v1) - Tree 2: `test.txt` v2 + `new.txt` - update index to point `test.txt` to blob v2 - add `new.txt` - `write-tree` → tree2 (two file entries) - Tree 3: include Tree 1 under `bak/` - `read-tree --prefix=bak ` - `write-tree` → tree3 - tree3 contains - `bak/` → tree1 - `new.txt` → blob - `test.txt` → blob v2 ## Commit Objects (snapshots + history + metadata) ### Why commits exist - Trees represent snapshots but: - SHA-1s are not memorable - need who/when/why metadata - need parent links to form history ### Creating commits with `git commit-tree` - Inputs - a tree SHA-1 (snapshot) - optional parent commit SHA-1(s) - message from stdin - Commit object fields - `tree ` - `parent ` (none for first commit) - `author ...` (from `user.name`, `user.email`, timestamp) - `committer ...` (same source) - blank line - commit message - Note about hashes in book - commit hashes differ due to timestamps/author data; use your own ### Example history - Commit 1 points to tree1 (no parent) - Commit 2 points to tree2, parent = commit1 - Commit 3 points to tree3, parent = commit2 - View history - `git log --stat ` - Key takeaway - Porcelain `git add`/`git commit` do essentially: - write blobs for changed content - update index - write tree(s) - write commit referencing tree + parent ## Object Storage (how objects are actually stored) ### Common storage recipe - Each object stored as: - header + content - Header format - ` \0` - type: `blob`, `tree`, `commit`, `tag` - size: bytes in content - null byte terminator - Object id - SHA-1 of (header + content) - Compression - zlib-compressed before writing to disk ### Ruby walk-through (blob example) - Build content string - Build header (`"blob #{bytesize}\0"`) - Concatenate and hash with SHA-1 - matches `git hash-object` (use `echo -n` to avoid newline) - Compress with zlib - Write to `.git/objects//` - Validate with `git cat-file -p ` ## Git References (refs) — naming commits/objects ### What refs are - Human-friendly names → files containing SHA-1s - Stored under `.git/refs/` - `refs/heads/` — branches - `refs/tags/` — tags - (later) `refs/remotes/` — remote-tracking refs ### Creating/updating refs - Direct edit possible but discouraged - `echo > .git/refs/heads/master` - Safer: `git update-ref` - `git update-ref refs/heads/master ` - Branch meaning - A branch is a ref that points to the tip commit of a line of work - Example: create branch at older commit - `git update-ref refs/heads/test ` - `git log test` shows only commits reachable from that ref ## `HEAD` — what you have checked out ### Symbolic reference (usual case) - `.git/HEAD` commonly contains - `ref: refs/heads/` - On checkout, Git updates `HEAD` to point at chosen branch ref - Commit parent determination - `git commit` uses commit pointed to by ref that `HEAD` references ### Detached HEAD (special case) - Sometimes `HEAD` contains a raw SHA-1 - Happens when checking out - a tag - a commit - a remote-tracking branch ### Managing HEAD safely - `git symbolic-ref HEAD` — read where HEAD points - `git symbolic-ref HEAD refs/heads/test` — set symbolic HEAD - Constraint - cannot point outside `refs/` namespace ## Tags (lightweight vs annotated) ### Tag object - Fourth object type: `tag` - Similar to commit object (tagger/date/message/pointer) - Usually points to a commit, but can tag any object (blob/tree/commit) ### Lightweight tags - Just a ref under `refs/tags/` pointing directly to an object - `git update-ref refs/tags/v1.0 ` - Never moves (unlike branch tips) ### Annotated tags - Create a tag object and a ref that points to it - `git tag -a v1.1 -m '...'` - `.git/refs/tags/v1.1` contains SHA-1 of the *tag object* - Tag object content includes - `object ` - `type ` - `tag ` - `tagger ...` - message - Examples mentioned - Tagging a maintainer’s GPG key stored as a blob - Kernel repo has an early tag pointing at an initial tree ## Remotes (remote-tracking references) ### What they are - Refs under `refs/remotes//...` - Store last known state of remote branches after communicating ### Example - After `git remote add origin ...` and `git push origin master` - `.git/refs/remotes/origin/master` stores last known remote SHA-1 ### Key characteristics - Read-only from user standpoint - You can checkout one, but Git won’t set `HEAD` as symbolic ref to it - They act as bookmarks managed by Git for remote state ## Packfiles (space-efficient object storage) ### Loose objects vs packed objects - Loose object: one zlib file per object - Packfile: - single `.pack` containing many objects - `.idx` index mapping SHA-1 → offsets ### When packing happens - Automatically when: - many loose objects - many packfiles - Manually via `git gc` - Often during push to a server ### Demonstration scenario (why deltas matter) - Add large file (`repo.rb`, ~22K) and commit - file stored as blob - Modify it slightly and commit again - creates a whole new blob - two near-identical large blobs now exist ### `git gc` effects - Creates pack + index - Removes many loose objects (reachable ones) - Leaves dangling/unreachable blobs loose (not in pack) ### Inspecting what’s packed - `git verify-pack -v .idx` - shows objects, sizes, offsets, delta bases - Delta storage behavior shown - newer version often stored in full - older version stored as delta against newer - optimized for fast access to most recent version - Repacking - can happen automatically - can be triggered any time via `git gc` ## Refspec (ref mapping rules for fetch/push) ### Where it appears - `.git/config` remote section created by `git remote add` - `fetch = +refs/heads/*:refs/remotes/origin/*` ### Syntax - `(+)?:` - optional `+` forces update even if not fast-forward - ``: refs on remote - ``: local tracking refs ### Default fetch behavior - Fetch all remote branches (`refs/heads/*`) - Track locally as `refs/remotes/origin/*` - Equivalent references - `origin/master` - `remotes/origin/master` - `refs/remotes/origin/master` ### Custom fetch examples - Fetch only master always - `fetch = +refs/heads/master:refs/remotes/origin/master` - One-time fetch to a different local name - `git fetch origin master:refs/remotes/origin/mymaster` - Multiple refspecs - CLI or multiple `fetch =` lines in config - Fast-forward enforcement and overrides - non-FF rejected unless `+` used - Partial globs (Git ≥ 2.6.0) - `qa*` patterns for multiple branches - Namespaces/directories for teams - e.g., `refs/heads/qa/*` → `refs/remotes/origin/qa/*` ## Pushing refspecs & deleting remote refs ### Pushing into a namespace - Push local `master` to remote `qa/master` - `git push origin master:refs/heads/qa/master` - Configure default push mapping - `push = refs/heads/master:refs/heads/qa/master` ### Deleting remote references - Old refspec deletion form - `git push origin :topic` - Newer explicit flag (Git ≥ 1.7.0) - `git push origin --delete topic` ### Note/limitation - Refspecs can’t fetch from one repo and push to another (as a single refspec trick) ## Transfer Protocols (moving data between repositories) ### Two major approaches - Dumb protocol - simple, HTTP read-only, no Git server-side logic - inefficient, hard to secure/private; rarely used now - Smart protocol - Git-aware server process - negotiates what data is needed - supports pushes ### Dumb protocol (HTTP) — conceptual clone walkthrough - `git clone http://server/.git` - Fetch refs list (requires server-generated metadata) - `GET info/refs` - generated by `update-server-info` (often via post-receive hook) - Fetch HEAD to determine default branch - `GET HEAD` → `ref: refs/heads/master` - Walk objects starting from advertised commit SHA - `GET objects//` for loose objects - parse commit → learn `tree` + `parent` - If tree object not found as loose (404) - check alternates - `GET objects/info/http-alternates` - check available packfiles - `GET objects/info/packs` - `GET objects/pack/pack-....idx` - `GET objects/pack/pack-....pack` - Once required objects are fetched - checkout working tree for branch pointed to by downloaded `HEAD` ### Smart protocol — overview - Upload (push): `send-pack` (client) ↔ `receive-pack` (server) - Download (fetch/clone): `fetch-pack` (client) ↔ `upload-pack` (server) #### Uploading data (push) - SSH transport - client runs remote command (conceptually) - `ssh ... "git-receive-pack '.git'"` - server advertises - current refs + SHA-1s - capabilities appended on the first line after a NUL separator - pkt-line framing - each chunk begins with 4 hex chars = length (including those 4 chars) - `0000` indicates end - client sends per-ref updates - ` ` - all zeros on left = create ref - all zeros on right = delete ref - client sends a packfile of objects server lacks - server replies success/failure - e.g., `unpack ok` - HTTP(S) transport - discovery - `GET .../info/refs?service=git-receive-pack` - push - `POST .../git-receive-pack` with update commands + packfile - note: HTTP may wrap in chunked transfer encoding #### Downloading data (fetch/clone) - SSH transport - client runs remote command - `ssh ... "git-upload-pack '.git'"` - server advertises - refs and capabilities - `symref=HEAD:refs/heads/master` so client knows default branch - negotiation - client sends `want ` - client sends `have ` - client sends `done` to request packfile generation - server returns packfile (optionally multiplexing progress via side-band) - HTTP(S) transport - discovery - `GET .../info/refs?service=git-upload-pack` - negotiation/data request - `POST .../git-upload-pack` with want/have data - response includes packfile ### Protocols summary note - Only the high-level handshake is covered - Many capabilities/features (e.g., `multi_ack`, `side-band`) exist beyond this chapter’s scope ## Maintenance and Data Recovery ### Maintenance (`gc`, packing, pruning) - Auto maintenance - Git may run `auto gc` occasionally - Usually no-op unless thresholds exceeded - What `git gc` does - packs loose objects into packfiles - consolidates packfiles - removes unreachable objects older than a few months - Trigger thresholds (approx) - ~7000 loose objects - >50 packfiles - Config knobs - `gc.auto` - `gc.autopacklimit` - Manual auto-gc run - `git gc --auto` (often does nothing) ### Packing refs into `packed-refs` - Before gc: refs stored as many small files - `.git/refs/heads/*`, `.git/refs/tags/*`, … - After gc: moved for efficiency into `.git/packed-refs` - format lines: ` ` - annotated tags include a “peeled” line starting with `^` - indicates the commit the tag ultimately points to - Updating a ref after packing - Git writes a new loose ref file under `.git/refs/...` - doesn’t edit `packed-refs` - Lookup behavior - Git checks loose refs first, then `packed-refs` fallback ### Data Recovery (finding lost commits) #### Common loss causes - force-delete a branch containing work you later want - `git reset --hard` moving a branch tip back, abandoning newer commits #### Reflog-based recovery - Reflog records where `HEAD` pointed whenever it changes - commits, branch switches, resets - also updated by `git update-ref` (reason to prefer it over manual ref edits) - Useful commands - `git reflog` — concise HEAD history - `git log -g` — reflog shown as a log - Recovery technique - find lost commit SHA-1 in reflog - create a ref/branch pointing to it - `git branch recover-branch ` #### Recovery without reflog - If reflog is missing (e.g., `.git/logs/` removed) - Use integrity checker - `git fsck --full` - shows dangling/unreachable objects - `dangling commit ` - Recover similarly - create a new branch ref pointing to the dangling commit ### Removing objects (purging big files from history) #### Problem statement - Git clones fetch full history - A huge file added once remains in history forever if reachable - even if deleted next commit - Especially painful in imported repos (SVN/Perforce) #### Strong warning - Destructive: rewrites commit history (new commit IDs) - Must coordinate contributors (rebase onto rewritten history) #### Workflow to locate and remove large objects - Confirm repo size after packing - `git gc` - `git count-objects -v` (check `size-pack`) - Find largest packed objects - `git verify-pack -v .idx | sort -k 3 -n | tail -3` - third field in output is object size - Map blob SHA to filename - `git rev-list --objects --all | grep ` - Identify commits that touched the path - `git log --oneline --branches -- ` - Rewrite history to remove the file from every tree - `git filter-branch --index-filter 'git rm --ignore-unmatch --cached ' -- ^..` - `--index-filter` is fast (no full checkout per commit) - `git rm --cached` removes from index/tree, not just working dir - Remove pointers to old history - `rm -Rf .git/refs/original` - `rm -Rf .git/logs/` - Repack/clean - `git gc` - optionally remove remaining loose objects - `git prune --expire now` ## Environment Variables (controlling Git behavior) > Chapter note: not exhaustive; highlights the most useful ### Global behavior - `GIT_EXEC_PATH` - where Git finds sub-programs (e.g., `git-commit`, `git-diff`) - inspect via `git --exec-path` - `HOME` - where Git finds global config - can be overridden for portable Git setups - `PREFIX` - system-wide config path: `$PREFIX/etc/gitconfig` - `GIT_CONFIG_NOSYSTEM` - disable system-wide config - Output paging/editing - `GIT_PAGER` (fallback `PAGER`) - `GIT_EDITOR` (fallback `EDITOR`) ### Repository locations - `GIT_DIR` - where `.git` directory is - if unset, Git walks up directory tree searching - `GIT_CEILING_DIRECTORIES` - stops upward search early (useful for slow filesystems) - `GIT_WORK_TREE` - working tree root for non-bare repos - `GIT_INDEX_FILE` - alternate index path - Object database - `GIT_OBJECT_DIRECTORY` — override `.git/objects` - `GIT_ALTERNATE_OBJECT_DIRECTORIES` - colon-separated additional object stores (share objects across repos) ### Pathspecs (path matching rules) - Pathspecs used in `.gitignore` and CLI patterns (e.g., `git add *.c`) - Wildcard behavior toggles - `GIT_GLOB_PATHSPECS=1` — wildcards enabled (default) - `GIT_NOGLOB_PATHSPECS=1` — wildcards literal (e.g., `*.c` matches file named `*.c`) - Per-argument overrides - prefix with `:(glob)` or `:(literal)` - `GIT_LITERAL_PATHSPECS` - disables wildcard matching and override prefixes - `GIT_ICASE_PATHSPECS` - case-insensitive pathspec matching ### Committing (author/committer identity) - Used primarily by `git-commit-tree` (then falls back to config) - Author fields - `GIT_AUTHOR_NAME` - `GIT_AUTHOR_EMAIL` - `GIT_AUTHOR_DATE` - Committer fields - `GIT_COMMITTER_NAME` - `GIT_COMMITTER_EMAIL` - `GIT_COMMITTER_DATE` - `EMAIL` - fallback email if `user.email` is unset ### Networking (HTTP behavior) - `GIT_CURL_VERBOSE` - emit libcurl debug messages - `GIT_SSL_NO_VERIFY` - skip SSL cert verification (self-signed/setup scenarios) - Low-speed abort settings - `GIT_HTTP_LOW_SPEED_LIMIT` - `GIT_HTTP_LOW_SPEED_TIME` - override `http.lowSpeedLimit` / `http.lowSpeedTime` - `GIT_HTTP_USER_AGENT` - override user-agent string ### Diffing and merging - `GIT_DIFF_OPTS` - only supports unified context count: `-u` / `--unified=` - `GIT_EXTERNAL_DIFF` - program invoked instead of built-in diff - Batch diff metadata for external diff tool - `GIT_DIFF_PATH_COUNTER` - `GIT_DIFF_PATH_TOTAL` - `GIT_MERGE_VERBOSITY` (recursive merge) - 0: only errors - 1: conflicts only - 2: + file changes (default) - 3: + skipped unchanged - 4: + all processed paths - 5+: deep debug ### Debugging/tracing (observability) - Output destinations - `"true"`, `"1"`, `"2"` → stderr - absolute path `/...` → write to file - `GIT_TRACE` - general tracing (alias expansion, sub-program exec) - `GIT_TRACE_PACK_ACCESS` - pack access tracing: packfile + offset - `GIT_TRACE_PACKET` - packet-level tracing for network operations - `GIT_TRACE_PERFORMANCE` - timing for each internal step/subcommand - `GIT_TRACE_SETUP` - shows discovered repo paths (`git_dir`, `worktree`, `cwd`, `prefix`, ...) ### Miscellaneous - `GIT_SSH` - program used instead of `ssh` - invoked as: `$GIT_SSH [user@]host [-p ] ` - wrapper script often needed for extra args; `~/.ssh/config` may be easier - `GIT_ASKPASS` - program to prompt for credentials (returns answer on stdout) - `GIT_NAMESPACE` - namespaced refs (like `--namespace`), often server-side - `GIT_FLUSH` - stdout buffering - `1` flush frequently; `0` buffer - `GIT_REFLOG_ACTION` - custom text written to reflog entries (action descriptor) ## Summary (what you should now understand) - Git internals = object database + refs + a UI on top - Main object types - blob (content), tree (directories), commit (history + metadata), tag (named pointer + metadata) - Refs and `HEAD` provide human-friendly naming and current-state tracking - Packfiles optimize storage through compression and deltas - Refspecs control fetch/push mappings and enable namespaced workflows - Transfer protocols - dumb: simple HTTP reads (rare) - smart: negotiated pack exchange (common) for fetch/push - Maintenance/recovery tools - `gc`, `packed-refs`, `reflog`, `fsck`, `filter-branch`, `prune` - Environment variables provide control, portability, and deep debugging capabilities ```