Files
mapas-mentales/mindmap/Git Internals.md

23 KiB
Raw Blame History

# Git Internals (Chapter 8)
## Why this chapter exists / positioning in the book
- Can be read early (curiosity) or late (after learning porcelain)
- Understanding internals helps explain *why* Git behaves as it does
- Tradeoff: powerful insight vs. potential complexity for beginners
- Core premise
  - Git = **content-addressable filesystem** + **VCS user interface** layered on top
- Historical note
  - Early Git (mostly pre-1.5) UI emphasized filesystem concepts → felt complex
  - Modern Git UI refined; early “complex Git” stereotype lingers
- Chapter flow
  - Content-addressable storage layer (objects) first
  - Then transports (protocols)
  - Then maintenance + recovery tasks

## Plumbing and Porcelain
- Porcelain commands (high-level UX)
  - Examples: `checkout`, `branch`, `remote`, …
  - Most of the book focuses on these
- Plumbing commands (low-level toolkit)
  - Designed to be chained (UNIX-style) or used from scripts/tools
  - Used here to expose internals and demonstrate implementation
  - Often not meant for humans to type frequently

## The `.git` directory (what Git stores/manipulates)
- Created by `git init`
- Backups/clones
  - Copying `.git/` elsewhere gives *nearly everything* needed
- Fresh repo typical contents
  - `config`
    - Project-specific configuration
  - `description`
    - Used by GitWeb only
  - `HEAD`
    - Points to current branch (or object in detached HEAD)
  - `hooks/`
    - Client/server hook scripts (covered elsewhere)
  - `info/`
    - Global excludes (patterns you dont want in `.gitignore`)
  - `objects/`
    - Object database (content store)
  - `refs/`
    - Pointers into commits (branches, tags, remotes, …)
  - `index` (not shown initially)
    - Staging area data (created when needed)
- “Core” pieces emphasized here
  - `objects/` — all stored content
  - `refs/` — names/pointers into commit graph
  - `HEAD` — whats checked out
  - `index` — staging area snapshot used to build trees/commits

## Git Objects (content-addressable store)
### Concept: a keyvalue database
- Insert arbitrary data → receive a unique key → retrieve later
- Key is a checksum (SHA-1 in these examples) of:
  - a header + the content (details later)

### Creating a blob object with `git hash-object`
- What it does
  - hashes content
  - optionally writes object into `.git/objects/`
  - returns the object id (40 hex chars = SHA-1)
- Key options
  - `-w` — write object to object database
  - `--stdin` — read content from stdin (otherwise expects a filename)
- Object storage layout on disk (loose objects)
  - Path: `.git/objects/<first2>/<remaining38>`
  - Directory name = first 2 chars of SHA-1
  - Filename = remaining 38 chars
- Inspecting an object
  - `git cat-file -p <sha>` — pretty-print content (auto-detect type)
  - `git cat-file -t <sha>` — print object type
- Blob objects
  - store *only content* (no filename)
  - example: versions of `test.txt` stored as different blobs

### Retrieving content
- You can “recreate” a file from a blob by redirecting `cat-file` output
  - `git cat-file -p <sha> > test.txt`
- Limitations of blobs alone
  - Must remember SHA-1 per version
  - No filenames or directory structure

## Tree Objects (filenames + directories + grouping)
### What a tree is
- Stores a directory listing-like structure
- Entries contain
  - mode
  - type (`blob` or `tree`)
  - SHA-1 of target object
  - filename
- Conceptual model (simplified UNIX-like)
  - tree ↔ directory entries
  - blob ↔ file contents

### Inspecting trees
- `git cat-file -p master^{tree}`
  - shows top-level tree for the last commit on `master`
  - example entries include blobs (files) and trees (subdirectories)
- Subtrees
  - a directory entry points to another tree object
- Shell quoting pitfalls for `master^{tree}`
  - Windows CMD: `^` is escape → use `master^^{tree}`
  - PowerShell: quote braces → `git cat-file -p 'master^{tree}'`
  - ZSH: `^` globbing → quote expression → `git cat-file -p "master^{tree}"`

### Building trees manually (via the index)
- Normal Git behavior
  - Creates trees from the staging area (index)
- Plumbing commands used
  - `git update-index`
    - manipulate index entries
    - `--add` required if path not in index yet
    - `--cacheinfo` used when content isnt in working tree (already in DB)
    - requires: `<mode> <sha> <path>`
    - valid file modes for blobs
      - `100644` normal file
      - `100755` executable
      - `120000` symlink
  - `git write-tree`
    - writes current index to a tree object
  - `git read-tree`
    - reads a tree into index
    - `--prefix=<dir>/` stages it as a subtree

### Example sequence (three trees)
- Tree 1: `test.txt` v1
  - stage blob via `update-index --add --cacheinfo 100644 <sha_v1> test.txt`
  - `write-tree` → tree1 (contains `test.txt` → blob v1)
- Tree 2: `test.txt` v2 + `new.txt`
  - update index to point `test.txt` to blob v2
  - add `new.txt`
  - `write-tree` → tree2 (two file entries)
- Tree 3: include Tree 1 under `bak/`
  - `read-tree --prefix=bak <tree1>`
  - `write-tree` → tree3
  - tree3 contains
    - `bak/` → tree1
    - `new.txt` → blob
    - `test.txt` → blob v2

## Commit Objects (snapshots + history + metadata)
### Why commits exist
- Trees represent snapshots but:
  - SHA-1s are not memorable
  - need who/when/why metadata
  - need parent links to form history

### Creating commits with `git commit-tree`
- Inputs
  - a tree SHA-1 (snapshot)
  - optional parent commit SHA-1(s)
  - message from stdin
- Commit object fields
  - `tree <tree_sha>`
  - `parent <parent_sha>` (none for first commit)
  - `author ...` (from `user.name`, `user.email`, timestamp)
  - `committer ...` (same source)
  - blank line
  - commit message
- Note about hashes in book
  - commit hashes differ due to timestamps/author data; use your own

### Example history
- Commit 1 points to tree1 (no parent)
- Commit 2 points to tree2, parent = commit1
- Commit 3 points to tree3, parent = commit2
- View history
  - `git log --stat <commit3_sha>`
- Key takeaway
  - Porcelain `git add`/`git commit` do essentially:
    - write blobs for changed content
    - update index
    - write tree(s)
    - write commit referencing tree + parent

## Object Storage (how objects are actually stored)
### Common storage recipe
- Each object stored as:
  - header + content
- Header format
  - `<type> <size>\0`
    - type: `blob`, `tree`, `commit`, `tag`
    - size: bytes in content
    - null byte terminator
- Object id
  - SHA-1 of (header + content)
- Compression
  - zlib-compressed before writing to disk

### Ruby walk-through (blob example)
- Build content string
- Build header (`"blob #{bytesize}\0"`)
- Concatenate and hash with SHA-1
  - matches `git hash-object` (use `echo -n` to avoid newline)
- Compress with zlib
- Write to `.git/objects/<sha[0,2]>/<sha[2,38]>`
- Validate with `git cat-file -p <sha>`

## Git References (refs) — naming commits/objects
### What refs are
- Human-friendly names → files containing SHA-1s
- Stored under `.git/refs/`
  - `refs/heads/` — branches
  - `refs/tags/` — tags
  - (later) `refs/remotes/` — remote-tracking refs

### Creating/updating refs
- Direct edit possible but discouraged
  - `echo <sha> > .git/refs/heads/master`
- Safer: `git update-ref`
  - `git update-ref refs/heads/master <sha>`
- Branch meaning
  - A branch is a ref that points to the tip commit of a line of work
- Example: create branch at older commit
  - `git update-ref refs/heads/test <sha_of_commit2>`
  - `git log test` shows only commits reachable from that ref

## `HEAD` — what you have checked out
### Symbolic reference (usual case)
- `.git/HEAD` commonly contains
  - `ref: refs/heads/<branch>`
- On checkout, Git updates `HEAD` to point at chosen branch ref
- Commit parent determination
  - `git commit` uses commit pointed to by ref that `HEAD` references

### Detached HEAD (special case)
- Sometimes `HEAD` contains a raw SHA-1
- Happens when checking out
  - a tag
  - a commit
  - a remote-tracking branch

### Managing HEAD safely
- `git symbolic-ref HEAD` — read where HEAD points
- `git symbolic-ref HEAD refs/heads/test` — set symbolic HEAD
- Constraint
  - cannot point outside `refs/` namespace

## Tags (lightweight vs annotated)
### Tag object
- Fourth object type: `tag`
- Similar to commit object (tagger/date/message/pointer)
- Usually points to a commit, but can tag any object (blob/tree/commit)

### Lightweight tags
- Just a ref under `refs/tags/` pointing directly to an object
  - `git update-ref refs/tags/v1.0 <commit_sha>`
- Never moves (unlike branch tips)

### Annotated tags
- Create a tag object and a ref that points to it
  - `git tag -a v1.1 <commit_sha> -m '...'`
- `.git/refs/tags/v1.1` contains SHA-1 of the *tag object*
- Tag object content includes
  - `object <target_sha>`
  - `type <target_type>`
  - `tag <name>`
  - `tagger ...`
  - message
- Examples mentioned
  - Tagging a maintainers GPG key stored as a blob
  - Kernel repo has an early tag pointing at an initial tree

## Remotes (remote-tracking references)
### What they are
- Refs under `refs/remotes/<remote>/...`
- Store last known state of remote branches after communicating

### Example
- After `git remote add origin ...` and `git push origin master`
  - `.git/refs/remotes/origin/master` stores last known remote SHA-1

### Key characteristics
- Read-only from user standpoint
- You can checkout one, but Git wont set `HEAD` as symbolic ref to it
- They act as bookmarks managed by Git for remote state

## Packfiles (space-efficient object storage)
### Loose objects vs packed objects
- Loose object: one zlib file per object
- Packfile:
  - single `.pack` containing many objects
  - `.idx` index mapping SHA-1 → offsets

### When packing happens
- Automatically when:
  - many loose objects
  - many packfiles
- Manually via `git gc`
- Often during push to a server

### Demonstration scenario (why deltas matter)
- Add large file (`repo.rb`, ~22K) and commit
  - file stored as blob
- Modify it slightly and commit again
  - creates a whole new blob
  - two near-identical large blobs now exist

### `git gc` effects
- Creates pack + index
- Removes many loose objects (reachable ones)
- Leaves dangling/unreachable blobs loose (not in pack)

### Inspecting whats packed
- `git verify-pack -v <pack>.idx`
  - shows objects, sizes, offsets, delta bases
- Delta storage behavior shown
  - newer version often stored in full
  - older version stored as delta against newer
  - optimized for fast access to most recent version
- Repacking
  - can happen automatically
  - can be triggered any time via `git gc`

## Refspec (ref mapping rules for fetch/push)
### Where it appears
- `.git/config` remote section created by `git remote add`
  - `fetch = +refs/heads/*:refs/remotes/origin/*`

### Syntax
- `(+)?<src>:<dst>`
  - optional `+` forces update even if not fast-forward
  - `<src>`: refs on remote
  - `<dst>`: local tracking refs

### Default fetch behavior
- Fetch all remote branches (`refs/heads/*`)
- Track locally as `refs/remotes/origin/*`
- Equivalent references
  - `origin/master`
  - `remotes/origin/master`
  - `refs/remotes/origin/master`

### Custom fetch examples
- Fetch only master always
  - `fetch = +refs/heads/master:refs/remotes/origin/master`
- One-time fetch to a different local name
  - `git fetch origin master:refs/remotes/origin/mymaster`
- Multiple refspecs
  - CLI or multiple `fetch =` lines in config
- Fast-forward enforcement and overrides
  - non-FF rejected unless `+` used
- Partial globs (Git ≥ 2.6.0)
  - `qa*` patterns for multiple branches
- Namespaces/directories for teams
  - e.g., `refs/heads/qa/*` → `refs/remotes/origin/qa/*`

## Pushing refspecs & deleting remote refs
### Pushing into a namespace
- Push local `master` to remote `qa/master`
  - `git push origin master:refs/heads/qa/master`
- Configure default push mapping
  - `push = refs/heads/master:refs/heads/qa/master`

### Deleting remote references
- Old refspec deletion form
  - `git push origin :topic`
- Newer explicit flag (Git ≥ 1.7.0)
  - `git push origin --delete topic`

### Note/limitation
- Refspecs cant fetch from one repo and push to another (as a single refspec trick)

## Transfer Protocols (moving data between repositories)
### Two major approaches
- Dumb protocol
  - simple, HTTP read-only, no Git server-side logic
  - inefficient, hard to secure/private; rarely used now
- Smart protocol
  - Git-aware server process
  - negotiates what data is needed
  - supports pushes

### Dumb protocol (HTTP) — conceptual clone walkthrough
- `git clone http://server/<repo>.git`
- Fetch refs list (requires server-generated metadata)
  - `GET info/refs`
  - generated by `update-server-info` (often via post-receive hook)
- Fetch HEAD to determine default branch
  - `GET HEAD` → `ref: refs/heads/master`
- Walk objects starting from advertised commit SHA
  - `GET objects/<sha_prefix>/<sha_rest>` for loose objects
  - parse commit → learn `tree` + `parent`
- If tree object not found as loose (404)
  - check alternates
    - `GET objects/info/http-alternates`
  - check available packfiles
    - `GET objects/info/packs`
    - `GET objects/pack/pack-....idx`
    - `GET objects/pack/pack-....pack`
- Once required objects are fetched
  - checkout working tree for branch pointed to by downloaded `HEAD`

### Smart protocol — overview
- Upload (push): `send-pack` (client) ↔ `receive-pack` (server)
- Download (fetch/clone): `fetch-pack` (client) ↔ `upload-pack` (server)

#### Uploading data (push)
- SSH transport
  - client runs remote command (conceptually)
    - `ssh ... "git-receive-pack '<repo>.git'"`
  - server advertises
    - current refs + SHA-1s
    - capabilities appended on the first line after a NUL separator
  - pkt-line framing
    - each chunk begins with 4 hex chars = length (including those 4 chars)
    - `0000` indicates end
  - client sends per-ref updates
    - `<old_sha> <new_sha> <refname>`
    - all zeros on left = create ref
    - all zeros on right = delete ref
  - client sends a packfile of objects server lacks
  - server replies success/failure
    - e.g., `unpack ok`
- HTTP(S) transport
  - discovery
    - `GET .../info/refs?service=git-receive-pack`
  - push
    - `POST .../git-receive-pack` with update commands + packfile
  - note: HTTP may wrap in chunked transfer encoding

#### Downloading data (fetch/clone)
- SSH transport
  - client runs remote command
    - `ssh ... "git-upload-pack '<repo>.git'"`
  - server advertises
    - refs and capabilities
    - `symref=HEAD:refs/heads/master` so client knows default branch
  - negotiation
    - client sends `want <sha>`
    - client sends `have <sha>`
    - client sends `done` to request packfile generation
  - server returns packfile (optionally multiplexing progress via side-band)
- HTTP(S) transport
  - discovery
    - `GET .../info/refs?service=git-upload-pack`
  - negotiation/data request
    - `POST .../git-upload-pack` with want/have data
  - response includes packfile

### Protocols summary note
- Only the high-level handshake is covered
- Many capabilities/features (e.g., `multi_ack`, `side-band`) exist beyond this chapters scope

## Maintenance and Data Recovery
### Maintenance (`gc`, packing, pruning)
- Auto maintenance
  - Git may run `auto gc` occasionally
  - Usually no-op unless thresholds exceeded
- What `git gc` does
  - packs loose objects into packfiles
  - consolidates packfiles
  - removes unreachable objects older than a few months
- Trigger thresholds (approx)
  - ~7000 loose objects
  - >50 packfiles
- Config knobs
  - `gc.auto`
  - `gc.autopacklimit`
- Manual auto-gc run
  - `git gc --auto` (often does nothing)

### Packing refs into `packed-refs`
- Before gc: refs stored as many small files
  - `.git/refs/heads/*`, `.git/refs/tags/*`, …
- After gc: moved for efficiency into `.git/packed-refs`
  - format lines: `<sha> <refname>`
  - annotated tags include a “peeled” line starting with `^`
    - indicates the commit the tag ultimately points to
- Updating a ref after packing
  - Git writes a new loose ref file under `.git/refs/...`
  - doesnt edit `packed-refs`
- Lookup behavior
  - Git checks loose refs first, then `packed-refs` fallback

### Data Recovery (finding lost commits)
#### Common loss causes
- force-delete a branch containing work you later want
- `git reset --hard` moving a branch tip back, abandoning newer commits

#### Reflog-based recovery
- Reflog records where `HEAD` pointed whenever it changes
  - commits, branch switches, resets
  - also updated by `git update-ref` (reason to prefer it over manual ref edits)
- Useful commands
  - `git reflog` — concise HEAD history
  - `git log -g` — reflog shown as a log
- Recovery technique
  - find lost commit SHA-1 in reflog
  - create a ref/branch pointing to it
    - `git branch recover-branch <sha>`

#### Recovery without reflog
- If reflog is missing (e.g., `.git/logs/` removed)
- Use integrity checker
  - `git fsck --full`
  - shows dangling/unreachable objects
    - `dangling commit <sha>`
- Recover similarly
  - create a new branch ref pointing to the dangling commit

### Removing objects (purging big files from history)
#### Problem statement
- Git clones fetch full history
- A huge file added once remains in history forever if reachable
  - even if deleted next commit
- Especially painful in imported repos (SVN/Perforce)

#### Strong warning
- Destructive: rewrites commit history (new commit IDs)
- Must coordinate contributors (rebase onto rewritten history)

#### Workflow to locate and remove large objects
- Confirm repo size after packing
  - `git gc`
  - `git count-objects -v` (check `size-pack`)
- Find largest packed objects
  - `git verify-pack -v <pack>.idx | sort -k 3 -n | tail -3`
  - third field in output is object size
- Map blob SHA to filename
  - `git rev-list --objects --all | grep <blob_sha_prefix>`
- Identify commits that touched the path
  - `git log --oneline --branches -- <file>`
- Rewrite history to remove the file from every tree
  - `git filter-branch --index-filter 'git rm --ignore-unmatch --cached <file>' -- <bad_commit>^..`
  - `--index-filter` is fast (no full checkout per commit)
  - `git rm --cached` removes from index/tree, not just working dir
- Remove pointers to old history
  - `rm -Rf .git/refs/original`
  - `rm -Rf .git/logs/`
- Repack/clean
  - `git gc`
  - optionally remove remaining loose objects
    - `git prune --expire now`

## Environment Variables (controlling Git behavior)
> Chapter note: not exhaustive; highlights the most useful

### Global behavior
- `GIT_EXEC_PATH`
  - where Git finds sub-programs (e.g., `git-commit`, `git-diff`)
  - inspect via `git --exec-path`
- `HOME`
  - where Git finds global config
  - can be overridden for portable Git setups
- `PREFIX`
  - system-wide config path: `$PREFIX/etc/gitconfig`
- `GIT_CONFIG_NOSYSTEM`
  - disable system-wide config
- Output paging/editing
  - `GIT_PAGER` (fallback `PAGER`)
  - `GIT_EDITOR` (fallback `EDITOR`)

### Repository locations
- `GIT_DIR`
  - where `.git` directory is
  - if unset, Git walks up directory tree searching
- `GIT_CEILING_DIRECTORIES`
  - stops upward search early (useful for slow filesystems)
- `GIT_WORK_TREE`
  - working tree root for non-bare repos
- `GIT_INDEX_FILE`
  - alternate index path
- Object database
  - `GIT_OBJECT_DIRECTORY` — override `.git/objects`
  - `GIT_ALTERNATE_OBJECT_DIRECTORIES`
    - colon-separated additional object stores (share objects across repos)

### Pathspecs (path matching rules)
- Pathspecs used in `.gitignore` and CLI patterns (e.g., `git add *.c`)
- Wildcard behavior toggles
  - `GIT_GLOB_PATHSPECS=1` — wildcards enabled (default)
  - `GIT_NOGLOB_PATHSPECS=1` — wildcards literal (e.g., `*.c` matches file named `*.c`)
- Per-argument overrides
  - prefix with `:(glob)` or `:(literal)`
- `GIT_LITERAL_PATHSPECS`
  - disables wildcard matching and override prefixes
- `GIT_ICASE_PATHSPECS`
  - case-insensitive pathspec matching

### Committing (author/committer identity)
- Used primarily by `git-commit-tree` (then falls back to config)
- Author fields
  - `GIT_AUTHOR_NAME`
  - `GIT_AUTHOR_EMAIL`
  - `GIT_AUTHOR_DATE`
- Committer fields
  - `GIT_COMMITTER_NAME`
  - `GIT_COMMITTER_EMAIL`
  - `GIT_COMMITTER_DATE`
- `EMAIL`
  - fallback email if `user.email` is unset

### Networking (HTTP behavior)
- `GIT_CURL_VERBOSE`
  - emit libcurl debug messages
- `GIT_SSL_NO_VERIFY`
  - skip SSL cert verification (self-signed/setup scenarios)
- Low-speed abort settings
  - `GIT_HTTP_LOW_SPEED_LIMIT`
  - `GIT_HTTP_LOW_SPEED_TIME`
  - override `http.lowSpeedLimit` / `http.lowSpeedTime`
- `GIT_HTTP_USER_AGENT`
  - override user-agent string

### Diffing and merging
- `GIT_DIFF_OPTS`
  - only supports unified context count: `-u<n>` / `--unified=<n>`
- `GIT_EXTERNAL_DIFF`
  - program invoked instead of built-in diff
- Batch diff metadata for external diff tool
  - `GIT_DIFF_PATH_COUNTER`
  - `GIT_DIFF_PATH_TOTAL`
- `GIT_MERGE_VERBOSITY` (recursive merge)
  - 0: only errors
  - 1: conflicts only
  - 2: + file changes (default)
  - 3: + skipped unchanged
  - 4: + all processed paths
  - 5+: deep debug

### Debugging/tracing (observability)
- Output destinations
  - `"true"`, `"1"`, `"2"` → stderr
  - absolute path `/...` → write to file
- `GIT_TRACE`
  - general tracing (alias expansion, sub-program exec)
- `GIT_TRACE_PACK_ACCESS`
  - pack access tracing: packfile + offset
- `GIT_TRACE_PACKET`
  - packet-level tracing for network operations
- `GIT_TRACE_PERFORMANCE`
  - timing for each internal step/subcommand
- `GIT_TRACE_SETUP`
  - shows discovered repo paths (`git_dir`, `worktree`, `cwd`, `prefix`, ...)

### Miscellaneous
- `GIT_SSH`
  - program used instead of `ssh`
  - invoked as: `$GIT_SSH [user@]host [-p <port>] <command>`
  - wrapper script often needed for extra args; `~/.ssh/config` may be easier
- `GIT_ASKPASS`
  - program to prompt for credentials (returns answer on stdout)
- `GIT_NAMESPACE`
  - namespaced refs (like `--namespace`), often server-side
- `GIT_FLUSH`
  - stdout buffering
  - `1` flush frequently; `0` buffer
- `GIT_REFLOG_ACTION`
  - custom text written to reflog entries (action descriptor)

## Summary (what you should now understand)
- Git internals = object database + refs + a UI on top
- Main object types
  - blob (content), tree (directories), commit (history + metadata), tag (named pointer + metadata)
- Refs and `HEAD` provide human-friendly naming and current-state tracking
- Packfiles optimize storage through compression and deltas
- Refspecs control fetch/push mappings and enable namespaced workflows
- Transfer protocols
  - dumb: simple HTTP reads (rare)
  - smart: negotiated pack exchange (common) for fetch/push
- Maintenance/recovery tools
  - `gc`, `packed-refs`, `reflog`, `fsck`, `filter-branch`, `prune`
- Environment variables provide control, portability, and deep debugging capabilities