Files
mapas-mentales/mindmap/Git Internals.md

667 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
```markmap
# Git Internals (Chapter 8)
## Why this chapter exists / positioning in the book
- Can be read early (curiosity) or late (after learning porcelain)
- Understanding internals helps explain *why* Git behaves as it does
- Tradeoff: powerful insight vs. potential complexity for beginners
- Core premise
- Git = **content-addressable filesystem** + **VCS user interface** layered on top
- Historical note
- Early Git (mostly pre-1.5) UI emphasized filesystem concepts → felt complex
- Modern Git UI refined; early “complex Git” stereotype lingers
- Chapter flow
- Content-addressable storage layer (objects) first
- Then transports (protocols)
- Then maintenance + recovery tasks
## Plumbing and Porcelain
- Porcelain commands (high-level UX)
- Examples: `checkout`, `branch`, `remote`, …
- Most of the book focuses on these
- Plumbing commands (low-level toolkit)
- Designed to be chained (UNIX-style) or used from scripts/tools
- Used here to expose internals and demonstrate implementation
- Often not meant for humans to type frequently
## The `.git` directory (what Git stores/manipulates)
- Created by `git init`
- Backups/clones
- Copying `.git/` elsewhere gives *nearly everything* needed
- Fresh repo typical contents
- `config`
- Project-specific configuration
- `description`
- Used by GitWeb only
- `HEAD`
- Points to current branch (or object in detached HEAD)
- `hooks/`
- Client/server hook scripts (covered elsewhere)
- `info/`
- Global excludes (patterns you dont want in `.gitignore`)
- `objects/`
- Object database (content store)
- `refs/`
- Pointers into commits (branches, tags, remotes, …)
- `index` (not shown initially)
- Staging area data (created when needed)
- “Core” pieces emphasized here
- `objects/` — all stored content
- `refs/` — names/pointers into commit graph
- `HEAD` — whats checked out
- `index` — staging area snapshot used to build trees/commits
## Git Objects (content-addressable store)
### Concept: a keyvalue database
- Insert arbitrary data → receive a unique key → retrieve later
- Key is a checksum (SHA-1 in these examples) of:
- a header + the content (details later)
### Creating a blob object with `git hash-object`
- What it does
- hashes content
- optionally writes object into `.git/objects/`
- returns the object id (40 hex chars = SHA-1)
- Key options
- `-w` — write object to object database
- `--stdin` — read content from stdin (otherwise expects a filename)
- Object storage layout on disk (loose objects)
- Path: `.git/objects/<first2>/<remaining38>`
- Directory name = first 2 chars of SHA-1
- Filename = remaining 38 chars
- Inspecting an object
- `git cat-file -p <sha>` — pretty-print content (auto-detect type)
- `git cat-file -t <sha>` — print object type
- Blob objects
- store *only content* (no filename)
- example: versions of `test.txt` stored as different blobs
### Retrieving content
- You can “recreate” a file from a blob by redirecting `cat-file` output
- `git cat-file -p <sha> > test.txt`
- Limitations of blobs alone
- Must remember SHA-1 per version
- No filenames or directory structure
## Tree Objects (filenames + directories + grouping)
### What a tree is
- Stores a directory listing-like structure
- Entries contain
- mode
- type (`blob` or `tree`)
- SHA-1 of target object
- filename
- Conceptual model (simplified UNIX-like)
- tree ↔ directory entries
- blob ↔ file contents
### Inspecting trees
- `git cat-file -p master^{tree}`
- shows top-level tree for the last commit on `master`
- example entries include blobs (files) and trees (subdirectories)
- Subtrees
- a directory entry points to another tree object
- Shell quoting pitfalls for `master^{tree}`
- Windows CMD: `^` is escape → use `master^^{tree}`
- PowerShell: quote braces → `git cat-file -p 'master^{tree}'`
- ZSH: `^` globbing → quote expression → `git cat-file -p "master^{tree}"`
### Building trees manually (via the index)
- Normal Git behavior
- Creates trees from the staging area (index)
- Plumbing commands used
- `git update-index`
- manipulate index entries
- `--add` required if path not in index yet
- `--cacheinfo` used when content isnt in working tree (already in DB)
- requires: `<mode> <sha> <path>`
- valid file modes for blobs
- `100644` normal file
- `100755` executable
- `120000` symlink
- `git write-tree`
- writes current index to a tree object
- `git read-tree`
- reads a tree into index
- `--prefix=<dir>/` stages it as a subtree
### Example sequence (three trees)
- Tree 1: `test.txt` v1
- stage blob via `update-index --add --cacheinfo 100644 <sha_v1> test.txt`
- `write-tree` → tree1 (contains `test.txt` → blob v1)
- Tree 2: `test.txt` v2 + `new.txt`
- update index to point `test.txt` to blob v2
- add `new.txt`
- `write-tree` → tree2 (two file entries)
- Tree 3: include Tree 1 under `bak/`
- `read-tree --prefix=bak <tree1>`
- `write-tree` → tree3
- tree3 contains
- `bak/` → tree1
- `new.txt` → blob
- `test.txt` → blob v2
## Commit Objects (snapshots + history + metadata)
### Why commits exist
- Trees represent snapshots but:
- SHA-1s are not memorable
- need who/when/why metadata
- need parent links to form history
### Creating commits with `git commit-tree`
- Inputs
- a tree SHA-1 (snapshot)
- optional parent commit SHA-1(s)
- message from stdin
- Commit object fields
- `tree <tree_sha>`
- `parent <parent_sha>` (none for first commit)
- `author ...` (from `user.name`, `user.email`, timestamp)
- `committer ...` (same source)
- blank line
- commit message
- Note about hashes in book
- commit hashes differ due to timestamps/author data; use your own
### Example history
- Commit 1 points to tree1 (no parent)
- Commit 2 points to tree2, parent = commit1
- Commit 3 points to tree3, parent = commit2
- View history
- `git log --stat <commit3_sha>`
- Key takeaway
- Porcelain `git add`/`git commit` do essentially:
- write blobs for changed content
- update index
- write tree(s)
- write commit referencing tree + parent
## Object Storage (how objects are actually stored)
### Common storage recipe
- Each object stored as:
- header + content
- Header format
- `<type> <size>\0`
- type: `blob`, `tree`, `commit`, `tag`
- size: bytes in content
- null byte terminator
- Object id
- SHA-1 of (header + content)
- Compression
- zlib-compressed before writing to disk
### Ruby walk-through (blob example)
- Build content string
- Build header (`"blob #{bytesize}\0"`)
- Concatenate and hash with SHA-1
- matches `git hash-object` (use `echo -n` to avoid newline)
- Compress with zlib
- Write to `.git/objects/<sha[0,2]>/<sha[2,38]>`
- Validate with `git cat-file -p <sha>`
## Git References (refs) — naming commits/objects
### What refs are
- Human-friendly names → files containing SHA-1s
- Stored under `.git/refs/`
- `refs/heads/` — branches
- `refs/tags/` — tags
- (later) `refs/remotes/` — remote-tracking refs
### Creating/updating refs
- Direct edit possible but discouraged
- `echo <sha> > .git/refs/heads/master`
- Safer: `git update-ref`
- `git update-ref refs/heads/master <sha>`
- Branch meaning
- A branch is a ref that points to the tip commit of a line of work
- Example: create branch at older commit
- `git update-ref refs/heads/test <sha_of_commit2>`
- `git log test` shows only commits reachable from that ref
## `HEAD` — what you have checked out
### Symbolic reference (usual case)
- `.git/HEAD` commonly contains
- `ref: refs/heads/<branch>`
- On checkout, Git updates `HEAD` to point at chosen branch ref
- Commit parent determination
- `git commit` uses commit pointed to by ref that `HEAD` references
### Detached HEAD (special case)
- Sometimes `HEAD` contains a raw SHA-1
- Happens when checking out
- a tag
- a commit
- a remote-tracking branch
### Managing HEAD safely
- `git symbolic-ref HEAD` — read where HEAD points
- `git symbolic-ref HEAD refs/heads/test` — set symbolic HEAD
- Constraint
- cannot point outside `refs/` namespace
## Tags (lightweight vs annotated)
### Tag object
- Fourth object type: `tag`
- Similar to commit object (tagger/date/message/pointer)
- Usually points to a commit, but can tag any object (blob/tree/commit)
### Lightweight tags
- Just a ref under `refs/tags/` pointing directly to an object
- `git update-ref refs/tags/v1.0 <commit_sha>`
- Never moves (unlike branch tips)
### Annotated tags
- Create a tag object and a ref that points to it
- `git tag -a v1.1 <commit_sha> -m '...'`
- `.git/refs/tags/v1.1` contains SHA-1 of the *tag object*
- Tag object content includes
- `object <target_sha>`
- `type <target_type>`
- `tag <name>`
- `tagger ...`
- message
- Examples mentioned
- Tagging a maintainers GPG key stored as a blob
- Kernel repo has an early tag pointing at an initial tree
## Remotes (remote-tracking references)
### What they are
- Refs under `refs/remotes/<remote>/...`
- Store last known state of remote branches after communicating
### Example
- After `git remote add origin ...` and `git push origin master`
- `.git/refs/remotes/origin/master` stores last known remote SHA-1
### Key characteristics
- Read-only from user standpoint
- You can checkout one, but Git wont set `HEAD` as symbolic ref to it
- They act as bookmarks managed by Git for remote state
## Packfiles (space-efficient object storage)
### Loose objects vs packed objects
- Loose object: one zlib file per object
- Packfile:
- single `.pack` containing many objects
- `.idx` index mapping SHA-1 → offsets
### When packing happens
- Automatically when:
- many loose objects
- many packfiles
- Manually via `git gc`
- Often during push to a server
### Demonstration scenario (why deltas matter)
- Add large file (`repo.rb`, ~22K) and commit
- file stored as blob
- Modify it slightly and commit again
- creates a whole new blob
- two near-identical large blobs now exist
### `git gc` effects
- Creates pack + index
- Removes many loose objects (reachable ones)
- Leaves dangling/unreachable blobs loose (not in pack)
### Inspecting whats packed
- `git verify-pack -v <pack>.idx`
- shows objects, sizes, offsets, delta bases
- Delta storage behavior shown
- newer version often stored in full
- older version stored as delta against newer
- optimized for fast access to most recent version
- Repacking
- can happen automatically
- can be triggered any time via `git gc`
## Refspec (ref mapping rules for fetch/push)
### Where it appears
- `.git/config` remote section created by `git remote add`
- `fetch = +refs/heads/*:refs/remotes/origin/*`
### Syntax
- `(+)?<src>:<dst>`
- optional `+` forces update even if not fast-forward
- `<src>`: refs on remote
- `<dst>`: local tracking refs
### Default fetch behavior
- Fetch all remote branches (`refs/heads/*`)
- Track locally as `refs/remotes/origin/*`
- Equivalent references
- `origin/master`
- `remotes/origin/master`
- `refs/remotes/origin/master`
### Custom fetch examples
- Fetch only master always
- `fetch = +refs/heads/master:refs/remotes/origin/master`
- One-time fetch to a different local name
- `git fetch origin master:refs/remotes/origin/mymaster`
- Multiple refspecs
- CLI or multiple `fetch =` lines in config
- Fast-forward enforcement and overrides
- non-FF rejected unless `+` used
- Partial globs (Git ≥ 2.6.0)
- `qa*` patterns for multiple branches
- Namespaces/directories for teams
- e.g., `refs/heads/qa/*` → `refs/remotes/origin/qa/*`
## Pushing refspecs & deleting remote refs
### Pushing into a namespace
- Push local `master` to remote `qa/master`
- `git push origin master:refs/heads/qa/master`
- Configure default push mapping
- `push = refs/heads/master:refs/heads/qa/master`
### Deleting remote references
- Old refspec deletion form
- `git push origin :topic`
- Newer explicit flag (Git ≥ 1.7.0)
- `git push origin --delete topic`
### Note/limitation
- Refspecs cant fetch from one repo and push to another (as a single refspec trick)
## Transfer Protocols (moving data between repositories)
### Two major approaches
- Dumb protocol
- simple, HTTP read-only, no Git server-side logic
- inefficient, hard to secure/private; rarely used now
- Smart protocol
- Git-aware server process
- negotiates what data is needed
- supports pushes
### Dumb protocol (HTTP) — conceptual clone walkthrough
- `git clone http://server/<repo>.git`
- Fetch refs list (requires server-generated metadata)
- `GET info/refs`
- generated by `update-server-info` (often via post-receive hook)
- Fetch HEAD to determine default branch
- `GET HEAD` → `ref: refs/heads/master`
- Walk objects starting from advertised commit SHA
- `GET objects/<sha_prefix>/<sha_rest>` for loose objects
- parse commit → learn `tree` + `parent`
- If tree object not found as loose (404)
- check alternates
- `GET objects/info/http-alternates`
- check available packfiles
- `GET objects/info/packs`
- `GET objects/pack/pack-....idx`
- `GET objects/pack/pack-....pack`
- Once required objects are fetched
- checkout working tree for branch pointed to by downloaded `HEAD`
### Smart protocol — overview
- Upload (push): `send-pack` (client) ↔ `receive-pack` (server)
- Download (fetch/clone): `fetch-pack` (client) ↔ `upload-pack` (server)
#### Uploading data (push)
- SSH transport
- client runs remote command (conceptually)
- `ssh ... "git-receive-pack '<repo>.git'"`
- server advertises
- current refs + SHA-1s
- capabilities appended on the first line after a NUL separator
- pkt-line framing
- each chunk begins with 4 hex chars = length (including those 4 chars)
- `0000` indicates end
- client sends per-ref updates
- `<old_sha> <new_sha> <refname>`
- all zeros on left = create ref
- all zeros on right = delete ref
- client sends a packfile of objects server lacks
- server replies success/failure
- e.g., `unpack ok`
- HTTP(S) transport
- discovery
- `GET .../info/refs?service=git-receive-pack`
- push
- `POST .../git-receive-pack` with update commands + packfile
- note: HTTP may wrap in chunked transfer encoding
#### Downloading data (fetch/clone)
- SSH transport
- client runs remote command
- `ssh ... "git-upload-pack '<repo>.git'"`
- server advertises
- refs and capabilities
- `symref=HEAD:refs/heads/master` so client knows default branch
- negotiation
- client sends `want <sha>`
- client sends `have <sha>`
- client sends `done` to request packfile generation
- server returns packfile (optionally multiplexing progress via side-band)
- HTTP(S) transport
- discovery
- `GET .../info/refs?service=git-upload-pack`
- negotiation/data request
- `POST .../git-upload-pack` with want/have data
- response includes packfile
### Protocols summary note
- Only the high-level handshake is covered
- Many capabilities/features (e.g., `multi_ack`, `side-band`) exist beyond this chapters scope
## Maintenance and Data Recovery
### Maintenance (`gc`, packing, pruning)
- Auto maintenance
- Git may run `auto gc` occasionally
- Usually no-op unless thresholds exceeded
- What `git gc` does
- packs loose objects into packfiles
- consolidates packfiles
- removes unreachable objects older than a few months
- Trigger thresholds (approx)
- ~7000 loose objects
- >50 packfiles
- Config knobs
- `gc.auto`
- `gc.autopacklimit`
- Manual auto-gc run
- `git gc --auto` (often does nothing)
### Packing refs into `packed-refs`
- Before gc: refs stored as many small files
- `.git/refs/heads/*`, `.git/refs/tags/*`, …
- After gc: moved for efficiency into `.git/packed-refs`
- format lines: `<sha> <refname>`
- annotated tags include a “peeled” line starting with `^`
- indicates the commit the tag ultimately points to
- Updating a ref after packing
- Git writes a new loose ref file under `.git/refs/...`
- doesnt edit `packed-refs`
- Lookup behavior
- Git checks loose refs first, then `packed-refs` fallback
### Data Recovery (finding lost commits)
#### Common loss causes
- force-delete a branch containing work you later want
- `git reset --hard` moving a branch tip back, abandoning newer commits
#### Reflog-based recovery
- Reflog records where `HEAD` pointed whenever it changes
- commits, branch switches, resets
- also updated by `git update-ref` (reason to prefer it over manual ref edits)
- Useful commands
- `git reflog` — concise HEAD history
- `git log -g` — reflog shown as a log
- Recovery technique
- find lost commit SHA-1 in reflog
- create a ref/branch pointing to it
- `git branch recover-branch <sha>`
#### Recovery without reflog
- If reflog is missing (e.g., `.git/logs/` removed)
- Use integrity checker
- `git fsck --full`
- shows dangling/unreachable objects
- `dangling commit <sha>`
- Recover similarly
- create a new branch ref pointing to the dangling commit
### Removing objects (purging big files from history)
#### Problem statement
- Git clones fetch full history
- A huge file added once remains in history forever if reachable
- even if deleted next commit
- Especially painful in imported repos (SVN/Perforce)
#### Strong warning
- Destructive: rewrites commit history (new commit IDs)
- Must coordinate contributors (rebase onto rewritten history)
#### Workflow to locate and remove large objects
- Confirm repo size after packing
- `git gc`
- `git count-objects -v` (check `size-pack`)
- Find largest packed objects
- `git verify-pack -v <pack>.idx | sort -k 3 -n | tail -3`
- third field in output is object size
- Map blob SHA to filename
- `git rev-list --objects --all | grep <blob_sha_prefix>`
- Identify commits that touched the path
- `git log --oneline --branches -- <file>`
- Rewrite history to remove the file from every tree
- `git filter-branch --index-filter 'git rm --ignore-unmatch --cached <file>' -- <bad_commit>^..`
- `--index-filter` is fast (no full checkout per commit)
- `git rm --cached` removes from index/tree, not just working dir
- Remove pointers to old history
- `rm -Rf .git/refs/original`
- `rm -Rf .git/logs/`
- Repack/clean
- `git gc`
- optionally remove remaining loose objects
- `git prune --expire now`
## Environment Variables (controlling Git behavior)
> Chapter note: not exhaustive; highlights the most useful
### Global behavior
- `GIT_EXEC_PATH`
- where Git finds sub-programs (e.g., `git-commit`, `git-diff`)
- inspect via `git --exec-path`
- `HOME`
- where Git finds global config
- can be overridden for portable Git setups
- `PREFIX`
- system-wide config path: `$PREFIX/etc/gitconfig`
- `GIT_CONFIG_NOSYSTEM`
- disable system-wide config
- Output paging/editing
- `GIT_PAGER` (fallback `PAGER`)
- `GIT_EDITOR` (fallback `EDITOR`)
### Repository locations
- `GIT_DIR`
- where `.git` directory is
- if unset, Git walks up directory tree searching
- `GIT_CEILING_DIRECTORIES`
- stops upward search early (useful for slow filesystems)
- `GIT_WORK_TREE`
- working tree root for non-bare repos
- `GIT_INDEX_FILE`
- alternate index path
- Object database
- `GIT_OBJECT_DIRECTORY` — override `.git/objects`
- `GIT_ALTERNATE_OBJECT_DIRECTORIES`
- colon-separated additional object stores (share objects across repos)
### Pathspecs (path matching rules)
- Pathspecs used in `.gitignore` and CLI patterns (e.g., `git add *.c`)
- Wildcard behavior toggles
- `GIT_GLOB_PATHSPECS=1` — wildcards enabled (default)
- `GIT_NOGLOB_PATHSPECS=1` — wildcards literal (e.g., `*.c` matches file named `*.c`)
- Per-argument overrides
- prefix with `:(glob)` or `:(literal)`
- `GIT_LITERAL_PATHSPECS`
- disables wildcard matching and override prefixes
- `GIT_ICASE_PATHSPECS`
- case-insensitive pathspec matching
### Committing (author/committer identity)
- Used primarily by `git-commit-tree` (then falls back to config)
- Author fields
- `GIT_AUTHOR_NAME`
- `GIT_AUTHOR_EMAIL`
- `GIT_AUTHOR_DATE`
- Committer fields
- `GIT_COMMITTER_NAME`
- `GIT_COMMITTER_EMAIL`
- `GIT_COMMITTER_DATE`
- `EMAIL`
- fallback email if `user.email` is unset
### Networking (HTTP behavior)
- `GIT_CURL_VERBOSE`
- emit libcurl debug messages
- `GIT_SSL_NO_VERIFY`
- skip SSL cert verification (self-signed/setup scenarios)
- Low-speed abort settings
- `GIT_HTTP_LOW_SPEED_LIMIT`
- `GIT_HTTP_LOW_SPEED_TIME`
- override `http.lowSpeedLimit` / `http.lowSpeedTime`
- `GIT_HTTP_USER_AGENT`
- override user-agent string
### Diffing and merging
- `GIT_DIFF_OPTS`
- only supports unified context count: `-u<n>` / `--unified=<n>`
- `GIT_EXTERNAL_DIFF`
- program invoked instead of built-in diff
- Batch diff metadata for external diff tool
- `GIT_DIFF_PATH_COUNTER`
- `GIT_DIFF_PATH_TOTAL`
- `GIT_MERGE_VERBOSITY` (recursive merge)
- 0: only errors
- 1: conflicts only
- 2: + file changes (default)
- 3: + skipped unchanged
- 4: + all processed paths
- 5+: deep debug
### Debugging/tracing (observability)
- Output destinations
- `"true"`, `"1"`, `"2"` → stderr
- absolute path `/...` → write to file
- `GIT_TRACE`
- general tracing (alias expansion, sub-program exec)
- `GIT_TRACE_PACK_ACCESS`
- pack access tracing: packfile + offset
- `GIT_TRACE_PACKET`
- packet-level tracing for network operations
- `GIT_TRACE_PERFORMANCE`
- timing for each internal step/subcommand
- `GIT_TRACE_SETUP`
- shows discovered repo paths (`git_dir`, `worktree`, `cwd`, `prefix`, ...)
### Miscellaneous
- `GIT_SSH`
- program used instead of `ssh`
- invoked as: `$GIT_SSH [user@]host [-p <port>] <command>`
- wrapper script often needed for extra args; `~/.ssh/config` may be easier
- `GIT_ASKPASS`
- program to prompt for credentials (returns answer on stdout)
- `GIT_NAMESPACE`
- namespaced refs (like `--namespace`), often server-side
- `GIT_FLUSH`
- stdout buffering
- `1` flush frequently; `0` buffer
- `GIT_REFLOG_ACTION`
- custom text written to reflog entries (action descriptor)
## Summary (what you should now understand)
- Git internals = object database + refs + a UI on top
- Main object types
- blob (content), tree (directories), commit (history + metadata), tag (named pointer + metadata)
- Refs and `HEAD` provide human-friendly naming and current-state tracking
- Packfiles optimize storage through compression and deltas
- Refspecs control fetch/push mappings and enable namespaced workflows
- Transfer protocols
- dumb: simple HTTP reads (rare)
- smart: negotiated pack exchange (common) for fetch/push
- Maintenance/recovery tools
- `gc`, `packed-refs`, `reflog`, `fsck`, `filter-branch`, `prune`
- Environment variables provide control, portability, and deep debugging capabilities
```