Git Enrichment Pipeline
tea-rags enriches every indexed code chunk with git-derived quality signals. The pipeline runs in two phases — file-level and chunk-level — both executing asynchronously in the background after indexing returns.
For metric definitions and research context, see Code Churn: Theory & Research. For practical usage (filtering, reranking), see Git Enrichments.
Pipeline Overview
Key Design Decisions
- No git blame — all metrics derive from commit history, not per-line attribution.
- No process spawns for commit data — isomorphic-git reads
.git/objects/pack/directly. - Single CLI call — only
git log --all --numstatfor line-level stats (one spawn total). - Background execution — both phases run asynchronously after indexing returns.
- HEAD-based caching — results are cached and invalidated when HEAD changes.
Phase 1: File-Level Enrichment
Reads git history via isomorphic-git (bounded by TRAJECTORY_GIT_LOG_MAX_AGE_MONTHS, default 12 months), with CLI fallback on timeout (TRAJECTORY_GIT_LOG_TIMEOUT_MS).
git log (isomorphic-git, reads .git directly)
-> per-file CommitInfo[] + linesAdded/linesDeleted
-> computeFileMetadata()
-> GitFileMetadata (stored on all chunks of the file)
Output: GitFileMetadata containing commitCount, relativeChurn, recencyWeightedFreq, changeDensity, churnVolatility, bugFixRate, two parallel ownership families (recentDominantAuthor / recentDominantAuthorPct / recentAuthors / recentContributorCount from the configurable recent commit window, and blameDominantAuthor / blameDominantAuthorPct / blameAuthors / blameContributorCount from git blame HEAD), and other signals. Stored on all chunks of the file via the git.* payload namespace.
The two ownership families capture distinct semantics: recent* reflects who's been actively committing in the recent window (good for activity / review routing), while blame* reflects who currently owns the live lines (good for authority and knowledge-silo analysis). When a long-time owner stops committing, the two diverge, and the divergence itself carries information.
Phase 2: Chunk-Level Churn Overlay
Walks recent commits, diffs trees, reads blobs, and computes line-level patches to determine which chunks were affected by each commit.
git log (last N commits, isomorphic-git)
-> for each commit: diffTrees(parent, commit) -> changed files
-> filter to files with >1 chunk in index
-> readBlob(parent) + readBlob(commit) -> structuredPatch (jsdiff)
-> hunks with line numbers -> overlaps(hunk, chunk)
-> per-chunk accumulators -> ChunkChurnOverlay
-> batchSetPayload with dot-notation merge
(git.chunkCommitCount, etc.)
Output: ChunkChurnOverlay containing chunkCommitCount, chunkChurnRatio, git.chunk.recentContributorCount, git.chunk.blameContributorCount, chunkBugFixRate, chunkLastModifiedAt, chunkAgeDays. Merged into existing git.* payload using dot-notation to avoid overwriting file-level data.
Performance
For a typical project (~2000 files, ~200 commits):
File-level enrichment: Typically 0.5-2s for small repos.
Chunk-level churn:
- 200 commits x ~5 changed files/commit x 60% in index x filter (>1 chunk) = ~400 file diffs
- Each: 2 blob reads (pack cache ~1ms) + 1 structuredPatch (~0.5ms) = ~2.5ms
- With 10 concurrent workers: ~100ms
- Total overhead: < 1s on top of file-level enrichment
Both phases are cached by HEAD SHA and run in background (non-blocking to indexing).
Skip Conditions
Chunk-level analysis is automatically skipped for:
- Single-chunk files — chunk equals file, no granularity benefit.
- Files with 1 commit — all chunks would get identical data.
- Files exceeding
TRAJECTORY_GIT_CHUNK_MAX_FILE_LINES— performance guard. - Binary files — blob read fails gracefully.
- Root commits — no parent to diff against.
GIT SESSIONS — Squash-Aware Grouping
Why it exists. Agent-driven development produces bursts of micro-commits: a single refactor session might land as 15–20 "fix typo", "adjust", "wip" commits within a few minutes. Treating each as an independent commit wrecks every churn-based signal — a 20-commit session looks identical to 20 separate production incidents.
What it does. When TRAJECTORY_GIT_SQUASH_AWARE_SESSIONS=true, the pipeline
groups commits by (author, time gap). Any silence gap larger than
TRAJECTORY_GIT_SESSION_GAP_MINUTES (default 30 min) starts a new session.
Session count — not raw commit count — then feeds churn-related signals.
Where it matters most:
- Solo devs pair-programming with an agent (single human + single agent author)
- Teams adopting AI-assisted workflows where agents produce fine-grained commits
- Any project where
git log --oneline | wc -lis misleading because most commits are agent checkpoints, not logical deliverables
Impact on signals. commitCount, chunkCommitCount, bugFixRate,
churnVolatility, and relativeChurn all use the deduplicated session count
when this mode is on. recentDominantAuthor, blameDominantAuthor, and
taskIds are unaffected — sessions affect counts, not who owns lines or who
mentioned which ticket.
Default is false — opt in per project. Enable via the environment variable
or the setup wizard (/tea-rags-setup:install step 7 — "Configure git
analytics").
Environment Variables
| Variable | Default | Description |
|---|---|---|
TRAJECTORY_GIT_ENABLED | true | Enable git enrichment during indexing. Set to false for non-git projects or fast iteration |
TRAJECTORY_GIT_LOG_MAX_AGE_MONTHS | 12 | Time window for file-level git analysis (months). 0 = no age limit |
TRAJECTORY_GIT_LOG_TIMEOUT_MS | 60000 | Timeout for git log --numstat (ms); falls back to native CLI on expiry |
TRAJECTORY_GIT_CHUNK_MAX_AGE_MONTHS | 6 | Time window for chunk-level churn analysis (months). 0 = no age limit |
TRAJECTORY_GIT_CHUNK_CONCURRENCY | 10 | Parallel commit processing for chunk churn |
TRAJECTORY_GIT_CHUNK_TIMEOUT_MS | 120000 | Timeout for chunk churn CLI pathspec (ms) |
TRAJECTORY_GIT_CHUNK_MAX_FILE_LINES | 10000 | Skip files larger than this for chunk analysis |
TRAJECTORY_GIT_SQUASH_AWARE_SESSIONS | false | Group commits into sessions (squash noise reduction) |
TRAJECTORY_GIT_SESSION_GAP_MINUTES | 30 | Gap between commits to split sessions |