Skip to main content

Git Enrichment Pipeline

tea-rags enriches every indexed code chunk with git-derived quality signals. The pipeline runs in two phases — file-level and chunk-level — both executing asynchronously in the background after indexing returns.

For metric definitions and research context, see Code Churn: Theory & Research. For practical usage (filtering, reranking), see Git Enrichments.

Pipeline Overview

Key Design Decisions

  • No git blame — all metrics derive from commit history, not per-line attribution.
  • No process spawns for commit data — isomorphic-git reads .git/objects/pack/ directly.
  • Single CLI call — only git log --all --numstat for line-level stats (one spawn total).
  • Background execution — both phases run asynchronously after indexing returns.
  • HEAD-based caching — results are cached and invalidated when HEAD changes.

Phase 1: File-Level Enrichment

Reads git history via isomorphic-git (bounded by TRAJECTORY_GIT_LOG_MAX_AGE_MONTHS, default 12 months), with CLI fallback on timeout (TRAJECTORY_GIT_LOG_TIMEOUT_MS).

git log (isomorphic-git, reads .git directly)
-> per-file CommitInfo[] + linesAdded/linesDeleted
-> computeFileMetadata()
-> GitFileMetadata (stored on all chunks of the file)

Output: GitFileMetadata containing commitCount, relativeChurn, recencyWeightedFreq, changeDensity, churnVolatility, bugFixRate, two parallel ownership families (recentDominantAuthor / recentDominantAuthorPct / recentAuthors / recentContributorCount from the configurable recent commit window, and blameDominantAuthor / blameDominantAuthorPct / blameAuthors / blameContributorCount from git blame HEAD), and other signals. Stored on all chunks of the file via the git.* payload namespace.

The two ownership families capture distinct semantics: recent* reflects who's been actively committing in the recent window (good for activity / review routing), while blame* reflects who currently owns the live lines (good for authority and knowledge-silo analysis). When a long-time owner stops committing, the two diverge, and the divergence itself carries information.

Phase 2: Chunk-Level Churn Overlay

Walks recent commits, diffs trees, reads blobs, and computes line-level patches to determine which chunks were affected by each commit.

git log (last N commits, isomorphic-git)
-> for each commit: diffTrees(parent, commit) -> changed files
-> filter to files with >1 chunk in index
-> readBlob(parent) + readBlob(commit) -> structuredPatch (jsdiff)
-> hunks with line numbers -> overlaps(hunk, chunk)
-> per-chunk accumulators -> ChunkChurnOverlay
-> batchSetPayload with dot-notation merge
(git.chunkCommitCount, etc.)

Output: ChunkChurnOverlay containing chunkCommitCount, chunkChurnRatio, git.chunk.recentContributorCount, git.chunk.blameContributorCount, chunkBugFixRate, chunkLastModifiedAt, chunkAgeDays. Merged into existing git.* payload using dot-notation to avoid overwriting file-level data.

Performance

For a typical project (~2000 files, ~200 commits):

File-level enrichment: Typically 0.5-2s for small repos.

Chunk-level churn:

  • 200 commits x ~5 changed files/commit x 60% in index x filter (>1 chunk) = ~400 file diffs
  • Each: 2 blob reads (pack cache ~1ms) + 1 structuredPatch (~0.5ms) = ~2.5ms
  • With 10 concurrent workers: ~100ms
  • Total overhead: < 1s on top of file-level enrichment

Both phases are cached by HEAD SHA and run in background (non-blocking to indexing).

Skip Conditions

Chunk-level analysis is automatically skipped for:

  • Single-chunk files — chunk equals file, no granularity benefit.
  • Files with 1 commit — all chunks would get identical data.
  • Files exceeding TRAJECTORY_GIT_CHUNK_MAX_FILE_LINES — performance guard.
  • Binary files — blob read fails gracefully.
  • Root commits — no parent to diff against.

GIT SESSIONS — Squash-Aware Grouping

Why it exists. Agent-driven development produces bursts of micro-commits: a single refactor session might land as 15–20 "fix typo", "adjust", "wip" commits within a few minutes. Treating each as an independent commit wrecks every churn-based signal — a 20-commit session looks identical to 20 separate production incidents.

What it does. When TRAJECTORY_GIT_SQUASH_AWARE_SESSIONS=true, the pipeline groups commits by (author, time gap). Any silence gap larger than TRAJECTORY_GIT_SESSION_GAP_MINUTES (default 30 min) starts a new session. Session count — not raw commit count — then feeds churn-related signals.

Where it matters most:

  • Solo devs pair-programming with an agent (single human + single agent author)
  • Teams adopting AI-assisted workflows where agents produce fine-grained commits
  • Any project where git log --oneline | wc -l is misleading because most commits are agent checkpoints, not logical deliverables

Impact on signals. commitCount, chunkCommitCount, bugFixRate, churnVolatility, and relativeChurn all use the deduplicated session count when this mode is on. recentDominantAuthor, blameDominantAuthor, and taskIds are unaffected — sessions affect counts, not who owns lines or who mentioned which ticket.

Default is false — opt in per project. Enable via the environment variable or the setup wizard (/tea-rags-setup:install step 7 — "Configure git analytics").

Environment Variables

VariableDefaultDescription
TRAJECTORY_GIT_ENABLEDtrueEnable git enrichment during indexing. Set to false for non-git projects or fast iteration
TRAJECTORY_GIT_LOG_MAX_AGE_MONTHS12Time window for file-level git analysis (months). 0 = no age limit
TRAJECTORY_GIT_LOG_TIMEOUT_MS60000Timeout for git log --numstat (ms); falls back to native CLI on expiry
TRAJECTORY_GIT_CHUNK_MAX_AGE_MONTHS6Time window for chunk-level churn analysis (months). 0 = no age limit
TRAJECTORY_GIT_CHUNK_CONCURRENCY10Parallel commit processing for chunk churn
TRAJECTORY_GIT_CHUNK_TIMEOUT_MS120000Timeout for chunk churn CLI pathspec (ms)
TRAJECTORY_GIT_CHUNK_MAX_FILE_LINES10000Skip files larger than this for chunk analysis
TRAJECTORY_GIT_SQUASH_AWARE_SESSIONSfalseGroup commits into sessions (squash noise reduction)
TRAJECTORY_GIT_SESSION_GAP_MINUTES30Gap between commits to split sessions