Skip to main content

TEA: Trajectory Enrichment Awareness

Standard code RAG systems embed source code as text and retrieve by semantic similarity alone. This works for "find code that looks like X" but ignores the history of how code evolved.

Trajectory enrichment attaches signals about code evolution to each chunk at index time — at the chunk level (individual functions/methods/classes), not just the file level. TeaRAGs ships two trajectory providers out of the box:

  • Git trajectory — signals derived from version control history: churn, authorship, volatility, bug-fix rates, task traceability
  • Static trajectory — signals derived from code structure itself: imports (blast radius), documentation weight, heading relevance, path risk, chunk density

A third provider — topological trajectory (symbol graphs, cross-file coupling, full blast radius analysis) — is on the roadmap.

Git Trajectory Signals

These signals describe how the code was developed over time:

Signal CategoryExamplesWhat It Captures
TemporalageDays, lastModifiedAt, firstCreatedAtWhen code was written and last changed
ChurncommitCount, relativeChurn, changeDensity, churnVolatilityHow frequently and erratically code changes
AuthorshiprecentDominantAuthor / recentContributorCount / recentDominantAuthorPct (commit-window) and blameDominantAuthor / blameContributorCount / blameDominantAuthorPct (live-line)Two parallel families: who's been actively committing lately vs who currently owns the live lines
QualitybugFixRate, chunkBugFixRateHow often changes are bug fixes
TraceabilitytaskIdsWhich tickets/issues drove the changes

Two Granularity Levels

  • File-level: all chunks from a file share the same git metadata (e.g., blameDominantAuthor, recentContributorCount, commitCount)
  • Chunk-level: commits are mapped to specific line ranges via diff hunk analysis, giving per-function/method churn, bug-fix rate, and age (e.g., chunkCommitCount, chunkBugFixRate, chunkAgeDays) — distinguishes hot functions from stable ones within the same file

Confidence Dampening

Statistical signals (ownership, bugFixRate, volatility) are confidence-dampened when commit counts are low, preventing noisy data from dominating results. A function with 1 commit and 100% bugFixRate is treated differently from one with 20 commits and 100% bugFixRate.

How This Differs from Standard Code RAG

AspectStandard Code RAGTrajectory-Enriched RAG
Index timeEmbed code text as vectorsEmbed code text + attach git trajectory metadata per chunk
RetrievalRank by cosine similarityRank by similarity, then rerank using trajectory signals
"Find risky code"Not possible (no risk signals)rerank: "hotspots" — boost high-churn, high-bugfix chunks
"Who owns this?"Not possiblererank: "ownership" — surface single-author knowledge silos via live-line ownership (git blame HEAD)
"Who's been committing here lately?"Not possiblererank: "recentActivityConcentration" — surface code with one dominant recent committer
"What changed for ticket X?"Not possibletaskId: "TD-1234" — trace code to requirements
"Find stable examples"Return whatever is most similarrerank: "stable" — boost low-churn, well-established code
Chunk granularitySame score for all chunks in a filePer-chunk churn overlay — each function/method tracked independently

The git trajectory layer is enabled by default (TRAJECTORY_GIT_ENABLED=true, legacy name CODE_ENABLE_GIT_METADATA still works). Set it to false to opt out — the system then operates as a standard semantic code search with AST-aware chunking, hybrid (BM25 + vector) retrieval, and structural signals from the static trajectory. The layer is also silently skipped for non-git directories.

Agentic Data-Driven Engineering

Trajectory enrichment opens the path to agentic data-driven engineering — a paradigm where AI coding agents make engineering decisions backed by empirical evidence from version control history, not pattern matching intuition.

Standard code RAG retrieves by semantic similarity: "find code that looks like X." An agent copies the first match without knowing if that code is stable, bug-prone, or written by an intern on their first day. With trajectory-enriched retrieval, every search result carries quality signals. An agent can reason about what to copy, what to avoid, and why code exists — before writing a single line.

5 core strategies:

  1. Stable Pattern Recognition (rerank: "stable") — find battle-tested, low-bug code as templates (low churn, low bugFixRate, survived production)
  2. Anti-Pattern Avoidance (rerank: "hotspots") — identify high-churn, bug-prone code to avoid (high bugFixRate, high churnVolatility)
  3. Style Consistency (rerank: "ownership") — match the live-line owner's patterns for a code area (their lines are still in the file, so their style is what's actually there)
  4. Historical Context (taskIds, metaOnly: true) — understand feature intent through ticket references
  5. Risk Assessment (rerank: "techDebt") — identify legacy code requiring defensive modification

This transforms code generation from artistic guesswork into data-driven engineering.

👉 Full explanation with examples

Planned: Topological Trajectory Enrichment

In addition to the already-shipped git and static trajectories, a third provider — topological trajectory — is planned. It will add signals derived from deep code-structure analysis:

  • Symbol dependency graph — which functions/classes call or depend on each other
  • Cross-file coupling — files that frequently change together (logical coupling from commit co-occurrence)
  • Full blast radius — number of transitive dependents affected by changing a symbol (the current static trajectory ships a shallow approximation via the imports signal)

These signals will feed the same reranking layer, enabling queries like "find high-impact code with many dependents" or "find tightly coupled modules."

👉 Code Quality Metrics: Theory & Research