Skip to main content

Data Model

TeaRAGs stores everything in a single Qdrant collection per indexed codebase. Every indexed point is a chunk — a function, class, markdown section, or block — with a dense vector (and optionally a sparse vector for hybrid search) plus a rich payload of structural and git-derived signals.

This page is the authoritative field catalog. Source of truth: src/core/domains/trajectory/static/payload-signals.ts (base), src/core/domains/trajectory/git/payload-signals.ts (git), and StaticPayloadBuilder#buildPayload().

Collection Structure

A Qdrant collection contains:

  1. Chunk points — one per indexed chunk, carrying vector(s) + payload (all fields below).
  2. Schema metadata point — a reserved point with _type: "schema_metadata" holding collection-level bookkeeping (see Schema Versioning below).

Collection naming is derived from the codebase absolute path hash — see src/core/infra/collection-name.ts (resolveCollectionName).

Chunk Payload

Organized by namespace: base (structural), git.file.* (file-level git signals), git.chunk.* (chunk-level git signals).

Base — Always Present

Written by StaticPayloadBuilder on every chunk:

FieldTypeDescription
contentstringThe actual code or documentation text
contentSizenumberCharacter count of content
relativePathstringPath relative to the codebase root
fileExtensionstringExtension with dot (e.g. ".ts")
languagestringProgramming language (e.g. "typescript", "ruby", "markdown")
codebasePathstringAbsolute path of the codebase root (used for resolution)
startLinenumberFirst line of the chunk in its file
endLinenumberLast line of the chunk in its file
chunkIndexnumberChunk position within the file (0-based)

Base — Conditional

Written only when relevant — absent for chunks where they don't apply:

FieldTypeWhen writtenDescription
namestringCode chunks with an identifierClass/function/symbol name
chunkTypestringChunks emitted by AST chunker"function", "class", "interface", "block"
symbolIdstringNamed code chunksUnique ID: Class#method (instance), Class.method (static), functionName (top-level), doc:<hash> (docs)
parentSymbolIdstringMethods inside a classParent class/module name
parentTypestringMethods inside a classParent AST node type ("class_declaration", etc.)
isDocumentationbooleanMarkdown / doc chunkstrue for doc sections
isTestbooleanTest filestrue when file matches test naming for the language
importsstring[]Code chunks with file-level importsFile-level imports inherited by every chunk of the file
headingPath{depth, text}[]Doc chunksHeading hierarchy leading to this chunk (used by documentationRelevance preset)
navigation{prevSymbolId?, nextSymbolId?}Chunks with adjacent symbolsEnables chunk-to-chunk navigation without re-reading the file
methodLinesnumberFunction chunksOriginal method line count before chunk splitting (used by decomposition preset)
methodDensitynumberFunction chunksCharacters per line, dampened for small chunks — a code density heuristic

git.file.* — File-Level Git Signals

Written by the git enrichment pipeline (phase 1) on every chunk of the file. All chunks of the same file share identical git.file.* values. See Git Enrichment Pipeline for computation details.

Primary signals (used by reranker)

FieldTypeLabel thresholdsDescription
git.file.commitCountnumberlow / typical / high / extremeTotal commits modifying this file
git.file.ageDaysnumberrecent / typical / old / legacyDays since last modification
git.file.recentDominantAuthorstringAuthor with most commits in the recent commit window
git.file.recentAuthorsstring[]All contributing authors in the recent commit window
git.file.recentDominantAuthorPctnumbershared / concentrated / silo / deep-silo% of recent-window commits by the recent dominant author (0–100)
git.file.recentContributorCountnumbersolo / pair / team / crowdDistinct contributors in the recent commit window
git.file.blameDominantAuthorstringAuthor owning the most live lines according to git blame HEAD
git.file.blameAuthorsstring[]All authors with at least one live line according to git blame HEAD
git.file.blameDominantAuthorPctnumbershared / concentrated / silo / deep-silo% of live lines owned by the blame dominant author (0–100)
git.file.blameContributorCountnumbersolo / pair / team / crowdDistinct authors of currently-live lines
git.file.fileChurnCountnumberminimal / moderate / significant / massiveTotal lines churned (added + deleted)
git.file.relativeChurnnumbernormal / high(linesAdded + linesDeleted) / currentLines
git.file.recencyWeightedFreqnumbernormal / burstRecency-weighted commit frequency
git.file.changeDensitynumbercalm / active / intenseCommits per month
git.file.churnVolatilitynumberstable / erraticStandard deviation of commit-interval days
git.file.bugFixRatenumberhealthy / concerning / critical% of commits classified as bug fixes (0–100)
git.file.taskIdsstring[]Task/ticket IDs extracted from commit messages (JIRA, GitHub, AzDO)
Two ownership families: recent* vs blame*

Ownership is captured by two parallel signal families with different semantics:

  • recent* (commit-based, from a configurable recent commit window) — answers "who has been actively committing here lately?" Useful for code-review routing, activity hotspots, and detecting feature-in-progress by a sole recent committer.
  • blame* (live-line ownership from git blame HEAD) — answers "who actually owns the code that is in the file right now?" Useful for authority ("who must approve this change"), knowledge silo / bus-factor analysis, and style copy when generating code (the owner's lines are still there).

When a long-time owner stops contributing, blame* still says they own (their lines remain), but recent* highlights newer committers. This divergence is itself information — it indicates a knowledge handoff in progress.

Provenance fields (not used by rerank, kept for debugging)

FieldTypeDescription
git.file.recentDominantAuthorEmailstringEmail of the recent dominant author (commit-window)
git.file.blameDominantAuthorEmailstringEmail of the blame dominant author (live-line owner)
git.file.lastModifiedAtnumberUnix timestamp of last commit
git.file.firstCreatedAtnumberUnix timestamp of first commit
git.file.lastCommitHashstringSHA of the last commit touching the file
git.file.linesAddednumberCumulative lines added across all commits
git.file.linesDeletednumberCumulative lines deleted across all commits
git.file.enrichedAtISO stringWhen this payload was enriched

git.chunk.* — Chunk-Level Git Signals

Written by the git enrichment pipeline (phase 2) only when chunk-level analysis applies — files with more than one chunk, more than one commit, and within TRAJECTORY_GIT_CHUNK_MAX_AGE_MONTHS. Merged into the existing git.* payload via dot-notation to avoid clobbering file-level data.

FieldTypeLabel thresholdsDescription
git.chunk.churnRationumbernormal / concentratedChunk's share of file churn (0–1)
git.chunk.commitCountnumberlow / typical / high / extremeCommits touching this specific chunk
git.chunk.ageDaysnumberrecent / typical / old / legacyDays since last modification to the chunk
git.chunk.recentContributorCountnumbersolo / pair / team / crowdDistinct recent-window contributors to the chunk
git.chunk.blameContributorCountnumbersolo / pair / team / crowdDistinct authors of currently-live lines in the chunk
git.chunk.bugFixRatenumberhealthy / concerning / criticalChunk-level bug-fix rate (0–100)
git.chunk.relativeChurnnumbernormal / highChurn relative to chunk size
git.chunk.recencyWeightedFreqnumbernormal / burstChunk-level recency-weighted frequency
git.chunk.changeDensitynumberactive / intenseChunk commits per month
git.chunk.churnVolatilitynumberstable / erraticStandard deviation of chunk commit intervals
git.chunk.taskIdsstring[]Task IDs from commits touching this chunk
git.chunk.lastModifiedAtnumberUnix timestamp of the last chunk-touching commit
git.chunk.enrichedAtISO stringWhen the chunk overlay was enriched
Chunk vs file alpha-blending

The reranker blends chunk and file signals via confidence-weighted alpha — see ChunkChurnSignal and related derived signals. When chunk data is missing (single-chunk file, short history, old commit), file-level signals carry 100% of the weight automatically.

Vectors

Each chunk point stores:

  • Dense vector — embedding of content, dimension depends on provider (ONNX default 768, OpenAI text-embedding-3-small 1536, etc.). Used for semantic similarity.
  • Sparse vector (optional) — BM25-style term frequencies. Enabled when enableHybrid=true (default). Powers hybrid_search.

Collections created before hybrid became the default still work dense-only — use reindex_changes to migrate (it auto-enables hybrid).

Labels and Percentile Stats

Numeric signals declare stats.labels mapping percentiles to human-readable names (e.g. p75 → "high"). The indexer computes per-codebase percentile thresholds and stores them in the StatsCache, scoped by (language, signal, source|test).

The reranker uses these thresholds to attach labels to ranking overlays:

{
"commitCount": { "value": 12, "label": "high" }
}

Thresholds differ per codebase — a TypeScript file with 8 commits is "high" in one project and "typical" in another. Retrieve the full threshold table via get_index_metrics or the tea-rags://schema/signal-labels resource.

Schema Versioning

Every collection contains one reserved point (ID SCHEMA_METADATA_ID) with payload:

{
"_type": "schema_metadata",
"schemaVersion": 4,
"sparseVersion": 1,
"migratedAt": "2026-04-20T14:58:39.612Z",
"indexes": ["language", "relativePath", "git.file.commitCount", "..."]
}

The server bumps schemaVersion when the Qdrant payload indexes change (adding a new indexed field). On startup, SchemaManager reads the stored version and reconciles indexes.

New payload fields (without new indexes) are handled by the separate SchemaDriftMonitor: it detects that code defines a field the stored points lack, warns the agent that a full reindex will populate them, and lets the user pick when to reindex. See Schema Drift vs Migrations for the philosophy.

Where Code Lives

ConcernSource
Base payload buildersrc/core/domains/trajectory/static/provider.ts (StaticPayloadBuilder)
Base signal catalogsrc/core/domains/trajectory/static/payload-signals.ts (BASE_PAYLOAD_SIGNALS)
Git signal catalogsrc/core/domains/trajectory/git/payload-signals.ts (gitPayloadSignalDescriptors)
Git enrichment pipelinesrc/core/domains/trajectory/git/ + src/core/domains/ingest/pipeline/enrichment/
Schema versioningsrc/core/adapters/qdrant/schema-manager.ts (SchemaManager)
Percentile thresholdssrc/core/infra/stats-cache.ts (StatsCache)