Indexing Pipeline
The indexing pipeline converts a codebase directory into a populated Qdrant collection. It runs in two flavours — full index (first time, or forceReindex: true) and incremental reindex (subsequent runs) — both built on the same stages.
For the payload written per chunk, see Data Model. For how git.* signals are computed after chunks land in Qdrant, see Git Enrichment Pipeline.
High-Level Flow
Indexing returns as soon as dense chunks are stored. Git enrichment continues in the background and populates git.* payload asynchronously — search works immediately, trajectory-aware signals come online shortly after.
Stages
1. Scan
FileScanner walks the codebase root and yields file paths. It reads:
.gitignore— standard git ignore rules.contextignore— TeaRAGs-specific overrides (e.g. exclude generated code)--ignorePatterns— per-call overrides from theindex_codebasetool- Language heuristics — binary files, lockfiles, and
dist//build/folders skipped by default
Output: FileList — absolute paths, totals, per-extension counts.
2. Setup Collection
IndexPipeline.setupCollection decides whether to:
- Create fresh — first time, or after
forceReindex. Allocates a new versioned collection ({name}_{embeddingModelId}_{schemaVersion}) and an alias pointing at it. - Reuse — the alias already points at a valid collection matching the current embedding model and schema version.
- Migrate — collection exists but schema drifted (new indexes).
SchemaManagercreates missing indexes and bumpsschemaVersion.
A snapshot of the current index state is written before any mutation, so interrupted runs can resume without data loss.
3. Process (Chunk → Embed → Upsert)
IndexPipeline.processAndTrack streams files through the ChunkPipeline worker pool. Inside the pipeline:
- Chunk — AST-aware chunker splits code by language-specific hooks (Ruby uses
alwaysExtractChildren, TypeScript usescomment-capture+class-body-chunker, etc.). Markdown is split by heading hierarchy. - Build payload —
StaticPayloadBuilderwrites the base payload (content, relativePath, symbolId, imports, navigation, …). - Embed — the embedding provider (ONNX / Ollama / OpenAI / Cohere / Voyage) computes a dense vector per chunk. Configurable batch size via
EMBEDDING_TUNE_BATCH_SIZE. - Sparse vectors — when hybrid is enabled, BM25 token frequencies are computed alongside.
- Batch upsert — dense + sparse + payload written to Qdrant in configurable batches. Concurrency controlled by
INGEST_PIPELINE_CONCURRENCY(default 1 — most providers are bottlenecked inside, not in flight).
Each batch triggers onBatchUpserted → EnrichmentCoordinator.onChunksStored, queuing git enrichment asynchronously so indexing throughput isn't blocked by git log parsing.
4. Finalize Alias
Once all chunks are upserted, finalizeAlias atomically switches the public alias from the old collection (if any) to the new one. Search traffic sees zero downtime — clients query the alias, not the underlying collection name.
5. Snapshot
A compact snapshot of file hashes is persisted to ~/.tea-rags/snapshots/{collection}.json (sharded for large repos). Subsequent reindex_changes calls diff the snapshot against the current working tree to find only-changed files.
Incremental Reindex
ReindexPipeline.reindexChanges skips the full walk:
- Prepare context — load snapshot, resolve collection via alias.
- Run migrations — reconcile any schema drift without re-embedding.
- Diff files — compare current hashes against snapshot → three lists:
added,modified,deleted. - Execute parallel pipelines —
ParallelSynchronizerrunsDeletionStrategy(remove deleted/modified chunks) and a regularChunkPipeline(index added/modified) concurrently on the same collection. - Finalize — refresh alias, save updated snapshot.
Incremental reindex is typically 10–100× faster than a full run because embedding (the dominant cost) only runs on changed files.
Enrichment Handoff
The enrichment pipeline is structurally separate from indexing:
| Aspect | Indexing | Enrichment |
|---|---|---|
| Trigger | index_codebase / reindex_changes | onChunksStored hook (after each batch) |
| Blocks return? | Yes — must finish before alias switch | No — async, continues after |
| Writes to | Base payload (content, structure) | git.* payload via batchSetPayload |
| Failure mode | Aborts the index | Logged, chunk keeps base payload |
This separation means users get working search within seconds even on fresh indexes of millions-of-lines repos — ranking by trajectory signals just warms up in the background. Check status with get_index_status.
Parallelism Summary
| Axis | Mechanism | Tuning |
|---|---|---|
| File discovery | Sequential (IO-bound, fast) | — |
| Chunking | Worker pool | INGEST_TUNE_CHUNKER_POOL_SIZE |
| Embedding | Provider batches + optional concurrency | EMBEDDING_TUNE_BATCH_SIZE, INGEST_PIPELINE_CONCURRENCY |
| File-level concurrency | BaseIndexingPipeline | INGEST_TUNE_FILE_CONCURRENCY |
| Qdrant upserts | Async batch queue | INGEST_BATCH_SIZE |
| Git enrichment | Chunk-level worker pool | TRAJECTORY_GIT_CHUNK_CONCURRENCY |
See Performance Tuning for recommended values per hardware profile.
Where Code Lives
| Stage | Source |
|---|---|
| Orchestration | src/core/api/internal/facades/ingest-facade.ts (IngestFacade) |
| Full index | src/core/domains/ingest/indexing.ts (IndexPipeline) |
| Incremental | src/core/domains/ingest/reindexing.ts (ReindexPipeline) |
| File scanning | src/core/domains/ingest/pipeline/scanner.ts (FileScanner) |
| Chunk pipeline | src/core/domains/ingest/pipeline/chunk-pipeline.ts (ChunkPipeline) |
| Chunker hooks | src/core/domains/ingest/pipeline/chunker/hooks/ |
| Payload builder | src/core/domains/trajectory/static/provider.ts |
| Enrichment hook | src/core/domains/ingest/pipeline/enrichment/coordinator.ts |
| Alias / snapshot | src/core/domains/ingest/sync/, src/core/domains/ingest/alias-cleanup.ts |
Related
- Data Model — payload fields produced by this pipeline
- Git Enrichment Pipeline — what runs after
onChunksStored - Indexing Repositories — user-facing guide with environment variables