Skip to main content

Indexing Pipeline

The indexing pipeline converts a codebase directory into a populated Qdrant collection. It runs in two flavours — full index (first time, or forceReindex: true) and incremental reindex (subsequent runs) — both built on the same stages.

For the payload written per chunk, see Data Model. For how git.* signals are computed after chunks land in Qdrant, see Git Enrichment Pipeline.

High-Level Flow

Indexing returns as soon as dense chunks are stored. Git enrichment continues in the background and populates git.* payload asynchronously — search works immediately, trajectory-aware signals come online shortly after.

Stages

1. Scan

FileScanner walks the codebase root and yields file paths. It reads:

  • .gitignore — standard git ignore rules
  • .contextignore — TeaRAGs-specific overrides (e.g. exclude generated code)
  • --ignorePatterns — per-call overrides from the index_codebase tool
  • Language heuristics — binary files, lockfiles, and dist//build/ folders skipped by default

Output: FileList — absolute paths, totals, per-extension counts.

2. Setup Collection

IndexPipeline.setupCollection decides whether to:

  • Create fresh — first time, or after forceReindex. Allocates a new versioned collection ({name}_{embeddingModelId}_{schemaVersion}) and an alias pointing at it.
  • Reuse — the alias already points at a valid collection matching the current embedding model and schema version.
  • Migrate — collection exists but schema drifted (new indexes). SchemaManager creates missing indexes and bumps schemaVersion.

A snapshot of the current index state is written before any mutation, so interrupted runs can resume without data loss.

3. Process (Chunk → Embed → Upsert)

IndexPipeline.processAndTrack streams files through the ChunkPipeline worker pool. Inside the pipeline:

  1. Chunk — AST-aware chunker splits code by language-specific hooks (Ruby uses alwaysExtractChildren, TypeScript uses comment-capture + class-body-chunker, etc.). Markdown is split by heading hierarchy.
  2. Build payloadStaticPayloadBuilder writes the base payload (content, relativePath, symbolId, imports, navigation, …).
  3. Embed — the embedding provider (ONNX / Ollama / OpenAI / Cohere / Voyage) computes a dense vector per chunk. Configurable batch size via EMBEDDING_TUNE_BATCH_SIZE.
  4. Sparse vectors — when hybrid is enabled, BM25 token frequencies are computed alongside.
  5. Batch upsert — dense + sparse + payload written to Qdrant in configurable batches. Concurrency controlled by INGEST_PIPELINE_CONCURRENCY (default 1 — most providers are bottlenecked inside, not in flight).

Each batch triggers onBatchUpsertedEnrichmentCoordinator.onChunksStored, queuing git enrichment asynchronously so indexing throughput isn't blocked by git log parsing.

4. Finalize Alias

Once all chunks are upserted, finalizeAlias atomically switches the public alias from the old collection (if any) to the new one. Search traffic sees zero downtime — clients query the alias, not the underlying collection name.

5. Snapshot

A compact snapshot of file hashes is persisted to ~/.tea-rags/snapshots/{collection}.json (sharded for large repos). Subsequent reindex_changes calls diff the snapshot against the current working tree to find only-changed files.

Incremental Reindex

ReindexPipeline.reindexChanges skips the full walk:

  1. Prepare context — load snapshot, resolve collection via alias.
  2. Run migrations — reconcile any schema drift without re-embedding.
  3. Diff files — compare current hashes against snapshot → three lists: added, modified, deleted.
  4. Execute parallel pipelinesParallelSynchronizer runs DeletionStrategy (remove deleted/modified chunks) and a regular ChunkPipeline (index added/modified) concurrently on the same collection.
  5. Finalize — refresh alias, save updated snapshot.

Incremental reindex is typically 10–100× faster than a full run because embedding (the dominant cost) only runs on changed files.

Enrichment Handoff

The enrichment pipeline is structurally separate from indexing:

AspectIndexingEnrichment
Triggerindex_codebase / reindex_changesonChunksStored hook (after each batch)
Blocks return?Yes — must finish before alias switchNo — async, continues after
Writes toBase payload (content, structure)git.* payload via batchSetPayload
Failure modeAborts the indexLogged, chunk keeps base payload

This separation means users get working search within seconds even on fresh indexes of millions-of-lines repos — ranking by trajectory signals just warms up in the background. Check status with get_index_status.

Parallelism Summary

AxisMechanismTuning
File discoverySequential (IO-bound, fast)
ChunkingWorker poolINGEST_TUNE_CHUNKER_POOL_SIZE
EmbeddingProvider batches + optional concurrencyEMBEDDING_TUNE_BATCH_SIZE, INGEST_PIPELINE_CONCURRENCY
File-level concurrencyBaseIndexingPipelineINGEST_TUNE_FILE_CONCURRENCY
Qdrant upsertsAsync batch queueINGEST_BATCH_SIZE
Git enrichmentChunk-level worker poolTRAJECTORY_GIT_CHUNK_CONCURRENCY

See Performance Tuning for recommended values per hardware profile.

Where Code Lives

StageSource
Orchestrationsrc/core/api/internal/facades/ingest-facade.ts (IngestFacade)
Full indexsrc/core/domains/ingest/indexing.ts (IndexPipeline)
Incrementalsrc/core/domains/ingest/reindexing.ts (ReindexPipeline)
File scanningsrc/core/domains/ingest/pipeline/scanner.ts (FileScanner)
Chunk pipelinesrc/core/domains/ingest/pipeline/chunk-pipeline.ts (ChunkPipeline)
Chunker hookssrc/core/domains/ingest/pipeline/chunker/hooks/
Payload buildersrc/core/domains/trajectory/static/provider.ts
Enrichment hooksrc/core/domains/ingest/pipeline/enrichment/coordinator.ts
Alias / snapshotsrc/core/domains/ingest/sync/, src/core/domains/ingest/alias-cleanup.ts