Skip to main content

Code Vectorization

Code vectorization transforms your source code into searchable vector embeddings, enabling:

  • Natural language search — ask questions about your code in plain English
  • Semantic understanding — find code by intent, not just keywords
  • Cross-language search — find similar patterns across different languages
  • Git-aware context — understand authorship, code age, and task history

The Indexing Pipeline

1. File Discovery

The indexer scans your project respecting:

  • .gitignore patterns
  • .contextignore patterns (project-specific overrides)
  • Built-in exclusions (node_modules, vendor, etc.)

2. AST-Aware Chunking

Code is intelligently split using language-aware parsers (tree-sitter). Each chunk preserves semantic boundaries — functions, classes, methods — rather than splitting at arbitrary line counts.

LanguageParserFeatures
TypeScript/JavaScripttree-sitterClasses, functions, interfaces, types, imports
Pythontree-sitterClasses, functions, decorators, async
Rubytree-sitterClasses, modules, methods, Rails DSL groups
Gotree-sitterStructs, functions, interfaces
Java/C#/C++tree-sitterClasses, methods, namespaces
Rusttree-sitterStructs, impl blocks, traits
PHPtree-sitterClasses, functions, traits
MarkdownremarkSections, code blocks
OthersLine-basedFallback chunking with configurable size

Each chunk carries:

  • Semantic boundaries — functions, classes, methods stay intact
  • Parent contextparentName, parentType for nested code (e.g., method inside a class)
  • Location info — file path, start/end line numbers
  • Language metadata — for filtering by language
  • Symbol ID — unique identifier like MyClass.processData

3. Vector Embedding

Chunks are converted to vectors using your configured embedding provider:

ProviderTypePrivacyBest For
Ollama (recommended)LocalFull — code never leaves your machineProduction, privacy-sensitive
OpenAICloudAPIQuick setup
CohereCloudAPIGeneral text
Voyage AICloudAPICode-specialized models

4. Storage in Qdrant

Vectors are stored in Qdrant with full metadata payloads (file path, language, chunk type, parent info, git metadata). Payload indexes enable fast filtered search.

5. Incremental Indexing

After initial indexing, reindex_changes detects:

  • Added files — new files since last index
  • Modified files — changed content (content-hash based detection)
  • Deleted files — removed files

Only affected chunks are updated. Hash-based change detection uses a two-level Merkle tree with consistent hashing across sharded snapshots, enabling fast diff computation even for large codebases.

Quick Start

# Index your codebase
# Ask your agent: "Index this codebase for semantic search"

# Update after changes
# Ask your agent: "Reindex changes in this project"

For detailed configuration (chunk sizes, batch sizes, custom extensions, ignore patterns), see Configuration and Indexing Repositories.