Skip to main content

Code Search & Retrieval

Retrieval research has a long history in general IR (document search, question answering). Code search inherits the machinery but differs in important ways. This page surveys the differences and explains which of them shaped TeaRAGs' design.

For the general RAG primer, see RAG Fundamentals. For known failure modes of semantic search on code specifically, see Semantic Search Criticism.


How Code Is Different from Prose

Code looks like text to a tokenizer but behaves differently for retrieval:

PropertyProseCode
VocabularyOpen, but distribution follows ZipfHighly skewed: a few identifiers dominate a file, many rare local names
RepetitionParaphrasing is naturalExact tokens (UserService, processPayment) carry strong signal
StructureParagraph-levelAST-level — a function is a unit, not "the next 300 words"
SemanticsMeaning from lexical surface + discourseMeaning from lexical surface + control flow + data flow
Test of relevance"Answers my question""Compiles / runs / passes tests"

The practical consequence: a retrieval system tuned for prose does worse on code than one that respects code structure. AST-aware chunking (what TeaRAGs does) preserves function/class boundaries. Character-window splits — standard in general-purpose RAG — split functions in half and degrade every downstream step.


Research on Code Search Quality

Benchmarks

Dense retrieval for code

Lexical and hybrid approaches

Evaluating on real tasks


Retrieval Pipeline — Code-Specific Considerations

1. Indexable units

Tree-sitter-parsed chunks at function/class granularity are the sweet spot for most languages. Exceptions:

  • Ruby-like DSL — class bodies are often pure declarations (associations, validations). Raw AST chunking produces oversized "body" chunks. TeaRAGs addresses this with custom hooks (class-body-chunker.ts). See RFC 0005.
  • Markdown — heading hierarchy replaces AST. Each section is a chunk with a headingPath.
  • Data / config — no semantic AST. Character chunker as fallback; ranking suffers but search still works.

2. What to embed

Three choices in the literature:

  • Code body only — what most systems do. Implicit assumption: the embedding model learned code semantics.
  • Code + docstring — concatenate. Works if the model was trained bimodally (CodeBERT et al.).
  • Natural-language description generated by an LLM — hallucinate a description, embed that. Higher quality on some benchmarks, but brittle.

TeaRAGs embeds the code body. This is the fastest, cheapest, and most predictable option, and code-specialized embedders handle bare code well.

3. Beyond dense retrieval

Pure vector search has two known weaknesses on code:

  • Rare identifiers — a query containing parseSyntacticallyAwareUnicodeText rarely surfaces its implementation via cosine because identifier names don't compose the way words do. BM25 handles this trivially.
  • Cross-file concepts — "the auth flow" touches many files. Single-chunk retrieval fragments the picture. Mitigated by using navigation links in TeaRAGs payload to walk adjacent chunks after retrieval.

Hybrid retrieval (dense + BM25, fused via reciprocal rank) is the practical answer to the first. TeaRAGs enables hybrid by default for new collections.

4. Reranking

Cross-encoders (transformer scoring query+candidate jointly) deliver large quality gains on benchmarks (+5–10 nDCG@10), at 10–100× the latency. In practice, rare choice for interactive tools.

Feature-based reranking — weighted sum of signals — is the pragmatic alternative. Trades one big scoring model for many cheap features:

  • Similarity (the original retriever's score)
  • Lexical features (BM25 overlap with query)
  • Trajectory features (churn, ownership, bug-fix rate) — TeaRAGs' contribution
  • Structural features (function size, cyclomatic complexity when available)

Each feature gets a weight; presets are curated weight configurations (techDebt, hotspots, etc.). See Reranking for the full model.


Evaluation — How Do You Know Code Search Is Good?

Standard IR metrics apply (Recall@K, MRR, nDCG), but the ground truth is harder to obtain:

  • Doc-comment pairs (CodeSearchNet) — cheap, but pretends comment writing is random. In practice, documented code is a biased sample.
  • Stack Overflow pairs — query = SO question, answer = accepted code snippet. Closer to real user intent. Used in CoSQA.
  • Bug-fix pairs — query = bug report, relevant = file touched by the fix. Good for "find the broken thing" queries specifically.
  • Downstream task performance — run an agent on a task, measure task success conditional on retrieval quality. Expensive, but the most honest signal. What RepoBench does.

TeaRAGs doesn't ship a benchmark harness out of the box. For project-specific evaluation, the typical approach is:

  1. Collect 20–50 real queries your team ran this week
  2. Record which result they actually clicked / used
  3. Compute MRR / nDCG against that small labelled set
  4. Tune rerank presets to optimise that specific distribution

Further Reading

Inside this knowledge base

Landmark papers

  • Husain et al. 2019 (CodeSearchNet) — the starting point for any code-search survey
  • Feng et al. 2020 (CodeBERT) — the dense retrieval turning point for code
  • Guo et al. 2021 (GraphCodeBERT) — the data-flow-aware refinement

Venues

  • ESEC/FSE, ICSE — primary software-engineering venues
  • SIGIR — the information-retrieval side
  • MSR (Mining Software Repositories) conference — overlap between retrieval and software evolution research