Semantic Search: Criticism and Responses

Semantic code search is not without critics. Understanding the real limitations — and the established counter-arguments — helps teams make informed adoption decisions and use the tool correctly.

Criticism 1: RAG Results Are Incomplete or False

What's Actually Being Criticized

"Embedding-only RAG doesn't understand code" — The criticism targets using vector similarity as the primary relevance criterion. For code, similarity shows "looks like this textually" but not "participates in this execution path." An agent may find semantically correct but unused or secondary code fragments.

"RAG creates false confidence" — When a model cites retrieved chunks, the answer looks "grounded" even if retrieval was incomplete or irrelevant. This is especially dangerous in code generation and refactoring: errors look convincing and are rarely questioned.

"Single-vector embeddings are fundamentally limited" — One embedding per chunk poorly encodes combinatorial and multi-hop queries ("all places where A and B under condition C"). This is a mathematical limitation, not a model quality issue.

"Code is a graph, not a document" — Semantic search over chunks ignores relationships: who calls whom, in what order, under what conditions. For understanding system behavior, this is critical — and why RAG over "text" is seen as insufficient.

Counter-Arguments

Semantic search is for discovery, not for answers. After semantic search, there's always verification through code search (grep, symbols, call-sites). Best practice: treat RAG as a candidate zone generator, not as proof.
Mandatory verification step. The workflow is: RAG → hypothesis → code search → confirmation. The agent is prohibited from drawing conclusions without confirmed call-sites or side-effects. RAG becomes a "lead," not an "argument."
Hybrid retrieval. Dense (semantic) + sparse (keyword/rg) + structural signals. Embeddings remain the first filter; precision is added through symbols, grep, and graphs.
Semantic search is the entry point to the graph, not its analysis. Real understanding is built through call-sites, symbols, and execution paths obtained via code search.

TeaRAGs approach: Hybrid search (BM25 + vector via RRF) combined with git trajectory signals. Results are not just "similar" — they carry empirical quality indicators (stability, churn, ownership) that help the agent assess confidence.

Criticism 2: Semantic Search Is Worse Than Planning Mode

What's Actually Being Criticized

Planning mode is compared to RAG as an alternative because it better controls context and reduces noise. In this comparison, RAG is presented as a static, imprecise mechanism that "serves up similar stuff" rather than conducting research. This typically applies to "naive RAG" that's called automatically and always — context is added "just in case."

Counter-Arguments

Planning and semantic search serve different roles. Planning manages steps; semantic search accelerates discovery. 2025/2026 best practice: planning decides when to call RAG, not replaces it.
Planning mode doesn't solve discovery problems. It works poorly when entry points are unknown, the project has significant legacy, and naming is inconsistent.
Cursor explicitly positions semantic search as a discovery tool for large codebases.

TeaRAGs approach: TeaRAGs is designed to be called by an agent as part of a planned workflow — not as a naive context injection layer. The agent decides when semantic search adds value, uses appropriate rerank presets for the task at hand, and verifies results through complementary tools.

The Bottom Line

Semantic code search is not a silver bullet. It's a discovery accelerator that works best when:

Combined with verification tools (grep, symbols, call-sites)
Used as part of a structured agent workflow, not as automatic context injection
Enriched with quality signals (git metrics, reranking) to reduce false confidence
Applied to the right problems (large codebases, unfamiliar code, pattern discovery)

The criticisms are valid for naive, embedding-only RAG. TeaRAGs addresses them through hybrid search, trajectory enrichment, and composable reranking — moving from "find similar text" to "find the right code to learn from."

These Principles in Practice

The verification workflow and multi-tool cascade are implemented as concrete agent instructions across the documentation:

Exact-Match Verification — the mandatory ripgrep step after code generation, with failure examples and correct workflow
The Three-Tool Cascade — TeaRAGs (meaning) → tree-sitter (structure) → ripgrep (exact text), with anti-patterns
Semantic Search is NOT a Grep Replacement — the core verification principle with Mermaid workflow diagram

Calibrating Reranking Weights Per Codebase

Preset reranking uses hardcoded normalization bounds (e.g., maxCommitCount = 50, maxAgeDays = 365, maxBugFixRate = 100). These defaults work for many codebases, but every codebase has a unique profile: a young startup repo where commitCount = 10 is high churn looks very different from a 10-year enterprise monorepo where commitCount = 200 is normal.

The problem: If your codebase's median commitCount is 3, the hotspots preset will barely distinguish between files — everything is "low churn" relative to the normalization ceiling of 50. Conversely, if your codebase's median commitCount is 80, the signal saturates and everything looks like a hotspot.

The solution: Sample your codebase's metadata distribution, then adjust custom weights and interpretation thresholds accordingly.

Discovery prompt

Use this prompt with your AI agent to profile your codebase's metric distribution. The agent will sample metadata from your index and compute meaningful percentiles:

Profile the codebase for reranking calibration.

Step 1: Get a metadata sample.
Run semantic_search with:
  - query: "core business logic"
  - metaOnly: true
  - limit: 100 (or as high as practical)
Repeat with 2-3 different broad queries ("data processing", "API handlers",
"utility functions") to get a representative sample.

Step 2: From the collected git metadata, compute for each signal:
  - Minimum, maximum, median, P75, P95 values for:
    commitCount, ageDays, bugFixRate, relativeChurn, churnVolatility,
    contributorCount, dominantAuthorPct
  - If chunk-level data is available:
    chunkCommitCount, chunkChurnRatio, chunkBugFixRate, chunkAgeDays

Step 3: Report findings as a table:
  | Signal | Min | Median | P75 | P95 | Max |
  with interpretation notes for this specific codebase.

Step 4: Recommend adjusted thresholds:
  - What counts as "high churn" in THIS codebase? (P75 of commitCount)
  - What counts as "old code"? (P75 of ageDays)
  - What counts as "buggy"? (P75 of bugFixRate)
  - Are chunk-level metrics available and meaningful?

Step 5: Suggest custom rerank weights optimized for this codebase:
  - A "hotspots" variant using codebase-specific signal distribution
  - A "stable template" variant for finding the best code to copy
  - Which signals have too little variance to be useful (skip them)

What to look for

Codebase profile	Typical signals	Calibration advice
Young repo (< 1 year, < 50K LOC)	Low commitCount across the board, few contributors	`chunkChurn` and `bugFix` are the most useful discriminators. `age` is uninformative — everything is young. Focus custom weights on `bugFix` + `volatility`.
Mature monorepo (5+ years, 500K+ LOC)	Wide distribution in all signals, many outliers	All signals are useful. Set custom thresholds at P75 rather than hardcoded values. `relativeChurn` is the strongest defect predictor — use it over raw `commitCount`.
High-velocity team (daily deploys, CI/CD)	Low `churnVolatility`, high `changeDensity`	`volatility` is uninformative — everyone commits regularly. Focus on `bugFix` + `chunkChurnRatio` for quality signals.
Legacy codebase (infrequent changes)	High `ageDays`, low `commitCount`	`recency` and `burstActivity` become the key discriminators — any recent change is significant. Use custom weights with `burstActivity: 0.4, bugFix: 0.3, pathRisk: 0.3`.

Example: Calibrated vs default

A 3-year enterprise monorepo with commitCount median = 45:

// Default "hotspots" preset — poor discrimination
// (maxCommitCount = 50, so everything above 50 saturates)

// Calibrated custom weights for this codebase:
{
  "rerank": {
    "custom": {
      "chunkChurn": 0.25,
      "bugFix": 0.3,
      "volatility": 0.25,
      "chunkRelativeChurn": 0.2
    }
  }
}

By shifting from absolute churn (which saturates at P50 in this codebase) to chunkRelativeChurn (which measures the chunk's share of file churn, always 0-1) and volatility (which captures erratic patterns regardless of scale), the search produces meaningful differentiation even when raw commit counts are uniformly high.

Criticism 1: RAG Results Are Incomplete or False​

What's Actually Being Criticized​

Counter-Arguments​

Criticism 2: Semantic Search Is Worse Than Planning Mode​

What's Actually Being Criticized​

Counter-Arguments​

The Bottom Line​

These Principles in Practice​

Calibrating Reranking Weights Per Codebase​

Discovery prompt​

What to look for​

Example: Calibrated vs default​

References​

Criticism 1: RAG Results Are Incomplete or False

What's Actually Being Criticized

Counter-Arguments

Criticism 2: Semantic Search Is Worse Than Planning Mode

What's Actually Being Criticized

Counter-Arguments

The Bottom Line

These Principles in Practice

Calibrating Reranking Weights Per Codebase

Discovery prompt

What to look for

Example: Calibrated vs default

References