Skip to main content

Signal Scoring Methods

How does a code search engine decide that one result is more relevant than another? Raw numbers from git history — like "142 days old" or "23 commits" — are not directly comparable. They live on different scales, have different distributions, and carry different levels of reliability.

This article explains the five scoring methods that TeaRAGs uses to transform raw code metrics into meaningful, comparable scores. We build from the simplest concept (normalization) to the most sophisticated (adaptive bounds), using git metrics as a running example.


1. Normalization — Making Numbers Comparable

Without Normalization

Imagine you want to rank code by a combination of age and commit count. You have three files:

FileAge (days)CommitsRaw sum
auth.ts14223165
utils.ts104858
config.ts3003303

If you just add the raw numbers, config.ts wins — but only because age is measured in days (large numbers) while commits are small numbers. Age dominates the score by accident of scale, not because it's more important. Swapping to hours (142 × 24 = 3408) would change the ranking entirely.

With Normalization

Normalization squeezes any number into the range 0 to 1, where 0 means "minimum" and 1 means "maximum." The formula is:

normalized=min ⁣(1,  valuebound)\text{normalized} = \min\!\Bigl(1,\;\frac{\text{value}}{\text{bound}}\Bigr)

The bound is the upper limit — any value at or above it maps to 1.0.

Now the same three files:

FileAgeage / 365Commitscommits / 50Sum
auth.ts1420.39230.460.85
utils.ts100.03480.960.99
config.ts3000.8230.060.88

Both signals contribute fairly. utils.ts ranks highest because it genuinely has high activity, not because of unit scale.

Example

If the bound for age is 365 days:

Raw age (days)CalculationNormalized
00 / 3650.00
3030 / 3650.08
142142 / 3650.39
365365 / 3651.00
500min(1, 500/365)1.00 (clamped)

Inversion

Some signals are "better when lower." For example, recency — code that was modified recently (low age) should score high. We simply flip the result:

recency=1normalize(ageDays,  365)\text{recency} = 1 - \text{normalize}(\text{ageDays},\; 365)
Age (days)normalize(age, 365)recency = 1 − normalized
70.020.98 (very recent)
1420.390.61
3000.820.18 (old)

2. Weighted Scoring — Combining Multiple Signals

Without Weighted Scoring

After normalization, you could simply average all signals:

FilesimilarityrecencychurnAverage
auth.ts0.850.610.460.64
utils.ts0.400.980.960.78

But this treats every signal as equally important. For a "tech debt" analysis, you care about code age and churn far more than how well it matches the search query. With equal weights, semantic similarity drowns out the signals that actually matter for the task.

With Weighted Scoring

Weighted scoring lets each analysis preset prioritize different signals. A tech debt preset might give similarity only 20% influence, while churn and age get 15% each:

score=i  wi×sii  wi\text{score} = \frac{\sum_{i}\; w_i \times s_i}{\sum_{i}\; |w_i|}

where wiw_i is the weight and sis_i is the signal value.

Example: Tech Debt Preset

SignalWeight (ww)Value (ss)Contribution (w×sw \times s)
similarity0.200.850.170
age0.150.700.105
churn0.150.600.090
bugFix0.150.400.060
volatility0.100.300.030
knowledgeSilo0.101.000.100
density0.100.250.025
blockPenalty−0.050.000.000
score=0.170+0.105+0.090+0.060+0.030+0.100+0.025+0.0000.20+0.15+0.15+0.15+0.10+0.10+0.10+0.05=0.5801.00=0.58\text{score} = \frac{0.170 + 0.105 + 0.090 + 0.060 + 0.030 + 0.100 + 0.025 + 0.000}{0.20 + 0.15 + 0.15 + 0.15 + 0.10 + 0.10 + 0.10 + 0.05} = \frac{0.580}{1.00} = 0.58

The weight sum is 1.0, so the division keeps the score in range. If weights don't sum to 1.0, the formula still normalizes correctly — it's the ratios between weights that matter, not their absolute values.

Negative Weights

The blockPenalty signal uses a negative weight (−0.05). This means: when this signal is high, the score goes down. It's used to suppress low-quality code chunks that lack reliable git data.


3. Confidence Dampening — Trusting Reliable Data

Without Dampening

Consider a search for bug-prone code. Three files come back:

FileCommitsBug fixesBug fix rateNormalized
auth.ts502040%0.40
utils.ts2150%0.50
config.ts1003030%0.30

utils.ts ranks highest — but its 50% rate is based on just 2 commits. One of them happened to be a fix. That's not a pattern, that's noise. With 50 more commits, the rate would likely drop to 5%. The ranking is dominated by statistically meaningless data.

With Dampening

Confidence dampening reduces a signal's strength when the underlying data is too sparse to be reliable. After dampening (threshold = 8):

FileBug fix rateCommitsDampening (n/k)2(n/k)^2Dampened score
auth.ts0.40501.000.40
utils.ts0.5020.0630.031
config.ts0.301001.000.30

Now auth.ts correctly ranks first. utils.ts is suppressed to near-zero because 2 commits provide almost no statistical confidence.

The formula is:

dampening={1if nk(nk) ⁣2if n<k\text{dampening} = \begin{cases} 1 & \text{if } n \geq k \\[4pt] \left(\dfrac{n}{k}\right)^{\!2} & \text{if } n < k \end{cases}

where nn is the commit count and kk is the confidence threshold.

The final signal value is multiplied by this dampening factor:

dampened=signal×dampening\text{dampened} = \text{signal} \times \text{dampening}

Why Quadratic?

The exponent of 2 (squaring) makes dampening aggressive for low commit counts but gentle near the threshold:

Commits (nn)Threshold (kk)n/kn/k(n/k)2(n/k)^2Effect
180.1250.016Signal almost eliminated
280.2500.063~6% of full strength
480.5000.250Quarter strength
680.7500.563Half strength
881.0001.000Full strength
2082.5001.000Full strength (clamped)

A linear formula (n/kn/k) would give 50% strength at 4 commits — too generous for such sparse data. The quadratic curve is stricter, reaching 50% only around 6 commits.

Where Does the Threshold Come From?

The threshold kk is determined from collection-wide statistics. After indexing a codebase, TeaRAGs computes the 25th percentile (p25) of commit counts across all indexed chunks. This becomes the dampening threshold.

Why p25? It represents the boundary of "low data" — code below this threshold has fewer commits than 75% of the codebase, so its statistical signals are unreliable.

If collection stats are not yet computed (e.g., first search after indexing), each signal has a fallback threshold — a hardcoded safe default.

Example

Given collection p25 = 8 commits:

FileBug Fix RateCommitsDampeningDampened Score
auth.ts40% → 0.40501.000.40
utils.ts50% → 0.5020.0630.031
config.ts30% → 0.3060.5630.169

Despite having the highest raw bug fix rate (50%), utils.ts scores lowest after dampening — because 2 commits provide almost no statistical confidence.


4. Alpha-Blending — Chunk vs. File Granularity

Without Alpha-Blending

TeaRAGs indexes code at two levels — files and chunks (functions/blocks). Suppose you search for bug-prone code and a file payment.ts has two functions:

LevelBug fix rateCommits
File (payment.ts)35%80
Chunk (processRefund)100%1

Using chunk data only: processRefund scores 1.0 — but its 100% rate is based on a single commit (which happened to be a fix). Misleading.

Using file data only: processRefund scores 0.35 — accurate for the file, but ignores that this specific function might genuinely be different from the rest of the file.

Neither approach is right. We need a way to gradually trust chunk data as it matures.

With Alpha-Blending

Alpha-blending mixes chunk and file values based on how mature and representative the chunk data is:

effective=α×chunk+(1α)×file\text{effective} = \alpha \times \text{chunk} + (1 - \alpha) \times \text{file}

where α\alpha (alpha) is a blending factor between 0 and 1:

α=min ⁣(1,  chunkCommitsfileCommitscoverage  ×  min ⁣(1,  chunkCommits3)maturity)\alpha = \min\!\Bigl(1,\;\underbrace{\frac{\text{chunkCommits}}{\text{fileCommits}}}_{\text{coverage}} \;\times\; \underbrace{\min\!\Bigl(1,\;\frac{\text{chunkCommits}}{3}\Bigr)}_{\text{maturity}}\Bigr)

Alpha has two components:

  1. Coverage = what fraction of the file's history does this chunk represent?
  2. Maturity = does this chunk have enough commits to be statistically meaningful? (Threshold: 3 commits)

Example

ScenarioChunk commitsFile commitsCoverageMaturityα\alphaMeaning
New function1500.020.330.007Almost pure file signal
Growing function5500.101.000.10010% chunk, 90% file
Mature function20500.401.000.400Significant chunk influence
Dominant function45500.901.000.900Mostly chunk signal

Why Maturity Matters

Without the maturity factor, a chunk with 1 commit in a file with 2 commits would get α=0.5\alpha = 0.5 — equal weight to chunk and file data. But 1 commit tells us almost nothing. The maturity threshold of 3 prevents low-commit chunks from having outsized influence.

When Chunk Data Is Missing

If a chunk has no git-specific data (e.g., the chunk was never individually tracked), α=0\alpha = 0 and the formula falls back to pure file-level values. This is the safe default — file data is always available.


5. Adaptive Bounds — Adjusting to the Data

Without Adaptive Bounds

In section 1, we used a fixed bound of 365 for age normalization. This works for a project that's about a year old. But consider two different projects:

Young project (2 months old):

FileAge (days)normalize(age, 365)
main.ts600.16
utils.ts450.12
config.ts300.08

All values are squeezed into 0.08–0.16. The age signal is practically useless — it can't distinguish files that are meaningfully different in age for this project.

Old project (5 years old):

FileAge (days)normalize(age, 365)
legacy.ts18001.00 (clamped)
core.ts9001.00 (clamped)
api.ts4001.00 (clamped)

Everything clamps to 1.0. Again, no discrimination — you can't tell "5 years old" from "1 year old."

With Adaptive Bounds

Adaptive bounds compute the normalization bound dynamically for each search query, based on the actual values in the result set:

bound=max ⁣(p95batch,  p95collection  or  defaultBound)\text{bound} = \max\!\bigl(\text{p95}_{\text{batch}},\;\text{p95}_{\text{collection}} \;\text{or}\; \text{defaultBound}\bigr)

The process:

  1. Collect raw values from all results in the current search batch
  2. Compute p95 — the 95th percentile of those values
  3. Floor with collection-level p95 (from pre-computed stats) or the hardcoded default bound

Why p95 and Not Mean or Median?

The bound is the denominator in normalization: value / bound. The choice of statistic determines how values distribute across the [0, 1] range:

StatisticWhat happensProblem
Mean~40-50% of values clamp to 1.0Upper half indistinguishable
Median (p50)~50% of values clamp to 1.0Half the results are identical
p95Only ~5% of values clamp to 1.095% of values spread across [0, 1]

p95 gives the best discrimination — almost all values get a unique position in the normalized range. Only true outliers (top 5%) are clamped, and those extreme values shouldn't distort the scale anyway.

Why Floor with a Default?

The floor prevents pathological cases:

  • Tiny batch (3 results): p95 is unreliable — flooring with the default keeps bounds sensible
  • Homogeneous batch (all values similar): p95 would be very small, making normalization hypersensitive to noise
  • All zeros: the default prevents division by zero

Example

A search returns 10 results with ageDays values:

[5, 10, 20, 35, 50, 80, 120, 200, 300, 500]
MethodBoundnormalize(200, bound)normalize(50, bound)
Fixed default (365)3650.550.14
p95 of batch (460)4600.430.11
Mean (132)1321.00 ← clamped0.38
Median (65)651.00 ← clamped0.77

With p95, both values get meaningful, distinguishable scores. With mean or median, the higher value is clamped to 1.0 — you lose the ability to tell "200 days" apart from "300 days."


How It All Fits Together

Each raw signal value passes through a pipeline of these methods. Here's the complete flow:

Raw payload value (e.g., ageDays = 142, commitCount = 23)

├─ 1. Alpha-blending ──────── Merge chunk + file values
│ α = coverage × maturity
│ effective = α × chunk + (1−α) × file

├─ 2. Adaptive bounds ─────── Compute per-query bound
│ bound = max(batchP95, collectionP95 or default)

├─ 3. Normalization ────────── Scale to [0, 1]
│ normalized = min(1, effective / bound)
│ (optional: invert with 1 − normalized)

├─ 4. Confidence dampening ── Reduce unreliable signals
│ dampened = normalized × (commitCount / threshold)²

└─ 5. Weighted scoring ────── Combine into final score
score = Σ(weight × signal) / Σ|weight|

Not every signal uses every method. The table below shows which methods apply to each signal.


Appendix: Signal Method Matrix

Git Signals

SignalNormalizationAdaptive BoundAlpha-BlendDampeningInversionDampening Fallback
recencyageDays / bound365blendSignal1 − ...
ageageDays / bound365blendSignal
stabilitycommitCount / bound50blendSignal1 − ...
churncommitCount / bound50blendSignal
bugFixbugFixRate / bound100blendSignal(n/k)2(n/k)^28
volatilitychurnVolatility / bound60(n/k)2(n/k)^28
densitychangeDensity / bound20blendSignal(n/k)2(n/k)^25
ownershipdominantAuthorPct / 100(n/k)2(n/k)^25
knowledgeSilostep functionblendSignal(n/k)2(n/k)^25
relativeChurnNormrelativeChurn / bound5.0blendSignal(n/k)2(n/k)^25
burstActivityrecencyWeightedFreq / bound10.0blendSignal
chunkChurnchunk.commitCount / bound30× alpha
chunkRelativeChurnchunk.churnRatio / bound1.0× alpha
blockPenalty1 − alpha

Structural Signals

SignalNormalizationAdaptive BoundAlpha-BlendDampeningInversion
similaritypassthrough (vector score)
chunkSize(endLine − startLine) / bound500
documentationbinary (0 or 1)
importsimports.length / bound20
pathRiskbinary (0 or 1)

Legend:

  • Adaptive Bound column shows the defaultBound (floor value); actual bound is computed per-query via p95
  • blendSignal = full alpha-blending of chunk + file values
  • × alpha = signal value multiplied directly by alpha (chunk-only signals)
  • 1 − alpha = inverse alpha (penalty for low-quality chunks)
  • Dampening Fallback = hardcoded threshold used when collection stats are not yet available