Code Quality Metrics: Theory & Research

Abstract

This page surveys the academic foundations, industrial systems, and practical implementation strategies for code quality metrics in the context of code intelligence tools. It covers coupling and dependency metrics, churn and complexity models, composite risk metrics, and defect prediction research — the scientific basis for trajectory enrichment awareness in TeaRAGs.

1. Foundations: Coupling and Dependency Metrics

1.1 Fan-In / Fan-Out (Henry & Kafura 1981)

The original information flow metrics defined by Henry and Kafura measure:

Fan-In -- number of modules that call or depend on a given module
Fan-Out -- number of modules that a given module calls or depends on

At the file level, these map to import/dependency relationships:

Structural Fan-In (SFIN) = count of files that import this file
Structural Fan-Out (SFOUT) = count of files this file imports

A longitudinal study of fan-in and fan-out in open-source systems (Counsell et al.) established that:

Fan-in follows a power-law distribution -- a few files are critically important
Fan-out follows a log-normal distribution -- most files have moderate dependencies
Refactored classes tend to have larger fan-in and fan-out values

References:

Henry, S. & Kafura, D. (1981). "Software Structure Metrics Based on Information Flow." IEEE TSE.
Counsell, S. et al. (2010). "An Evolutionary Study of Fan-in and Fan-out Metrics in OSS." IEEE CSMR.
Counsell, S. et al. (2011). "A Longitudinal Study of Fan-In and Fan-Out Coupling in Open-Source Systems."

1.2 Robert Martin's Package Metrics (2002)

Robert C. Martin formalized package-level coupling metrics in Agile Software Development: Principles, Patterns, and Practices:

Metric	Formula	Interpretation
Ca (Afferent Coupling)	Count of external classes depending on this package	Responsibility -- who needs me
Ce (Efferent Coupling)	Count of external classes this package depends on	Dependency -- who I need
Instability (I)	`Ce / (Ce + Ca)`	0 = maximally stable, 1 = maximally unstable
Abstractness (A)	abstract classes / total classes	0 = concrete, 1 = abstract
Distance (D)	`	A + I - 1

The Stable Dependencies Principle states: depend in the direction of stability (low I). High fan-in modules should be stable (low I) and abstract (high A).

Key Insight for Blast Radius

Instability I = Ce / (Ce + Ca) elegantly combines fan-in and fan-out into a single metric. Modules with low instability (high fan-in relative to fan-out) have the largest blast radius.

References:

Martin, R.C. (2002). Agile Software Development: Principles, Patterns, and Practices. Prentice Hall.
Almugrin, S. & Melton, A. (2015). "A Validation of Martin's Metric." Journal of Object Technology.

1.3 Graph-Theoretic Metrics

When the entire dependency graph is available, network analysis metrics apply:

Metric	What it measures	Relevance to blast radius
In-Degree	Direct dependents (= fan-in)	First-order blast radius
Betweenness Centrality	Frequency as a bridge between shortest paths	"Gatekeeper" files -- changing them disrupts information flow
PageRank	Importance weighted by quality of incoming links	Files depended on by important files are more important
Closeness Centrality	Average distance to all other nodes	Files close to everything propagate changes widely

He (2013) applied centrality metrics to Java software dependency networks and found a statistically significant relationship between a class's graph importance (degree, PageRank, HITS) and bug-proneness.

The node2defect approach (ASE 2018) used network embedding on dependency graphs to predict defects, outperforming traditional metrics alone.

References:

He, Z. et al. (2013). "Using Software Dependency to Bug Prediction." Mathematical Problems in Engineering.
Qu, Y. et al. (2018). "node2defect: Using Network Embedding to Improve Software Defect Prediction." ASE 2018.
Zimmermann, T. & Nagappan, N. (2008). "Predicting Defects Using Network Analysis on Dependency Graphs." ICSE 2008.

2. Churn x Complexity: The Hotspot Model

2.1 Hotspot Analysis (Tornhill 2013--2024)

Adam Tornhill's Your Code as a Crime Scene methodology, implemented in CodeScene, defines hotspots as:

Hotspot = High Complexity x High Change Frequency

The key insight: complexity only matters if you need to deal with it frequently. A 500-line function that never changes is not a problem. A 50-line function that changes every week is.

Empirical Validation

In a case study of 400 KLOC with 89 developers and 18,000+ commits, hotspot analysis identified 7 of 8 most defect-dense modules. 4% of code was responsible for 72% of all defects.

References:

Tornhill, A. (2015). Your Code as a Crime Scene. Pragmatic Bookshelf.
Tornhill, A. (2024). Your Code as a Crime Scene, Second Edition. Pragmatic Bookshelf.

2.2 Temporal Coupling (CodeScene)

Beyond direct import dependencies, CodeScene measures temporal coupling -- files that change together in commits:

Two files that consistently appear in the same commits have high temporal coupling
This reveals hidden dependencies not visible in import graphs
Use cases: detecting copy-paste code, evaluating test relevance, finding architectural decay

Filter: Commits touching more than 50 files are excluded (reorganizations create false positives).

References:

CodeScene Documentation. "Temporal Coupling." docs.enterprise.codescene.io.

2.3 Relative Code Churn (Nagappan & Ball, Microsoft, ICSE 2005)

The foundational Microsoft Research study on Windows Server 2003 demonstrated that relative churn measures are far superior to absolute ones:

Metric	Formula	Absolute?
ChurnedLOC / TotalLOC	changed lines / file size	Relative
ChurnCount / FileCount	churns / files in component	Relative
ChurnedLOC / ChurnCount	lines per churn	Relative

Results: relative code churn predicted defect density with 89% accuracy on Windows Server 2003 binaries.

References:

Nagappan, N. & Ball, T. (2005). "Use of Relative Code Churn Measures to Predict System Defect Density." ICSE 2005, pp. 284--292.

3. Composite Risk Metrics

3.1 CRAP: Change Risk Anti-Patterns

The CRAP metric combines cyclomatic complexity with test coverage:

CRAP(m) = comp(m)^2 * (1 - cov(m)/100)^3 + comp(m)

Complexity	Required coverage for CRAP < 30
5	0%
10	42%
25	80%
30+	Impossible -- must reduce complexity

CRAP threshold of 30 was empirically chosen. Below 30 = low maintenance risk. Above 30 = "crappy" code.

References:

Crap4j Project. crap4j.org. "The Code C.R.A.P. Metric Hits the Fan."
NDepend Blog. "CRAP Metric Is a Thing And It Tells You About Risk in Your Code."

3.2 SonarQube Technical Debt Model

SonarQube defines technical debt as remediation cost (in minutes) and rates maintainability:

Rating	Debt Ratio	Interpretation
A	0--5%	Minimal debt
B	6--10%	Manageable
C	11--20%	Attention needed
D	21--50%	Significant
E	50%+	Critical

SonarQube measures both cyclomatic complexity (path count) and cognitive complexity (human readability). These are complementary: cyclomatic measures testability, cognitive measures comprehensibility.

References:

SonarSource. "Understanding Measures and Metrics." docs.sonarsource.com.

3.3 Defect Prediction via Combined Metrics

Large-scale empirical studies consistently show that combining metric categories outperforms single-category models:

Metric Category	Examples	Predictive Power (alone)
Process metrics	churn, change frequency, author count	High
Code metrics	complexity, LOC, coupling	Medium-High
Network metrics	fan-in, centrality, PageRank	Medium
Combined	all of the above	Highest

Nagappan & Ball (Microsoft) demonstrated that software dependencies and churn measures together are efficient predictors of post-release failures.

CCC metrics (coupling, cohesion, complexity) were shown to correlate with security vulnerabilities in Mozilla Firefox at statistically significant levels.

References:

Nagappan, N. & Ball, T. (2007). "Using Software Dependencies and Churn Metrics to Predict Field Failures." ISSRE 2007.
Shin, Y. et al. (2010). "Can Complexity, Coupling, and Cohesion Metrics Be Used as Early Indicators of Vulnerabilities?" ACM SAC 2010.
Rebro, D.A. et al. (2023). "Source Code Metrics for Software Defects Prediction." arXiv:2301.08022.

4. Industrial Systems

4.1 CodeScene

The most mature blast-radius-aware system. Combines:

Hotspot analysis (complexity x change frequency)
Temporal coupling (co-change analysis)
X-Ray (method-level temporal coupling within hotspots)
Code health scoring (nesting, function length, coupling)

Does not compute structural fan-in/fan-out from imports. Relies on temporal coupling as a proxy for hidden dependencies.

4.2 SonarQube

Focuses on static analysis: complexity, code smells, duplications, test coverage. Does not incorporate git history, churn, or dependency graphs. Maintainability rating is based on remediation cost, not change risk.

4.3 Google (Tricorder + ML Code Review)

Google's internal tools process millions of code review comments per year. ML models predict needed code edits from reviewer comments. Risk assessment is implicit in the review process -- trained on change history, code complexity, and developer activity.

4.4 NDepend / JArchitect

Static analysis tools that implement Martin's metrics (Ca, Ce, Instability, Abstractness, Distance from Main Sequence) and compute dependency graphs with fan-in/fan-out at method, class, and namespace levels.

5. Metric Tiers for tea-rags Implementation

Tier 1: Core (highest ROI, implement first)

Metric	Signal	Source	Rationale
importedByCount	`importedBy`	Reverse dependency scan	Direct blast radius. Power-law = few files dominate.
Instability	`instability`	`Ce / (Ce + Ca)`	Martin's metric. Single number combining fan-in + fan-out.
isHub	`isHub`	`fanIn > t1 AND fanOut > t2`	Hub files = maximum change risk. Boolean flag.
Cyclomatic Complexity	`complexity`	tree-sitter AST	Foundation for hotspot model. Every study uses it.
Composite Hotspot	(preset)	`churn x complexity`	CodeScene proved: 4% code = 72% defects.

Tier 2: Recommended (strong research backing)

Metric	Signal	Source	Rationale
Cognitive Complexity	`cognitive`	tree-sitter AST	SonarQube: better than cyclomatic for readability.
Change Risk Score	(preset)	`churn x complexity x blastRadius`	Composite of all three dimensions.
isLeaf	`isLeaf`	`fanIn = 0 AND fanOut >= 1`	Leaf nodes are safe to change (no dependents).
fanOut	`fanOut`	`imports.length` (already exists)	Efferent coupling -- dependency on externalities.

Tier 3: Situational (useful in specific contexts)

Metric	When useful	Why not Tier 1
Temporal Coupling	Hidden dependency detection, copy-paste	Requires co-change analysis pipeline
Betweenness Centrality	Architectural "bridge" detection	Requires full graph + graph algorithms
PageRank	"Importance" with quality weighting	Overkill for file-level analysis
Distance from Main Sequence	Package health checks	Requires abstractness data
CRAP Score	Risk = complexity x (1 - coverage)	No test coverage data available
Transitive Fan-In	Cascading blast radius (N levels deep)	Exponential computation cost

6. Proposed Reranker Signals

Building on the existing signals in the tea-rags reranker:

// New signals (Tier 1-2)
importedBy:    normalize(deps.importedByCount, maxImportedBy)   // fan-in
fanOut:        normalize(imports.length, maxFanOut)              // fan-out (using existing data)
isHub:         deps.isHub ? 1.0 : 0                             // high fan-in AND high fan-out
instability:   deps.instability                                 // Ce/(Ce+Ca), already 0-1
complexity:    normalize(complexity.cyclomatic, maxComplexity)   // future: from AST

New Presets

`blastRadius`

blastRadius: {
  similarity:  0.30,
  importedBy:  0.25,  // fan-in = direct blast radius
  fanOut:      0.15,  // efferent coupling
  isHub:       0.15,  // hub flag = max risk
  churn:       0.15,  // existing churn signal
}

`changeRisk`

changeRisk: {
  similarity:  0.20,
  churn:       0.15,
  importedBy:  0.15,  // blast radius dimension
  complexity:  0.15,  // complexity dimension (when available)
  bugFix:      0.10,
  volatility:  0.10,
  knowledgeSilo: 0.10,
  burstActivity: 0.05,
}

Updated `impactAnalysis`

The current impactAnalysis preset uses similarity: 0.5, imports: 0.5. The proposed update:

impactAnalysis: {
  similarity:  0.30,
  importedBy:  0.30,  // was: imports 0.5 (fan-out only)
  fanOut:      0.15,
  isHub:       0.15,
  churn:       0.10,
}

7. Payload Schema Extension

// New top-level payload field: deps
interface DepsPayload {
  importedByCount: number;   // structural fan-in
  importedBy?: string[];     // list of importing file paths (optional, can be large)
  fanOut: number;            // = imports.length
  isHub: boolean;            // importedByCount > t1 AND fanOut > t2
  isLeaf: boolean;           // importedByCount = 0 AND fanOut >= 1
  instability: number;       // Ce / (Ce + Ca), range 0-1
}

// Future: complexity payload
interface ComplexityPayload {
  cyclomatic: number;        // McCabe complexity
  cognitive: number;         // SonarQube cognitive complexity
}

8. Implementation Phases

Phase A: AST Import Extraction

Replace regex-based import extraction with tree-sitter AST parsing. Expand language support beyond JS/TS/Python. Extract structured ImportEntry objects with kind classification (named, default, namespace, side-effect).

Phase B: importedBy Computation

Post-indexing pass: build reverse dependency map from existing imports[] data. Store deps.* payload via batchSetPayload. Compute instability, isHub, isLeaf.

Phase C: blastRadius Preset

Wire new signals into the reranker. Add blastRadius and update impactAnalysis preset. Add normalization bounds.

Phase D: Complexity Metrics

Compute cyclomatic and cognitive complexity from tree-sitter AST. Store in complexity.* payload. Add reranker signal.

Phase E: Change Risk Score

Composite preset combining all three dimensions: churn, blast radius, complexity.

References

Henry, S. & Kafura, D. (1981). "Software Structure Metrics Based on Information Flow." IEEE TSE, 7(5), 510--518.
Martin, R.C. (2002). Agile Software Development: Principles, Patterns, and Practices. Prentice Hall.
Nagappan, N. & Ball, T. (2005). "Use of Relative Code Churn Measures to Predict System Defect Density." ICSE 2005, 284--292.
Nagappan, N. & Ball, T. (2007). "Using Software Dependencies and Churn Metrics to Predict Field Failures." ISSRE 2007.
Zimmermann, T. & Nagappan, N. (2008). "Predicting Defects Using Network Analysis on Dependency Graphs." ICSE 2008.
Hassan, A.E. (2009). "Predicting Faults Using the Complexity of Code Changes." ICSE 2009.
Shin, Y. et al. (2010). "Can Complexity, Coupling, and Cohesion Metrics Be Used as Early Indicators of Vulnerabilities?" ACM SAC 2010.
Counsell, S. et al. (2010). "An Evolutionary Study of Fan-in and Fan-out Metrics in OSS." IEEE CSMR.
He, Z. et al. (2013). "Using Software Dependency to Bug Prediction." Mathematical Problems in Engineering.
Tornhill, A. (2015). Your Code as a Crime Scene. Pragmatic Bookshelf.
Qu, Y. et al. (2018). "node2defect: Using Network Embedding to Improve Software Defect Prediction." ASE 2018.
Rebro, D.A. et al. (2023). "Source Code Metrics for Software Defects Prediction." arXiv:2301.08022.
Tornhill, A. (2024). Your Code as a Crime Scene, Second Edition. Pragmatic Bookshelf.

Abstract​

1. Foundations: Coupling and Dependency Metrics​

1.1 Fan-In / Fan-Out (Henry & Kafura 1981)​

1.2 Robert Martin's Package Metrics (2002)​

1.3 Graph-Theoretic Metrics​

2. Churn x Complexity: The Hotspot Model​

2.1 Hotspot Analysis (Tornhill 2013--2024)​

2.2 Temporal Coupling (CodeScene)​

2.3 Relative Code Churn (Nagappan & Ball, Microsoft, ICSE 2005)​

3. Composite Risk Metrics​

3.1 CRAP: Change Risk Anti-Patterns​

3.2 SonarQube Technical Debt Model​

3.3 Defect Prediction via Combined Metrics​

4. Industrial Systems​

4.1 CodeScene​

4.2 SonarQube​

4.3 Google (Tricorder + ML Code Review)​

4.4 NDepend / JArchitect​

5. Metric Tiers for tea-rags Implementation​

Tier 1: Core (highest ROI, implement first)​

Tier 2: Recommended (strong research backing)​

Tier 3: Situational (useful in specific contexts)​

6. Proposed Reranker Signals​

New Presets​

blastRadius​

changeRisk​

Updated impactAnalysis​

7. Payload Schema Extension​

8. Implementation Phases​

Phase A: AST Import Extraction​

Phase B: importedBy Computation​

Phase C: blastRadius Preset​

Phase D: Complexity Metrics​

Phase E: Change Risk Score​

References​

Abstract

1. Foundations: Coupling and Dependency Metrics

1.1 Fan-In / Fan-Out (Henry & Kafura 1981)

1.2 Robert Martin's Package Metrics (2002)

1.3 Graph-Theoretic Metrics

2. Churn x Complexity: The Hotspot Model

2.1 Hotspot Analysis (Tornhill 2013--2024)

2.2 Temporal Coupling (CodeScene)

2.3 Relative Code Churn (Nagappan & Ball, Microsoft, ICSE 2005)

3. Composite Risk Metrics

3.1 CRAP: Change Risk Anti-Patterns

3.2 SonarQube Technical Debt Model

3.3 Defect Prediction via Combined Metrics

4. Industrial Systems

4.1 CodeScene

4.2 SonarQube

4.3 Google (Tricorder + ML Code Review)

4.4 NDepend / JArchitect

5. Metric Tiers for tea-rags Implementation

Tier 1: Core (highest ROI, implement first)

Tier 2: Recommended (strong research backing)

Tier 3: Situational (useful in specific contexts)

6. Proposed Reranker Signals

New Presets

`blastRadius`

`changeRisk`

Updated `impactAnalysis`

7. Payload Schema Extension

8. Implementation Phases

Phase A: AST Import Extraction

Phase B: importedBy Computation

Phase C: blastRadius Preset

Phase D: Complexity Metrics

Phase E: Change Risk Score

References