Skip to main content

Agent-Augmented Development

What happens to software engineering metrics, workflows, and code quality when a substantial fraction of commits are produced by AI agents. This page summarises the emerging research and explains which of those effects motivate TeaRAGs' design choices — particularly GIT SESSIONS.


The Shift

Agentic development is qualitatively different from human coding, not just "faster typing":

  • Commit cadence — human engineers commit in logical units (a feature, a fix). Agents commit in micro-increments (pass the test, adjust one line, pass again). A 20-commit agent session is functionally equivalent to one human commit.
  • Authorship distribution — solo devs working with an agent produce bimodal histories: mostly human, bursts of agent. Team ownership heuristics built for human-only histories misinterpret this.
  • Code volume — generated code outpaces reviewed code. Without tooling that explicitly flags agent-authored regions, review rigor diverges from generation speed.
  • Search patterns — agents search exhaustively before editing. The cost of a bad search is amplified: they'll act on the first relevant-looking result, not the best one.

These aren't predictions — they're observed in empirical studies of GitHub activity since 2023. TeaRAGs' design assumes all of them.


Academic and Industry Research

Measuring AI-driven productivity

  • Peng et al. (2023). "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot." Randomized controlled trial across 95 developers. Copilot users completed tasks 55.8% faster than control. Caveat: single-task benchmark, not sustained workflow.
  • Kalliamvakou et al. (2022). "Research: Quantifying GitHub Copilot's Impact on Developer Productivity and Happiness." Self-report survey. 88% of respondents said they felt more productive, but this is perception, not measured throughput.
  • Ziegler et al. (2022). "Productivity Assessment of Neural Code Completion." Adoption correlates with productivity self-assessment; acceptance rate is the strongest individual predictor.

Code quality with AI assistance

Churn-prediction models meet agentic commits

  • Nagappan & Ball (2005). "Use of Relative Code Churn Measures to Predict System Defect Density." Classic result: relative churn (lines changed / file size) is the strongest single defect predictor. See Code Churn Research for the full treatment.
  • Tornhill (2018). Your Code as a Crime Scene (2nd ed.). Pragmatic Bookshelf. The "hotspot" model: complexity × change frequency. Works well on human commits; over-flags agent burst commits as hotspots.
  • Bird et al. (2011). "Don't Touch My Code! Examining the Effects of Ownership on Software Quality." Concentrated ownership correlates with fewer defects. Agent-co-authored commits dilute commit-based ownership signals; recentDominantAuthorPct loses fidelity. Live-line ownership (blameDominantAuthorPct) is more robust on agent-heavy repos because it counts who currently owns the lines, not how many commits each name appeared in.

Why GIT SESSIONS Exists

The research above lines up into a single problem: commit-level churn metrics systematically misrepresent agent-heavy codebases.

TeaRAGs addresses this via the GIT SESSIONS mode (TRAJECTORY_GIT_SQUASH_AWARE_SESSIONS=true). It groups commits by (author, time gap) — any silence gap larger than TRAJECTORY_GIT_SESSION_GAP_MINUTES (default 30) starts a new session. Session count, not raw commit count, feeds churn signals.

Effect on each compromised metric:

SignalRaw problemSession-aware fix
commitCount20 micro-commits = "hotspot"; false positive20 → 1 session
bugFixRateAgent fix-and-retry loop inflates rateCounted once per session
churnVolatilityAgent bursts produce extreme stddevSessions smooth the burstiness
relativeChurnCumulative lines changed across retries inflateDeduplicated at session boundaries

recentDominantAuthor / blameDominantAuthor and taskIds are unaffected by session-mode deduplication — they're inherently per-author-per-ticket (or per live-line) and stay meaningful.

See Git Enrichment Pipeline → GIT SESSIONS for the implementation detail and default tuning.


Practical Implications for Agent Workflows

A few consequences worth building your agent's behaviour around:

  1. Don't learn from your own hotspots. If the agent sees a file with commitCount=40 from its own recent session, that's not "important code" — that's churn from last hour's TDD loop. TeaRAGs' relativeChurn normalises by file size, and session mode further de-noises.
  2. Pair generation with retrieval. Agents have an unfair advantage: they can query the index cheaply before writing. Use it — Agentic Data-Driven Engineering shows the retrieval-first generation pattern.
  3. Freshness beats recency. ageDays changes the moment an agent touches a file. blameDominantAuthor (live-line ownership) doesn't move just because someone re-saved a file — only when actual lines change owner. Prefer live-line authorship signals over time signals for stability judgements on agent-heavy repos.
  4. Trust tests, not "it looks right". AI-generated code often compiles and looks plausible but silently fails edge cases. Trajectory signals like chunkBugFixRate are proxy indicators for "this function has been wrong before" — useful even when no test is failing right now.

Further Reading

Inside this knowledge base

Where to dig deeper

  • ACM Conference on AI for Software Engineering (AISE) proceedings — the current venue for this research
  • Neurips / ICSE workshops on "LLMs for Code" — annual reviews of the state of the art
  • The CodeSearchNet benchmark (GitHub, 2019) — still widely used for code-retrieval evaluation