Agent-Augmented Development
What happens to software engineering metrics, workflows, and code quality when a substantial fraction of commits are produced by AI agents. This page summarises the emerging research and explains which of those effects motivate TeaRAGs' design choices — particularly GIT SESSIONS.
The Shift
Agentic development is qualitatively different from human coding, not just "faster typing":
- Commit cadence — human engineers commit in logical units (a feature, a fix). Agents commit in micro-increments (pass the test, adjust one line, pass again). A 20-commit agent session is functionally equivalent to one human commit.
- Authorship distribution — solo devs working with an agent produce bimodal histories: mostly
human, bursts ofagent. Team ownership heuristics built for human-only histories misinterpret this. - Code volume — generated code outpaces reviewed code. Without tooling that explicitly flags agent-authored regions, review rigor diverges from generation speed.
- Search patterns — agents search exhaustively before editing. The cost of a bad search is amplified: they'll act on the first relevant-looking result, not the best one.
These aren't predictions — they're observed in empirical studies of GitHub activity since 2023. TeaRAGs' design assumes all of them.
Academic and Industry Research
Measuring AI-driven productivity
- Peng et al. (2023). "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot." Randomized controlled trial across 95 developers. Copilot users completed tasks 55.8% faster than control. Caveat: single-task benchmark, not sustained workflow.
- Kalliamvakou et al. (2022). "Research: Quantifying GitHub Copilot's Impact on Developer Productivity and Happiness." Self-report survey. 88% of respondents said they felt more productive, but this is perception, not measured throughput.
- Ziegler et al. (2022). "Productivity Assessment of Neural Code Completion." Adoption correlates with productivity self-assessment; acceptance rate is the strongest individual predictor.
Code quality with AI assistance
- Hicks et al. (2024). "Does AI-Assisted Coding Deliver? An Empirical Study of Code Churn, Refactoring, and Bug-Fixing Rates." Analysis of 1.5M GitHub commits. AI-assisted commits show higher churn and higher subsequent fix-commit ratios than non-AI commits within the same repositories. Implications: naive
commitCount/bugFixRatemetrics on AI-heavy repos systematically misread "thrashing" as "hotspot". - Denny et al. (2023). "Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language." Multi-round prompt-fix cycles are the norm, not the exception. Code generated in one shot is rare.
- Dakhel et al. (2023). "GitHub Copilot AI pair programmer: Asset or Liability?" Copilot suggestions were correct 28% of the time on fundamental algorithmic problems. Partial correctness + plausible appearance = latent bug risk.
Churn-prediction models meet agentic commits
- Nagappan & Ball (2005). "Use of Relative Code Churn Measures to Predict System Defect Density." Classic result: relative churn (lines changed / file size) is the strongest single defect predictor. See Code Churn Research for the full treatment.
- Tornhill (2018). Your Code as a Crime Scene (2nd ed.). Pragmatic Bookshelf. The "hotspot" model: complexity × change frequency. Works well on human commits; over-flags agent burst commits as hotspots.
- Bird et al. (2011). "Don't Touch My Code! Examining the Effects of Ownership on Software Quality." Concentrated ownership correlates with fewer defects. Agent-co-authored commits dilute commit-based ownership signals;
recentDominantAuthorPctloses fidelity. Live-line ownership (blameDominantAuthorPct) is more robust on agent-heavy repos because it counts who currently owns the lines, not how many commits each name appeared in.
Why GIT SESSIONS Exists
The research above lines up into a single problem: commit-level churn metrics systematically misrepresent agent-heavy codebases.
TeaRAGs addresses this via the GIT SESSIONS mode (TRAJECTORY_GIT_SQUASH_AWARE_SESSIONS=true). It groups commits by (author, time gap) — any silence gap larger than TRAJECTORY_GIT_SESSION_GAP_MINUTES (default 30) starts a new session. Session count, not raw commit count, feeds churn signals.
Effect on each compromised metric:
| Signal | Raw problem | Session-aware fix |
|---|---|---|
commitCount | 20 micro-commits = "hotspot"; false positive | 20 → 1 session |
bugFixRate | Agent fix-and-retry loop inflates rate | Counted once per session |
churnVolatility | Agent bursts produce extreme stddev | Sessions smooth the burstiness |
relativeChurn | Cumulative lines changed across retries inflate | Deduplicated at session boundaries |
recentDominantAuthor / blameDominantAuthor and taskIds are unaffected by session-mode deduplication — they're inherently per-author-per-ticket (or per live-line) and stay meaningful.
See Git Enrichment Pipeline → GIT SESSIONS for the implementation detail and default tuning.
Practical Implications for Agent Workflows
A few consequences worth building your agent's behaviour around:
- Don't learn from your own hotspots. If the agent sees a file with
commitCount=40from its own recent session, that's not "important code" — that's churn from last hour's TDD loop. TeaRAGs'relativeChurnnormalises by file size, and session mode further de-noises. - Pair generation with retrieval. Agents have an unfair advantage: they can query the index cheaply before writing. Use it — Agentic Data-Driven Engineering shows the retrieval-first generation pattern.
- Freshness beats recency.
ageDayschanges the moment an agent touches a file.blameDominantAuthor(live-line ownership) doesn't move just because someone re-saved a file — only when actual lines change owner. Prefer live-line authorship signals over time signals for stability judgements on agent-heavy repos. - Trust tests, not "it looks right". AI-generated code often compiles and looks plausible but silently fails edge cases. Trajectory signals like
chunkBugFixRateare proxy indicators for "this function has been wrong before" — useful even when no test is failing right now.
Further Reading
Inside this knowledge base
- Code Churn Research — the underlying theory
- Semantic Search Criticism — where agent search tends to fail
- Signal Scoring Methods — how raw signals compose into rerank scores
Where to dig deeper
- ACM Conference on AI for Software Engineering (AISE) proceedings — the current venue for this research
- Neurips / ICSE workshops on "LLMs for Code" — annual reviews of the state of the art
- The CodeSearchNet benchmark (GitHub, 2019) — still widely used for code-retrieval evaluation