TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

2026-03-18Software Engineering

Software EngineeringArtificial Intelligence
AI summary

The authors studied how AI coding tools often fix software problems but sometimes cause new bugs. They created TDAD, a tool that helps AI agents figure out which tests might break after code changes by analyzing the structure of the code and tests. When tested, TDAD greatly reduced bugs caused by changes and improved problem-solving success. Interestingly, simply telling AI to follow test-driven development steps made more bugs, showing that giving AI the right context is more helpful than just instructions. The authors also showed an automatic improvement loop that drastically boosted success while preventing new bugs.

AI coding agentssoftware regressionstest-driven developmentabstract syntax tree (AST)code-test graphregression testingbenchmarkauto-improvement loopGraphRAGcontextual information
Authors
Pepe Alonso
Abstract
AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.