TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

2026-07-02Software Engineering

Software EngineeringArtificial IntelligenceComputation and Language
AI summary

The authors created TestEvo-Bench, a new benchmark to evaluate how well software testing tools can keep tests up to date when code changes. Unlike earlier benchmarks, TestEvo-Bench links tests directly to real code commits and checks if tests actually run and match the code changes. It includes tasks for both writing new tests and updating existing ones, drawn from many open-source Java projects. They tested current advanced tools and found good but not perfect success rates, especially dropping on newer tasks or with limited resources. This helps better measure how well automated tools understand and adapt tests as code evolves.

software testingbenchmarktest generationtest updatecode evolutioncommit historytest executionmutation scoreopen-source Javaautomation
Authors
Jiale Amber Wang, Kaiyuan Wang, Pengyu Nie
Abstract
Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, with two tracks: in test generation, the agent shall write new tests to capture the new software behavior; in test update, the agent shall adapt failing existing tests to the changed software behavior. Each task is anchored to a real commit history and packaged with environment configuration to support execution-grounded metrics such as pass rate, coverage, and mutation score. TestEvo-Bench is also a live benchmark: each task records the timestamp of the test and code changes, and new tasks are periodically mined by our automated pipeline, so evaluation can be restricted to tasks postdating a model's training cutoff to reduce data leakage risk. The current snapshot contains 746 test generation and 509 test update tasks, curated from 59,950 candidate co-evolution records across 152 open-source Java projects. We experiment with four state-of-the-art agents that combine strong harnesses (Claude Code, Gemini CLI, and SWE-Agent) with strong foundation models (Claude Opus 4.7 and Gemini 3.1 Pro). Results show that they achieve up to 77.5% success rate on test generation and 74.6% on test update. However, success rate is materially lower on the most recent benchmark tasks and drops significantly under limited per-task cost.