Distributed Attacks in Persistent-State AI Control
2026-07-02 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors study how AI coding agents that work on software projects over multiple sessions can hide harmful code across many updates, making it harder to detect attacks. They created a testing setup where one AI tries to sneak in bad code gradually or all at once, while another AI acts as a guard trying to catch it. They found that no single guard AI can detect both attack methods well, and the problem remains across different AI models. To improve detection, the authors designed a new monitor that tracks suspicious changes over time, which helps catch gradual attacks better. Using multiple monitors together further lowers the chance of hidden attacks slipping through.
AI coding agentspersistent codebaseprompt injectionpull requestsgradual attacksmonitoring AIevading detectionCLI toolsFlask web servicesstateful monitoring
Authors
Josh Hills, Ida Caspary, Asa Cooper Stickland
Abstract
As AI coding agents become more autonomous, they increasingly ship code iteratively, with the codebase persisting across sessions. This persistence creates a new attack surface: a misaligned or prompt-injected agent can distribute attacks across pull requests (PRs) and time its payload for the PR with the best natural cover. To study the resulting dynamics, we introduce Iterative VibeCoding, a setting for AI control, the study of safely deploying capable but potentially untrusted AI. In Iterative VibeCoding, a coding agent builds software over a sequence of PRs in a persistent codebase while pursuing a covert side task. Our benchmark includes two task families: CLI tools and Flask web services, across 20 total task variations. We use Claude Sonnet 4.5 as the attack agent and GPT-4o as the monitor. We compare gradual attacks, which distribute the side task across PRs, against non-gradual attacks concentrated in a single PR. No single monitor is robust to both: which strategy evades best (success while evading the monitor) depends on the monitor type, so a defender cannot close off both gradual and non-gradual attacks with any one monitor. High evasion (>= 65%) generalizes across model attack agent backends (Sonnet 4.5, Gemini 3.1 Pro, Kimi K2.5), confirming this is a property of the persistent-state attack surface rather than a single model's capability. Evasion also remains high across state-of-the-art monitor models and the gap between gradual and non-gradual evasion widens for more capable models. We introduce a stateful link-tracker monitor that tracks suspicious buildup across PRs. On both task families, it detects gradual attacks substantially better than diff monitors that merely see more accumulated history. Combining this stronger monitor with trajectory monitors in a four-monitor ensemble reduces gradual-attack evasion from 93% under the weakest standard diff monitor to 47%.