Stateful Online Monitoring Catches Distributed Agent Attacks

2026-05-29Cryptography and Security

Cryptography and SecurityArtificial Intelligence
AI summary

The authors found that attackers can hide harmful activities by spreading them across many user accounts, making each look safe on its own. Traditional safety monitors look at one user at a time, so they miss these hidden attacks. To solve this, the authors created a new monitoring system that watches many users at once, spotting suspicious patterns earlier without slowing down most normal users. Their method was tested in large simulations and caught attacks sooner than older monitors. They also found it guards well against other types of attacks by looking at repeated bad behaviors across accounts.

language modelssoftware vulnerabilitiescybersecurityagent-based attacksdistributed attackssafety monitoringclusteringstateful monitoringred-teamingjailbreak detection
Authors
Davis Brown, Samarth Bhargav, Arav Santhanam, Kasper Hong, Ivan Zhang, Matan Shtepel, Steffi Chern, Alexander Robey, Eric Wong, Hamed Hassani
Abstract
Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi-agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real-time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large-scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red-teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.