SecureClaw: Clawing Back Control of LLM Agents

2026-06-08Cryptography and Security

Cryptography and SecurityArtificial Intelligence
AI summary

The authors introduce SecureClaw, a system designed to keep tool-using large language models (LLMs) safe from two main security problems: unauthorized actions and leaks of sensitive information during processing. Their approach uses two protections—one controls what information the model can read by replacing secrets with safe summaries, and the other controls what actions the model can finalize by requiring trusted approval before changes happen. This design lets the model plan and operate normally without exposing sensitive data or performing unauthorized actions. In tests, SecureClaw successfully blocked attacks while still letting the models perform useful tasks.

large language model (LLM)tool-using agentssecurity boundariesplaintext confinementauthorizationtrusted gatewayPREVIEW→COMMIT protocoldeclassificationattack success rate (ASR)Agent Security Bench (ASB)
Authors
Yuhan Ma, Stefan Schmid
Abstract
Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.