Forensic Trajectory Signatures for Agent Memory Poisoning Detection

2026-06-29 • Cryptography and Security

Cryptography and SecurityMachine Learning

AI summaryⓘ

The authors found a consistent behavior pattern in large language model (LLM) agents when they are attacked through 'memory poisoning,' where the attack needs the agent to retrieve certain information before sending an email. This pattern isn’t just a coincidence — it’s required for the attack to work. They built detection methods based on this behavior, achieving very high accuracy, and showed that the attack leaves multiple signs in the agent’s actions, not just one. Their method works well across different models and can even distinguish memory-poisoning attacks from other prompt-injection attacks using only tool usage logs.

large language modelsmemory poisoningbehavioral invariantmemory recalltool invocationsprompt injectionattack detectionrandom forest classifiertrajectory featuresforensic analysis

Authors

Jun Wen Leong

Abstract

We discover a behavioral invariant in LLM agents under persistent memory poisoning: in architectures where routing information is retrieved through observable memory-tool invocations, successful attacks require calling memory_recall_fact before email_send_email, a transition that non-exfiltrating sessions rarely exhibit. Under the evaluated architecture, this invariant follows from the attack's information-retrieval dependency rather than being merely an empirical correlation, and suppressing it breaks the attack. A simple rule exploiting this invariant alone achieves AUC = 0.9563. A Random Forest classifier over 19 trajectory features refines it to AUC = 0.9904 (BCa 95% CI [0.987, 0.993], N=10,000 resamples), demonstrating that the attack imprints on multiple independent behavioral channels. The signature is overdetermined: removing all recall-related features (half the feature set) leaves AUC unchanged at 0.990, confirming that memory poisoning induces a distributed trajectory signature rather than a single observable anomaly. Cross-model hold-out on 9 models (7B-120B parameters) confirms AUC = 1.000 on 6/9 hold-out splits, with all three exceptions mechanistically explained. The invariant generalizes to frontier models (GPT-4.1, GPT-4o) without retraining. A strictly prefix-only variant achieves AUC = 0.934, suggesting that real-time blocking is feasible with moderate degradation. The boundary is forensically useful: prompt-injection attacks that bypass memory produce a distinct trajectory (score = 0.541), enabling incident responders to distinguish memory-channel attacks from prompt-injection attacks using tool-call logs alone.

View PDFOpen arXiv