When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents
2026-06-22 • Machine Learning
Machine Learning
AI summaryⓘ
The authors studied a method called hidden-state probing, which looks inside a vision-language model to detect bad prompts before the model acts. They found that just having a strong ability to tell normal from attacked prompts (high AUC) doesn't guarantee the model really recognizes harmful content. They suggest extra tests to better understand what the probing results mean and warn against overinterpreting them as clear malicious content detection. Their work offers a set of guidelines for interpreting these tests but is based on one specific model and scenario.
hidden-state probingvision-language modelindirect prompt injectionlinear classifierAUCmultimodal agentsMind2WebQwen2.5-VL-7Bpost-hoc diagnosticsmalicious content detection
Authors
Yanhang Li, Zhichao Fan, Zexin Zhuang
Abstract
Hidden-state probing -- a linear classifier on a frozen vision-language model's internal activations -- has emerged as an attractive evaluation tool for flagging indirect prompt injection (IPI) in multimodal computer-use agents before the agent emits a corrupted action. We argue, on a single-backbone cautionary case study (Qwen2.5-VL-7B on Mind2Web, teacher-forced replay), that a high probing AUC on a clean-vs-attack split is not, on its own, evidence of malicious-content detection. Two post-hoc diagnostics -- a paired-construction scalar baseline on text-side injections, and same-step nuisance-matched visual controls on the overlay surface -- do not license an unqualified malicious-content interpretation of the headline while leaving room for partly-semantic readings. We package the diagnostics as a candidate control set with reporting heuristics for what a high clean-vs-attack AUC does and does not license. Labels are injection-surface-present, not attack success; generalisation beyond this backbone and benchmark is a conjecture.