Internal-State Probes Read the Situation, Not the Action: Three Negative Results for Pre-Action Misalignment Monitoring

2026-06-29Machine Learning

Machine Learning
AI summary

The authors study how internal signals in AI models might predict harmful actions before they actually happen. They test three different techniques on several large models but find that these signals are not reliable early warnings, often failing to generalize across different situations or being too influenced by unrelated factors. The authors propose a method to rigorously check such internal predictions by making sure they work both before harmful actions and across diverse scenarios. Their results are mostly negative, showing that current internal probes cannot yet serve as strong pre-action monitors for harmful behavior in AI.

AI safetymodel interpretabilityinternal probespre-action monitoringlarge language modelsgeneralizationsemantic legibilityfine-tuningbehavior predictionconcept specificity
Authors
Max Fomin, Elad David, Amit LeVi
Abstract
Probes on model internals could help monitor agentic systems if they identify harmful text or tool actions before those actions are generated. We ask when an internal readout supports this stronger pre-action claim, rather than merely describing the prompt, construction contrast, or current trajectory. We test three methods across three model families: a Qwen2.5-Coder-32B-Instruct fine-tune/base direction, Llama-3.1-8B-Instruct probes at the last token of unsafe prefills, and Gemma-3-27B-IT emotion-concept vectors used for projection and steering in a blackmail tool-action scenario. Across these cases, construction validity, semantic legibility, and steering effects do not become robust pre-action monitors: each is undercut by a generalization or specificity check. The Qwen direction separates fine-tune from base at AUC 1.000, yet crosses its threshold on 0/143 audited pre-assistant turn contexts and on 0/342 Qwen prefill rows where the model continues the unsafe trajectory. The Llama features decode prompt domain almost perfectly (AUC 0.999), while the best future-behavior probe reaches AUC 0.801 and only +5.1 pp accuracy lift over majority; single-source cross-domain transfer is non-positive on five of six ordered pairs. Gemma emotion projections are semantically meaningful, but a shared-prefix minimal pair has indistinguishable states before the first differing input, and steering specificity weakens against unrelated learned directions such as cats}, weather, sports, and geography. We contribute a methodology for converting internal-readout claims into pre-action tests, and report scoped negative results: monitor claims must survive both scenario/action generalization and concept-specificity controls. Code is released at https://github.com/maxf-zn/misalignment_monitoring