Attacking the Trusted Imagination: Oracle-Level Integrity Attacks on Imagine-then-Act World Models

2026-06-22Machine Learning

Machine LearningArtificial IntelligenceCryptography and Security
AI summary

The authors study how certain AI systems imagine future events before acting, which makes their imagined steps vulnerable to attacks. They found that it’s easier for attackers to slightly change these imagined futures in harmful ways than to precisely control them. To detect such attacks, the authors propose a method that spots unrealistic imagined futures without needing extra parameters. Their tests show that while random damage is easy to detect, targeted attacks are harder but still limited, and some systems relying heavily on imagination can fail significantly under attack.

vision-language-action (VLA) policiesworld-action model (WAM)latent trajectoryimagine-then-actadversarial attacksL-infinity perturbationprojected gradient descentmodel-predictive-control (MPC)denoiser detectorFisher p-value
Authors
Linghan Chen, Kaiyan Ji, Minyu Guo
Abstract
Many recent vision-language-action (VLA) policies adopt an imagine-then-act design. A world-action model (WAM) first imagines a short future as a latent trajectory z~, on which the action is then conditioned. We identify this trusted imagination, rather than the reactive policy, as the exposed attack surface. A downstream oracle, such as a safety gate, a visual model-predictive-control (MPC) planner, or an imagine-then-check verifier, consumes z~ as a prediction of the future. The robustness of the policy therefore does not entail the robustness of systems that rely on the WAM. The underlying phenomenon is an asymmetry. Corrupting the imagination is easy, since it requires only displacing z~ from its natural-future manifold. Steering it precisely is hard, since it must reach a specified on-manifold target. We adopt a capability-based threat model with an L-infinity-bounded observation perturbation. The attacker applies projected gradient descent through the fully differentiable observation-to-imagination map. The same off-manifold property motivates a parameter-free denoiser detector. We evaluate three targets: RynnVLA-002, LingBot-VA, and LaDi-WM. Untargeted corruption is roughly 60x stronger than random and is detected at AUC 1.0. Targeted control remains bounded. An adaptive attacker evades detection only by forgoing corruption. The reactive policy remains robust to corrupted imagination. A native imagination-driven MPC, however, exhibits the first adversary-specific task failure (at epsilon=0.01, success 0.70 versus 0.05; Fisher p < 10^-4).