Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models
2026-03-13 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors study if video models that create 'worlds' from 2D videos can correctly predict how things change even when the camera isn't watching. They make a test called STEVO-Bench that hides parts of the video, turns off lights, or changes where the camera points to see if models still understand how events evolve. Their tests show these models often struggle to separate what is really changing from what they see. The authors provide ways to find and understand why these models fail, giving clues about how current video models might be biased.
video world modelsstate evolutionobservation controlbenchmarkSTEVO-Benchcamera occlusiontemporal dynamicsmodel evaluationnatural processesdata bias
Authors
Ziqi Ma, Mengzhan Liufu, Georgia Gkioxari
Abstract
Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/