YoCausal: How Far is Video Generation from World Model? A Causality Perspective

2026-05-28Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors studied whether video diffusion models really understand cause-and-effect or just learn patterns of change over time. They created a new test called YoCausal that uses real videos played backward to check if models notice when time is reversed. Their test has two parts: one measures if models sense the arrow of time, and the other checks if models truly grasp causality beyond just timing. Testing 13 top models showed that recognizing time's direction doesn't mean the models understand causality, and humans still do much better at this. This work helps highlight the difference between noticing time patterns and understanding real causes in videos.

video diffusion modelscausalityarrow of timeViolation of Expectationcounterfactual samplesdenoising lossReverse Surprise IndexCausality Cognition Indexvideo-language models
Authors
You-Zhe Xie, Yu-Hsuan Li, Jie-Ying Lee, Kaipeng Zhang, Yu-Lun Liu, Zhixiang Wang
Abstract
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.