Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

2026-06-01Artificial Intelligence

Artificial Intelligence
AI summary

The authors point out that current tests for AI models that combine vision and language mainly reward guessing the next word rather than understanding real physical actions. They created a new benchmark called Causal-Plan-Bench to better measure if models can reason about physical cause and effect. They also made a large dataset, Causal-Plan-1M, with detailed step-by-step reasoning from videos. Their experiments show that most models still struggle with real physical planning, but their new method called Causal Planner performs better by learning physical logic. They also find that giving models more training data helps them improve significantly in understanding cause and effect.

embodied vision-language planningnext-token predictioncausal reasoningphysical autonomybenchmarkegocentric videosreasoning tracestraining data scalingnext-state estimationCausal Scaling Law
Authors
Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong, Hanwen Cui, Heng Cao, Zirui Song, Yifan Yang, Chong Luo, Bei Liu, Yiming Li
Abstract
Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state reasoning. This rewards models that mimic statistical language priors rather than track causal dependencies, reducing physical planning to shallow sequence modeling. We argue that reliable physical autonomy requires a shift from linguistically grounded token prediction toward physically grounded causal reasoning. To this end, we introduce Causal-Plan-Bench, a high-fidelity diagnostic suite curated through multi-stage verification to evaluate embodied planning across four causal dimensions. We also construct Causal-Plan-1M, a million-scale corpus of explicit reasoning traces produced by a four-stage annotation pipeline over egocentric videos. Extensive evaluation shows that leading models still struggle to demonstrate genuine physical agency, with Gemini 3 Pro reaching only 38.18 on our benchmark. In contrast, our training recipe enables Causal Planner, built on Qwen3-VL-8B, to internalize physical logic for more accurate next-state estimation. The model achieves strong in-domain performance and cross-benchmark generalization, and reveals a Causal Scaling Law: scaling causal training data to one million instances yields a 36.3% relative gain, from 33.22 to 45.28. Overall, our work provides a concrete step toward turning agents from superficial token predictors into physically grounded causal reasoners.