Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model

2026-06-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address a problem in modeling complex actions where many small and large movements happen together, which current methods handle by squishing all actions into one big chunk, losing important details. They propose DexAC-WM, a method that breaks down actions into smaller parts and connects them better to what is seen in videos. They also add a semantic branch to give the model context about objects and scenes, helping it predict videos more accurately when many detailed actions happen. Their experiments show improved video quality and action accuracy, and their approach works with different base models, highlighting the importance of structured action modeling and semantic understanding for complex control tasks.

action-conditioned world modelhigh-DoF controlaction tokenizationsemantic groundingvideo predictionvisual-temporal realismFID (Fréchet Inception Distance)FVD (Fréchet Video Distance)PCK (Percentage of Correct Keypoints)
Authors
Zizhao Yuan, Zhengtu Liang, Taowen Wang, Qiwei Liang, Yichi Wang, Yunheng Wang, Yuetong Fang, Lusong Li, Zecui Zeng, Renjing Xu
Abstract
Recent advances in action-conditioned world models show promising progress in modeling complex interactions and forecasting future states under diverse action sequences. While these models are often driven by stronger visual representations and model capacity, action conditioning itself remains underexplored. Most existing approaches compress the entire action sequence into a single representation, which works well for low-DoF control but becomes less reliable in high-DoF scenarios. We observe that high-DoF dexterous actions are inherently heterogeneous, spanning multiple orders of magnitude, where large-scale motions coexist with subtle but important signals. When uniformly aggregated, optimization exhibits an imbalance across action components, which hinders the modeling of fine-grained effects and affects action fidelity. We therefore propose DexAC-WM, which treats action conditioning as a structured process rather than global compression. DexAC preserves dimension-level semantics via action tokenization and aligns action signals with visual dynamics through local refinement and global modulation. To address the limited high-level semantic grounding in existing world models, we further introduce a semantic branch that provides rich object-scene priors, which enables world model to capture dynamic visual details while supporting high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that combining the semantic branch with DexAC significantly improves FID, FVD, and PCK, demonstrating gains in visual-temporal realism and action-following consistency. We further verify that DexAC extends to other backbones, showing the scalability of our structured action-conditioning design. These results suggest that scaling world models to high-DoF control requires both structured action modeling and semantic grounding.