Causal Reward World Models: Zero-shot Reward Design for Automated Skill Generation

2026-06-22Robotics

Robotics
AI summary

The authors focus on improving how computers learn to create reward rules for robots without humans manually setting them. They point out that current methods rely too much on guessing from feedback and can be fooled by misleading clues. To fix this, they introduce a model that understands cause-and-effect relationships between parts of tasks and physical robot actions by learning from lots of examples in advance. This helps the system make better decisions about rewards without repeated trial and error and works well even on new tasks and robots. Their approach makes designing robot skills faster and more reliable.

Automated Reward DesignReinforcement LearningLarge Language ModelsCausal ModelingReward FunctionOffline Pre-trainingRobotic ControlZero-shot LearningMulti-task LearningCausal Inference
Authors
Yang Yang, Yuchuang Tong, Zhengtao Zhang, Xu Ding, Ning Yang, Yifan Zhang, Haipeng Li, Kehu Yang, Miao Xin
Abstract
Automated Reward Design (ARD) aims to replace manual reward engineering in reinforcement learning with language-driven reward function synthesis. However, existing approaches based on large language models (LLMs) remain inherently correlation-driven, relying on iterative environmental feedback to refine reward hypotheses for each specific task. This paradigm not only results in inefficient reasoning but also makes LLMs susceptible to semantically plausible yet causally spurious reward components, leading to ineffective optimization. To address these limitations, we propose the Causal Reward World Model (CRWM), which explicitly models the causal topological relationships between candidate reward components and task-targeted physical variables through offline pre-training on multi-task interaction data. Based on a coarse-to-fine pre-training strategy, we introduce a joint optimization module that integrates Explicit Mechanism Decoupling with Confidence-Aware Soft Fusion to refine coarse structural priors using micro-level trajectories, thereby constructing a robust and interpretable causal skeleton. During inference, LLMs leverage CRWM as a task-irrelevant causal prior to constrain the reward generation, enabling zero-shot reward function design. Our work opens up a new white-box paradigm for the ARD problem. Extensive experiments on complex continuous control benchmarks demonstrate that CRWM generates executable reward functions without feedback-driven reward refinement, significantly reducing the design latency for acquiring new robotic skills while matching or surpassing state-of-the-art performance, and further exhibits strong generalization capabilities across unseen tasks and diverse robotic embodiments.