Policy and World Modeling Co-Training for Language Agents

2026-06-01 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors found a way to teach language-based AI agents not just which actions get rewards, but also what those actions actually do. They noticed that the usual learning process already records how actions change the environment, so they created a method called PaW that uses this information to improve learning without extra steps or slowdowns. PaW carefully picks useful data, handles noisy information well, and balances learning signals to make training better. Tests showed this method helps AI agents perform better on different tasks than before.

reinforcement learninglarge language modelsworld modelingon-policy rolloutspolicy trainingauxiliary supervisionloss functionsagentic tasksreward signalsnoise tolerance

Authors

Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu, Haoze Lv, Yanbin Wei, Lingting Zhu, Shengju Qian, Xin Wang, Ying-Cong Chen, Qi Wang, Ke Tang

Abstract

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.

View PDFOpen arXiv