Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

2026-05-25 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors introduce TC-WM, a new method for helping AI agents better understand and predict their environment using simpler and more relevant internal representations. Instead of directly using complex visual features or raw pixels, their approach takes existing visual embeddings and gently simplifies them into a compact form that is easier for the AI to use for planning and control. They also make sure this simplified form aligns well with the actual physical state of the agent, improving effectiveness especially when no rewards or further learning are available. Tests show that their method outperforms other approaches in making accurate predictions and controlling agents across different simulated environments.

world modelslatent representationvisual embeddingscontrastive learningoffline learningtask-centriccontrolplanningpretrained modelsRobomimic

Authors

Minghao Fu, Fan Feng, Nicklas Hansen, Biwei Huang

Abstract

World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward-free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction. To address this, we propose TC-WM, a framework for turning foundation-model embeddings into compact, task-sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC-WM linearly projects high-dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent's physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task-centric dynamics. Theoretically, we show that TC-WM suffices to identify the underlying task-centric latent factors up to a simple transformation. Empirically, TC-WM enables test-time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world-modeling quality and more precise control than state-of-the-art approaches.

View PDFOpen arXiv