UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
2026-04-21 • Robotics
RoboticsArtificial Intelligence
AI summaryⓘ
The authors address the problem that humanoid robots don't have enough direct data to learn from, so they try to use lots of human action videos instead. They created UniT, a method that links human and robot movements by focusing on the shared visual outcomes of actions rather than exact body movements. This approach lets robots learn from human data more efficiently and even perform new tasks without extra training. Their experiments show UniT helps transfer skills from humans to robots both in simulations and the real world.
humanoid robotsegocentric human datacross-embodiment transferkinematic mismatchlatent spacepolicy learningworld modelingzero-shot transferaction representationvisual anchoring
Authors
Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, Yixiao Ge
Abstract
Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.