TEXEDO : Test Time Scaling for Controller-aware Language-conditioned Humanoid Motion Generation

2026-06-22Robotics

Robotics
AI summary

The authors present TEXEDO, a method that helps robots follow text instructions better by choosing motions they can actually perform. Instead of relying only on human motion data that might not fit robot abilities well, TEXEDO picks the best motion from several options based on how well the motion aligns with the instruction and whether the robot can physically do it. They test their method in simulations and on a real robot, showing improvements in movement accuracy and instruction matching. This approach ensures generated motions are both meaningful and practical for robots to execute.

text-conditioned motion generationhumanoid robotsmotion retargetingdynamic feasibilitytracking controllerssemantic alignmentreward modelsimulationrobot motion planningphysical executability
Authors
Jianuo Cao, Yuxin Chen, Yuzhen Song, Masayoshi Tomizuka, Chenran Li, Thomas Tian
Abstract
Text-conditioned motion generation is a promising interface for programming humanoid robots, yet current generators are often trained on human motion datasets retargeted to robot morphologies. Although such data provides rich semantic and kinematic priors, it fails to capture the nuances of whole-body tracking controllers, including balance, contact dynamics, actuation limits, and controller-specific failure modes. As a result, generated motions can be semantically plausible but difficult or impossible for the robot to execute. We introduce TEXEDO, a test-time scaling framework for humanoid motion generation that improves motion quality without requiring a stronger underlying generator. Given a text prompt, TEXEDO samples multiple candidate motions from a pretrained text-conditioned generator and selects the best motion that is both executable and task-aligned. The reward model combines a dynamic feasibility verifier, distilled from whole-body tracking rollouts to predict physical executability, with a semantic alignment verifier that measures text-motion alignment in a learned co-embedding space. Our pipeline treats dynamic feasibility as a hard constraint and semantic alignment as the selection objective within the feasible set. Through large-scale simulation studies and real-world deployment on a Unitree G1 humanoid robot, we show that TEXEDO consistently improves both tracking fidelity and text alignment. These results demonstrate that grounded verification is an effective path toward deployable language-guided humanoid motion generation. Project website: https://jianuocao.github.io/TEXEDO/