Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

2026-06-02Robotics

RoboticsArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors developed Humanoid-GPT, a large Transformer model trained on a huge dataset of human motion to control full-body movements. Unlike earlier models that struggled with limited data and balancing agility with adaptability, their approach uses extensive data and a powerful model to handle complex and fast movements. This lets Humanoid-GPT perform well even on new types of motions and control tasks it hasn't seen before. Their experiments show it sets a new standard for generalizing whole-body control in motion synthesis.

Transformercausal attentionmotion capture (mocap)pre-trainingzero-shot generalizationwhole-body controlmotion synthesisdeep learningtransformer scaling
Authors
Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi
Abstract
We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.